Lost in the Middle β€” Benchmark Suite v4

A Modular, Reproducible Benchmark Suite for Evaluating Position Bias in Long-Context Language Models


Table of Contents

  1. What is "Lost in the Middle"?
  2. Position Bias Index (PBI)
  3. Architecture Overview
  4. Experiment Descriptions
  5. Quick Start
  6. Kaggle Usage
  7. Output Structure
  8. Results & Graphs
  9. Conclusions & Discussion
  10. Extending the Suite
  11. Citation

1. What is "Lost in the Middle"?

1.1 The Core Phenomenon

The "Lost in the Middle" (LITM) effect, first systematically documented by Liu et al. (2023), describes a critical failure mode in large language models (LLMs) when processing long contexts:

Models perform best when relevant information appears at the beginning or end of a context, and worst when it is buried in the middle.

This creates a characteristic U-shaped accuracy curve when plotting model performance against the position of the target information within a long document.

1.2 Why Does This Happen?

The LITM effect arises from how modern transformer-based LLMs process attention:

Mechanism Explanation
Attention Dilution In long sequences, the softmax over attention weights becomes increasingly diffuse. Middle-position tokens receive proportionally less attention mass than edge-position tokens.
Positional Bias in Training Pretraining data often places key information at document boundaries (introductions, summaries). Models learn a positional prior that favors start and end positions.
KV Cache Pressure During autoregressive generation, the key-value cache grows linearly with sequence length. Attention computation over very long contexts becomes noisier in the middle regions.
Softmax Saturation With many tokens competing for attention probability mass, individual middle tokens are "drowned out" by the aggregate signal from surrounding tokens.

1.3 The U-Shaped Curve

When you plot accuracy vs. position, you see a U-shape:

Accuracy
  1.0 |  ●                                     ●
  0.9 |    ●                                 ●
  0.8 |      ●                             ●
  0.7 |        ●                         ●
  0.6 |          ●       ●       ●     ●
  0.5 |            ●       ●       ●
      +----+----+----+----+----+----+----+----+
      0   0.12 0.25 0.37  0.5  0.62 0.75 0.87  1.0
                     Position
  • Position 0.0 (start): High accuracy β€” primacy bias
  • Position 0.5 (middle): Lowest accuracy β€” the "lost" zone
  • Position 1.0 (end): High accuracy β€” recency bias

1.4 Why This Matters

The LITM effect has profound implications for real-world LLM deployments:

  • Retrieval-Augmented Generation (RAG): If a retriever returns relevant documents in the middle of a concatenated prompt, the generator may ignore them.
  • Long-Document QA: Answers hidden in the middle of legal contracts, medical records, or research papers are systematically missed.
  • In-Context Learning: Demonstration examples placed in the middle of a prompt are less effective than those at the start or end.
  • Conversational AI: Critical instructions buried in long chat histories are forgotten.

2. Position Bias Index (PBI)

2.1 Definition

To quantify the LITM effect consistently across experiments, we define the Position Bias Index (PBI):

PBI = (accuracy_start + accuracy_end) / 2 βˆ’ accuracy_middle

Where:

  • accuracy_start = accuracy when target is at the beginning (position 0.0)
  • accuracy_end = accuracy when target is at the end (position 1.0)
  • accuracy_middle = accuracy when target is at the center (position 0.5)

2.2 Interpretation

PBI Range Meaning
PBI > 0.30 Strong U-shaped bias β€” severe middle degradation
PBI = 0.15–0.30 Moderate bias β€” noticeable middle dip
PBI = 0.05–0.15 Weak bias β€” slight middle dip
PBI β‰ˆ 0.00 Flat curve β€” no positional bias
PBI < 0.00 Inverted-U β€” rare, model performs best in middle

2.3 Why PBI?

PBI is superior to simply reporting "middle accuracy" because:

  1. Normalizes for overall model competence: A model with 95% edge accuracy and 60% middle accuracy gets the same PBI as one with 70% edge and 35% middle.
  2. Captures the full U-shape: It explicitly contrasts edges against center.
  3. Comparable across experiments: KV retrieval, needle-in-haystack, and reasoning tasks all speak the same metric language.

2.4 Expanded Positions

Unlike the original Liu et al. paper (which tested 5 positions), this suite tests 9 positions for finer-grained curve resolution:

Position Normalized Description
0 0.000 Absolute start
N/8 0.125 Early
N/4 0.250 Early-middle
3N/8 0.375 Pre-middle
N/2 0.500 Exact middle
5N/8 0.625 Post-middle
3N/4 0.750 Late-middle
7N/8 0.875 Late
Nβˆ’1 1.000 Absolute end

3. Architecture Overview

3.1 Module Structure

litm-benchmark-suite-v4/
β”‚
β”œβ”€β”€ src/                          ← Shared infrastructure
β”‚   β”œβ”€β”€ model_loader.py           # 4-bit quantized model loading
β”‚   β”œβ”€β”€ generator.py              # Chat-template text generation
β”‚   β”œβ”€β”€ metrics.py                # PBI, exact-match, numeric scoring
β”‚   β”œβ”€β”€ plotting.py               # Standardized curve/bar plots
β”‚   └── utils.py                  # JSONL/JSON I/O
β”‚
β”œβ”€β”€ experiments/                  ← Core experiment logic (library)
β”‚   β”œβ”€β”€ kv_retrieval.py           # Exp 1: UUID key-value pairs
β”‚   β”œβ”€β”€ needle_in_haystack.py     # Exp 2: Secret code in prose
β”‚   β”œβ”€β”€ multi_needle.py           # Exp 3: Three simultaneous needles
β”‚   β”œβ”€β”€ fact_reasoning.py         # Exp 4: Math with buried facts
β”‚   β”œβ”€β”€ semantic_distractors.py   # Exp 5: Gold among similar facts
β”‚   β”œβ”€β”€ temporal_narrative.py     # Exp 6: Events in chronology
β”‚   └── conversation_memory.py    # Exp 7: Instruction in chat history
β”‚
β”œβ”€β”€ kaggle/                       ← Standalone Kaggle runners
β”‚   β”œβ”€β”€ run_exp1a_kv100.py
β”‚   β”œβ”€β”€ run_exp1b_kv200.py
β”‚   β”œβ”€β”€ run_exp2_needle.py
β”‚   β”œβ”€β”€ run_exp3_multi.py
β”‚   β”œβ”€β”€ run_exp4_reason.py
β”‚   β”œβ”€β”€ run_exp5_semantic.py
β”‚   β”œβ”€β”€ run_exp6_narrative.py
β”‚   β”œβ”€β”€ run_exp7_conversation.py
β”‚   └── run_all_kaggle.py         # Master Kaggle runner
β”‚
β”œβ”€β”€ run_all.py                    # Local master runner
β”œβ”€β”€ config.yaml                   # Hyperparameter configuration
└── requirements.txt              # Dependencies

3.2 Design Philosophy

  • experiments/ = Reusable library modules. Each file exports a run_*() function. Never executed directly.
  • kaggle/ = Thin entry-point scripts. Each imports one experiment module, adds Kaggle-specific paths and CLI args, and runs it.
  • src/ = Shared utilities. Model loading, generation, metrics, plotting β€” used by everything.

This separation means:

  • You can import experiments.kv_retrieval into your own custom pipeline.
  • You can run any experiment standalone in Kaggle without touching the core logic.
  • You can add a new experiment by writing one module and one Kaggle wrapper.

4. Experiment Descriptions

4.1 Experiment 1: Key-Value Retrieval

Files: experiments/kv_retrieval.py, kaggle/run_exp1a_kv100.py, kaggle/run_exp1b_kv200.py

Motivation

This is the canonical LITM task from Liu et al. (2023). It tests the most basic form of long-context retrieval: given a JSON object with many key-value pairs, can the model extract the value for a specific key?

Methodology

  1. Generate N random UUID key-value pairs (e.g., 100 or 200).
  2. Select one pair as the gold target.
  3. Place the gold pair at 9 controlled positions: 0, N/8, N/4, ..., Nβˆ’1.
  4. Prompt the model with the full JSON object and ask for the value corresponding to the gold key.
  5. Score with exact-match against the true value.

Prompt Template

Extract the value corresponding to the specified key in the JSON object below.

JSON data:
{"<key1>": "<value1>",
 "<key2>": "<value2>",
 ...}

Key: "<query_key>"
Corresponding value:

Why This Task?

  • No reasoning required β€” pure retrieval. If the model fails, it's unambiguously an attention/position problem.
  • Structured input β€” JSON provides clear boundaries, eliminating ambiguity about what constitutes "the answer."
  • Scalable β€” trivial to generate 50, 100, 500, or 1000 keys.

Variants

  • 1A (100 keys): Moderate length. Tests position bias in a medium-length context.
  • 1B (200 keys): Double length. Tests whether bias amplifies with context length.

Expected Results

  • U-shaped curve with strong middle dip.
  • PBI ~ 0.35–0.50 for 1.5B models.
  • PBI higher for 200 keys than 100 keys (longer contexts = worse middle performance).

4.2 Experiment 2: Needle in a Haystack

Files: experiments/needle_in_haystack.py, kaggle/run_exp2_needle.py

Motivation

Tests retrieval from unstructured natural language prose. Unlike KV retrieval (structured), this requires the model to search through fluent text to find a specific fact.

Methodology

  1. Generate a long document of N filler sentences (default: 500) from a pool of generic factual statements.
  2. Insert a "needle" sentence containing a unique secret code at a controlled depth.
  3. Ask the model to extract the secret code.
  4. Score with exact-match.

Example Document

The history of pottery spans thousands of years. [1].
Marine biologists study coral reef ecosystems. [2].
...
The secret code is SECRET-0042. [250].   ← needle at position 250/500
...
Railway engineering requires precise curvature. [500].

Prompt

Read the text and find the secret code.

<document>

What is the secret code? Answer with only the code.

Why This Task?

  • Unstructured retrieval β€” tests whether the model can locate a specific fact in prose, not just structured data.
  • High information density β€” every sentence is semantically meaningful, creating realistic competition for attention.
  • Scalable to extreme lengths β€” can test 1K, 2K, or even 10K sentences.

Expected Results

  • U-shaped curve, possibly stronger than KV retrieval because prose is less structured than JSON.
  • PBI ~ 0.30–0.45.

4.3 Experiment 3: Multi-Needle Retrieval

Files: experiments/multi_needle.py, kaggle/run_exp3_multi.py

Motivation

Real documents often contain multiple relevant facts, not just one. This tests whether the model can retrieve all of them simultaneously, and whether position bias affects each needle independently.

Methodology

  1. Generate a long filler document (default: 300 sentences).
  2. Insert three distinct secret codes at three fixed positions:
    • Code A at position 0 (start)
    • Code B at position N/2 (middle)
    • Code C at position Nβˆ’1 (end)
  3. Ask the model to list all three codes in order.
  4. Score each code independently with exact-match.

Prompt

Read the text and list ALL three secret codes in order.

<document>

Codes:

Why This Task?

  • Tests multi-hop attention β€” the model must attend to three non-contiguous locations.
  • Reveals asymmetric bias β€” does the model retrieve start and end needles but miss the middle one?
  • Models real RAG scenarios β€” multiple retrieved chunks concatenated together.

Expected Results

  • Start code: ~90–100% accuracy
  • End code: ~90–100% accuracy
  • Middle code: ~50–70% accuracy (the "lost" needle)
  • Bar chart showing asymmetric retrieval.

4.4 Experiment 4: Fact-Dependent Reasoning

Files: experiments/fact_reasoning.py, kaggle/run_exp4_reason.py

Motivation

Retrieval is only step one. In real tasks, models must use retrieved facts to perform reasoning (math, inference, decision-making). This tests whether position bias persists when the model must both retrieve a fact and reason with it.

Methodology

  1. Generate a long document of N distractor sentences (default: 300).
  2. Insert one critical fact at a controlled depth about a fictional product with a random price, e.g.:

    "For this order, Zylor apples cost $247/kg."

  3. Ask a math question that requires this fact:

    "According to the document, I buy 12 kg of Zylor apples. What is my total cost?"

  4. The model must (a) find the fictional product's price, (b) multiply by quantity.
  5. Score with exact integer match.

Critical Design Choice

All products are fictional (e.g., "Zylor apples," "Krynn berries") with random prices ($50–$500). The model cannot answer from parametric knowledge β€” it MUST read the document.

Why This Task?

  • Reasoning Γ— Retrieval β€” failure could be either retrieval failure or reasoning failure. This disentangles them.
  • More realistic than pure retrieval β€” most real tasks require using information, not just locating it.
  • Tests compositional generalization β€” can the model compose retrieved facts with arithmetic?

Expected Results

  • If the model is capable: U-shaped curve, but possibly weaker than pure retrieval because reasoning demands deeper processing. PBI ~ 0.20–0.40.
  • If the model is not capable: Near-chance accuracy across all depths (~10–30%), making position bias statistically undetectable. This itself is a valuable finding β€” it establishes that LITM effects are observable only when the underlying task lies within the model's competence frontier.

4.5 Experiment 5: Semantic Similarity Distractors

Files: experiments/semantic_distractors.py, kaggle/run_exp5_semantic.py

Motivation

In real documents, the target fact is rarely uniquely distinct. It competes with semantically similar distractors. This tests whether position bias interacts with associative interference.

Methodology

  1. Create a list of N factual statements (default: 80) from the same semantic domain, all about fictional countries with random secret codes.
    • E.g., "The capital of Xyloria is ZENTH-7392.", "The capital of Freloria is VORT-1854.", ...
  2. Insert the gold fact among them at a controlled depth.
  3. Ask a question requiring the secret code from the gold fact.

    "What is the capital of Xyloria? Answer with only the secret code."

  4. The distractors create associative competition β€” the model must distinguish "Xyloria" from "Freloria," "Zenthar," etc.

Critical Design Choice

All countries are fictional and all codes are random. The model cannot answer from parametric knowledge. It must read the specific line in the document.

Why This Task?

  • Associative interference β€” similar-looking facts compete for attention.
  • Tests discriminative retrieval β€” not just "find the needle" but "find the right needle among similar needles."
  • Models RAG with dense semantic overlap β€” e.g., multiple retrieved passages about related topics.

Expected Results

  • Classic LITM (U-shape) may NOT appear. When distractors are semantically dense and the target is not lexically unique, recency bias can collapse.
  • Instead of U-shape, you may see a monotonic decline or primacy-only pattern: high at start, declining through the document, with no recovery at the end.
  • This is a novel finding: semantic density destroys the recency advantage because the final items are not distinct enough to "pop" against their neighbors.

4.6 Experiment 6: Temporal Narrative

Files: experiments/temporal_narrative.py, kaggle/run_exp6_narrative.py

Motivation

Documents often have inherent temporal structure (chronologies, logs, histories). Does chronological ordering help or hurt retrieval? Does the model use temporal scaffolding, or does raw position dominate?

Methodology

  1. Generate a timeline of N historical events (default: 100).
    • 30 generic historical events (e.g., "the king issued a decree").
    • 8 statue-unveiling distractors with different materials/locations (e.g., "a bronze statue was unveiled in the town square").
  2. Insert a target event at a controlled depth with a random secret code:

    "Year 1050: a golden statue was unveiled in the central square (CODE: XJ-7392)."

  3. The target is one of 9 statue-unveiling events β€” not lexically unique. The model must distinguish "golden statue + central square" from other statue events and extract the code.
  4. Ask the model to identify the code.

    "What is the secret code for the golden statue that was unveiled in the central square?"

  5. Score with exact-match against the secret code.

Critical Design Choice

The target is semantically embedded in a family of similar events. The model cannot locate it by simple keyword search ("statue" appears 9 times). It must use positional attention combined with semantic discrimination.

Why This Task?

  • Temporal structure β€” events have meaningful ordering, not arbitrary placement.
  • Semantic competition β€” similar events compete for attention, testing true positional bias rather than lexical uniqueness.
  • Models real-world timelines β€” medical histories, legal case files, project logs with repeated event types.

Expected Results

  • U-shaped curve, but possibly different from unstructured tasks.
  • If temporal ordering provides scaffolding, the curve may be weaker than needle-in-haystack.
  • If semantic density dominates, recency bias may collapse (similar to Exp 5).

4.7 Experiment 7: Conversation Memory

Files: experiments/conversation_memory.py, kaggle/run_exp7_conversation.py

Motivation

Conversational AI must maintain coherence across long dialogue histories. Critical instructions or facts buried in the middle of a chat are frequently "forgotten." This tests dialog-state position bias.

Methodology

  1. Generate a synthetic conversation of N turns (default: 100) between User and Assistant.
    • User messages from a pool of generic questions.
    • Assistant messages from a pool of generic answers.
  2. Insert a critical instruction at a controlled depth:

    User: "Please always remember that my favorite color is MYFAVCOLOR-042. This is very important." Assistant: "I will remember that."

  3. At the end, ask the model to recall the instruction.

    "Based on our conversation, what is my favorite color?"

  4. Score with exact-match against the color code.

Why This Task?

  • Dialog-specific β€” tests position bias in the conversational domain.
  • Instruction following β€” models are explicitly told to "remember." Do they?
  • Models real chatbot failures β€” system prompts, user preferences, critical warnings buried in history.

Expected Results

  • U-shaped curve, possibly very strong because dialog turns are short and attention can "skip" over middle turns.
  • PBI may be higher than document tasks.

5. Quick Start

5.1 Local Installation

# Clone the repository
git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4
cd litm-benchmark-suite-v4

# Install dependencies
pip install -r requirements.txt

5.2 Run All Experiments (Local)

python run_all.py \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --output ./results \
    --n-examples 50

5.3 Run Single Experiment (Local)

python run_all.py \
    --experiments needle \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --output ./results

5.4 Available CLI Flags

Flag Default Description
--model Qwen/Qwen2.5-1.5B-Instruct HuggingFace model identifier
--output ./results Output directory
--n-examples 50 Examples per position (KV only)
--n-keys-100 100 Keys for Exp 1A
--n-keys-200 200 Keys for Exp 1B
--needle-sentences 500 Sentences for Exp 2
--experiments all Comma-separated: kv100,kv200,needle,multi,reason,semantic,narrative,conversation
--zip False Create zip archive of results

6. Kaggle Usage

6.1 Run a Single Experiment (Recommended)

Each experiment is self-contained and takes ~15–25 minutes on a T4 GPU.

In a Kaggle notebook cell:

# Cell 1: Clone and install
!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
%cd litm
!pip install -q -r requirements.txt

# Cell 2: Run one experiment
!python kaggle/run_exp2_needle.py \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --n-examples 30

# Cell 3: Zip and download
import shutil
shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")

6.2 Run Experiment 6 (Temporal Narrative)

# Cell 1: Clone and install (fresh to get latest code)
!rm -rf litm
!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
%cd litm
!pip install -q -r requirements.txt

# Cell 2: Run Experiment 6
!python kaggle/run_exp6_narrative.py \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --n-examples 30 \
    --n-events 100

# Cell 3: Zip and download
import shutil
shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")

6.3 Run All Experiments Overnight

!python kaggle/run_all_kaggle.py \
    --model Qwen/Qwen2.5-1.5B-Instruct \
    --output /kaggle/working/litm_results \
    --n-examples 50

6.4 Kaggle Scripts

Script Experiment ~Time on T4
kaggle/run_exp1a_kv100.py KV Retrieval (100 keys) ~15 min
kaggle/run_exp1b_kv200.py KV Retrieval (200 keys) ~25 min
kaggle/run_exp2_needle.py Needle in Haystack ~20 min
kaggle/run_exp3_multi.py Multi-Needle ~15 min
kaggle/run_exp4_reason.py Fact-Dependent Reasoning ~20 min
kaggle/run_exp5_semantic.py Semantic Similarity Distractors ~15 min
kaggle/run_exp6_narrative.py Temporal Narrative ~15 min
kaggle/run_exp7_conversation.py Conversation Memory ~15 min
kaggle/run_all_kaggle.py All 7 experiments ~2–2.5 hrs

7. Output Structure

7.1 Per-Experiment Output

Each experiment produces a folder with the following files:

results/
β”œβ”€β”€ exp1a_kv100/
β”‚   β”œβ”€β”€ kv100_data.jsonl           # Raw generated examples
β”‚   β”œβ”€β”€ kv100_pos_0.jsonl          # Predictions: gold at position 0
β”‚   β”œβ”€β”€ kv100_pos_12.jsonl         # Predictions: gold at position 12
β”‚   β”œβ”€β”€ kv100_pos_25.jsonl         # Predictions: gold at position 25
β”‚   β”œβ”€β”€ kv100_pos_37.jsonl         # Predictions: gold at position 37
β”‚   β”œβ”€β”€ kv100_pos_50.jsonl         # Predictions: gold at position 50 (MIDDLE)
β”‚   β”œβ”€β”€ kv100_pos_62.jsonl         # Predictions: gold at position 62
β”‚   β”œβ”€β”€ kv100_pos_75.jsonl         # Predictions: gold at position 75
β”‚   β”œβ”€β”€ kv100_pos_87.jsonl         # Predictions: gold at position 87
β”‚   β”œβ”€β”€ kv100_pos_99.jsonl         # Predictions: gold at position 99 (END)
β”‚   β”œβ”€β”€ kv100_summary.json         # Accuracies + PBI
β”‚   └── kv100_curve.png            # U-shaped accuracy plot
β”‚
β”œβ”€β”€ exp2_needle/
β”‚   β”œβ”€β”€ needle_depth_0.0.jsonl
β”‚   β”œβ”€β”€ needle_depth_0.125.jsonl
β”‚   β”œβ”€β”€ ... (9 depth files)
β”‚   β”œβ”€β”€ needle_depth_1.0.jsonl
β”‚   β”œβ”€β”€ needle_summary.json
β”‚   └── needle_curve.png
β”‚
β”œβ”€β”€ ... (exp3-7 follow same pattern)
β”‚
└── master_summary.json              # Aggregated results from all experiments

7.2 JSONL Format

Each .jsonl file contains one record per example:

{"model_answer": "a1b2-c3d4-...", "correct": 1.0, "value": "target-uuid", "gold_position": 50}
{"model_answer": "wrong-guess", "correct": 0.0, "value": "target-uuid", "gold_position": 50}

7.3 Summary JSON Format

{
  "experiment": "kv_retrieval",
  "num_keys": 100,
  "num_examples": 50,
  "positions": {
    "0": 0.94,
    "12": 0.78,
    "25": 0.64,
    "37": 0.58,
    "50": 0.54,
    "62": 0.60,
    "75": 0.70,
    "87": 0.82,
    "99": 0.90
  },
  "pbi": 0.38,
  "time_minutes": 12.5
}

7.4 Plot Files

Each experiment saves a .png plot:

  • Curve plots (Experiments 1, 2, 4, 5, 6, 7): X-axis = normalized position, Y-axis = accuracy. Red curve with markers.
  • Bar plots (Experiment 3): X-axis = Start/Middle/End, Y-axis = accuracy.

8. Results & Graphs

[USER TO INSERT OUTPUT GRAPHS HERE]

8.1 Experiment 1A: KV Retrieval (100 keys)

[Upload kv100_curve.png here]

Observations:

8.2 Experiment 1B: KV Retrieval (200 keys)

[Upload kv200_curve.png here]

Observations:

8.3 Experiment 2: Needle in Haystack

[Upload needle_curve.png here]

Observations:

8.4 Experiment 3: Multi-Needle

[Upload multi_bar.png here]

Observations:

8.5 Experiment 4: Fact-Dependent Reasoning

[Upload reason_curve.png here]

Observations:

8.6 Experiment 5: Semantic Similarity Distractors

[Upload semantic_curve.png here]

Observations:

8.7 Experiment 6: Temporal Narrative

[Upload narrative_curve.png here]

Observations:

8.8 Experiment 7: Conversation Memory

[Upload conversation_curve.png here]

Observations:

8.9 Cross-Experiment PBI Comparison

Experiment PBI Edge Accuracy Middle Accuracy Classification
KV 100 keys
KV 200 keys
Needle
Multi-Needle (middle)
Fact Reasoning
Semantic Distractors
Temporal Narrative
Conversation Memory

9. Conclusions & Discussion

[USER TO WRITE CONCLUSIONS HERE]

9.1 Key Findings

Summarize the main discoveries from your experiments:

9.2 Implications

What do these results mean for practitioners?

9.3 Limitations

What are the limitations of this study?

9.4 Future Work

What experiments or analyses would strengthen these findings?


10. Extending the Suite

10.1 Add a New Experiment

  1. Create experiments/my_experiment.py with a run_my_experiment(model_name, ..., out_dir) function.
  2. Create kaggle/run_expN_myexperiment.py that calls your function with Kaggle defaults.
  3. Import and add to run_all.py and kaggle/run_all_kaggle.py.

10.2 Add a New Model

Change --model to any HuggingFace causal LM:

python run_all.py --model meta-llama/Llama-3.2-1B-Instruct
python run_all.py --model Qwen/Qwen2.5-7B-Instruct

The suite automatically handles 4-bit quantization via bitsandbytes.

10.3 Adjust Scale

Increase context length by changing experiment parameters:

python kaggle/run_exp1b_kv200.py --n-keys 500 --n-examples 100
python kaggle/run_exp2_needle.py --n-sentences 1000 --n-examples 50

11. Citation

If you use this benchmark suite in your research, please cite both the original paper and this suite:

@article{liu2023lost,
  title={Lost in the Middle: How Language Models Use Long Contexts},
  author={Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy},
  journal={arXiv preprint arXiv:2307.03172},
  year={2023}
}

@software{litm_benchmark_suite_v4,
  title={Lost in the Middle Benchmark Suite v4},
  author={abhshkp},
  year={2026},
  url={https://huggingface.co/abhshkp/litm-benchmark-suite-v4}
}

Acknowledgments

This suite extends the foundational work of Liu et al. (2023) and incorporates community feedback on scalable, modular benchmarking. Built with HuggingFace Transformers, bitsandbytes, and matplotlib.

Downloads last month
27
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Paper for abhshkp/litm-benchmark-suite-v4