Upload README.md

352be1f verified 17 days ago

28.5 kB

	---
	tags:
	- ml-intern
	- lost-in-the-middle
	- long-context
	- position-bias
	- benchmark
	---

	# Lost in the Middle — Benchmark Suite v4

	> A Modular, Reproducible Benchmark Suite for Evaluating Position Bias in Long-Context Language Models

	---

	## Table of Contents

	1. [What is "Lost in the Middle"?](#1-what-is-lost-in-the-middle)
	2. [Position Bias Index (PBI)](#2-position-bias-index-pbi)
	3. [Architecture Overview](#3-architecture-overview)
	4. [Experiment Descriptions](#4-experiment-descriptions)
	5. [Quick Start](#5-quick-start)
	6. [Kaggle Usage](#6-kaggle-usage)
	7. [Output Structure](#7-output-structure)
	8. [Results & Graphs](#8-results--graphs)
	9. [Conclusions & Discussion](#9-conclusions--discussion)
	10. [Extending the Suite](#10-extending-the-suite)
	11. [Citation](#11-citation)

	---

	## 1. What is "Lost in the Middle"?

	### 1.1 The Core Phenomenon

	The "Lost in the Middle" (LITM) effect, first systematically documented by Liu et al. (2023), describes a critical failure mode in large language models (LLMs) when processing long contexts:

	> *Models perform best when relevant information appears at the beginning* or end of a context, and worst when it is buried in the middle.**

	This creates a characteristic U-shaped accuracy curve when plotting model performance against the position of the target information within a long document.

	### 1.2 Why Does This Happen?

	The LITM effect arises from how modern transformer-based LLMs process attention:

	\| Mechanism \| Explanation \|
	\|-----------\|-------------\|
	\| Attention Dilution \| In long sequences, the softmax over attention weights becomes increasingly diffuse. Middle-position tokens receive proportionally less attention mass than edge-position tokens. \|
	\| Positional Bias in Training \| Pretraining data often places key information at document boundaries (introductions, summaries). Models learn a positional prior that favors start and end positions. \|
	\| KV Cache Pressure \| During autoregressive generation, the key-value cache grows linearly with sequence length. Attention computation over very long contexts becomes noisier in the middle regions. \|
	\| Softmax Saturation \| With many tokens competing for attention probability mass, individual middle tokens are "drowned out" by the aggregate signal from surrounding tokens. \|

	### 1.3 The U-Shaped Curve

	When you plot accuracy vs. position, you see a U-shape:

	```
	Accuracy
	1.0 \| ● ●
	0.9 \| ● ●
	0.8 \| ● ●
	0.7 \| ● ●
	0.6 \| ● ● ● ●
	0.5 \| ● ● ●
	+----+----+----+----+----+----+----+----+
	0 0.12 0.25 0.37 0.5 0.62 0.75 0.87 1.0
	Position
	```

	- Position 0.0 (start): High accuracy — primacy bias
	- Position 0.5 (middle): Lowest accuracy — the "lost" zone
	- Position 1.0 (end): High accuracy — recency bias

	### 1.4 Why This Matters

	The LITM effect has profound implications for real-world LLM deployments:

	- Retrieval-Augmented Generation (RAG): If a retriever returns relevant documents in the middle of a concatenated prompt, the generator may ignore them.
	- Long-Document QA: Answers hidden in the middle of legal contracts, medical records, or research papers are systematically missed.
	- In-Context Learning: Demonstration examples placed in the middle of a prompt are less effective than those at the start or end.
	- Conversational AI: Critical instructions buried in long chat histories are forgotten.

	---

	## 2. Position Bias Index (PBI)

	### 2.1 Definition

	To quantify the LITM effect consistently across experiments, we define the Position Bias Index (PBI):

	```
	PBI = (accuracy_start + accuracy_end) / 2 − accuracy_middle
	```

	Where:
	- `accuracy_start` = accuracy when target is at the beginning (position 0.0)
	- `accuracy_end` = accuracy when target is at the end (position 1.0)
	- `accuracy_middle` = accuracy when target is at the center (position 0.5)

	### 2.2 Interpretation

	\| PBI Range \| Meaning \|
	\|-----------\|---------\|
	\| PBI > 0.30 \| Strong U-shaped bias — severe middle degradation \|
	\| PBI = 0.15–0.30 \| Moderate bias — noticeable middle dip \|
	\| PBI = 0.05–0.15 \| Weak bias — slight middle dip \|
	\| PBI ≈ 0.00 \| Flat curve — no positional bias \|
	\| PBI < 0.00 \| Inverted-U — rare, model performs best in middle \|

	### 2.3 Why PBI?

	PBI is superior to simply reporting "middle accuracy" because:
	1. Normalizes for overall model competence: A model with 95% edge accuracy and 60% middle accuracy gets the same PBI as one with 70% edge and 35% middle.
	2. Captures the full U-shape: It explicitly contrasts edges against center.
	3. Comparable across experiments: KV retrieval, needle-in-haystack, and reasoning tasks all speak the same metric language.

	### 2.4 Expanded Positions

	Unlike the original Liu et al. paper (which tested 5 positions), this suite tests 9 positions for finer-grained curve resolution:

	\| Position \| Normalized \| Description \|
	\|----------\|-----------\|-------------\|
	\| 0 \| 0.000 \| Absolute start \|
	\| N/8 \| 0.125 \| Early \|
	\| N/4 \| 0.250 \| Early-middle \|
	\| 3N/8 \| 0.375 \| Pre-middle \|
	\| N/2 \| 0.500 \| Exact middle \|
	\| 5N/8 \| 0.625 \| Post-middle \|
	\| 3N/4 \| 0.750 \| Late-middle \|
	\| 7N/8 \| 0.875 \| Late \|
	\| N−1 \| 1.000 \| Absolute end \|

	---

	## 3. Architecture Overview

	### 3.1 Module Structure

	```
	litm-benchmark-suite-v4/
	│
	├── src/ ← Shared infrastructure
	│ ├── model_loader.py # 4-bit quantized model loading
	│ ├── generator.py # Chat-template text generation
	│ ├── metrics.py # PBI, exact-match, numeric scoring
	│ ├── plotting.py # Standardized curve/bar plots
	│ └── utils.py # JSONL/JSON I/O
	│
	├── experiments/ ← Core experiment logic (library)
	│ ├── kv_retrieval.py # Exp 1: UUID key-value pairs
	│ ├── needle_in_haystack.py # Exp 2: Secret code in prose
	│ ├── multi_needle.py # Exp 3: Three simultaneous needles
	│ ├── fact_reasoning.py # Exp 4: Math with buried facts
	│ ├── semantic_distractors.py # Exp 5: Gold among similar facts
	│ ├── temporal_narrative.py # Exp 6: Events in chronology
	│ └── conversation_memory.py # Exp 7: Instruction in chat history
	│
	├── kaggle/ ← Standalone Kaggle runners
	│ ├── run_exp1a_kv100.py
	│ ├── run_exp1b_kv200.py
	│ ├── run_exp2_needle.py
	│ ├── run_exp3_multi.py
	│ ├── run_exp4_reason.py
	│ ├── run_exp5_semantic.py
	│ ├── run_exp6_narrative.py
	│ ├── run_exp7_conversation.py
	│ └── run_all_kaggle.py # Master Kaggle runner
	│
	├── run_all.py # Local master runner
	├── config.yaml # Hyperparameter configuration
	└── requirements.txt # Dependencies
	```

	### 3.2 Design Philosophy

	- `experiments/` = Reusable library modules. Each file exports a `run_*()` function. Never executed directly.
	- `kaggle/` = Thin entry-point scripts. Each imports one experiment module, adds Kaggle-specific paths and CLI args, and runs it.
	- `src/` = Shared utilities. Model loading, generation, metrics, plotting — used by everything.

	This separation means:
	- You can import `experiments.kv_retrieval` into your own custom pipeline.
	- You can run any experiment standalone in Kaggle without touching the core logic.
	- You can add a new experiment by writing one module and one Kaggle wrapper.

	---

	## 4. Experiment Descriptions

	### 4.1 Experiment 1: Key-Value Retrieval

	Files: `experiments/kv_retrieval.py`, `kaggle/run_exp1a_kv100.py`, `kaggle/run_exp1b_kv200.py`

	#### Motivation
	This is the canonical LITM task from Liu et al. (2023). It tests the most basic form of long-context retrieval: given a JSON object with many key-value pairs, can the model extract the value for a specific key?

	#### Methodology
	1. Generate `N` random UUID key-value pairs (e.g., 100 or 200).
	2. Select one pair as the gold target.
	3. Place the gold pair at 9 controlled positions: 0, N/8, N/4, ..., N−1.
	4. Prompt the model with the full JSON object and ask for the value corresponding to the gold key.
	5. Score with exact-match against the true value.

	#### Prompt Template
	```
	Extract the value corresponding to the specified key in the JSON object below.

	JSON data:
	{"<key1>": "<value1>",
	"<key2>": "<value2>",
	...}

	Key: "<query_key>"
	Corresponding value:
	```

	#### Why This Task?
	- No reasoning required — pure retrieval. If the model fails, it's unambiguously an attention/position problem.
	- Structured input — JSON provides clear boundaries, eliminating ambiguity about what constitutes "the answer."
	- Scalable — trivial to generate 50, 100, 500, or 1000 keys.

	#### Variants
	- 1A (100 keys): Moderate length. Tests position bias in a medium-length context.
	- 1B (200 keys): Double length. Tests whether bias amplifies with context length.

	#### Expected Results
	- U-shaped curve with strong middle dip.
	- PBI ~ 0.35–0.50 for 1.5B models.
	- PBI higher for 200 keys than 100 keys (longer contexts = worse middle performance).

	---

	### 4.2 Experiment 2: Needle in a Haystack

	Files: `experiments/needle_in_haystack.py`, `kaggle/run_exp2_needle.py`

	#### Motivation
	Tests retrieval from unstructured natural language prose. Unlike KV retrieval (structured), this requires the model to search through fluent text to find a specific fact.

	#### Methodology
	1. Generate a long document of `N` filler sentences (default: 500) from a pool of generic factual statements.
	2. Insert a "needle" sentence containing a unique secret code at a controlled depth.
	3. Ask the model to extract the secret code.
	4. Score with exact-match.

	#### Example Document
	```
	The history of pottery spans thousands of years. [1].
	Marine biologists study coral reef ecosystems. [2].
	...
	The secret code is SECRET-0042. [250]. ← needle at position 250/500
	...
	Railway engineering requires precise curvature. [500].
	```

	#### Prompt
	```
	Read the text and find the secret code.

	<document>

	What is the secret code? Answer with only the code.
	```

	#### Why This Task?
	- Unstructured retrieval — tests whether the model can locate a specific fact in prose, not just structured data.
	- High information density — every sentence is semantically meaningful, creating realistic competition for attention.
	- Scalable to extreme lengths — can test 1K, 2K, or even 10K sentences.

	#### Expected Results
	- U-shaped curve, possibly stronger than KV retrieval because prose is less structured than JSON.
	- PBI ~ 0.30–0.45.

	---

	### 4.3 Experiment 3: Multi-Needle Retrieval

	Files: `experiments/multi_needle.py`, `kaggle/run_exp3_multi.py`

	#### Motivation
	Real documents often contain multiple relevant facts, not just one. This tests whether the model can retrieve all of them simultaneously, and whether position bias affects each needle independently.

	#### Methodology
	1. Generate a long filler document (default: 300 sentences).
	2. Insert three distinct secret codes at three fixed positions:
	- Code A at position 0 (start)
	- Code B at position N/2 (middle)
	- Code C at position N−1 (end)
	3. Ask the model to list all three codes in order.
	4. Score each code independently with exact-match.

	#### Prompt
	```
	Read the text and list ALL three secret codes in order.

	<document>

	Codes:
	```

	#### Why This Task?
	- Tests multi-hop attention — the model must attend to three non-contiguous locations.
	- Reveals asymmetric bias — does the model retrieve start and end needles but miss the middle one?
	- Models real RAG scenarios — multiple retrieved chunks concatenated together.

	#### Expected Results
	- Start code: ~90–100% accuracy
	- End code: ~90–100% accuracy
	- Middle code: ~50–70% accuracy (the "lost" needle)
	- Bar chart showing asymmetric retrieval.

	---

	### 4.4 Experiment 4: Fact-Dependent Reasoning

	Files: `experiments/fact_reasoning.py`, `kaggle/run_exp4_reason.py`

	#### Motivation
	Retrieval is only step one. In real tasks, models must use retrieved facts to perform reasoning (math, inference, decision-making). This tests whether position bias persists when the model must both retrieve a fact and reason with it.

	#### Methodology
	1. Generate a long document of `N` distractor sentences (default: 300).
	2. Insert one critical fact at a controlled depth about a fictional product with a random price, e.g.:
	> "For this order, Zylor apples cost $247/kg."
	3. Ask a math question that requires this fact:
	> "According to the document, I buy 12 kg of Zylor apples. What is my total cost?"
	4. The model must (a) find the fictional product's price, (b) multiply by quantity.
	5. Score with exact integer match.

	#### Critical Design Choice
	All products are fictional (e.g., "Zylor apples," "Krynn berries") with random prices ($50–$500). The model cannot answer from parametric knowledge — it MUST read the document.

	#### Why This Task?
	- Reasoning × Retrieval — failure could be either retrieval failure or reasoning failure. This disentangles them.
	- More realistic than pure retrieval — most real tasks require using information, not just locating it.
	- Tests compositional generalization — can the model compose retrieved facts with arithmetic?

	#### Expected Results
	- If the model is capable: U-shaped curve, but possibly weaker than pure retrieval because reasoning demands deeper processing. PBI ~ 0.20–0.40.
	- If the model is not capable: Near-chance accuracy across all depths (~10–30%), making position bias statistically undetectable. This itself is a valuable finding — it establishes that LITM effects are observable only when the underlying task lies within the model's competence frontier.

	---

	### 4.5 Experiment 5: Semantic Similarity Distractors

	Files: `experiments/semantic_distractors.py`, `kaggle/run_exp5_semantic.py`

	#### Motivation
	In real documents, the target fact is rarely uniquely distinct. It competes with semantically similar distractors. This tests whether position bias interacts with associative interference.

	#### Methodology
	1. Create a list of `N` factual statements (default: 80) from the same semantic domain, all about fictional countries with random secret codes.
	- E.g., "The capital of Xyloria is ZENTH-7392.", "The capital of Freloria is VORT-1854.", ...
	2. Insert the gold fact among them at a controlled depth.
	3. Ask a question requiring the secret code from the gold fact.
	> "What is the capital of Xyloria? Answer with only the secret code."
	4. The distractors create associative competition — the model must distinguish "Xyloria" from "Freloria," "Zenthar," etc.

	#### Critical Design Choice
	All countries are fictional and all codes are random. The model cannot answer from parametric knowledge. It must read the specific line in the document.

	#### Why This Task?
	- Associative interference — similar-looking facts compete for attention.
	- Tests discriminative retrieval — not just "find the needle" but "find the right needle among similar needles."
	- Models RAG with dense semantic overlap — e.g., multiple retrieved passages about related topics.

	#### Expected Results
	- Classic LITM (U-shape) may NOT appear. When distractors are semantically dense and the target is not lexically unique, recency bias can collapse.
	- Instead of U-shape, you may see a monotonic decline or primacy-only pattern: high at start, declining through the document, with no recovery at the end.
	- This is a novel finding: semantic density destroys the recency advantage because the final items are not distinct enough to "pop" against their neighbors.

	---

	### 4.6 Experiment 6: Temporal Narrative

	Files: `experiments/temporal_narrative.py`, `kaggle/run_exp6_narrative.py`

	#### Motivation
	Documents often have inherent temporal structure (chronologies, logs, histories). Does chronological ordering help or hurt retrieval? Does the model use temporal scaffolding, or does raw position dominate?

	#### Methodology
	1. Generate a timeline of `N` historical events (default: 100).
	- 30 generic historical events (e.g., "the king issued a decree").
	- 8 statue-unveiling distractors with different materials/locations (e.g., "a bronze statue was unveiled in the town square").
	2. Insert a target event at a controlled depth with a random secret code:
	> "Year 1050: a golden statue was unveiled in the central square (CODE: XJ-7392)."
	3. The target is one of 9 statue-unveiling events — not lexically unique. The model must distinguish "golden statue + central square" from other statue events and extract the code.
	4. Ask the model to identify the code.
	> "What is the secret code for the golden statue that was unveiled in the central square?"
	5. Score with exact-match against the secret code.

	#### Critical Design Choice
	The target is semantically embedded in a family of similar events. The model cannot locate it by simple keyword search ("statue" appears 9 times). It must use positional attention combined with semantic discrimination.

	#### Why This Task?
	- Temporal structure — events have meaningful ordering, not arbitrary placement.
	- Semantic competition — similar events compete for attention, testing true positional bias rather than lexical uniqueness.
	- Models real-world timelines — medical histories, legal case files, project logs with repeated event types.

	#### Expected Results
	- U-shaped curve, but possibly different from unstructured tasks.
	- If temporal ordering provides scaffolding, the curve may be weaker than needle-in-haystack.
	- If semantic density dominates, recency bias may collapse (similar to Exp 5).

	---

	### 4.7 Experiment 7: Conversation Memory

	Files: `experiments/conversation_memory.py`, `kaggle/run_exp7_conversation.py`

	#### Motivation
	Conversational AI must maintain coherence across long dialogue histories. Critical instructions or facts buried in the middle of a chat are frequently "forgotten." This tests dialog-state position bias.

	#### Methodology
	1. Generate a synthetic conversation of `N` turns (default: 100) between User and Assistant.
	- User messages from a pool of generic questions.
	- Assistant messages from a pool of generic answers.
	2. Insert a critical instruction at a controlled depth:
	> User: "Please always remember that my favorite color is MYFAVCOLOR-042. This is very important."
	> Assistant: "I will remember that."
	3. At the end, ask the model to recall the instruction.
	> "Based on our conversation, what is my favorite color?"
	4. Score with exact-match against the color code.

	#### Why This Task?
	- Dialog-specific — tests position bias in the conversational domain.
	- Instruction following — models are explicitly told to "remember." Do they?
	- Models real chatbot failures — system prompts, user preferences, critical warnings buried in history.

	#### Expected Results
	- U-shaped curve, possibly very strong because dialog turns are short and attention can "skip" over middle turns.
	- PBI may be higher than document tasks.

	---

	## 5. Quick Start

	### 5.1 Local Installation

	```bash
	# Clone the repository
	git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4
	cd litm-benchmark-suite-v4

	# Install dependencies
	pip install -r requirements.txt
	```

	### 5.2 Run All Experiments (Local)

	```bash
	python run_all.py \
	--model Qwen/Qwen2.5-1.5B-Instruct \
	--output ./results \
	--n-examples 50
	```

	### 5.3 Run Single Experiment (Local)

	```bash
	python run_all.py \
	--experiments needle \
	--model Qwen/Qwen2.5-1.5B-Instruct \
	--output ./results
	```

	### 5.4 Available CLI Flags

	\| Flag \| Default \| Description \|
	\|------\|---------\|-------------\|
	\| `--model` \| `Qwen/Qwen2.5-1.5B-Instruct` \| HuggingFace model identifier \|
	\| `--output` \| `./results` \| Output directory \|
	\| `--n-examples` \| `50` \| Examples per position (KV only) \|
	\| `--n-keys-100` \| `100` \| Keys for Exp 1A \|
	\| `--n-keys-200` \| `200` \| Keys for Exp 1B \|
	\| `--needle-sentences` \| `500` \| Sentences for Exp 2 \|
	\| `--experiments` \| `all` \| Comma-separated: `kv100,kv200,needle,multi,reason,semantic,narrative,conversation` \|
	\| `--zip` \| `False` \| Create zip archive of results \|

	---

	## 6. Kaggle Usage

	### 6.1 Run a Single Experiment (Recommended)

	Each experiment is self-contained and takes ~15–25 minutes on a T4 GPU.

	In a Kaggle notebook cell:

	```python
	# Cell 1: Clone and install
	!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
	%cd litm
	!pip install -q -r requirements.txt

	# Cell 2: Run one experiment
	!python kaggle/run_exp2_needle.py \
	--model Qwen/Qwen2.5-1.5B-Instruct \
	--n-examples 30

	# Cell 3: Zip and download
	import shutil
	shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")
	```

	### 6.2 Run Experiment 6 (Temporal Narrative)

	```python
	# Cell 1: Clone and install (fresh to get latest code)
	!rm -rf litm
	!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
	%cd litm
	!pip install -q -r requirements.txt

	# Cell 2: Run Experiment 6
	!python kaggle/run_exp6_narrative.py \
	--model Qwen/Qwen2.5-1.5B-Instruct \
	--n-examples 30 \
	--n-events 100

	# Cell 3: Zip and download
	import shutil
	shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")
	```

	### 6.3 Run All Experiments Overnight

	```python
	!python kaggle/run_all_kaggle.py \
	--model Qwen/Qwen2.5-1.5B-Instruct \
	--output /kaggle/working/litm_results \
	--n-examples 50
	```

	### 6.4 Kaggle Scripts

	\| Script \| Experiment \| ~Time on T4 \|
	\|--------\|-----------\|-------------\|
	\| `kaggle/run_exp1a_kv100.py` \| KV Retrieval (100 keys) \| ~15 min \|
	\| `kaggle/run_exp1b_kv200.py` \| KV Retrieval (200 keys) \| ~25 min \|
	\| `kaggle/run_exp2_needle.py` \| Needle in Haystack \| ~20 min \|
	\| `kaggle/run_exp3_multi.py` \| Multi-Needle \| ~15 min \|
	\| `kaggle/run_exp4_reason.py` \| Fact-Dependent Reasoning \| ~20 min \|
	\| `kaggle/run_exp5_semantic.py` \| Semantic Similarity Distractors \| ~15 min \|
	\| `kaggle/run_exp6_narrative.py` \| Temporal Narrative \| ~15 min \|
	\| `kaggle/run_exp7_conversation.py` \| Conversation Memory \| ~15 min \|
	\| `kaggle/run_all_kaggle.py` \| All 7 experiments \| ~2–2.5 hrs \|

	---

	## 7. Output Structure

	### 7.1 Per-Experiment Output

	Each experiment produces a folder with the following files:

	```
	results/
	├── exp1a_kv100/
	│ ├── kv100_data.jsonl # Raw generated examples
	│ ├── kv100_pos_0.jsonl # Predictions: gold at position 0
	│ ├── kv100_pos_12.jsonl # Predictions: gold at position 12
	│ ├── kv100_pos_25.jsonl # Predictions: gold at position 25
	│ ├── kv100_pos_37.jsonl # Predictions: gold at position 37
	│ ├── kv100_pos_50.jsonl # Predictions: gold at position 50 (MIDDLE)
	│ ├── kv100_pos_62.jsonl # Predictions: gold at position 62
	│ ├── kv100_pos_75.jsonl # Predictions: gold at position 75
	│ ├── kv100_pos_87.jsonl # Predictions: gold at position 87
	│ ├── kv100_pos_99.jsonl # Predictions: gold at position 99 (END)
	│ ├── kv100_summary.json # Accuracies + PBI
	│ └── kv100_curve.png # U-shaped accuracy plot
	│
	├── exp2_needle/
	│ ├── needle_depth_0.0.jsonl
	│ ├── needle_depth_0.125.jsonl
	│ ├── ... (9 depth files)
	│ ├── needle_depth_1.0.jsonl
	│ ├── needle_summary.json
	│ └── needle_curve.png
	│
	├── ... (exp3-7 follow same pattern)
	│
	└── master_summary.json # Aggregated results from all experiments
	```

	### 7.2 JSONL Format

	Each `.jsonl` file contains one record per example:

	```json
	{"model_answer": "a1b2-c3d4-...", "correct": 1.0, "value": "target-uuid", "gold_position": 50}
	{"model_answer": "wrong-guess", "correct": 0.0, "value": "target-uuid", "gold_position": 50}
	```

	### 7.3 Summary JSON Format

	```json
	{
	"experiment": "kv_retrieval",
	"num_keys": 100,
	"num_examples": 50,
	"positions": {
	"0": 0.94,
	"12": 0.78,
	"25": 0.64,
	"37": 0.58,
	"50": 0.54,
	"62": 0.60,
	"75": 0.70,
	"87": 0.82,
	"99": 0.90
	},
	"pbi": 0.38,
	"time_minutes": 12.5
	}
	```

	### 7.4 Plot Files

	Each experiment saves a `.png` plot:
	- Curve plots (Experiments 1, 2, 4, 5, 6, 7): X-axis = normalized position, Y-axis = accuracy. Red curve with markers.
	- Bar plots (Experiment 3): X-axis = Start/Middle/End, Y-axis = accuracy.

	---

	## 8. Results & Graphs

	> [USER TO INSERT OUTPUT GRAPHS HERE]

	### 8.1 Experiment 1A: KV Retrieval (100 keys)

	[Upload kv100_curve.png here]

	Observations:

	### 8.2 Experiment 1B: KV Retrieval (200 keys)

	[Upload kv200_curve.png here]

	Observations:

	### 8.3 Experiment 2: Needle in Haystack

	[Upload needle_curve.png here]

	Observations:

	### 8.4 Experiment 3: Multi-Needle

	[Upload multi_bar.png here]

	Observations:

	### 8.5 Experiment 4: Fact-Dependent Reasoning

	[Upload reason_curve.png here]

	Observations:

	### 8.6 Experiment 5: Semantic Similarity Distractors

	[Upload semantic_curve.png here]

	Observations:

	### 8.7 Experiment 6: Temporal Narrative

	[Upload narrative_curve.png here]

	Observations:

	### 8.8 Experiment 7: Conversation Memory

	[Upload conversation_curve.png here]

	Observations:

	### 8.9 Cross-Experiment PBI Comparison

	\| Experiment \| PBI \| Edge Accuracy \| Middle Accuracy \| Classification \|
	\|-----------\|-----\|--------------\|-----------------\|----------------\|
	\| KV 100 keys \| \| \| \| \|
	\| KV 200 keys \| \| \| \| \|
	\| Needle \| \| \| \| \|
	\| Multi-Needle (middle) \| \| \| \| \|
	\| Fact Reasoning \| \| \| \| \|
	\| Semantic Distractors \| \| \| \| \|
	\| Temporal Narrative \| \| \| \| \|
	\| Conversation Memory \| \| \| \| \|

	---

	## 9. Conclusions & Discussion

	> [USER TO WRITE CONCLUSIONS HERE]

	### 9.1 Key Findings

	Summarize the main discoveries from your experiments:

	1.
	2.
	3.

	### 9.2 Implications

	What do these results mean for practitioners?

	### 9.3 Limitations

	What are the limitations of this study?

	### 9.4 Future Work

	What experiments or analyses would strengthen these findings?

	---

	## 10. Extending the Suite

	### 10.1 Add a New Experiment

	1. Create `experiments/my_experiment.py` with a `run_my_experiment(model_name, ..., out_dir)` function.
	2. Create `kaggle/run_expN_myexperiment.py` that calls your function with Kaggle defaults.
	3. Import and add to `run_all.py` and `kaggle/run_all_kaggle.py`.

	### 10.2 Add a New Model

	Change `--model` to any HuggingFace causal LM:

	```bash
	python run_all.py --model meta-llama/Llama-3.2-1B-Instruct
	python run_all.py --model Qwen/Qwen2.5-7B-Instruct
	```

	The suite automatically handles 4-bit quantization via `bitsandbytes`.

	### 10.3 Adjust Scale

	Increase context length by changing experiment parameters:

	```bash
	python kaggle/run_exp1b_kv200.py --n-keys 500 --n-examples 100
	python kaggle/run_exp2_needle.py --n-sentences 1000 --n-examples 50
	```

	---

	## 11. Citation

	If you use this benchmark suite in your research, please cite both the original paper and this suite:

	```bibtex
	@article{liu2023lost,
	title={Lost in the Middle: How Language Models Use Long Contexts},
	author={Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy},
	journal={arXiv preprint arXiv:2307.03172},
	year={2023}
	}

	@software{litm_benchmark_suite_v4,
	title={Lost in the Middle Benchmark Suite v4},
	author={abhshkp},
	year={2026},
	url={https://huggingface.co/abhshkp/litm-benchmark-suite-v4}
	}
	```

	---

	## Acknowledgments

	This suite extends the foundational work of Liu et al. (2023) and incorporates community feedback on scalable, modular benchmarking. Built with HuggingFace Transformers, bitsandbytes, and matplotlib.