Lost in the Middle β Benchmark Suite v4
A Modular, Reproducible Benchmark Suite for Evaluating Position Bias in Long-Context Language Models
Table of Contents
- What is "Lost in the Middle"?
- Position Bias Index (PBI)
- Architecture Overview
- Experiment Descriptions
- Quick Start
- Kaggle Usage
- Output Structure
- Results & Graphs
- Conclusions & Discussion
- Extending the Suite
- Citation
1. What is "Lost in the Middle"?
1.1 The Core Phenomenon
The "Lost in the Middle" (LITM) effect, first systematically documented by Liu et al. (2023), describes a critical failure mode in large language models (LLMs) when processing long contexts:
Models perform best when relevant information appears at the beginning or end of a context, and worst when it is buried in the middle.
This creates a characteristic U-shaped accuracy curve when plotting model performance against the position of the target information within a long document.
1.2 Why Does This Happen?
The LITM effect arises from how modern transformer-based LLMs process attention:
| Mechanism | Explanation |
|---|---|
| Attention Dilution | In long sequences, the softmax over attention weights becomes increasingly diffuse. Middle-position tokens receive proportionally less attention mass than edge-position tokens. |
| Positional Bias in Training | Pretraining data often places key information at document boundaries (introductions, summaries). Models learn a positional prior that favors start and end positions. |
| KV Cache Pressure | During autoregressive generation, the key-value cache grows linearly with sequence length. Attention computation over very long contexts becomes noisier in the middle regions. |
| Softmax Saturation | With many tokens competing for attention probability mass, individual middle tokens are "drowned out" by the aggregate signal from surrounding tokens. |
1.3 The U-Shaped Curve
When you plot accuracy vs. position, you see a U-shape:
Accuracy
1.0 | β β
0.9 | β β
0.8 | β β
0.7 | β β
0.6 | β β β β
0.5 | β β β
+----+----+----+----+----+----+----+----+
0 0.12 0.25 0.37 0.5 0.62 0.75 0.87 1.0
Position
- Position 0.0 (start): High accuracy β primacy bias
- Position 0.5 (middle): Lowest accuracy β the "lost" zone
- Position 1.0 (end): High accuracy β recency bias
1.4 Why This Matters
The LITM effect has profound implications for real-world LLM deployments:
- Retrieval-Augmented Generation (RAG): If a retriever returns relevant documents in the middle of a concatenated prompt, the generator may ignore them.
- Long-Document QA: Answers hidden in the middle of legal contracts, medical records, or research papers are systematically missed.
- In-Context Learning: Demonstration examples placed in the middle of a prompt are less effective than those at the start or end.
- Conversational AI: Critical instructions buried in long chat histories are forgotten.
2. Position Bias Index (PBI)
2.1 Definition
To quantify the LITM effect consistently across experiments, we define the Position Bias Index (PBI):
PBI = (accuracy_start + accuracy_end) / 2 β accuracy_middle
Where:
accuracy_start= accuracy when target is at the beginning (position 0.0)accuracy_end= accuracy when target is at the end (position 1.0)accuracy_middle= accuracy when target is at the center (position 0.5)
2.2 Interpretation
| PBI Range | Meaning |
|---|---|
| PBI > 0.30 | Strong U-shaped bias β severe middle degradation |
| PBI = 0.15β0.30 | Moderate bias β noticeable middle dip |
| PBI = 0.05β0.15 | Weak bias β slight middle dip |
| PBI β 0.00 | Flat curve β no positional bias |
| PBI < 0.00 | Inverted-U β rare, model performs best in middle |
2.3 Why PBI?
PBI is superior to simply reporting "middle accuracy" because:
- Normalizes for overall model competence: A model with 95% edge accuracy and 60% middle accuracy gets the same PBI as one with 70% edge and 35% middle.
- Captures the full U-shape: It explicitly contrasts edges against center.
- Comparable across experiments: KV retrieval, needle-in-haystack, and reasoning tasks all speak the same metric language.
2.4 Expanded Positions
Unlike the original Liu et al. paper (which tested 5 positions), this suite tests 9 positions for finer-grained curve resolution:
| Position | Normalized | Description |
|---|---|---|
| 0 | 0.000 | Absolute start |
| N/8 | 0.125 | Early |
| N/4 | 0.250 | Early-middle |
| 3N/8 | 0.375 | Pre-middle |
| N/2 | 0.500 | Exact middle |
| 5N/8 | 0.625 | Post-middle |
| 3N/4 | 0.750 | Late-middle |
| 7N/8 | 0.875 | Late |
| Nβ1 | 1.000 | Absolute end |
3. Architecture Overview
3.1 Module Structure
litm-benchmark-suite-v4/
β
βββ src/ β Shared infrastructure
β βββ model_loader.py # 4-bit quantized model loading
β βββ generator.py # Chat-template text generation
β βββ metrics.py # PBI, exact-match, numeric scoring
β βββ plotting.py # Standardized curve/bar plots
β βββ utils.py # JSONL/JSON I/O
β
βββ experiments/ β Core experiment logic (library)
β βββ kv_retrieval.py # Exp 1: UUID key-value pairs
β βββ needle_in_haystack.py # Exp 2: Secret code in prose
β βββ multi_needle.py # Exp 3: Three simultaneous needles
β βββ fact_reasoning.py # Exp 4: Math with buried facts
β βββ semantic_distractors.py # Exp 5: Gold among similar facts
β βββ temporal_narrative.py # Exp 6: Events in chronology
β βββ conversation_memory.py # Exp 7: Instruction in chat history
β
βββ kaggle/ β Standalone Kaggle runners
β βββ run_exp1a_kv100.py
β βββ run_exp1b_kv200.py
β βββ run_exp2_needle.py
β βββ run_exp3_multi.py
β βββ run_exp4_reason.py
β βββ run_exp5_semantic.py
β βββ run_exp6_narrative.py
β βββ run_exp7_conversation.py
β βββ run_all_kaggle.py # Master Kaggle runner
β
βββ run_all.py # Local master runner
βββ config.yaml # Hyperparameter configuration
βββ requirements.txt # Dependencies
3.2 Design Philosophy
experiments/= Reusable library modules. Each file exports arun_*()function. Never executed directly.kaggle/= Thin entry-point scripts. Each imports one experiment module, adds Kaggle-specific paths and CLI args, and runs it.src/= Shared utilities. Model loading, generation, metrics, plotting β used by everything.
This separation means:
- You can import
experiments.kv_retrievalinto your own custom pipeline. - You can run any experiment standalone in Kaggle without touching the core logic.
- You can add a new experiment by writing one module and one Kaggle wrapper.
4. Experiment Descriptions
4.1 Experiment 1: Key-Value Retrieval
Files: experiments/kv_retrieval.py, kaggle/run_exp1a_kv100.py, kaggle/run_exp1b_kv200.py
Motivation
This is the canonical LITM task from Liu et al. (2023). It tests the most basic form of long-context retrieval: given a JSON object with many key-value pairs, can the model extract the value for a specific key?
Methodology
- Generate
Nrandom UUID key-value pairs (e.g., 100 or 200). - Select one pair as the gold target.
- Place the gold pair at 9 controlled positions: 0, N/8, N/4, ..., Nβ1.
- Prompt the model with the full JSON object and ask for the value corresponding to the gold key.
- Score with exact-match against the true value.
Prompt Template
Extract the value corresponding to the specified key in the JSON object below.
JSON data:
{"<key1>": "<value1>",
"<key2>": "<value2>",
...}
Key: "<query_key>"
Corresponding value:
Why This Task?
- No reasoning required β pure retrieval. If the model fails, it's unambiguously an attention/position problem.
- Structured input β JSON provides clear boundaries, eliminating ambiguity about what constitutes "the answer."
- Scalable β trivial to generate 50, 100, 500, or 1000 keys.
Variants
- 1A (100 keys): Moderate length. Tests position bias in a medium-length context.
- 1B (200 keys): Double length. Tests whether bias amplifies with context length.
Expected Results
- U-shaped curve with strong middle dip.
- PBI ~ 0.35β0.50 for 1.5B models.
- PBI higher for 200 keys than 100 keys (longer contexts = worse middle performance).
4.2 Experiment 2: Needle in a Haystack
Files: experiments/needle_in_haystack.py, kaggle/run_exp2_needle.py
Motivation
Tests retrieval from unstructured natural language prose. Unlike KV retrieval (structured), this requires the model to search through fluent text to find a specific fact.
Methodology
- Generate a long document of
Nfiller sentences (default: 500) from a pool of generic factual statements. - Insert a "needle" sentence containing a unique secret code at a controlled depth.
- Ask the model to extract the secret code.
- Score with exact-match.
Example Document
The history of pottery spans thousands of years. [1].
Marine biologists study coral reef ecosystems. [2].
...
The secret code is SECRET-0042. [250]. β needle at position 250/500
...
Railway engineering requires precise curvature. [500].
Prompt
Read the text and find the secret code.
<document>
What is the secret code? Answer with only the code.
Why This Task?
- Unstructured retrieval β tests whether the model can locate a specific fact in prose, not just structured data.
- High information density β every sentence is semantically meaningful, creating realistic competition for attention.
- Scalable to extreme lengths β can test 1K, 2K, or even 10K sentences.
Expected Results
- U-shaped curve, possibly stronger than KV retrieval because prose is less structured than JSON.
- PBI ~ 0.30β0.45.
4.3 Experiment 3: Multi-Needle Retrieval
Files: experiments/multi_needle.py, kaggle/run_exp3_multi.py
Motivation
Real documents often contain multiple relevant facts, not just one. This tests whether the model can retrieve all of them simultaneously, and whether position bias affects each needle independently.
Methodology
- Generate a long filler document (default: 300 sentences).
- Insert three distinct secret codes at three fixed positions:
- Code A at position 0 (start)
- Code B at position N/2 (middle)
- Code C at position Nβ1 (end)
- Ask the model to list all three codes in order.
- Score each code independently with exact-match.
Prompt
Read the text and list ALL three secret codes in order.
<document>
Codes:
Why This Task?
- Tests multi-hop attention β the model must attend to three non-contiguous locations.
- Reveals asymmetric bias β does the model retrieve start and end needles but miss the middle one?
- Models real RAG scenarios β multiple retrieved chunks concatenated together.
Expected Results
- Start code: ~90β100% accuracy
- End code: ~90β100% accuracy
- Middle code: ~50β70% accuracy (the "lost" needle)
- Bar chart showing asymmetric retrieval.
4.4 Experiment 4: Fact-Dependent Reasoning
Files: experiments/fact_reasoning.py, kaggle/run_exp4_reason.py
Motivation
Retrieval is only step one. In real tasks, models must use retrieved facts to perform reasoning (math, inference, decision-making). This tests whether position bias persists when the model must both retrieve a fact and reason with it.
Methodology
- Generate a long document of
Ndistractor sentences (default: 300). - Insert one critical fact at a controlled depth about a fictional product with a random price, e.g.:
"For this order, Zylor apples cost $247/kg."
- Ask a math question that requires this fact:
"According to the document, I buy 12 kg of Zylor apples. What is my total cost?"
- The model must (a) find the fictional product's price, (b) multiply by quantity.
- Score with exact integer match.
Critical Design Choice
All products are fictional (e.g., "Zylor apples," "Krynn berries") with random prices ($50β$500). The model cannot answer from parametric knowledge β it MUST read the document.
Why This Task?
- Reasoning Γ Retrieval β failure could be either retrieval failure or reasoning failure. This disentangles them.
- More realistic than pure retrieval β most real tasks require using information, not just locating it.
- Tests compositional generalization β can the model compose retrieved facts with arithmetic?
Expected Results
- If the model is capable: U-shaped curve, but possibly weaker than pure retrieval because reasoning demands deeper processing. PBI ~ 0.20β0.40.
- If the model is not capable: Near-chance accuracy across all depths (~10β30%), making position bias statistically undetectable. This itself is a valuable finding β it establishes that LITM effects are observable only when the underlying task lies within the model's competence frontier.
4.5 Experiment 5: Semantic Similarity Distractors
Files: experiments/semantic_distractors.py, kaggle/run_exp5_semantic.py
Motivation
In real documents, the target fact is rarely uniquely distinct. It competes with semantically similar distractors. This tests whether position bias interacts with associative interference.
Methodology
- Create a list of
Nfactual statements (default: 80) from the same semantic domain, all about fictional countries with random secret codes.- E.g., "The capital of Xyloria is ZENTH-7392.", "The capital of Freloria is VORT-1854.", ...
- Insert the gold fact among them at a controlled depth.
- Ask a question requiring the secret code from the gold fact.
"What is the capital of Xyloria? Answer with only the secret code."
- The distractors create associative competition β the model must distinguish "Xyloria" from "Freloria," "Zenthar," etc.
Critical Design Choice
All countries are fictional and all codes are random. The model cannot answer from parametric knowledge. It must read the specific line in the document.
Why This Task?
- Associative interference β similar-looking facts compete for attention.
- Tests discriminative retrieval β not just "find the needle" but "find the right needle among similar needles."
- Models RAG with dense semantic overlap β e.g., multiple retrieved passages about related topics.
Expected Results
- Classic LITM (U-shape) may NOT appear. When distractors are semantically dense and the target is not lexically unique, recency bias can collapse.
- Instead of U-shape, you may see a monotonic decline or primacy-only pattern: high at start, declining through the document, with no recovery at the end.
- This is a novel finding: semantic density destroys the recency advantage because the final items are not distinct enough to "pop" against their neighbors.
4.6 Experiment 6: Temporal Narrative
Files: experiments/temporal_narrative.py, kaggle/run_exp6_narrative.py
Motivation
Documents often have inherent temporal structure (chronologies, logs, histories). Does chronological ordering help or hurt retrieval? Does the model use temporal scaffolding, or does raw position dominate?
Methodology
- Generate a timeline of
Nhistorical events (default: 100).- 30 generic historical events (e.g., "the king issued a decree").
- 8 statue-unveiling distractors with different materials/locations (e.g., "a bronze statue was unveiled in the town square").
- Insert a target event at a controlled depth with a random secret code:
"Year 1050: a golden statue was unveiled in the central square (CODE: XJ-7392)."
- The target is one of 9 statue-unveiling events β not lexically unique. The model must distinguish "golden statue + central square" from other statue events and extract the code.
- Ask the model to identify the code.
"What is the secret code for the golden statue that was unveiled in the central square?"
- Score with exact-match against the secret code.
Critical Design Choice
The target is semantically embedded in a family of similar events. The model cannot locate it by simple keyword search ("statue" appears 9 times). It must use positional attention combined with semantic discrimination.
Why This Task?
- Temporal structure β events have meaningful ordering, not arbitrary placement.
- Semantic competition β similar events compete for attention, testing true positional bias rather than lexical uniqueness.
- Models real-world timelines β medical histories, legal case files, project logs with repeated event types.
Expected Results
- U-shaped curve, but possibly different from unstructured tasks.
- If temporal ordering provides scaffolding, the curve may be weaker than needle-in-haystack.
- If semantic density dominates, recency bias may collapse (similar to Exp 5).
4.7 Experiment 7: Conversation Memory
Files: experiments/conversation_memory.py, kaggle/run_exp7_conversation.py
Motivation
Conversational AI must maintain coherence across long dialogue histories. Critical instructions or facts buried in the middle of a chat are frequently "forgotten." This tests dialog-state position bias.
Methodology
- Generate a synthetic conversation of
Nturns (default: 100) between User and Assistant.- User messages from a pool of generic questions.
- Assistant messages from a pool of generic answers.
- Insert a critical instruction at a controlled depth:
User: "Please always remember that my favorite color is MYFAVCOLOR-042. This is very important." Assistant: "I will remember that."
- At the end, ask the model to recall the instruction.
"Based on our conversation, what is my favorite color?"
- Score with exact-match against the color code.
Why This Task?
- Dialog-specific β tests position bias in the conversational domain.
- Instruction following β models are explicitly told to "remember." Do they?
- Models real chatbot failures β system prompts, user preferences, critical warnings buried in history.
Expected Results
- U-shaped curve, possibly very strong because dialog turns are short and attention can "skip" over middle turns.
- PBI may be higher than document tasks.
5. Quick Start
5.1 Local Installation
# Clone the repository
git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4
cd litm-benchmark-suite-v4
# Install dependencies
pip install -r requirements.txt
5.2 Run All Experiments (Local)
python run_all.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--output ./results \
--n-examples 50
5.3 Run Single Experiment (Local)
python run_all.py \
--experiments needle \
--model Qwen/Qwen2.5-1.5B-Instruct \
--output ./results
5.4 Available CLI Flags
| Flag | Default | Description |
|---|---|---|
--model |
Qwen/Qwen2.5-1.5B-Instruct |
HuggingFace model identifier |
--output |
./results |
Output directory |
--n-examples |
50 |
Examples per position (KV only) |
--n-keys-100 |
100 |
Keys for Exp 1A |
--n-keys-200 |
200 |
Keys for Exp 1B |
--needle-sentences |
500 |
Sentences for Exp 2 |
--experiments |
all |
Comma-separated: kv100,kv200,needle,multi,reason,semantic,narrative,conversation |
--zip |
False |
Create zip archive of results |
6. Kaggle Usage
6.1 Run a Single Experiment (Recommended)
Each experiment is self-contained and takes ~15β25 minutes on a T4 GPU.
In a Kaggle notebook cell:
# Cell 1: Clone and install
!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
%cd litm
!pip install -q -r requirements.txt
# Cell 2: Run one experiment
!python kaggle/run_exp2_needle.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--n-examples 30
# Cell 3: Zip and download
import shutil
shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")
6.2 Run Experiment 6 (Temporal Narrative)
# Cell 1: Clone and install (fresh to get latest code)
!rm -rf litm
!git clone https://huggingface.co/abhshkp/litm-benchmark-suite-v4 litm
%cd litm
!pip install -q -r requirements.txt
# Cell 2: Run Experiment 6
!python kaggle/run_exp6_narrative.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--n-examples 30 \
--n-events 100
# Cell 3: Zip and download
import shutil
shutil.make_archive("/kaggle/working/litm_results", "zip", "/kaggle/working/litm_results")
6.3 Run All Experiments Overnight
!python kaggle/run_all_kaggle.py \
--model Qwen/Qwen2.5-1.5B-Instruct \
--output /kaggle/working/litm_results \
--n-examples 50
6.4 Kaggle Scripts
| Script | Experiment | ~Time on T4 |
|---|---|---|
kaggle/run_exp1a_kv100.py |
KV Retrieval (100 keys) | ~15 min |
kaggle/run_exp1b_kv200.py |
KV Retrieval (200 keys) | ~25 min |
kaggle/run_exp2_needle.py |
Needle in Haystack | ~20 min |
kaggle/run_exp3_multi.py |
Multi-Needle | ~15 min |
kaggle/run_exp4_reason.py |
Fact-Dependent Reasoning | ~20 min |
kaggle/run_exp5_semantic.py |
Semantic Similarity Distractors | ~15 min |
kaggle/run_exp6_narrative.py |
Temporal Narrative | ~15 min |
kaggle/run_exp7_conversation.py |
Conversation Memory | ~15 min |
kaggle/run_all_kaggle.py |
All 7 experiments | ~2β2.5 hrs |
7. Output Structure
7.1 Per-Experiment Output
Each experiment produces a folder with the following files:
results/
βββ exp1a_kv100/
β βββ kv100_data.jsonl # Raw generated examples
β βββ kv100_pos_0.jsonl # Predictions: gold at position 0
β βββ kv100_pos_12.jsonl # Predictions: gold at position 12
β βββ kv100_pos_25.jsonl # Predictions: gold at position 25
β βββ kv100_pos_37.jsonl # Predictions: gold at position 37
β βββ kv100_pos_50.jsonl # Predictions: gold at position 50 (MIDDLE)
β βββ kv100_pos_62.jsonl # Predictions: gold at position 62
β βββ kv100_pos_75.jsonl # Predictions: gold at position 75
β βββ kv100_pos_87.jsonl # Predictions: gold at position 87
β βββ kv100_pos_99.jsonl # Predictions: gold at position 99 (END)
β βββ kv100_summary.json # Accuracies + PBI
β βββ kv100_curve.png # U-shaped accuracy plot
β
βββ exp2_needle/
β βββ needle_depth_0.0.jsonl
β βββ needle_depth_0.125.jsonl
β βββ ... (9 depth files)
β βββ needle_depth_1.0.jsonl
β βββ needle_summary.json
β βββ needle_curve.png
β
βββ ... (exp3-7 follow same pattern)
β
βββ master_summary.json # Aggregated results from all experiments
7.2 JSONL Format
Each .jsonl file contains one record per example:
{"model_answer": "a1b2-c3d4-...", "correct": 1.0, "value": "target-uuid", "gold_position": 50}
{"model_answer": "wrong-guess", "correct": 0.0, "value": "target-uuid", "gold_position": 50}
7.3 Summary JSON Format
{
"experiment": "kv_retrieval",
"num_keys": 100,
"num_examples": 50,
"positions": {
"0": 0.94,
"12": 0.78,
"25": 0.64,
"37": 0.58,
"50": 0.54,
"62": 0.60,
"75": 0.70,
"87": 0.82,
"99": 0.90
},
"pbi": 0.38,
"time_minutes": 12.5
}
7.4 Plot Files
Each experiment saves a .png plot:
- Curve plots (Experiments 1, 2, 4, 5, 6, 7): X-axis = normalized position, Y-axis = accuracy. Red curve with markers.
- Bar plots (Experiment 3): X-axis = Start/Middle/End, Y-axis = accuracy.
8. Results & Graphs
[USER TO INSERT OUTPUT GRAPHS HERE]
8.1 Experiment 1A: KV Retrieval (100 keys)
[Upload kv100_curve.png here]
Observations:
8.2 Experiment 1B: KV Retrieval (200 keys)
[Upload kv200_curve.png here]
Observations:
8.3 Experiment 2: Needle in Haystack
[Upload needle_curve.png here]
Observations:
8.4 Experiment 3: Multi-Needle
[Upload multi_bar.png here]
Observations:
8.5 Experiment 4: Fact-Dependent Reasoning
[Upload reason_curve.png here]
Observations:
8.6 Experiment 5: Semantic Similarity Distractors
[Upload semantic_curve.png here]
Observations:
8.7 Experiment 6: Temporal Narrative
[Upload narrative_curve.png here]
Observations:
8.8 Experiment 7: Conversation Memory
[Upload conversation_curve.png here]
Observations:
8.9 Cross-Experiment PBI Comparison
| Experiment | PBI | Edge Accuracy | Middle Accuracy | Classification |
|---|---|---|---|---|
| KV 100 keys | ||||
| KV 200 keys | ||||
| Needle | ||||
| Multi-Needle (middle) | ||||
| Fact Reasoning | ||||
| Semantic Distractors | ||||
| Temporal Narrative | ||||
| Conversation Memory |
9. Conclusions & Discussion
[USER TO WRITE CONCLUSIONS HERE]
9.1 Key Findings
Summarize the main discoveries from your experiments:
9.2 Implications
What do these results mean for practitioners?
9.3 Limitations
What are the limitations of this study?
9.4 Future Work
What experiments or analyses would strengthen these findings?
10. Extending the Suite
10.1 Add a New Experiment
- Create
experiments/my_experiment.pywith arun_my_experiment(model_name, ..., out_dir)function. - Create
kaggle/run_expN_myexperiment.pythat calls your function with Kaggle defaults. - Import and add to
run_all.pyandkaggle/run_all_kaggle.py.
10.2 Add a New Model
Change --model to any HuggingFace causal LM:
python run_all.py --model meta-llama/Llama-3.2-1B-Instruct
python run_all.py --model Qwen/Qwen2.5-7B-Instruct
The suite automatically handles 4-bit quantization via bitsandbytes.
10.3 Adjust Scale
Increase context length by changing experiment parameters:
python kaggle/run_exp1b_kv200.py --n-keys 500 --n-examples 100
python kaggle/run_exp2_needle.py --n-sentences 1000 --n-examples 50
11. Citation
If you use this benchmark suite in your research, please cite both the original paper and this suite:
@article{liu2023lost,
title={Lost in the Middle: How Language Models Use Long Contexts},
author={Liu, Nelson F and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy},
journal={arXiv preprint arXiv:2307.03172},
year={2023}
}
@software{litm_benchmark_suite_v4,
title={Lost in the Middle Benchmark Suite v4},
author={abhshkp},
year={2026},
url={https://huggingface.co/abhshkp/litm-benchmark-suite-v4}
}
Acknowledgments
This suite extends the foundational work of Liu et al. (2023) and incorporates community feedback on scalable, modular benchmarking. Built with HuggingFace Transformers, bitsandbytes, and matplotlib.
- Downloads last month
- 27