White paper: Effects of beam search on translation quality and resource consumption.
Table of Contents
Executive Summary
This study evaluates the impact of beam search width on translation quality and resource consumption using Straker's Tiri J financial translation model (7B parameters, int8 quantization). Testing 2000 English→Japanese translation tasks across beam sizes 1-5, we measured quality using industry-standard metrics (BLEU, CHRF, BLEURT, TER) and tracked VRAM consumption on NVIDIA RTX 4090 hardware.
Key Findings
Quality Improvements:
- Beam search delivers 6-9% improvement in BLEU/CHRF scores over greedy decoding
- Peak quality achieved at beam size 5, but 93% of maximum quality reached at beam size 2
- Diminishing returns evident: each beam beyond #2 yields ≤0.5% quality gain
Resource Costs:
- VRAM consumption scales approximately linearly: +7-10% per additional beam
- Beam size 2 requires only 10% additional VRAM for 8% quality improvement
- Beam size 5 demands 33%+ more VRAM than greedy decoding
Optimal Configuration:
- Production environments: Beam size 2 provides best cost/quality balance
- Research applications: Beam size 5 maximizes metric scores
- Resource-constrained deployments: Greedy decoding (beam size 1) for <16GB VRAM
Business Impact
The analysis reveals that beam size 2 represents the optimal sweet spot for production deployment, delivering substantial quality improvements (8% BLEU gain) with minimal resource overhead (10% VRAM increase). Organizations can achieve 93% of maximum translation quality while maintaining efficient resource utilization, making this configuration ideal for scalable production environments.
1. Introduction
1.1 Hardware Configuration
- GPU: NVIDIA GeForce RTX 4090 (24GB VRAM)
1.2 Model Architecture
- Base Model: tiri-j-fin-7b-v1.1 (7B parameter transformer) - Early Tiri model, no longer used.
- Specialization: English→Japanese translation in the financial domain.
- Optimization: int8 - quantization reduces memory footprint by 4x vs FP32
- Batch Size: Fixed at 8 sequences
1.3 Beam Search Fundamentals
Beam search is a search algorithm that explores a graph by expanding the most promising nodes in a limited set. It is often used in optimization problems, particularly in sequence-to-sequence tasks like translation
Breadth-First Nature with Limitations: While similar to a breadth-first search, where all possible nodes are expanded at each step, beam search introduces a parameter called "beam width" or "beam size" that limits the number of nodes to explore. This allows the algorithm to strike a balance between efficiency and result quality.
Beam Size: Beam size is a hyperparameter that determines how many of the most probable partial solutions (nodes) are kept at each step. For instance, a beam size of 3 means at most three nodes will be expanded and 'searched' at each step.
Step-by-Step Expansion:
- Start with an initial node or partial solution.
- Expand this partial solution by creating all possible next steps.
- Evaluate each new partial solution using a scoring function, often a probability for sequence tasks.
- Keep only the top "beam search" number of partial solutions with the highest scores.
- Repeat the process with the surviving partial solutions until a complete solution is found or a certain condition is met.
Our tests below outline the effect that beam size has on translation quality. We tested our Tiri J model with 2000 previously unseen translation tasks, scored by industry standard translation quality metrics, and correlated against resource consumption.
1.4 Metrics Explained
1. BLEU (Bilingual Evaluation Understudy)
- Measures n-gram precision against reference translations
- Range: 0-100 (higher = better)
- Correlates well with human judgment at corpus level
2. CHRF (Character n-gram F-score)
- Evaluates character-level similarity using F-score
- Range: 0-100 (higher = better)
- Particularly effective for languages with complex morphology (e.g., Japanese)
3. BLEURT
- Learned metric using pre-trained BERT models
- Range: 0-1 (higher = better)
- Captures semantic similarity beyond surface forms
4. TER (Translation Edit Rate)
- Measures edit distance (insertions/deletions/substitutions) needed to match reference
- Range: 0-100 (lower = better)
2. Results
2.1 Performance Metrics Table With % Change Between Beams
Table 1: Absolute Scores
| Beams | BLEU | CHRF | BLEURT | TER | |
|---|---|---|---|---|---|
| 1 | 54.24 | 58.92 | 0.834 | 68.15 | |
| 2 | 58.58 | 62.48 | 0.843 | 70.51 | |
| 3 | 58.80 | 62.64 | 0.843 | 70.07 | |
| 4 | 58.20 | 62.17 | 0.841 | 70.26 | |
| 5 | 58.90 | 62.74 | 0.844 | 70.84 |
Table 2: VRAM Requirements
| Beams | VRAM (GB) | % Increase vs Greedy |
|---|---|---|
| 1 | 15.0 | 0% |
| 2 | 16.5 | +10% |
| 3 | 17.5 | +16.67% |
| 4 | 18.5 | +23.33% |
| 5 | 20+ | ≥33.33% |
Table 3: Calculated Performance
| Beams | BLEU | Δ% vs Prev | CHRF | Δ% vs Prev | BLEURT | Δ% vs Prev | VRAM (GB) | Δ% vs Prev |
|---|---|---|---|---|---|---|---|---|
| 1 | 54.24 | - | 58.92 | - | 0.834 | - | 15.0 | - |
| 2 | 58.58 | +8.00% | 62.48 | +6.04% | 0.843 | +1.08% | 16.5 | +10.00% |
| 3 | 58.80 | +0.38% | 62.64 | +0.26% | 0.843 | 0.00% | 17.5 | +6.06% |
| 4 | 58.20 | -1.02% | 62.17 | -0.75% | 0.841 | -0.24% | 18.5 | +5.71% |
| 5 | 58.90 | +1.20% | 62.74 | +0.92% | 0.844 | +0.36% | 20.0+ | ≥+8.11% |
Key Trends:
- BLEU/CHRF show 8-9% gains from Beam 1→5
- TER paradoxically worsens with beam search (3-4% degradation)
- At this batch size, VRAM scales mostly linearly (+1.5GB/beam) until Beam 5 (+33% total)
2.2 Key Observations From Metrics
2.3 Beam Transition Analysis
- 1→2: 8% BLEU gain for 10% VRAM increase
- 2→3: <0.5% quality gains despite 6% VRAM growth
- 4→5: 1.2% BLEU recovery requires 8%+ VRAM
2.4 Critical Thresholds
- 90% Quality Ceiling: Beam 2 achieves 93% of max BLEU at 50% VRAM cost
- Negative ROI Zone: Beam 4 reduces quality while increasing resource use
- Diminishing Returns: Each beam beyond #2 yields ≤0.5% BLEU gain per beam
2.5 TER Paradox Analysis
While beam search improves n-gram matching metrics (BLEU/CHRF), its tendency to produce longer translations increases edit distance through two mechanisms:
- Insertion Penalties: Additional words require deletions to match reference length
- Word Choice Differences: Longer outputs increase opportunities for mismatches in how words are arranged
3. Visual Analysis
3.1 Quality-VRAM Tradeoff Curve
Peak BLEU requires 33% more VRAM than baseline, while CHRF plateaus after Beam 3
3.2 Key Observations from Visualization
- Non-Linear Scaling: Quality metrics plateau after Beam 2 while VRAM continues linear growth
- Metric Consensus: High inter-metric correlation (BLEU/CHRF/BLEURT) validates their reliability in these tests
- Beam 4 Anomaly: Visible as dip in Figure 1's quality curves despite VRAM increase
4. Practical Implications
Deployment Recommendations
| Use Case | Optimal Beams | Rationale |
|---|---|---|
| Basic Production | 2 | Best cost/quality balance |
| Research | 5 | Maximizes metric scores |
| Mobile Deployment | 1 | Only configuration <16GB VRAM |
Hardware Planning Guidelines
- VRAM Budgeting: Each additional beam requires ~7% VRAM headroom on this architecture
- Batch Size Warning: Doubling batch size from 8→16 would require ~30GB VRAM at Beam 5
5. Conclusion
This study reveals three fundamental tradeoffs in beam search optimization:
- Quality Gains: Beam search can improve BLEU/CHRF by 6-9%
- Resource Costs: Each beam increases VRAM consumption by 7-10%
- Metric Conflicts: TER behavior in this study suggests beam search produces "different but valid" translations
While beam search enhances translation quality, you must balance metric improvements against resource costs. Beam 2 emerges as the optimal choice for Tiri Translation applications.
[Link to Tiri: https://www.straker.ai/products/tiri]