# Distance-to-Goal Metrics: Complete Reference Table ## Priority-Ordered Metric Comparison | Priority | Metric Name | Effort | Value | Concept | What It Measures | Key Value | Potential Risks | |---------|-------------|--------|-------|---------|------------------|-----------|-----------------| | **1** | **Cardinality Matching** | 30 min | Very High | Compare row/value counts | "Did you return the right number of things?" | Catches over/under-retrieval early; works universally | None significant; very robust | | **2** | **Value Overlap (Set-based)** | 45 min | Very High | Flatten results to value sets, compute Jaccard | "How many correct values did you find?" | Rewards partial correctness; format-agnostic | May over-reward coincidental matches in large results | | **3** | **Numeric Range Proximity** | 1 hour | High | Logarithmic distance for numbers | "Are your numbers in the right ballpark?" | Critical for COUNT/SUM/AVG queries; rewards order-of-magnitude | Only useful for numeric questions (~40% of dataset) | | **4** | **Row-wise Best Match** | 1.5 hours | High | Find best row pairing between results | "How many rows are correct, ignoring order?" | Handles ORDER BY issues; forgiving for column mismatch | Computationally expensive for large result sets (100+ rows) | | **5** | **Schema Coverage** | 2 hours | Medium-High | Compare tables used in queries | "Are you querying the right tables?" | Guides exploration toward relevant parts of schema | Requires SQL parsing; may reward irrelevant table access | | **6** | **Column Alignment** | 2.5 hours | Medium | Fuzzy match column names | "Do your result columns make semantic sense?" | Helps disambiguate multi-column results | Requires metadata tracking; fuzzy matching can be noisy | | **7** | **Rank Correlation** | 3 hours | Medium | Spearman correlation on ordered results | "Is your ranking correct for TOP-K queries?" | Specific to ordered results; robust to ties | Only applicable to ~20% of questions; needs scipy dependency | | **8** | **SQL Structure** | 6+ hours | Low-Medium | Parse query AST, compare structure | "Is your query syntactically similar?" | Might help with complex multi-join queries | HIGH RISK: Reward hacking; overfitting; ignores semantic equivalence | | **9** | **Execution Plan** | 8+ hours | Low | Compare database query plans | "Does your query execute similarly?" | Theoretical value for optimization | NOT RECOMMENDED: Too complex; DB-specific; doesn't guarantee correctness | --- ## Detailed Conceptual Explanations ### ๐Ÿฅ‡ Tier 1: Must-Have Metrics #### 1. Cardinality Matching **How It Works**: ``` Input: agent_result = [row1, row2, row3, row4, row5] (5 rows) gold_result = [row1, row2, row3] (3 rows) Calculation: difference = |5 - 3| = 2 relative_error = 2 / 3 = 0.667 reward = 1 - min(1.0, 0.667) = 0.333 ``` **Conceptual Meaning**: "Getting the right number of things is the first step to getting the right things." **Value Provided**: - **Early signal**: Before content is right, size can be right - **Universal**: Works for integers (1 value), lists (N values), tables (M rows) - **Catches common errors**: - Missing `GROUP BY` โ†’ too few rows - Cartesian join โ†’ too many rows - Wrong aggregation โ†’ count off by orders of magnitude **Risk Analysis**: โœ… **Very low risk** - No reward hacking opportunities (agent can't game cardinality without semantic progress) - Monotonic with correctness (better results = better cardinality) - Fast to compute (O(1)) --- #### 2. Value Overlap (Set-based) **How It Works**: ``` Question: "Which departments have >50 employees?" Gold: {Engineering, Sales, Marketing} Agent: {Engineering, Sales, HR, Legal} (2 correct, 2 wrong) Calculation: intersection = {Engineering, Sales} = 2 values union = {Engineering, Sales, Marketing, HR, Legal} = 5 values Jaccard = 2/5 = 0.4 ``` **Conceptual Meaning**: "You found some of the right answers; find the rest." **Value Provided**: - **Partial credit**: Agent gets 0.4 reward even though final answer isn't perfect - **Format agnostic**: Flattens everything to atomic values - `[("Engineering", 42)]` โ†’ `{Engineering, 42}` - `[(42, "Engineering")]` โ†’ `{Engineering, 42}` (same!) - **Compositional**: Finding 1 of 3 correct departments = 0.33, finding 2 of 3 = 0.67, finding all 3 = 1.0 **Risk Analysis**: โš ๏ธ **Low-medium risk** - **Coincidental matches**: If gold answer is `{42}` and agent returns entire `employees` table with 100 rows, might contain `42` by chance - **Mitigation**: Combine with cardinality (penalizes returning too many values) - **False precision**: `{42}` vs `{42.0}` vs `{42.001}` might all become `{"42"}` after string conversion --- #### 3. Numeric Range Proximity **How It Works**: ``` Question: "Average salary in Engineering?" Gold: 95000 Agent attempt 1: 87000 (off by 8.4%) relative_error = 8000 / 95000 = 0.084 reward = 1 - log10(1 + 0.084) = 1 - log10(1.084) = 1 - 0.035 = 0.965 Agent attempt 2: 9500 (off by 90%, wrong order of magnitude) relative_error = 85500 / 95000 = 0.9 reward = 1 - log10(1 + 0.9) = 1 - log10(1.9) = 1 - 0.279 = 0.721 Agent attempt 3: 950000 (10x too high) relative_error = 855000 / 95000 = 9.0 reward = 1 - log10(1 + 9.0) = 1 - log10(10) = 1 - 1.0 = 0.0 ``` **Conceptual Meaning**: "Being off by 10% is very different from being off by 10x." **Value Provided**: - **Order-of-magnitude thinking**: Rewards agent for "ballpark correct" before exact - **Logarithmic scale**: - 95k โ†’ 100k (5% error) = high reward (0.98) - 95k โ†’ 190k (100% error) = medium reward (0.70) - 95k โ†’ 950k (900% error) = no reward (0.05) - **Natural for SQL**: COUNT, SUM, AVG queries often close but not exact on first try **Risk Analysis**: โš ๏ธ **Medium risk** - **Only useful for ~40% of questions**: Text/categorical answers get no benefit - **Multiple numbers**: If result has `[42, 100, 5]` and gold is `[42]`, which number to compare? - Solution: Use closest match for each gold number - **Zero-handling**: `gold=0, agent=1` is infinitely far; needs special case --- #### 4. Row-wise Best Match **How It Works**: ``` Question: "Top 3 departments by size" Gold: [(Engineering, 65), (Sales, 58), (Marketing, 52)] Agent: [(Marketing, 52), (Engineering, 65), (Sales, 58)] # Wrong order! Process: For each gold row, find best matching agent row: Gold row 1: (Engineering, 65) vs Agent row 1: (Marketing, 52) โ†’ 0/2 match = 0.0 vs Agent row 2: (Engineering, 65) โ†’ 2/2 match = 1.0 โœ“ vs Agent row 3: (Sales, 58) โ†’ 0/2 match = 0.0 Best match: 1.0 Gold row 2: (Sales, 58) Best match: 1.0 (agent row 3) Gold row 3: (Marketing, 52) Best match: 1.0 (agent row 1) Final reward: (1.0 + 1.0 + 1.0) / 3 = 1.0 ``` **Conceptual Meaning**: "You got the right rows, just in the wrong orderโ€”that's still mostly correct." **Value Provided**: - **Order-invariant**: Catches `ORDER BY` mistakes without penalizing heavily - **Extra columns forgiven**: If agent returns `(Engineering, 65, 95000)` and gold is `(Engineering, 65)`, first 2 columns match โ†’ 0.67 reward - **Partial row matches**: Agent got department name right but count wrong โ†’ 0.5 reward per row **Risk Analysis**: โš ๏ธ **Medium risk** - **Computationally expensive**: O(Mร—N) comparisons for M gold rows and N agent rows - For 100-row results: 10,000 comparisons per reward calculation - Mitigation: Limit to first 20 rows - **Ambiguous matching**: If gold has duplicate rows, which agent row should match which? --- ### ๐Ÿฅˆ Tier 2: Nice-to-Have Metrics #### 5. Schema Coverage **How It Works**: ``` Question: "How many employees in Engineering?" Gold query: SELECT COUNT(*) FROM employees WHERE department='Engineering' Tables used: {employees} Agent query: SELECT COUNT(*) FROM employees e JOIN departments d ON e.dept_id = d.id WHERE d.name='Engineering' Tables used: {employees, departments} Calculation: intersection = {employees} = 1 union = {employees, departments} = 2 Jaccard = 1/2 = 0.5 Penalty for extra table: 0.1 * 1 = 0.1 Final: 0.5 - 0.1 = 0.4 ``` **Conceptual Meaning**: "You're exploring the right part of the database." **Value Provided**: - **Exploration guidance**: Early signal before query results are correct - **Helps with multi-hop**: "You found `employees`, now look at `departments`" - **Penalizes shotgun approach**: Agent that queries every table gets low reward **Risk Analysis**: โš ๏ธ **Medium-high risk** - **Multiple valid paths**: Simple query might use 1 table, complex query uses 3โ€”both correct - **Irrelevant table penalty**: What if agent explores `departments` first before finding `employees`? Gets penalized for valid exploration - **Requires SQL parsing**: Dependency on `sqlparse` library; edge cases in parsing --- #### 6. Column Alignment **How It Works**: ``` Question: "List departments and their average salaries" Gold columns: [department_name, avg_salary] Agent columns: [dept, average_compensation] Fuzzy matching: "department_name" vs "dept" โ†’ SequenceMatcher = 0.65 (partial match) โœ“ "avg_salary" vs "average_compensation" โ†’ SequenceMatcher = 0.45 (weak match) โœ— Reward: 1/2 columns matched = 0.5 ``` **Conceptual Meaning**: "Your columns have the right semantic meaning." **Value Provided**: - **Disambiguates multi-column results**: If result has `[42, 100, 5]`, which column is the answer? - **Catches projection errors**: Agent did `SELECT *` when should've done `SELECT department, COUNT(*)` - **Fuzzy matching helps**: "dept" matches "department", "emp_id" matches "employee_id" **Risk Analysis**: โš ๏ธ **High risk** - **Requires metadata**: Need to track column names from query results (not always available in raw SQLite) - **Fuzzy matching noise**: "count" matches "country" (0.7 similarity), "id" matches "bid" (0.67 similarity) - **Aliasing issues**: `SELECT COUNT(*) AS total` vs `SELECT COUNT(*) AS num_employees`โ€”both mean the same thing --- #### 7. Rank Correlation **How It Works**: ``` Question: "Top 5 products by revenue" Gold: [ProductA: $1M, ProductB: $900K, ProductC: $800K, ProductD: $750K, ProductE: $700K] Ranks: [1, 2, 3, 4, 5] Agent: [ProductA: $1M, ProductC: $850K, ProductB: $880K, ProductE: $710K, ProductD: $740K] Ranks: [1, 3, 2, 5, 4] Spearman correlation: Rank differences: [0, -1, +1, -1, +1] Correlation coefficient: 0.9 Reward: (0.9 + 1) / 2 = 0.95 ``` **Conceptual Meaning**: "You got the relative ordering mostly right." **Value Provided**: - **Specific to TOP-K queries**: ~20% of Spider questions involve ranking - **Robust to ties**: Handles "tied for 2nd place" correctly - **Partial credit for ordering**: Top 3 correct but bottom 2 swapped โ†’ still high reward **Risk Analysis**: โš ๏ธ **Medium risk** - **Limited applicability**: Only works for ordered results - **Requires scipy**: Heavy dependency just for one metric - **Rank vs. value confusion**: Agent might get ranking right but values wrong (or vice versa) --- ### ๐Ÿฅ‰ Tier 3: Avoid (High Risk, Low Value) #### 8. SQL Structure Similarity **How It Works**: ``` Gold query: SELECT d.name, COUNT(*) FROM employees e JOIN departments d ON e.dept_id = d.id GROUP BY d.name Agent query: SELECT department, COUNT(*) FROM employees GROUP BY department Structural comparison: Tables: {employees} vs {employees, departments} โ†’ 0.5 Joins: {(employees, departments)} vs {} โ†’ 0.0 Aggregates: {COUNT} vs {COUNT} โ†’ 1.0 Group By: {d.name} vs {department} โ†’ 0.5 Weighted average: 0.5 ``` **Conceptual Meaning**: "Your query looks syntactically similar to the gold query." **Risk Analysis**: ๐Ÿ›‘ **VERY HIGH RISK - DO NOT IMPLEMENT** - **Reward hacking**: Agent learns to copy SQL structure without understanding semantics - **Multiple valid solutions**: Gold uses JOIN, agent uses subqueryโ€”both correct, but structure reward penalizes - **Overfitting**: Agent optimizes for "looking like gold query" instead of "getting right answer" - **Research evidence**: SQL-TRAIL paper found structure-based rewards hurt generalization --- #### 9. Execution Plan Similarity **How It Works**: ``` Gold query execution plan: 1. Scan departments (10 rows) 2. Scan employees (1000 rows) 3. Hash join (O(N)) 4. Aggregate (O(N)) Agent query execution plan: 1. Scan employees (1000 rows) 2. Nested loop with departments (O(Nยฒ)) 3. Aggregate (O(N)) Similarity: 2/4 steps similar = 0.5 ``` **Conceptual Meaning**: "Your query executes in a similar way." **Risk Analysis**: ๐Ÿ›‘ **VERY HIGH RISK - DO NOT IMPLEMENT** - **Database-specific**: SQLite plans differ from PostgreSQL plans - **Doesn't guarantee correctness**: Two queries with similar plans can have different results - **Computationally expensive**: Running EXPLAIN on every query doubles execution time - **Complexity**: Comparing tree structures is non-trivial - **No research evidence**: No prior work shows this helps RL training --- ## ๐Ÿงฎ Combining Multiple Metrics into Final Reward ### The Challenge You have multiple distance-to-goal metrics. How do you combine them into a single scalar reward? ``` Current state: cardinality_score = 0.8 value_overlap_score = 0.6 numeric_range_score = 0.9 row_match_score = 0.7 Need: single_reward = ??? ``` --- ## ๐ŸŽฏ Method 1: Weighted Average (RECOMMENDED for MVP) **Formula**: ```python def weighted_average_reward(scores, weights): """Simple weighted average of applicable metrics.""" total_weight = sum(weights.values()) return sum(scores[k] * weights[k] for k in scores) / total_weight ``` **Example Implementation**: ```python def compute_progress_reward(agent_result, gold_result, agent_query=None): # Compute all metrics scores = { 'cardinality': cardinality_reward(agent_result, gold_result), 'value_overlap': value_overlap_reward(agent_result, gold_result), 'numeric_range': numeric_range_reward(agent_result, gold_result), 'row_match': rowwise_best_match(agent_result, gold_result), } # Fixed weights (tune these!) weights = { 'cardinality': 0.25, 'value_overlap': 0.40, # Highest weight (most universal) 'numeric_range': 0.15, 'row_match': 0.20, } return weighted_average_reward(scores, weights) ``` **Pros**: - โœ… Simple to implement and understand - โœ… Easy to tune (adjust weights based on training performance) - โœ… Monotonic (if individual metrics improve, final reward improves) - โœ… Bounded [0, 1] **Cons**: - โš ๏ธ Fixed weights might not be optimal for all question types - โš ๏ธ Treats all metrics as equally important (regardless of context) **When to use**: Default choice for MVP. Start here. --- ## ๐ŸŽฏ Method 2: Adaptive Weighting by Question Type **Formula**: ```python def adaptive_weighted_reward(scores, question_metadata): """Adjust weights based on question characteristics.""" # Detect question type is_numeric = has_numeric_answer(gold_result) is_multirow = len(gold_result) > 1 is_ordered = "TOP" in question.upper() or "ORDER BY" in gold_query # Adaptive weights weights = { 'cardinality': 0.25, # Always important 'value_overlap': 0.40 if not is_numeric else 0.30, 'numeric_range': 0.30 if is_numeric else 0.0, # Only for numeric 'row_match': 0.20 if is_multirow else 0.10, } # Normalize total = sum(weights.values()) weights = {k: v/total for k, v in weights.items()} return sum(scores[k] * weights[k] for k in scores) ``` **Example**: ```python Question: "Average salary in Engineering?" (numeric, single-value) โ†’ weights: cardinality=0.25, value_overlap=0.30, numeric_range=0.35, row_match=0.10 Question: "List all departments with >50 employees" (text, multi-row) โ†’ weights: cardinality=0.25, value_overlap=0.45, numeric_range=0.0, row_match=0.30 ``` **Pros**: - โœ… More accurate reward signal for different question types - โœ… Automatically disables irrelevant metrics (e.g., numeric_range for text questions) - โœ… Can tune weights per question type independently **Cons**: - โš ๏ธ More complex to implement (need question type detection) - โš ๏ธ More hyperparameters to tune - โš ๏ธ Risk of over-engineering **When to use**: If fixed weighting shows poor performance on specific question types. --- ## ๐ŸŽฏ Method 3: Max Pooling (Optimistic) **Formula**: ```python def max_pooling_reward(scores): """Take the best metric (optimistic reward).""" return max(scores.values()) ``` **Example**: ``` scores = { 'cardinality': 0.3, # Wrong row count 'value_overlap': 0.8, # Found most values! 'numeric_range': 0.4, # Numbers off 'row_match': 0.5, # Some rows match } reward = max(0.3, 0.8, 0.4, 0.5) = 0.8 ``` **Conceptual Meaning**: "Give credit for whatever the agent did best." **Pros**: - โœ… Very forgiving (agent gets credit for any progress) - โœ… Encourages diverse exploration strategies - โœ… Simple to implement **Cons**: - ๐Ÿ›‘ **Too lenient**: Agent might game easiest metric - ๐Ÿ›‘ **Non-compositional**: Doesn't reward improving multiple aspects simultaneously - ๐Ÿ›‘ **Unstable gradients**: Reward can jump dramatically between steps **When to use**: If agent is struggling to learn anything (extremely sparse rewards). Use as temporary scaffolding, then switch to weighted average. --- ## ๐ŸŽฏ Method 4: Minimum Threshold + Average (Strict) **Formula**: ```python def threshold_average_reward(scores, thresholds): """All metrics must meet threshold; then take average.""" # Check all thresholds for metric, score in scores.items(): if score < thresholds.get(metric, 0.0): return 0.0 # Fail if any metric below threshold # All thresholds met โ†’ return average return sum(scores.values()) / len(scores) ``` **Example**: ```python scores = { 'cardinality': 0.9, 'value_overlap': 0.7, 'numeric_range': 0.3, # Below threshold! } thresholds = { 'cardinality': 0.5, 'value_overlap': 0.5, 'numeric_range': 0.5, } # numeric_range (0.3) < threshold (0.5) โ†’ return 0.0 ``` **Conceptual Meaning**: "You must do reasonably well on all aspects to get any reward." **Pros**: - โœ… Prevents over-optimization of single metric - โœ… Encourages balanced progress **Cons**: - ๐Ÿ›‘ **Too strict**: Might be too hard for early training - ๐Ÿ›‘ **Cliff dynamics**: Slight improvement might not change reward at all - ๐Ÿ›‘ **Threshold tuning**: Requires careful calibration **When to use**: If agent is gaming one metric while ignoring others. Use as penalty mechanism. --- ## ๐ŸŽฏ Method 5: Hierarchical (Coarse-to-Fine) **Formula**: ```python def hierarchical_reward(scores): """First get cardinality right, then content, then structure.""" # Layer 1: Cardinality (must be >0.5 to proceed) if scores['cardinality'] < 0.5: return scores['cardinality'] * 0.3 # Low reward, stuck at layer 1 # Layer 2: Content (must be >0.5 to proceed) content_score = (scores['value_overlap'] + scores['numeric_range']) / 2 if content_score < 0.5: return 0.3 + content_score * 0.4 # Medium reward, stuck at layer 2 # Layer 3: Structure (all metrics combined) final_score = ( 0.2 * scores['cardinality'] + 0.4 * content_score + 0.4 * scores['row_match'] ) return 0.5 + final_score * 0.5 # High reward, layer 3 ``` **Conceptual Meaning**: "Learn to get the count right first, then the values, then the structure." **Visualization**: ``` Reward progression: 0.0 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0.3 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 0.7 โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 1.0 โ†‘ โ†‘ โ†‘ Cardinality Content Structure correct correct correct ``` **Pros**: - โœ… Natural curriculum (easier tasks first) - โœ… Clear progression signal - โœ… Prevents agent from over-optimizing structure before content **Cons**: - โš ๏ธ More complex logic - โš ๏ธ Requires careful threshold tuning - โš ๏ธ Might slow down learning if thresholds too strict **When to use**: If agent learns poorly with flat reward. Provides curriculum learning. --- ## ๐ŸŽฏ Method 6: Product (Multiplicative) **Formula**: ```python def product_reward(scores): """Multiply all metrics (all must be good).""" product = 1.0 for score in scores.values(): product *= score return product ``` **Example**: ``` scores = { 'cardinality': 0.9, 'value_overlap': 0.8, 'numeric_range': 0.7, } reward = 0.9 ร— 0.8 ร— 0.7 = 0.504 ``` **Conceptual Meaning**: "All aspects must be good; weak performance on any metric drags down total." **Pros**: - โœ… Encourages balanced improvement - โœ… Penalizes weak performance on any dimension **Cons**: - ๐Ÿ›‘ **Too strict**: Single low score (0.1) makes entire reward near zero - ๐Ÿ›‘ **Vanishing gradients**: Product of small numbers becomes very small - ๐Ÿ›‘ **Not bounded predictably**: Can produce very small rewards even for good progress **When to use**: Rarely. Only if you need extremely strict "all-or-nothing" reward. --- ## ๐ŸŽฏ Method 7: Percentile Aggregation (Robust) **Formula**: ```python def percentile_reward(scores, percentile=50): """Use median (or other percentile) of all metrics.""" import numpy as np return np.percentile(list(scores.values()), percentile) ``` **Example**: ``` scores = [0.9, 0.8, 0.3, 0.7, 0.6] sorted = [0.3, 0.6, 0.7, 0.8, 0.9] percentile_50 (median) = 0.7 percentile_75 = 0.8 percentile_25 = 0.6 ``` **Conceptual Meaning**: "Reward based on typical performance, ignoring outliers." **Pros**: - โœ… Robust to outlier metrics (one very low or very high score doesn't dominate) - โœ… Simple to implement - โœ… Tunable (change percentile to be more/less strict) **Cons**: - โš ๏ธ Less interpretable than weighted average - โš ๏ธ Ignores some information (throws away best and worst scores) **When to use**: If one metric is noisy or unreliable, use median to ignore it. --- ## ๐Ÿ“Š Comparison Table: Combination Methods | Method | Complexity | Interpretability | Robustness | Training Stability | Best Use Case | |--------|------------|------------------|------------|-------------------|---------------| | **Weighted Average** | Low | High | Medium | High | **MVP default** | | **Adaptive Weighting** | Medium | Medium | High | High | Different question types need different signals | | **Max Pooling** | Low | Medium | Low | Low | Agent struggling to learn anything | | **Threshold + Average** | Medium | High | Medium | Medium | Agent gaming one metric | | **Hierarchical** | High | High | High | Medium | Want curriculum learning | | **Product** | Low | Low | Low | Low | All aspects must be perfect (rare) | | **Percentile** | Low | Low | High | High | One metric is noisy/unreliable | --- ## ๐ŸŽฌ Recommended Implementation Strategy ### Phase 1: Start Simple (Week 1) ```python def compute_progress_reward(agent_result, gold_result): """Initial implementation: weighted average of 3 metrics.""" scores = { 'cardinality': cardinality_reward(agent_result, gold_result), 'value_overlap': value_overlap_reward(agent_result, gold_result), 'numeric_range': numeric_range_reward(agent_result, gold_result), } weights = {'cardinality': 0.25, 'value_overlap': 0.50, 'numeric_range': 0.25} return weighted_average_reward(scores, weights) ``` **Why**: Simple, interpretable, easy to debug. --- ### Phase 2: Add Context (Week 2, if needed) ```python def compute_progress_reward(agent_result, gold_result, question_type): """Adaptive weighting based on question type.""" scores = { 'cardinality': cardinality_reward(agent_result, gold_result), 'value_overlap': value_overlap_reward(agent_result, gold_result), 'numeric_range': numeric_range_reward(agent_result, gold_result), 'row_match': rowwise_best_match(agent_result, gold_result), } # Adapt weights if question_type == 'numeric': weights = {'cardinality': 0.2, 'value_overlap': 0.3, 'numeric_range': 0.4, 'row_match': 0.1} elif question_type == 'multirow': weights = {'cardinality': 0.25, 'value_overlap': 0.4, 'numeric_range': 0.05, 'row_match': 0.3} else: # default weights = {'cardinality': 0.25, 'value_overlap': 0.5, 'numeric_range': 0.15, 'row_match': 0.1} return weighted_average_reward(scores, weights) ``` **Why**: Improves signal quality without major complexity. --- ### Phase 3: Add Safeguards (Week 3, if agent is gaming) ```python def compute_progress_reward(agent_result, gold_result, question_type): """Weighted average with anti-gaming measures.""" scores = compute_all_scores(agent_result, gold_result) # Anti-gaming: if cardinality is way off, cap other rewards if scores['cardinality'] < 0.3: # Agent is nowhere close on size โ†’ limit credit for content return scores['cardinality'] * 0.5 # Anti-gaming: if value overlap is low, cap row match if scores['value_overlap'] < 0.4: scores['row_match'] *= 0.5 # Penalize structure if content is wrong # Standard weighted average return weighted_average_reward(scores, get_weights(question_type)) ``` **Why**: Prevents reward hacking while keeping interpretability. --- ## ๐Ÿงช How to Validate Your Combination Method Create test suite: ```python # Test 1: Perfect match assert compute_reward(gold, gold) == 1.0 # Test 2: Completely wrong assert compute_reward(random_result, gold) < 0.2 # Test 3: Monotonicity (better result โ†’ higher reward) result_v1 = partially_correct_result() # 30% right result_v2 = more_correct_result() # 60% right result_v3 = mostly_correct_result() # 90% right assert compute_reward(result_v1, gold) < compute_reward(result_v2, gold) < compute_reward(result_v3, gold) # Test 4: Bounded [0, 1] for _ in range(100): random_result = generate_random_result() reward = compute_reward(random_result, gold) assert 0.0 <= reward <= 1.0 # Test 5: Insensitive to format (same values, different structure) result_format_a = [("Engineering", 42)] result_format_b = [(42, "Engineering")] assert abs(compute_reward(result_format_a, gold) - compute_reward(result_format_b, gold)) < 0.1 # Allow small difference ``` --- ## ๐Ÿ’ก Final Recommendation **For MVP (Phase 1-3)**: - Use **Method 1: Weighted Average** with metrics #1-3 (cardinality, value overlap, numeric range) - Fixed weights: `{0.25, 0.50, 0.25}` **If training shows issues**: - Add **Method 2: Adaptive Weighting** based on question type - Add metric #4 (row-wise match) **If agent games rewards**: - Add threshold checks from **Method 4** - Add anti-gaming logic from Phase 3 example **Never use**: - Method 3 (Max Pooling) - too gameable - Method 6 (Product) - too strict, vanishing gradients - Metrics #8-9 (SQL structure, execution plans) - research shows they hurt **Bottom line**: Start simple (weighted average of 3 metrics), add complexity only when needed, always validate with test suite.