Rohan03 commited on
Commit
965bdfb
·
verified ·
1 Parent(s): 415a664

test: real-world regression results — 10/10 pass

Browse files
Files changed (1) hide show
  1. tests/results/REAL_WORLD_RESULTS.md +32 -0
tests/results/REAL_WORLD_RESULTS.md ADDED
@@ -0,0 +1,32 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Real-World Regression Test Results
2
+
3
+ ## Models Tested
4
+ - **Llama-3.3-70B-Instruct** via OpenRouter
5
+ - **Gemma-4-26B** via OpenRouter
6
+
7
+ ## Results: 10/10 PASS across both models
8
+
9
+ ### Llama-3.3-70B (6/6 ✓)
10
+ | Domain | Test | Result | Time |
11
+ |--------|------|--------|------|
12
+ | Coding | fibonacci (4 tests) | ✓ 100% | 7s |
13
+ | Coding | fizzbuzz (4 tests) | ✓ 100% | 8s |
14
+ | Coding | factorial (3 tests) | ✓ 100% | 8s |
15
+ | Learning | heuristic_growth | ✓ [3→9→18] | - |
16
+ | Security | catch_rate | ✓ 93% | - |
17
+ | Security | false_positive | ✓ 0% | - |
18
+
19
+ ### Gemma-4-26B (4/4 ✓)
20
+ | Domain | Test | Result | Time |
21
+ |--------|------|--------|------|
22
+ | Coding | fibonacci (4 tests) | ✓ 100% | 9s |
23
+ | Coding | fizzbuzz (4 tests) | ✓ 100% | 7s |
24
+ | Coding | factorial (3 tests) | ✓ 100% | 10s |
25
+ | Learning | heuristic_growth | ✓ [3→6→11] | - |
26
+
27
+ ## Key Findings
28
+ 1. **Both models produce correct code** on first attempt for all coding tasks
29
+ 2. **Heuristic library grows** across runs: Llama 3→9→18, Gemma 3→6→11
30
+ 3. **Universal parser works** — no JSON requirement, handles both models' output formats
31
+ 4. **Immune system** catches 93% of adversarial content with 0% false positives
32
+ 5. **Average response time** ~8 seconds per task (includes actor + critic + optimizer calls)