Spaces:
Running
Running
Updated README
Browse files
README.md
CHANGED
|
@@ -409,48 +409,53 @@ tags:
|
|
| 409 |
|
| 410 |
## 12. Baseline Inference Script
|
| 411 |
|
| 412 |
-
`inference.py` uses an OpenAI-compatible client with configurable provider settings to run `llama-3.3-
|
| 413 |
|
| 414 |
-
|
| 415 |
-
|
| 416 |
-
|
| 417 |
-
|
| 418 |
-
|
| 419 |
-
|
| 420 |
-
|
| 421 |
-
|
| 422 |
-
|
| 423 |
-
|
| 424 |
-
|
| 425 |
-
|
| 426 |
-
|
| 427 |
-
|
| 428 |
-
|
| 429 |
-
|
| 430 |
-
|
| 431 |
-
|
| 432 |
-
|
| 433 |
-
|
| 434 |
-
|
| 435 |
-
|
| 436 |
-
|
| 437 |
-
|
| 438 |
-
|
| 439 |
-
|
| 440 |
-
|
| 441 |
-
|
| 442 |
-
|
| 443 |
-
|
| 444 |
-
|
| 445 |
-
|
| 446 |
-
|
| 447 |
-
|
| 448 |
-
|
| 449 |
-
|
| 450 |
-
|
| 451 |
-
|
| 452 |
-
|
| 453 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 454 |
```
|
| 455 |
|
| 456 |
---
|
|
|
|
| 409 |
|
| 410 |
## 12. Baseline Inference Script
|
| 411 |
|
| 412 |
+
`inference.py` uses an OpenAI-compatible client with configurable provider settings to run any LLM (default: `meta-llama/Llama-3.3-70B-Instruct` via Hugging Face router) as a zero-shot SRE agent against all 3 tasks and reports structured scores.
|
| 413 |
|
| 414 |
+
### Environment Variables
|
| 415 |
+
|
| 416 |
+
| Variable | Default | Description |
|
| 417 |
+
|---|---|---|
|
| 418 |
+
| `HF_TOKEN` | *(required)* | API key for the LLM provider |
|
| 419 |
+
| `API_BASE_URL` | `https://router.huggingface.co/v1` | API endpoint |
|
| 420 |
+
| `MODEL_NAME` | `meta-llama/Llama-3.3-70B-Instruct` | Model identifier |
|
| 421 |
+
| `ENV_URL` | `http://localhost:7860` | LogTriageEnv server |
|
| 422 |
+
|
| 423 |
+
### Key Features
|
| 424 |
+
|
| 425 |
+
- **System prompt** — Structured SRE triage persona with action schema enforced via JSON output
|
| 426 |
+
- **Conversation history** — Bounded to 8 turns to stay within context limits
|
| 427 |
+
- **Fallback logic** — Heuristic fallback if LLM fails to parse or call; avoids episode crashes
|
| 428 |
+
- **Step rate limiting** — 200ms sleep between steps to avoid provider rate limits
|
| 429 |
+
- **Health check** — Validates environment is reachable before running tasks
|
| 430 |
+
- **Seeded reproducibility** — All tasks run with `seed=42`
|
| 431 |
+
|
| 432 |
+
### Usage
|
| 433 |
+
|
| 434 |
+
```bash
|
| 435 |
+
export HF_TOKEN=your_key_here
|
| 436 |
+
export API_BASE_URL=https://api.groq.com/openai/v1 # or HF router
|
| 437 |
+
export MODEL_NAME=llama-3.3-70b-versatile
|
| 438 |
+
|
| 439 |
+
python inference.py
|
| 440 |
+
```
|
| 441 |
+
|
| 442 |
+
### Output
|
| 443 |
+
|
| 444 |
+
The script prints a per-task score bar and returns a JSON block with full breakdown:
|
| 445 |
+
|
| 446 |
+
```json
|
| 447 |
+
{
|
| 448 |
+
"api_base_url": "https://api.groq.com/openai/v1",
|
| 449 |
+
"model_name": "llama-3.3-70b-versatile",
|
| 450 |
+
"seed": 42,
|
| 451 |
+
"results": [
|
| 452 |
+
{ "task_id": "single_crash", "score": 1.0, "steps_taken": 5, "breakdown": {} },
|
| 453 |
+
{ "task_id": "cascading_failure", "score": 0.65, "steps_taken": 9, "breakdown": {} },
|
| 454 |
+
{ "task_id": "silent_degradation", "score": 1.0, "steps_taken": 12, "breakdown": {} }
|
| 455 |
+
],
|
| 456 |
+
"average_score": 0.8833,
|
| 457 |
+
"runtime_seconds": 45.2
|
| 458 |
+
}
|
| 459 |
```
|
| 460 |
|
| 461 |
---
|