title: Clinical Note Scribe
emoji: π₯
colorFrom: blue
colorTo: green
sdk: docker
pinned: false
Clinical Note Scribe
An OpenEnv-compliant environment for evaluating AI agents on clinical SOAP-note generation from doctorβpatient transcripts.
Built for the Meta Γ Hugging Face OpenEnv Hackathon.
#Clinical Note Scribe Medical documentation is one of the most time-consuming parts of a doctor's day. After every patient visit, clinicians spend significant time converting spoken conversations into structured clinical notes β time that could be spent on patient care instead. Clinical Note Scribe is a reinforcement learning training environment where an AI agent learns to do exactly that: listen to a doctor-patient conversation and produce a well-structured, accurate, and safe clinical note in SOAP format.
Interesting part
The interesting part about this project is that we have created a frontend which shiws live what number of reward is given on the beasis of submition.
#Three levels of difficulty Tasks range from a routine check-up all the way to a chaotic ER visit with overlapping symptoms and urgent orders β each with its own grader and reward logic.
Environment Description
A doctorβpatient conversation is recorded as a text transcript. The agent's goal is to read the transcript along with structured patient context (demographics, medications, labs) and produce a clinically accurate, concise SOAP note (Subjective, Objective, Assessment, Plan).
The agent interacts through a standard reset() / step() / state() API. Three action types are available: submit a full note, request clarification, or revise a single section. A multi-signal reward function scores each submission on clinical accuracy, conciseness, safe language, and structural validity, with penalties for excessive steps or invalid actions.
Observation Space
| Field | Type | Description |
|---|---|---|
transcript |
str |
Full doctorβpatient transcript for the current task |
task_id |
str |
Unique identifier for the active task |
patient_context |
dict[str, Any] |
Structured patient demographics, conditions, medications, allergies, and labs |
current_draft |
Optional[str] |
The agent's most recent SOAP-note draft (null until first submission or revision) |
errors_so_far |
list[str] |
Accumulated error/feedback messages from prior invalid actions |
step_count |
int |
Number of steps taken so far in the current episode (0-indexed at reset) |
Action Space
| Field | Type | Description |
|---|---|---|
action_type |
Literal["submit_note", "request_clarify", "revise_section"] |
Required. The kind of action the agent is taking |
soap_note |
Optional[SOAPNote] |
Complete SOAP note β required when action_type == "submit_note" |
section |
Optional[Literal["S", "O", "A", "P"]] |
Which SOAP section to revise β required when action_type == "revise_section" |
revision_text |
Optional[str] |
Replacement text for the section β required when action_type == "revise_section" |
clarify_question |
Optional[str] |
Free-text question β required when action_type == "request_clarify" |
SOAPNote Schema
| Field | Type | Description |
|---|---|---|
subjective |
str |
Patient's self-reported symptoms, history, and concerns |
objective |
str |
Clinician's measurable findings β vitals, exam, labs, imaging |
assessment |
str |
Differential diagnoses and clinical reasoning |
plan |
str |
Treatment plan, medications, follow-ups, referrals |
Tasks
π’ Easy β Routine Check-Up
Task ID: easy_routine_checkup Β· Max steps: 5
A 6-turn dialogue about a common cold and blood pressure screening for a 34-year-old female. Straightforward clinical picture with no complications.
π‘ Medium β Chronic Disease Follow-Up
Task ID: medium_chronic_disease_followup Β· Max steps: 8
A 14-turn follow-up visit for a 58-year-old male with Type 2 Diabetes and Hypertension. Includes HbA1c lab review (7.2% β 7.8%), medication adjustments (adding glipizide 5 mg, uptitrating lisinopril 20 β 40 mg), a 2-week statin gap, and dietary counselling around restaurant meals.
π΄ Hard β Complex ER Visit
Task ID: hard_complex_er_visit Β· Max steps: 10
A rapid 20-turn emergency-room encounter for a 72-year-old female with CAD, AFib, and CKD Stage 3. Overlapping chest pain and shortness of breath with a dual ACS vs PE differential. Includes a patient self-contradiction (denied then admitted nitroglycerin use at home), contrast dye allergy complicating CT-PA workup (V/Q scan ordered instead), elevated D-dimer (1840 ng/mL), and Cardiac ICU admission.
Reward Function
value = clamp(weighted_sum β deductions, 0.0, 1.0)
| Signal | Weight | Criteria |
|---|---|---|
grader_score |
Γ 0.60 | Clinical accuracy from task-specific grader |
conciseness_bonus |
Γ 0.10 | 1.0 if total SOAP note β€ 400 words |
safe_language_score |
Γ 0.15 | 1.0 if no unsafe-certainty phrases detected |
format_valid |
Γ 0.15 | 1.0 if all four SOAP fields are non-empty |
| Deduction | Rate | Trigger |
|---|---|---|
| Step penalty | β0.05 | Per step beyond 3 (penalises excessive clarification) |
| Error penalty | β0.10 | Per invalid action in errors_so_far |
Installation
Prerequisites
- Python 3.11+
- An OpenAI-compatible API key (set as
HF_TOKEN)
Local Setup
# Clone the repository
git clone https://github.com/<your-org>/meta-huggingface-hackathon-team-silver-orca.git
cd meta-huggingface-hackathon-team-silver-orca
# Create a virtual environment (optional but recommended)
python -m venv venv
# On Linux/macOS:
source venv/bin/activate
# On Windows:
# venv\Scripts\activate
# Install dependencies
pip install -r requirements.txt
Docker Setup
docker build -t meta-huggingface-hackathon-team-silver-orca .
Usage
1. Start the Environment Server
The environment runs as a REST API. Start the server first before running the agent.
Using Python
uvicorn server.app:app --host 0.0.0.0 --port 7860
Using Docker
docker run -p 7860:7860 meta-huggingface-hackathon-team-silver-orca
2. Run the Agent (Inference)
In another terminal, run the baseline inference script which will interact with the running environment:
On Linux/macOS:
export HF_TOKEN="sk-..."
export MODEL_NAME="gpt-4o-mini" # or any OpenAI-compatible model
export API_BASE_URL="https://api.openai.com/v1"
python inference.py
On Windows (PowerShell):
$env:HF_TOKEN="sk-..."
$env:MODEL_NAME="gpt-4o-mini"
$env:API_BASE_URL="https://api.openai.com/v1"
python inference.py
API Endpoints
| Method | Path | Description |
|---|---|---|
GET |
/health |
Liveness probe β {"status": "ok"} |
POST |
/reset |
Start a new episode β Observation |
POST |
/step |
Submit an action β {observation, reward, done, info} |
GET |
/state |
Inspect environment state β EnvironmentState |
Baseline Scores
Scores obtained using gpt-4o-mini with temperature=0.2 via inference.py:
| Task | Difficulty | Score |
|---|---|---|
easy_routine_checkup |
π’ Easy | 0.8520 |
medium_chronic_disease_followup |
π‘ Medium | 0.7450 |
hard_complex_er_visit |
π΄ Hard | 0.5110 |
| Average | 0.7026 |
Note: These baseline scores use dynamic clinical graders that check for explicit diagnoses and strict formatting. Scores will accurately vary based on the specific LLM used.
Structured Logging
Every episode emits JSON log lines to stdout, scraped by the OpenEnv validator:
{"event": "START", "task_id": "easy_routine_checkup", "timestamp": 1700000000.0}
{"event": "STEP", "step": 1, "action_type": "submit_note", "reward": 0.82}
{"event": "END", "task_id": "easy_routine_checkup", "final_score": 0.82}
Project Structure
meta-huggingface-hackathon-team-silver-orca/
βββ openenv.yaml β OpenEnv spec metadata + graders
βββ inference.py β Baseline inference (OpenAI client, all 3 tasks)
βββ Dockerfile β Containerised server (port 7860)
βββ README.md β This file
βββ requirements.txt
β
βββ environment/
β βββ __init__.py
β βββ models.py β Pydantic v2 models (Observation, Action, Reward, β¦)
β βββ env.py β ClinicalNoteScribeEnv (reset/step/state)
β βββ reward.py β Multi-signal reward function
β βββ tasks/
β βββ __init__.py β Task & grader registries
β βββ task_easy.py β Routine check-up + grader stub
β βββ task_medium.py β Chronic disease follow-up + grader stub
β βββ task_hard.py β Complex ER visit + grader stub
β
βββ server/
β βββ __init__.py
β βββ app.py β FastAPI application
β βββ routes.py β API route definitions
β
βββ data/
βββ transcripts/
β βββ easy.txt β 6-turn routine check-up transcript
β βββ medium.txt β 14-turn chronic disease follow-up transcript
β βββ hard.txt β 20-turn complex ER visit transcript
βββ clarify_answers.json β Clarification Q&A lookup (10 entries)
License
MIT