Spaces:

agentDebugger
/

AgentDebugger-training-v3

Running

App Files Files Community

AgentDebugger-training-v3 / training

Commit History

Cuda returns false fixed

b8172c5

shank commited on 12 days ago

COMPUTE_DRIVE fix

77156dd

shank commited on 12 days ago

Fix: Removed BitsandBytes

bdec91d

shank commited on 12 days ago

Fix: Fixed again again

accb271

shank commited on 12 days ago

Fix: Fixed again

9864e61

shank commited on 12 days ago

Fix: Fixing Again

6747185

shank commited on 12 days ago

Fix: Fixing

18b4e8a

shank commited on 12 days ago

Fix: Trying to fix dependency issues

024f3c7

shank commited on 12 days ago

Fix: Fixed file

cb09ef1

shank commited on 12 days ago

fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error

4668456

shank commited on 12 days ago

fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B

a2fa47a

shank commited on 12 days ago

fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl)

2bfaf77

shank commited on 12 days ago

fix: empty requirements.txt, install training deps at runtime

5d0b2d4

shank commited on 13 days ago

fix: remove wandb - click conflict with gradio causes resolution-too-deep

2005cd2

shank commited on 13 days ago

chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0

e67270e

shank commited on 13 days ago

Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both

ea6fe4e

shank commited on 13 days ago

Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs

9487853

shank commited on 13 days ago

Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8

73f957d

shank commited on 13 days ago

Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0

8b16369

shank commited on 13 days ago

Align gradio version with Hugging Face Space builder2

633a3b7

shank commited on 13 days ago

Stabilize Space runtime: pin ML deps and disable runtime package drift

663b8db

shank commited on 13 days ago

Pin torch to cu121 build + use model.device instead of hardcoded cuda string

8f291e0

shank commited on 13 days ago

Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100

c325ad7

shank commited on 13 days ago

Reduce training to 500 steps with tightened curriculum for A10G budget

ba8df98

shank commited on 13 days ago

Fix eval device selection with CUDA-safe fallback

dc8001b

shank commited on 13 days ago

Optimize for A100 80GB: 8 generations, batch 4, lr 2e-5, dense logging

2b1fbf3

shank commited on 13 days ago

Restore full 1000-step training with original curriculum

1128de1

shank commited on 13 days ago

Reduce training to 500 steps with tightened curriculum for A10G budget

3152fa9

shank commited on 13 days ago

Add Gradio training monitor and fix subprocess python path

b92ad01

shank commited on 13 days ago

Update: Started making changes for the hackathon

a55c81d

shank commited on 13 days ago

Commit History

Cuda returns false fixed b8172c5

COMPUTE_DRIVE fix 77156dd

Fix: Removed BitsandBytes bdec91d

Fix: Fixed again again accb271

Fix: Fixed again 9864e61

Fix: Fixing Again 6747185

Fix: Fixing 18b4e8a

Fix: Trying to fix dependency issues 024f3c7

Fix: Fixed file cb09ef1

fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error 4668456

fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B a2fa47a

fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl) 2bfaf77

fix: empty requirements.txt, install training deps at runtime 5d0b2d4

fix: remove wandb - click conflict with gradio causes resolution-too-deep 2005cd2

chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0 e67270e

Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both ea6fe4e

Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs 9487853

Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8 73f957d

Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0 8b16369

Align gradio version with Hugging Face Space builder2 633a3b7

Stabilize Space runtime: pin ML deps and disable runtime package drift 663b8db

Pin torch to cu121 build + use model.device instead of hardcoded cuda string 8f291e0

Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100 c325ad7

Reduce training to 500 steps with tightened curriculum for A10G budget ba8df98

Fix eval device selection with CUDA-safe fallback dc8001b

Optimize for A100 80GB: 8 generations, batch 4, lr 2e-5, dense logging 2b1fbf3

Restore full 1000-step training with original curriculum 1128de1

Reduce training to 500 steps with tightened curriculum for A10G budget 3152fa9

Add Gradio training monitor and fix subprocess python path b92ad01

Update: Started making changes for the hackathon a55c81d

Cuda returns false fixed

b8172c5

COMPUTE_DRIVE fix

77156dd

Fix: Removed BitsandBytes

bdec91d

Fix: Fixed again again

accb271

Fix: Fixed again

9864e61

Fix: Fixing Again

6747185

Fix: Fixing

18b4e8a

Fix: Trying to fix dependency issues

024f3c7

Fix: Fixed file

cb09ef1

fix: serialize bug_metadata as JSON to fix pyarrow mixed-type error

4668456

fix: upgrade bitsandbytes>=0.49.0 (triton.ops), switch to Qwen2.5-Coder-3B

a2fa47a

fix: torch at build time, remove mergekit (conflicts accelerate/peft/trl)

2bfaf77

fix: empty requirements.txt, install training deps at runtime

5d0b2d4

fix: remove wandb - click conflict with gradio causes resolution-too-deep

2005cd2

chore: normalize dataset inputs and fix mergekit dependency for TRL 0.14.0

e67270e

Auto-detect GPU: bfloat16+batch2+gen8 on A100, float16+batch1+gen4 on T4 — same script works on both

ea6fe4e

Reduce max_completion_length to 160 for T4 speed: target 1000 steps in <8hrs

9487853

Optimize for Kaggle P100: float16, batch=1, grad_accum=8, num_gen=4, max_completion=256, lora_r=8

73f957d

Fix GRPOConfig: rename max_new_tokens to max_completion_length for trl==0.14.0

8b16369

Align gradio version with Hugging Face Space builder2

633a3b7

Stabilize Space runtime: pin ML deps and disable runtime package drift

663b8db

Pin torch to cu121 build + use model.device instead of hardcoded cuda string

8f291e0

Replace unsloth with bitsandbytes+peft: fixes CUDA driver incompatibility on HF A100

c325ad7

Reduce training to 500 steps with tightened curriculum for A10G budget

ba8df98

Fix eval device selection with CUDA-safe fallback

dc8001b

Optimize for A100 80GB: 8 generations, batch 4, lr 2e-5, dense logging

2b1fbf3

Restore full 1000-step training with original curriculum

1128de1

Reduce training to 500 steps with tightened curriculum for A10G budget

3152fa9

Add Gradio training monitor and fix subprocess python path

b92ad01

Update: Started making changes for the hackathon

a55c81d