File size: 2,536 Bytes
1103d2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b9680f4
1103d2c
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
---
license: apache-2.0
base_model: unsloth/Qwen2.5-Coder-32B-Instruct-bnb-4bit
language:
- en
library_name: unsloth
tags:
- amd
- rocm
- hip
- cuda
- code-generation
- lablab-ai
- ghost-coder
- mi300x
- unsloth
---

# Ghost-Coder: Autonomous CUDA-to-HIP Translator 👻

**Ghost-Coder** is a specialized, agent-ready LLM designed to bridge the gap between NVIDIA's CUDA and AMD's open ROCm ecosystem. Developed for the **Lablab.ai AMD Developer Hackathon (2026)**, this model serves as the "brain" of a self-healing agentic workflow that translates, compiles, and iterates on GPU kernels.

## 🚀 Overview
Ghost-Coder isn't just a translator; it’s an engineer. By fine-tuning **Qwen2.5-Coder-32B** specifically on the **CASS (CUDA-to-HIP)** mapping dataset, we've enabled a model that understands the deep structural nuances of GPU programming, from shared memory primitives to warp-level synchronization.

### 💎 Hardware & Framework
- **Training Hardware:** AMD Instinct™ MI300X VF (192GB HBM3)
- **Framework:** [Unsloth](https://github.com/unslothai/unsloth) (Optimized for 2x faster ROCm fine-tuning)
- **Optimization:** 4-bit QLoRA with a 4096 context window.

## 🧠 Model Highlights
- **High-Fidelity Mapping:** Precise translation of `cuda*` APIs to their corresponding `hip*` counterparts.
- **Agentic Ready:** Optimized to parse `hipcc` compiler error logs and self-correct syntax or logic errors in real-time.
- **Massive Scale:** Leveraging the 32B parameter Qwen2.5-Coder foundation for superior C++ reasoning compared to smaller 7B models.

## 🛠️ Training Specifications
To ensure maximum generalization and prevent overfitting, the model underwent a high-throughput training sprint:

| Parameter | Configuration |
| :--- | :--- |
| **Total Steps** | 200 (Optimized Sprint) |
| **Global Batch Size** | 64 |
| **Learning Rate** | 2e-4 |
| **VRAM Utilization** | ~158GB / 192GB |
| **Dataset** | 12,800+ Curated CUDA-to-HIP Pairs |

## 🏁 Intended Use (The Ghost-Harness)
This model is designed to work within the **Ghost-Harness** agentic loop:
1. **Input:** User provides a raw `.cu` (CUDA) file.
2. **Action:** Ghost-Coder generates a `.cpp` (HIP) translation.
3. **Validation:** The harness runs `hipcc` on the output.
4. **Self-Healing:** If compilation fails, the error logs are fed back to Ghost-Coder for an iterative fix.

## 📝 Acknowledgments
Special thanks to **AMD** for the world-class compute and **Lablab.ai** for hosting the "Build Across the AI Stack" challenge.

---
*Developed by Talha*