Spaces:

lablab-ai-amd-developer-hackathon
/

ROCmPort-AI

Running

App Files Files Community

tazwarrrr commited on 21 days ago

Commit

a5be23e

1 Parent(s): 28263c0

fixing bugs

Browse files

Files changed (15) hide show

BENCHMARKS.md +70 -60
README.md +150 -206
backend/agents/analyzer.py +13 -9
backend/agents/coordinator.py +179 -159
backend/agents/optimizer.py +7 -5
backend/agents/tester.py +53 -67
backend/agents/translator.py +9 -6
backend/main.py +149 -29
backend/models.py +14 -2
backend/prompts/coordinator_prompt.txt +1 -1
backend/tools/hipify_wrapper.py +20 -15
backend/tools/rocprof_wrapper.py +71 -69
docs/FAILURE_CASES.md +38 -0
docs/JUDGE_MODE.md +42 -0
frontend/index.html +1410 -790

BENCHMARKS.md CHANGED Viewed

@@ -1,82 +1,92 @@
-# ROCmPort AI - Benchmark Results
-## 📊 Performance Results on AMD MI300X (Real rocprof)
-| Kernel | Size | Baseline HIP | Optimized ROCm | Speedup | Notes |
-|--------|------|--------------|----------------|---------|-------|
-| **Matrix Multiply** | 1024×1024 | 12.4ms | 9.5ms | **1.31x** | Shared memory tiling applied |
-| **Vector Add** | 10M elements | 3.2ms | 2.9ms | **1.10x** | Memory coalescing fixed |
-| **2D Convolution** | 256×256 | 28.7ms | 21.3ms | **1.35x** | LDS optimization applied |
-| **Parallel Reduction** | 1M elements | 15.2ms | 12.1ms | **1.25x** | Warp-size aligned unrolling |
-### 🎯 Key Findings
-- **Memory-bound kernels** show the highest gains (up to 1.35x)
-- **Compute-bound kernels** show moderate improvements (1.10-1.20x)
-- **Shared memory tiling** is the most effective optimization
-- **Wavefront alignment** consistently improves performance
-### 📈 Performance Breakdown
-#### Matrix Multiply (1024×1024)
-- **Baseline HIP**: 12.4ms (straight hipify output)
-- **Optimized ROCm**: 9.5ms (after agent optimizations)
-- **Bandwidth Utilization**: 87% → 94%
-- **Key Optimization**: 32×32 shared memory tiles
-#### Vector Add (10M elements)
-- **Baseline HIP**: 3.2ms
-- **Optimized ROCm**: 2.9ms
-- **Bandwidth Utilization**: 71% → 78%
-- **Key Optimization**: Memory access coalescing
-#### 2D Convolution (256×256)
-- **Baseline HIP**: 28.7ms
-- **Optimized ROCm**: 21.3ms
-- **Bandwidth Utilization**: 68% → 91%
-- **Key Optimization**: LDS (Local Data Store) usage
-#### Parallel Reduction (1M elements)
-- **Baseline HIP**: 15.2ms
-- **Optimized ROCm**: 12.1ms
-- **Bandwidth Utilization**: 74% → 89%
-- **Key Optimization**: 64-thread wavefront aware unrolling
----
-### 🔬 Hardware Configuration
-**Test System:**
-- **GPU**: AMD Instinct MI300X
-- **Memory**: 192GB HBM3
-- **Bandwidth**: 5.3 TB/s theoretical
-- **ROCm Version**: 6.2
-- **Compiler**: hipcc 6.2.0
-- **Profiler**: rocprof v2
-**Environment:**
-- **OS**: Ubuntu 22.04 LTS
-- **Driver**: AMDGPU 23.40
-- **CPU**: AMD EPYC 9654 (for comparison)
----
-### 📝 Methodology
-1. **Baseline**: Generated using `hipify-clang` with no optimizations
-2. **Optimized**: ROCmPort AI agent pipeline applied
-3. **Measurement**: rocprof with kernel execution counters
-4. **Validation**: Output correctness verified via checksum
-5. **Iterations**: 3 runs per kernel, median reported
----
-### 🏆 Performance Claims
-> **ROCmPort AI delivers 1.10x to 1.35x speedup over baseline HIP**
-**Important**: All comparisons are **Optimized ROCm vs Baseline HIP** (straight hipify output). We do not compare against NVIDIA CUDA performance - we prove our agents add value beyond mechanical translation.
----
-*Benchmarked on AMD Instinct MI300X, ROCm 6.2, rocprof counters. Results may vary based on input size and system configuration.*

+# ROCmPort AI Benchmarking Guide
+This document defines how to report performance without overclaiming.
+## Reporting Principles
+- Compare against a clearly stated baseline.
+- Use reproducible runs with fixed input sizes and environment details.
+- Include correctness checks before accepting performance numbers.
+- Report failures and non-improving cases, not only wins.
+## Baseline Definitions
+Use one of these and name it explicitly in each table:
+- Baseline A: Straight `hipify-clang` output with minimal manual edits.
+- Baseline B: Existing hand-written HIP version from the team.
+Recommended: use Baseline A for measuring migration automation value.
+Quick answer format for live review:
+- Q: What is your baseline?
+- A: Straight hipify output with minimal compile edits (Baseline A), measured on the same hardware and inputs.
+## Required Environment Metadata
+Always include:
+- GPU model (for example MI300X) and memory size.
+- ROCm version, compiler version, and profiler version.
+- OS and driver versions.
+- Kernel launch parameters and input sizes.
+- Number of runs and aggregation rule (median recommended).
+## Required Measurement Fields
+For each kernel tested, provide:
+- Kernel name and workload shape.
+- Baseline latency.
+- Optimized latency.
+- Speedup ratio.
+- Correctness status (pass/fail and checksum or tolerance).
+- Notes on optimization strategy.
+Example table format:
+| Kernel | Shape | Baseline (ms) | Optimized (ms) | Speedup | Correctness | Notes |
+|---|---|---:|---:|---:|---|---|
+| matrix_multiply | 1024x1024 | 12.4 | 9.5 | 1.31x | pass | LDS tiling + wavefront-aware launch |
+Include non-win cases in the same table. Example:
+| Kernel | Shape | Baseline (ms) | Optimized (ms) | Speedup | Correctness | Notes |
+|---|---|---:|---:|---:|---|---|
+| sparse_scatter | 4M elements | 6.0 | 6.3 | 0.95x | pass | Irregular access pattern; optimization did not help |
+## Reproducibility Checklist
+Before publishing numbers, verify all items:
+- Same input set for baseline and optimized runs.
+- Warm-up runs excluded or consistently handled.
+- At least 3 measured runs (prefer 5+) with median reported.
+- No hidden manual edits after optimization output unless documented.
+- Full command lines and profiler artifacts retained.
+## Evidence Package for Review
+A technical review package should include:
+- CUDA source input.
+- Baseline HIP output.
+- Optimized HIP output.
+- Compile logs and profiler summaries.
+- Final report explaining what changed and why.
+## Interpreting Results Responsibly
+- Some kernels will regress or fail initially; this is normal for migration.
+- Improvement ranges vary by memory behavior, occupancy, and control-flow patterns.
+- Do not claim universal speedups.
+Preferred claim style:
+"ROCmPort AI improved X out of Y tested kernels against a stated baseline under reproducible MI300X conditions."
+## Current Repository Status
+The repository includes demo kernels intended to exercise migration behavior.
+Treat any sample numbers as demonstrations unless accompanied by full reproducibility artifacts from your environment.

README.md CHANGED Viewed

@@ -1,275 +1,219 @@
 # ROCmPort AI
-**The fastest way to escape CUDA lock-in and run on AMD.**
-Paste CUDA code → 5 AI agents automatically port it to ROCm/HIP → optimize for MI300X → benchmark on real hardware → show you the performance improvement — live, with full visibility into every decision the agents make.
----
-## 🎬 What Happens in 10 Seconds
-1. Paste CUDA code
-2. AI detects issues (warp size, memory bottlenecks)
-3. Converts to ROCm
-4. Tries optimization → fails → retries
-5. Shows real benchmark improvement on AMD GPU
-Result: Working, optimized AMD code in minutes.
----
-## 🚀 Quick Start
-### Option 1: One-Click Start (Recommended)
-```bash
-# Windows
-start.bat
-# Linux/Mac
-./start.sh
-```
-This will:
-- Install all dependencies
-- Create .env file from template
-- Start the FastAPI server
-- Open the web interface at `http://localhost:8000`
-### Option 2: Manual Setup
-```bash
-cd backend
-pip install -r requirements.txt
-cp .env.example .env
-# Add your GROQ_API_KEY to .env file
-uvicorn main:app --reload --port 8000
-```
-Then open `frontend/index.html` in your browser.
----
-## � One-Command Demo with Docker
-```bash
-docker build -t rocmport-ai .
-docker run -p 8000:8000 rocmport-ai
-```
-Then open http://localhost:8000 in your browser.
----
-## � Project Structure
 ```
-ROCmPort AI/
-├── backend/
-│   ├── main.py              ← FastAPI + SSE streaming endpoint
-│   ├── models.py            ← All Pydantic schemas
-│   ├── requirements.txt     ← Dependencies (includes openai==1.47.0)
-│   ├── agents/
-│   │   ├── analyzer.py      ← Warp size detection, workload classification
-│   │   ├── translator.py    ← hipify pass 1 + LLM pass 2
-│   │   ├── optimizer.py     ← AMD MI300X-specific optimizations
-│   │   ├── tester.py        ← Real rocprof OR mocked (controlled failure)
-│   │   └── coordinator.py  ← Full pipeline + retry loop
-│   ├── tools/
-│   │   ├── hipify_wrapper.py ← Real hipify-clang or Python fallback
-│   │   ├── rocprof_wrapper.py ← hipcc compiler + rocprof parser
-│   │   └── llm_client.py    ← Groq ↔ vLLM swap for AMD Cloud
-│   ├── demo_kernels/
-│   │   ├── vector_add.cu    ← Simple kernel with warp size bug
-│   │   ├── matrix_multiply.cu ← Complex kernel with controlled failure
-│   │   ├── convolution_2d.cu ← Advanced kernel for optimization demo
-│   │   └── reduction.cu      ← Classic reduction with warp size unroll bug
-│   └── prompts/
-│       ├── analyzer_prompt.txt
-│       ├── translator_prompt.txt
-│       ├── optimizer_prompt.txt
-│       └── coordinator_prompt.txt
-├── frontend/
-│   └── index.html           ← Full UI with dark terminal aesthetic
-├── .env.example             ← Environment variables template
-├── start.bat                ← Windows startup script
-├── start.sh                 ← Linux/Mac startup script
-└── README.md                ← This file
-```
----
-## 🤖 The 5 Agents
-### 1. **Analyzer** — Deep Code Analysis
-- Detects all CUDA kernels and APIs
-- **Critical**: Flags warp size assumptions (32→64 threads)
-- Classifies workload: compute-bound vs memory-bound
-- Identifies multi-GPU sharding (unnecessary on MI300X's 192GB)
-### 2. **Translator** — Two-Pass Conversion
-- **Pass 1**: hipify-clang for mechanical replacements (cuda→hip)
-- **Pass 2**: LLM fixes what hipify misses (warp size, intrinsics)
-- Tracks every change with confidence levels
-### 3. **Optimizer** — MI300X-Specific Tuning
-- Shared memory tiling (32×32 blocks)
-- Memory coalescing fixes
-- Wavefront alignment (256 thread blocks)
-- Removes GPU sharding code
-### 4. **Tester** — Real Hardware Benchmarking
-- Compiles with hipcc
-- Profiles with rocprof on real MI300X
-- **Controlled failure**: Iteration 1 performs worse → triggers retry
-- Iteration 2 shows improvement
-### 5. **Coordinator** — Intelligent Orchestration
-- Manages retry loop when optimization fails
-- Generates final migration report
-- Explains AMD hardware advantages
----
-## ⚙️ Configuration
-### Environment Variables
-Copy `.env.example` to `.env` and configure:
-```bash
-# Required for local development
-GROQ_API_KEY=your_groq_api_key_here
-# Optional: Override Groq model
-GROQ_MODEL=llama-3.3-70b-versatile
-# For AMD Cloud deployment
-USE_VLLM=true
-VLLM_BASE_URL=http://your-amd-cloud:8000
-VLLM_API_KEY=your_vllm_key
-VLLM_MODEL=amd/llama-3.3-70b
-# On AMD Cloud with real hardware
-ROCM_AVAILABLE=true
-HIPCC_PATH=hipcc
-ROCPROF_PATH=rocprof
-```
-### Getting API Keys
-1. **Groq (Local Development)**: Free at [console.groq.com](https://console.groq.com)
-2. **vLLM (AMD Cloud)**: Deploy vLLM on MI300X with OpenAI-compatible API
----
-## 🎯 Demo Kernels
-Three pre-tested CUDA examples included:
-1. **Vector Add** - Simple kernel demonstrating basic pipeline
-2. **Matrix Multiply** - Shows shared memory tiling optimization
-3. **2D Convolution** - Advanced memory access pattern optimization
-4. **Parallel Reduction** - Demonstrates warp-size aware unrolling (32 vs 64)
-All contain intentional warp size bugs to demonstrate AMD-specific fixes.
----
-## 🌐 AMD Cloud Deployment
-simply set:
 ```bash
-ROCM_AVAILABLE=true
-USE_VLLM=true
 ```
-Everything else is already wired up for real MI300X hardware.
----
-## 🔧 Development
-### Running Tests
 ```bash
-cd backend
-python -m pytest tests/
 ```
-### Code Structure
-- **FastAPI** backend with SSE streaming
-- **Vanilla JS** frontend (no heavy frameworks)
-- **CrewAI** for agent orchestration
-- **Pydantic** for data models
-### Contributing
-1. Fork the repository
-2. Create feature branch
-3. Test with demo kernels
-4. Submit PR
----
----
-## 🎥 Watch the 2-min Demo
-[ROCmPort AI on AMD MI300X](https://youtu.be/your-link)
----
-## ☁️ Run on AMD Cloud (Real MI300X)
 ```bash
-# Set environment for real hardware
-export ROCM_AVAILABLE=true
-export USE_VLLM=true
-# Deploy vLLM on MI300X
-docker run --gpus all -p 8000:8000 \
-  vllm/vllm:latest \
-  --model amd/llama-3.3-70b \
-  --gpu-memory-utilization 0.95
-# Start ROCmPort AI
-cd backend
-uvicorn main:app --host 0.0.0.0 --port 8000
 ```
----
-## 🔧 Troubleshooting
-| Issue | Solution |
-|-------|----------|
-| **"GROQ_API_KEY not found"** | Add your API key to `.env` file from [console.groq.com](https://console.groq.com) |
-| **"hipcc not found"** | Install ROCm: `sudo apt install rocm-dkms` or use AMD Cloud |
-| **"Permission denied"** | Check file permissions: `chmod +x start.sh` |
-| **Frontend not loading** | Ensure backend is running on port 8000 |
-| **No speedup shown** | Check if `ROCM_AVAILABLE=true` for real hardware |
----
-## 🎯 Why ROCmPort AI Wins This Hackathon
-1. **Real Hardware Integration** - Actual MI300X benchmarking with rocprof, not mocked data
-2. **Intelligent Agent Pipeline** - 5 specialized AI agents working in sequence with retry logic
-3. **Trust Layer Verification** - Checksum verification ensures migrated code actually works
-4. **Human Override Capability** - Developers can edit and re-test optimized code
-5. **Cost Impact Analysis** - Shows real business value ($20k-$100k savings per module)
-6. **Simple Mode Toggle** - "Explain Like I'm 5" makes complex concepts accessible
-7. **Live SSE Streaming** - Real-time visibility into every agent decision
-8. **GitHub PR Simulation** - One-click export with diffs and reports
-9. **Predictive Analysis** - AI predicts performance gains before optimization
-10. **Honest Performance Claims** - Compares optimized ROCm vs baseline HIP, not fabricated NVIDIA comparisons
----
-## 👤 Creator
-**Tazwar Ahnaf Enan**
-AI Engineer & GPU Systems Builder
-[![X (Twitter)](https://img.shields.io/badge/X-@TazwarEnan-1DA1F2?style=flat-square&logo=x)](https://x.com/TazwarEnan)
-[![GitHub](https://img.shields.io/badge/GitHub-tazwaryayyyy-181717?style=flat-square&logo=github)](https://github.com/tazwaryayyyy)
-*Built with 🔥 for AMD Developer Hackathon 2026*

 # ROCmPort AI
+ROCmPort AI helps CUDA teams migrate to AMD by translating, testing, and iteratively optimizing kernels using real hardware feedback.
+It is an acceleration system for migration work, not a one-click replacement for CUDA expertise.
+## What This Project Is
+ROCmPort AI orchestrates a migration loop:
+1. Analyze CUDA code and detect migration risks.
+2. Translate with hipify plus LLM-assisted fixes.
+3. Compile and profile with ROCm tooling.
+4. Propose optimization changes and re-test.
+5. Return artifacts and decision trace.
+## What This Project Is Not
+- Not guaranteed to auto-fix all CUDA kernels.
+- Not a claim that every kernel improves.
+- Not a replacement for domain experts in performance-critical code.
+Complex kernels can fail conversion due to architecture assumptions, undefined behavior, inline PTX, or handcrafted memory logic. The value is reduced migration time and faster debug loops.
+## Target User and Business Case
+Primary product position:
+- Tool for teams evaluating AMD migration cost and performance tradeoffs.
+Typical use cases:
+- Port legacy CUDA modules to HIP/ROCm with a measurable baseline.
+- Build a migration backlog ranked by risk and expected impact.
+- Identify kernels where MI300X memory capacity can remove sharding complexity.
+Cost and performance impact should be calculated from your environment and workload, not fixed marketing ranges.
+## AMD-Specific Technical Considerations (MI300X)
+ROCmPort AI explicitly reasons about AMD constraints and opportunities, including:
+- Wavefront size 64 (vs CUDA warp 32 assumptions), which affects reduction trees, ballot/shuffle idioms, and launch geometry.
+- LDS (local data store) usage and bank behavior for tile staging and reuse.
+- MI300X memory capacity (192GB HBM) and implications for reducing model/data sharding in some workflows.
+- Memory access patterns and occupancy tradeoffs under ROCm compiler behavior.
+These are the places where migration often breaks or underperforms even after a successful hipify pass.
+### Concrete Wavefront Mismatch Example
+From `backend/demo_kernels/reduction.cu`, the reduction tail assumes a 32-thread warp:
+```cpp
+// NVIDIA-style assumption (incorrect on AMD wavefront=64)
+if (tid < 32) {
+	volatile float* vsmem = sdata;
+	vsmem[tid] += vsmem[tid + 32];
+	vsmem[tid] += vsmem[tid + 16];
+	...
+}
 ```
+A wavefront-aware correction expands the final stage to include the 64-wide lane behavior:
+```cpp
+// AMD-aware final reduction stage
+if (tid < 64) {
+	volatile float* vsmem = sdata;
+	vsmem[tid] += vsmem[tid + 32];
+	if (tid < 32) {
+		vsmem[tid] += vsmem[tid + 16];
+		vsmem[tid] += vsmem[tid + 8];
+		vsmem[tid] += vsmem[tid + 4];
+		vsmem[tid] += vsmem[tid + 2];
+		vsmem[tid] += vsmem[tid + 1];
+	}
+}
+```
+The key point is not the exact rewrite shape; it is that warp-size assumptions must be made explicit and re-validated on AMD.
+## Why This Is More Than Glue
+ROCmPort AI combines existing tools, but its core value is the control system around them:
+- Decision loop: detect failure/perf regressions, apply next strategy, re-run.
+- Explainability: stream each step and rationale (SSE logs + final report).
+- Verification: pair code changes with compile/test/profiler evidence.
+## Judge Mode Walkthrough
+Use this flow for technical review:
+1. Show original CUDA kernel.
+2. Show baseline HIP from straight hipify output.
+3. Run ROCmPort AI and show per-agent trace.
+4. Show final optimized HIP output.
+5. Show measured result against the declared baseline.
+6. Show one case with marginal gain or no gain.
+This format makes the comparison falsifiable and avoids curated-demo concerns.
+- Full walkthrough: `docs/JUDGE_MODE.md`.
+## Documented Failure Case
+At least one failure path is documented with source, output, root cause, and fix requirements:
+- See `docs/FAILURE_CASES.md`.
+This is intentional: credibility improves when the system's failure boundary is visible.
+## Quick Start
+### Option 1: Startup Script
+```bash
+# Windows
+start.bat
+# Linux/Mac
+./start.sh
+```
+### Option 2: Manual
 ```bash
+cd backend
+pip install -r requirements.txt
+cp .env.example .env
+# add your GROQ_API_KEY
+uvicorn main:app --reload --port 8000
 ```
+Open `frontend/index.html` in a browser.
+### Option 3: Docker
 ```bash
+docker build -t rocmport-ai .
+docker run -p 8000:8000 rocmport-ai
 ```
+## Benchmarking and Reproducibility
+Benchmark claims should always include:
+- Baseline definition (e.g., straight hipify output).
+- Hardware/software versions.
+- Input sizes and run counts.
+- Correctness verification.
+- Full logs or scripts to reproduce.
+See `BENCHMARKS.md` for the recommended reporting format used by this repository.
+## Project Structure
+```text
+ROCmPort AI/
+├── backend/
+│   ├── main.py
+│   ├── models.py
+│   ├── agents/
+│   │   ├── analyzer.py
+│   │   ├── translator.py
+│   │   ├── optimizer.py
+│   │   ├── tester.py
+│   │   └── coordinator.py
+│   ├── tools/
+│   │   ├── hipify_wrapper.py
+│   │   ├── rocprof_wrapper.py
+│   │   └── llm_client.py
+│   ├── demo_kernels/
+│   └── prompts/
+├── frontend/
+│   └── index.html
+├── BENCHMARKS.md
+└── README.md
+```
+## Configuration
+Copy `.env.example` to `.env`:
 ```bash
+GROQ_API_KEY=your_key
+GROQ_MODEL=llama-3.3-70b-versatile
+USE_VLLM=true
+VLLM_BASE_URL=http://your-amd-cloud:8000
+VLLM_API_KEY=your_vllm_key
+VLLM_MODEL=amd/llama-3.3-70b
+ROCM_AVAILABLE=true
+HIPCC_PATH=hipcc
+ROCPROF_PATH=rocprof
 ```
+## Defensible Scope
+This project is harder to replicate than a thin wrapper because it couples:
+- Multi-agent orchestration with retry decisions.
+- Structured traceability across analysis, translation, optimization, and test phases.
+- Integrated reporting where claims can be audited against intermediate artifacts.
+A basic weekend clone can chain hipify and an LLM. The differentiator is reliable decision flow and evidence quality under failure.
+## Troubleshooting
+| Issue | Resolution |
+|---|---|
+| `GROQ_API_KEY not found` | Add key to `.env`. |
+| `hipcc not found` | Install ROCm toolchain or run in an ROCm-enabled environment. |
+| Backend unavailable | Verify FastAPI server is running on port `8000`. |
+| No improvement observed | Re-check baseline definition, kernel size, and profiler counters. |
+## License
+See `LICENSE`.

backend/agents/analyzer.py CHANGED Viewed

@@ -1,24 +1,28 @@
-import json
-import re
-from models import AnalyzerResult, WorkloadType
-from tools.llm_client import LLMClient
-from tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 def generate_prediction(workload_type: WorkloadType, line_count: int) -> str:
     """Generate performance prediction based on workload analysis"""
     if workload_type == WorkloadType.MEMORY_BOUND:
-        return "🧠 Prediction: This kernel is memory-bound → HIGH potential gain on MI300X (5.3 TB/s vs H100 3.35 TB/s bandwidth)"
     elif workload_type == WorkloadType.COMPUTE_BOUND:
-        return "🧠 Prediction: This kernel is compute-bound → MODERATE gain on MI300X (wavefront efficiency improvements)"
     else:
         return "🧠 Prediction: Unknown workload type → LIMITED gain prediction without further analysis"
 SYSTEM_PROMPT = """You are an expert CUDA and GPU architecture engineer analyzing CUDA code before porting it to AMD ROCm/HIP.
 Your job is to deeply analyze CUDA code and output a structured JSON analysis. Be specific and technical.
@@ -53,7 +57,7 @@ Respond ONLY with this exact JSON structure, no markdown, no extra text:
 def run(cuda_code: str) -> AnalyzerResult:
     # Count lines for complexity estimation
     line_count = len([line for line in cuda_code.split('\n') if line.strip()])
     try:
         raw = chat_complete(
             messages=[
@@ -77,7 +81,7 @@ def run(cuda_code: str) -> AnalyzerResult:
             "line_count": line_count,
             "complexity_score": 5
         }
     workload_type = WorkloadType(data.get("workload_type", "unknown"))
     prediction = generate_prediction(workload_type, line_count)

+# pylint: disable=broad-exception-caught
+from ..models import AnalyzerResult, WorkloadType
+from ..tools.llm_client import LLMClient
+from ..tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 def generate_prediction(workload_type: WorkloadType, line_count: int) -> str:
     """Generate performance prediction based on workload analysis"""
+    size_hint = "large" if line_count and line_count > 200 else "small/medium"
     if workload_type == WorkloadType.MEMORY_BOUND:
+        return f"🧠 Prediction: This {size_hint} kernel is memory-bound → HIGH potential gain on MI300X (5.3 TB/s vs H100 3.35 TB/s bandwidth)"
     elif workload_type == WorkloadType.COMPUTE_BOUND:
+        return f"🧠 Prediction: This {size_hint} kernel is compute-bound → MODERATE gain on MI300X (wavefront efficiency improvements)"
     else:
         return "🧠 Prediction: Unknown workload type → LIMITED gain prediction without further analysis"
 SYSTEM_PROMPT = """You are an expert CUDA and GPU architecture engineer analyzing CUDA code before porting it to AMD ROCm/HIP.
 Your job is to deeply analyze CUDA code and output a structured JSON analysis. Be specific and technical.
 def run(cuda_code: str) -> AnalyzerResult:
     # Count lines for complexity estimation
     line_count = len([line for line in cuda_code.split('\n') if line.strip()])
     try:
         raw = chat_complete(
             messages=[
             "line_count": line_count,
             "complexity_score": 5
         }
     workload_type = WorkloadType(data.get("workload_type", "unknown"))
     prediction = generate_prediction(workload_type, line_count)

backend/agents/coordinator.py CHANGED Viewed

@@ -1,202 +1,224 @@
 import asyncio
 from typing import AsyncGenerator
-from models import (
-    AgentEvent, AgentStatus, AnalyzerResult, TranslatorResult,
-    OptimizerResult, TesterResult, FinalReport, WorkloadType, CostEstimate
 )
-from agents import analyzer, translator, optimizer, tester
 def calculate_cost_estimate(analyzer_result: AnalyzerResult) -> CostEstimate:
-    """Calculate cost impact estimate based on code complexity"""
-    line_count = analyzer_result.line_count or 100
     complexity = analyzer_result.complexity_score or 5
     if complexity <= 3:
         manual_weeks = "1-2 weeks"
         savings = "$5,000-$10,000"
         factor = "Low"
     elif complexity <= 7:
-        manual_weeks = "3-6 weeks"
         savings = "$20,000-$50,000"
         factor = "Medium"
     else:
         manual_weeks = "6-10 weeks"
         savings = "$50,000-$100,000"
         factor = "High"
     return CostEstimate(
         manual_porting_weeks=manual_weeks,
-        rocmport_minutes="5 minutes",
         estimated_savings=savings,
-        complexity_factor=factor
     )
 def simplify_explanation(report: FinalReport) -> str:
-    """Convert technical explanations to simple language for "Explain Like I'm 5" mode"""
     simple_text = report.amd_advantage_explanation
-    # Replace technical terms with simple, natural explanations
-    simple_text = simple_text.replace("5.3 TB/s memory bandwidth", "much faster memory access")
     simple_text = simple_text.replace("3.35 TB/s", "slower memory access")
-    simple_text = simple_text.replace("memory-bound", "needs to move a lot of data")
-    simple_text = simple_text.replace("compute-bound", "does a lot of calculations")
-    simple_text = simple_text.replace("wavefront", "group of threads working together")
-    simple_text = simple_text.replace("shared memory tiling", "shares data between threads efficiently")
     simple_text = simple_text.replace("coalescing", "accesses memory in order")
     simple_text = simple_text.replace("optimization", "improvement")
     simple_text = simple_text.replace("performance", "speed")
     simple_text = simple_text.replace("benchmark", "test")
     simple_text = simple_text.replace("iteration", "try")
-    # Make sentences more natural
     simple_text = simple_text.replace("This kernel is", "This code is")
     simple_text = simple_text.replace("The optimization", "The improvement")
     simple_text = simple_text.replace("achieves", "gets")
     simple_text = simple_text.replace("demonstrates", "shows")
     return simple_text
-async def run_pipeline(cuda_code: str, kernel_name: str = "custom", simple_mode: bool = False) -> AsyncGenerator[AgentEvent, None]:
-    """
-    Full agent pipeline. Yields AgentEvent objects as SSE data.
-    Coordinator handles the retry loop when Tester fails iteration 1.
-    """
-    # ─── ANALYZER ───────────────────────────────────────────────
-    yield AgentEvent(agent="analyzer", status=AgentStatus.RUNNING,
-                     message="Scanning CUDA code for kernels, APIs, and hardware-specific issues...")
     try:
         analyzer_result: AnalyzerResult = await asyncio.to_thread(analyzer.run, cuda_code)
     except Exception as e:
-        yield AgentEvent(agent="analyzer", status=AgentStatus.FAILED,
-                         message="Analysis failed", detail=str(e))
         return
-    detail_parts = [f"Found {len(analyzer_result.kernels_found)} kernel(s): {', '.join(analyzer_result.kernels_found)}"]
-    detail_parts.append(f"Workload: {analyzer_result.workload_type.value}")
-    detail_parts.append(f"Difficulty: {analyzer_result.difficulty} — {analyzer_result.difficulty_reason}")
     if analyzer_result.warp_size_issue:
-        detail_parts.append(f"⚠️ WARP SIZE ISSUE: {analyzer_result.warp_size_detail}")
     if analyzer_result.sharding_detected:
-        detail_parts.append("⚠️ Multi-GPU sharding detected — unnecessary on MI300X (192GB)")
-    # Add prediction if available
     if analyzer_result.prediction:
         detail_parts.append(analyzer_result.prediction)
-    # Calculate cost estimate
-    try:
-        cost_estimate = calculate_cost_estimate(analyzer_result)
-    except Exception as e:
-        # Fallback cost estimate if calculation fails
-        cost_estimate = CostEstimate(
-            manual_porting_weeks="3-6 weeks",
-            rocmport_minutes="5 minutes",
-            estimated_savings="$20,000-$50,000",
-            complexity_factor="Medium"
-        )
-    yield AgentEvent(agent="analyzer", status=AgentStatus.DONE,
-                     message=f"Found {len(analyzer_result.kernels_found)} kernel(s) | {analyzer_result.workload_type.value} workload | Difficulty: {analyzer_result.difficulty}",
-                     detail="\n".join(detail_parts))
-    # ─── TRANSLATOR ──────────────────────────────────────────────
-    yield AgentEvent(agent="translator", status=AgentStatus.RUNNING,
-                     message="Running hipify-clang (pass 1) then LLM correction (pass 2)...")
-    # Processing...
     try:
-        translator_result: TranslatorResult = await asyncio.to_thread(
-            translator.run, cuda_code, analyzer_result
-        )
     except Exception as e:
-        yield AgentEvent(agent="translator", status=AgentStatus.FAILED,
-                         message="Translation failed", detail=str(e))
         return
-    detail = (
-        f"Total changes: {translator_result.total_changes} "
-        f"({translator_result.hipify_changes} hipify, {translator_result.llm_changes} LLM)\n"
-        f"Warp size corrected: {analyzer_result.warp_size_issue}\n"
-        f"Kernel launch syntax updated"
     )
-    yield AgentEvent(agent="translator", status=AgentStatus.DONE,
-                     message=f"{translator_result.total_changes} changes ({translator_result.hipify_changes} hipify + {translator_result.llm_changes} LLM)",
-                     detail=detail)
-    # ─── OPTIMIZER (iteration 1) ──────────────────────────────────
-    yield AgentEvent(agent="optimizer", status=AgentStatus.RUNNING,
-                     message="Applying AMD MI300X-specific optimizations (iteration 1)...")
-    # Processing...
     try:
         optimizer_result: OptimizerResult = await asyncio.to_thread(
-            optimizer.run, translator_result.hip_code, analyzer_result, 1
         )
     except Exception as e:
-        yield AgentEvent(agent="optimizer", status=AgentStatus.FAILED,
-                         message="Optimization failed", detail=str(e))
         return
-    changes_text = "\n".join(
-        f"• {c['description']}" for c in optimizer_result.changes
     )
-    yield AgentEvent(agent="optimizer", status=AgentStatus.DONE,
-                     message=f"{len(optimizer_result.changes)} optimization(s) applied",
-                     detail=changes_text)
-    # ─── TESTER (iteration 1) ────────────────────────────────────
-    yield AgentEvent(agent="tester", status=AgentStatus.RUNNING,
-                     message="Compiling with hipcc and profiling with rocprof (iteration 1)...")
-    # Testing...
     try:
         tester_result_1: TesterResult = await asyncio.to_thread(
-            tester.run, optimizer_result.optimized_code, analyzer_result, 1, kernel_name
         )
     except Exception as e:
-        yield AgentEvent(agent="tester", status=AgentStatus.FAILED,
-                         message="Testing failed", detail=str(e))
         return
     if not tester_result_1.success:
-        yield AgentEvent(agent="tester", status=AgentStatus.FAILED,
-                         message="Compilation failed — using cached benchmark",
-                         detail=tester_result_1.notes)
         return
-    # ─── CONTROLLED FAILURE → RETRY LOOP ─────────────────────────
     if tester_result_1.speedup < 1.0:
         yield AgentEvent(
-            agent="tester", status=AgentStatus.FAILED,
-            message=f"❌ Iteration 1: {tester_result_1.speedup}x — worse than baseline HIP",
-            detail=f"Bandwidth utilized: {tester_result_1.bandwidth_utilized}%\n{tester_result_1.notes}"
         )
         yield AgentEvent(
-            agent="coordinator", status=AgentStatus.RUNNING,
-            message="Performance degraded — re-running Optimizer with profiler feedback...",
-            detail=f"Profiler says: {tester_result_1.notes}\nSwitching optimization strategy."
         )
-        # Testing...
-        # Optimizer iteration 2 with profiler feedback
-        yield AgentEvent(agent="optimizer", status=AgentStatus.RETRYING,
-                         message="Trying alternative optimization strategy (iteration 2)...",
-                         detail=f"Previous strategy caused regression. Profiler feedback: {tester_result_1.notes}")
-    # Trace: Optimizer v2
         try:
             optimizer_result_2: OptimizerResult = await asyncio.to_thread(
@@ -204,31 +226,36 @@ async def run_pipeline(cuda_code: str, kernel_name: str = "custom", simple_mode:
                 translator_result.hip_code,
                 analyzer_result,
                 2,
-                tester_result_1.notes
             )
         except Exception as e:
-            yield AgentEvent(agent="optimizer", status=AgentStatus.FAILED,
-                             message="Re-optimization failed", detail=str(e))
             return
-        changes_text_2 = "\n".join(f"• {c['description']}" for c in optimizer_result_2.changes)
-        yield AgentEvent(agent="optimizer", status=AgentStatus.DONE,
-                         message=f"Alternative strategy: {len(optimizer_result_2.changes)} change(s) applied",
-                         detail=changes_text_2)
-        # Tester iteration 2
-        yield AgentEvent(agent="tester", status=AgentStatus.RUNNING,
-                         message="Re-profiling with alternative optimization (iteration 2)...")
-        # Testing...
         try:
             tester_result_final: TesterResult = await asyncio.to_thread(
-                tester.run, optimizer_result_2.optimized_code, analyzer_result, 2, kernel_name
             )
         except Exception as e:
-            yield AgentEvent(agent="tester", status=AgentStatus.FAILED,
-                             message="Re-testing failed", detail=str(e))
             return
         final_optimizer = optimizer_result_2
@@ -236,50 +263,45 @@ async def run_pipeline(cuda_code: str, kernel_name: str = "custom", simple_mode:
         tester_result_final = tester_result_1
         final_optimizer = optimizer_result
-    # ─── TESTER FINAL RESULT ─────────────────────────────────────
     yield AgentEvent(
         agent="tester",
         status=AgentStatus.DONE,
-        message=f"✅ Iteration {tester_result_final.iteration}: {tester_result_final.speedup}x faster than baseline HIP",
         detail=(
             f"Execution time: {tester_result_final.execution_ms:.1f}ms\n"
             f"Memory bandwidth: {tester_result_final.bandwidth_utilized:.1f}% utilized\n"
             f"Bottleneck type: {tester_result_final.bottleneck}\n"
             f"{tester_result_final.notes}"
-        )
     )
-    # ─── COORDINATOR FINAL REPORT ────────────────────────────────
-    yield AgentEvent(agent="coordinator", status=AgentStatus.RUNNING,
-                     message="Generating migration report...")
-    # Processing...
-    amd_explanation = _build_amd_explanation(analyzer_result, tester_result_final)
-    # Calculate cost estimate
     try:
         cost_estimate = calculate_cost_estimate(analyzer_result)
-    except Exception as e:
-        # Fallback cost estimate if calculation fails
         cost_estimate = CostEstimate(
             manual_porting_weeks="3-6 weeks",
-            rocmport_minutes="5 minutes",
             estimated_savings="$20,000-$50,000",
-            complexity_factor="Medium"
         )
-    # Always generate simplified explanation
     temp_report = FinalReport(
         migration_success=True,
         speedup=tester_result_final.speedup,
         bandwidth_utilized=tester_result_final.bandwidth_utilized,
-        total_changes=translator_result.total_changes + len(final_optimizer.changes),
         bottleneck=tester_result_final.bottleneck,
         amd_advantage_explanation=amd_explanation,
         iterations=tester_result_final.iteration,
         hip_code=translator_result.hip_code,
         optimized_code=final_optimizer.optimized_code,
     )
     simplified_explanation = simplify_explanation(temp_report)
@@ -287,36 +309,34 @@ async def run_pipeline(cuda_code: str, kernel_name: str = "custom", simple_mode:
         migration_success=True,
         speedup=tester_result_final.speedup,
         bandwidth_utilized=tester_result_final.bandwidth_utilized,
-        total_changes=translator_result.total_changes + len(final_optimizer.changes),
         bottleneck=tester_result_final.bottleneck,
         amd_advantage_explanation=amd_explanation,
         iterations=tester_result_final.iteration,
         hip_code=translator_result.hip_code,
         optimized_code=final_optimizer.optimized_code,
         cost_estimate=cost_estimate,
-        simplified_explanation=simplified_explanation
     )
-    import json
     yield AgentEvent(
         agent="coordinator",
         status=AgentStatus.DONE,
         message="Migration complete",
-        detail=json.dumps(report.model_dump())
     )
 def _build_amd_explanation(analyzer_result: AnalyzerResult, tester_result: TesterResult) -> str:
     if analyzer_result.workload_type == WorkloadType.MEMORY_BOUND:
         return (
-            f"This is a memory-bound kernel — performance scales with memory bandwidth. "
-            f"MI300X delivers 5.3 TB/s vs H100's 3.35 TB/s (58% more bandwidth). "
-            f"After optimization, bandwidth utilization reached {tester_result.bandwidth_utilized:.0f}%, "
-            f"meaning this workload extracts full value from AMD's memory architecture."
-        )
-    else:
-        return (
-            f"This is a compute-bound kernel. MI300X delivers 1.3 PFLOPS FP16 "
-            f"vs H100's 989 TFLOPS — 31% more raw throughput. "
-            f"After wavefront-aligned optimization, compute utilization improved significantly."
         )

 import asyncio
+import json
 from typing import AsyncGenerator
+# pylint: disable=broad-exception-caught
+from . import analyzer, optimizer, tester, translator
+from ..models import (
+    AgentEvent,
+    AgentStatus,
+    AnalyzerResult,
+    CostEstimate,
+    FinalReport,
+    OptimizerResult,
+    TesterResult,
+    TranslatorResult,
+    WorkloadType,
 )
 def calculate_cost_estimate(analyzer_result: AnalyzerResult) -> CostEstimate:
+    """Calculate cost impact estimate based on code complexity."""
     complexity = analyzer_result.complexity_score or 5
     if complexity <= 3:
         manual_weeks = "1-2 weeks"
         savings = "$5,000-$10,000"
         factor = "Low"
     elif complexity <= 7:
+        manual_weeks = "3-6 weeks"
         savings = "$20,000-$50,000"
         factor = "Medium"
     else:
         manual_weeks = "6-10 weeks"
         savings = "$50,000-$100,000"
         factor = "High"
     return CostEstimate(
         manual_porting_weeks=manual_weeks,
+        rocmport_minutes="Varies by kernel",
         estimated_savings=savings,
+        complexity_factor=factor,
     )
 def simplify_explanation(report: FinalReport) -> str:
+    """Convert technical explanation to simpler wording for explain mode."""
     simple_text = report.amd_advantage_explanation
+    simple_text = simple_text.replace(
+        "5.3 TB/s memory bandwidth", "much faster memory access")
     simple_text = simple_text.replace("3.35 TB/s", "slower memory access")
+    simple_text = simple_text.replace(
+        "memory-bound", "needs to move a lot of data")
+    simple_text = simple_text.replace(
+        "compute-bound", "does a lot of calculations")
+    simple_text = simple_text.replace(
+        "wavefront", "group of threads working together")
+    simple_text = simple_text.replace(
+        "shared memory tiling", "shares data between threads efficiently")
     simple_text = simple_text.replace("coalescing", "accesses memory in order")
     simple_text = simple_text.replace("optimization", "improvement")
     simple_text = simple_text.replace("performance", "speed")
     simple_text = simple_text.replace("benchmark", "test")
     simple_text = simple_text.replace("iteration", "try")
     simple_text = simple_text.replace("This kernel is", "This code is")
     simple_text = simple_text.replace("The optimization", "The improvement")
     simple_text = simple_text.replace("achieves", "gets")
     simple_text = simple_text.replace("demonstrates", "shows")
     return simple_text
+async def run_pipeline(
+    cuda_code: str,
+    kernel_name: str = "custom",
+    simple_mode: bool = False,
+) -> AsyncGenerator[AgentEvent, None]:
+    """Run full pipeline and stream AgentEvent objects."""
+    _ = simple_mode
+    yield AgentEvent(
+        agent="analyzer",
+        status=AgentStatus.RUNNING,
+        message="Scanning CUDA code for kernels, APIs, and hardware-specific issues...",
+    )
     try:
         analyzer_result: AnalyzerResult = await asyncio.to_thread(analyzer.run, cuda_code)
     except Exception as e:
+        yield AgentEvent(agent="analyzer", status=AgentStatus.FAILED, message="Analysis failed", detail=str(e))
         return
+    detail_parts = [
+        f"Found {len(analyzer_result.kernels_found)} kernel(s): {', '.join(analyzer_result.kernels_found)}",
+        f"Workload: {analyzer_result.workload_type.value}",
+        f"Difficulty: {analyzer_result.difficulty} - {analyzer_result.difficulty_reason}",
+    ]
     if analyzer_result.warp_size_issue:
+        detail_parts.append(
+            f"WARP SIZE ISSUE: {analyzer_result.warp_size_detail}")
     if analyzer_result.sharding_detected:
+        detail_parts.append(
+            "Multi-GPU sharding detected; review if needed on MI300X memory capacity.")
     if analyzer_result.prediction:
         detail_parts.append(analyzer_result.prediction)
+    yield AgentEvent(
+        agent="analyzer",
+        status=AgentStatus.DONE,
+        message=(
+            f"Found {len(analyzer_result.kernels_found)} kernel(s) | "
+            f"{analyzer_result.workload_type.value} workload | Difficulty: {analyzer_result.difficulty}"
+        ),
+        detail="\n".join(detail_parts),
+    )
+    yield AgentEvent(
+        agent="translator",
+        status=AgentStatus.RUNNING,
+        message="Running hipify-clang (pass 1) then LLM correction (pass 2)...",
+    )
     try:
+        translator_result: TranslatorResult = await asyncio.to_thread(translator.run, cuda_code, analyzer_result)
     except Exception as e:
+        yield AgentEvent(agent="translator", status=AgentStatus.FAILED, message="Translation failed", detail=str(e))
         return
+    yield AgentEvent(
+        agent="translator",
+        status=AgentStatus.DONE,
+        message=(
+            f"{translator_result.total_changes} changes "
+            f"({translator_result.hipify_changes} hipify + {translator_result.llm_changes} LLM)"
+        ),
+        detail=(
+            f"Total changes: {translator_result.total_changes} "
+            f"({translator_result.hipify_changes} hipify, {translator_result.llm_changes} LLM)\n"
+            f"Warp size corrected: {analyzer_result.warp_size_issue}\n"
+            "Kernel launch syntax updated"
+        ),
     )
+    yield AgentEvent(
+        agent="optimizer",
+        status=AgentStatus.RUNNING,
+        message="Applying AMD MI300X-specific optimizations (iteration 1)...",
+    )
     try:
         optimizer_result: OptimizerResult = await asyncio.to_thread(
+            optimizer.run,
+            translator_result.hip_code,
+            analyzer_result,
+            1,
         )
     except Exception as e:
+        yield AgentEvent(agent="optimizer", status=AgentStatus.FAILED, message="Optimization failed", detail=str(e))
         return
+    yield AgentEvent(
+        agent="optimizer",
+        status=AgentStatus.DONE,
+        message=f"{len(optimizer_result.changes)} optimization(s) applied",
+        detail="\n".join(
+            f"- {c['description']}" for c in optimizer_result.changes),
     )
+    yield AgentEvent(
+        agent="tester",
+        status=AgentStatus.RUNNING,
+        message="Compiling with hipcc and profiling with rocprof (iteration 1)...",
+    )
     try:
         tester_result_1: TesterResult = await asyncio.to_thread(
+            tester.run,
+            optimizer_result.optimized_code,
+            analyzer_result,
+            1,
+            kernel_name,
         )
     except Exception as e:
+        yield AgentEvent(agent="tester", status=AgentStatus.FAILED, message="Testing failed", detail=str(e))
         return
     if not tester_result_1.success:
+        yield AgentEvent(
+            agent="tester",
+            status=AgentStatus.FAILED,
+            message="Compilation or profiling failed",
+            detail=tester_result_1.notes,
+        )
         return
     if tester_result_1.speedup < 1.0:
         yield AgentEvent(
+            agent="tester",
+            status=AgentStatus.FAILED,
+            message=f"Iteration 1: {tester_result_1.speedup}x vs baseline HIP (regression)",
+            detail=(
+                f"Bandwidth utilized: {tester_result_1.bandwidth_utilized}%\n"
+                f"{tester_result_1.notes}"
+            ),
         )
         yield AgentEvent(
+            agent="coordinator",
+            status=AgentStatus.RUNNING,
+            message="Performance regressed, retrying optimizer with profiler feedback...",
+            detail=f"Profiler feedback: {tester_result_1.notes}",
         )
+        yield AgentEvent(
+            agent="optimizer",
+            status=AgentStatus.RETRYING,
+            message="Trying alternative optimization strategy (iteration 2)...",
+            detail=f"Previous strategy regressed. Feedback: {tester_result_1.notes}",
+        )
         try:
             optimizer_result_2: OptimizerResult = await asyncio.to_thread(
                 translator_result.hip_code,
                 analyzer_result,
                 2,
+                tester_result_1.notes,
             )
         except Exception as e:
+            yield AgentEvent(agent="optimizer", status=AgentStatus.FAILED, message="Re-optimization failed", detail=str(e))
             return
+        yield AgentEvent(
+            agent="optimizer",
+            status=AgentStatus.DONE,
+            message=f"Alternative strategy: {len(optimizer_result_2.changes)} change(s) applied",
+            detail="\n".join(
+                f"- {c['description']}" for c in optimizer_result_2.changes),
+        )
+        yield AgentEvent(
+            agent="tester",
+            status=AgentStatus.RUNNING,
+            message="Re-profiling with alternative optimization (iteration 2)...",
+        )
         try:
             tester_result_final: TesterResult = await asyncio.to_thread(
+                tester.run,
+                optimizer_result_2.optimized_code,
+                analyzer_result,
+                2,
+                kernel_name,
             )
         except Exception as e:
+            yield AgentEvent(agent="tester", status=AgentStatus.FAILED, message="Re-testing failed", detail=str(e))
             return
         final_optimizer = optimizer_result_2
         tester_result_final = tester_result_1
         final_optimizer = optimizer_result
     yield AgentEvent(
         agent="tester",
         status=AgentStatus.DONE,
+        message=f"Iteration {tester_result_final.iteration}: {tester_result_final.speedup}x vs baseline HIP",
         detail=(
             f"Execution time: {tester_result_final.execution_ms:.1f}ms\n"
             f"Memory bandwidth: {tester_result_final.bandwidth_utilized:.1f}% utilized\n"
             f"Bottleneck type: {tester_result_final.bottleneck}\n"
             f"{tester_result_final.notes}"
+        ),
     )
+    yield AgentEvent(agent="coordinator", status=AgentStatus.RUNNING, message="Generating migration report...")
+    amd_explanation = _build_amd_explanation(
+        analyzer_result, tester_result_final)
     try:
         cost_estimate = calculate_cost_estimate(analyzer_result)
+    except Exception:
         cost_estimate = CostEstimate(
             manual_porting_weeks="3-6 weeks",
+            rocmport_minutes="Varies by kernel",
             estimated_savings="$20,000-$50,000",
+            complexity_factor="Medium",
         )
     temp_report = FinalReport(
         migration_success=True,
         speedup=tester_result_final.speedup,
         bandwidth_utilized=tester_result_final.bandwidth_utilized,
+        total_changes=translator_result.total_changes +
+        len(final_optimizer.changes),
         bottleneck=tester_result_final.bottleneck,
         amd_advantage_explanation=amd_explanation,
         iterations=tester_result_final.iteration,
         hip_code=translator_result.hip_code,
         optimized_code=final_optimizer.optimized_code,
+        verification=tester_result_final.verification,
     )
     simplified_explanation = simplify_explanation(temp_report)
         migration_success=True,
         speedup=tester_result_final.speedup,
         bandwidth_utilized=tester_result_final.bandwidth_utilized,
+        total_changes=translator_result.total_changes +
+        len(final_optimizer.changes),
         bottleneck=tester_result_final.bottleneck,
         amd_advantage_explanation=amd_explanation,
         iterations=tester_result_final.iteration,
         hip_code=translator_result.hip_code,
         optimized_code=final_optimizer.optimized_code,
+        verification=tester_result_final.verification,
         cost_estimate=cost_estimate,
+        simplified_explanation=simplified_explanation,
     )
     yield AgentEvent(
         agent="coordinator",
         status=AgentStatus.DONE,
         message="Migration complete",
+        detail=json.dumps(report.model_dump()),
     )
 def _build_amd_explanation(analyzer_result: AnalyzerResult, tester_result: TesterResult) -> str:
     if analyzer_result.workload_type == WorkloadType.MEMORY_BOUND:
         return (
+            "This is a memory-bound kernel; performance scales with memory bandwidth. "
+            "MI300X provides higher memory bandwidth than H100-class hardware, and this workload "
+            f"reached {tester_result.bandwidth_utilized:.0f}% utilization after optimization."
         )
+    return (
+        "This is a compute-bound kernel; launch geometry and wavefront-aware tuning are key drivers. "
+        "After optimization, compute utilization and execution characteristics improved."
+    )

backend/agents/optimizer.py CHANGED Viewed

@@ -1,15 +1,17 @@
-import json
-import re
-from models import OptimizerResult, AnalyzerResult, WorkloadType
-from tools.llm_client import LLMClient
-from tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 ALLOWED_OPTIMIZATIONS = """
 You may ONLY suggest these specific, well-known AMD MI300X optimizations:
 1. Shared memory tiling: Replace naive global memory access with 32x32 shared memory tiles (__shared__)

+# pylint: disable=broad-exception-caught
+from ..models import OptimizerResult, AnalyzerResult, WorkloadType
+from ..tools.llm_client import LLMClient
+from ..tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 ALLOWED_OPTIMIZATIONS = """
 You may ONLY suggest these specific, well-known AMD MI300X optimizations:
 1. Shared memory tiling: Replace naive global memory access with 32x32 shared memory tiles (__shared__)

backend/agents/tester.py CHANGED Viewed

@@ -1,10 +1,7 @@
 import os
-import subprocess
-import tempfile
-import random
 import hashlib
-from models import TesterResult, AnalyzerResult, WorkloadType, VerificationResult
-from tools.rocprof_wrapper import RocprofWrapper
 # Set ROCM_AVAILABLE=true on AMD Cloud
 ROCM_AVAILABLE = os.environ.get("ROCM_AVAILABLE", "false").lower() == "true"
@@ -19,27 +16,23 @@ DEMO_KERNEL_CHECKSUMS = {
 }
-def compute_output_checksum(output_data: list, sample_size: int = 100) -> str:
-    """Compute checksum of first N elements of output data"""
-    if not output_data:
         return "empty"
-    # Take first sample_size elements or all if less
-    sample = output_data[:min(sample_size, len(output_data))]
-    # Convert to string and compute SHA256
-    sample_str = ','.join([str(x) for x in sample])
-    return hashlib.sha256(sample_str.encode()).hexdigest()[:32]
 def verify_demo_kernel(kernel_name: str, optimized_code: str) -> VerificationResult:
     """Verify demo kernel execution and output correctness"""
     expected = DEMO_KERNEL_CHECKSUMS.get(kernel_name, "mock_checksum")
-    actual = compute_output_checksum(optimized_code)
     # In mock mode, indicate this is simulated verification
     is_mock = not ROCM_AVAILABLE
     verification = VerificationResult(
         compiled_successfully=True,
         executed_without_error=True,
@@ -48,18 +41,12 @@ def verify_demo_kernel(kernel_name: str, optimized_code: str) -> VerificationRes
         actual_checksum=actual,
         mock_mode=is_mock
     )
-    # For demo purposes, simulate verification
-    if kernel_name in DEMO_KERNEL_CHECKSUMS:
-        # Simulate successful verification on iteration 2, failed on iteration 1
-        import time
-        current_time = int(time.time())
-        if current_time % 2 == 0:  # Simulate alternating success/failure
-            verification.output_matches_expected = True
-            verification.checksum_computed = DEMO_KERNEL_CHECKSUMS[kernel_name]
-        else:
-            verification.checksum_computed = "wrong_checksum_demo"
     return verification
@@ -67,27 +54,24 @@ def run(optimized_code: str, analyzer_result: AnalyzerResult,
         iteration: int = 1, kernel_name: str = "matrix_multiply") -> TesterResult:
     """
     On AMD Cloud (ROCM_AVAILABLE=true): runs real hipcc + rocprof
-    Locally: returns realistic mocked results
-    Controlled failure: iteration 1 always performs worse than baseline.
-    Iteration 2 shows the improvement. This is intentional demo design.
     """
     rocprof_wrapper = RocprofWrapper()
     # Add verification for demo kernels
     verification = None
     if kernel_name in DEMO_KERNEL_CHECKSUMS:
         verification = verify_demo_kernel(kernel_name, optimized_code)
     if ROCM_AVAILABLE:
         return _run_real(optimized_code, analyzer_result, iteration, rocprof_wrapper, verification)
     else:
-        # Use mock data from RocprofWrapper and convert to TesterResult
-        profiling_data = rocprof_wrapper._get_mock_profiling_data()
-        return _convert_profiling_to_tester_result(profiling_data, analyzer_result, iteration, kernel_name, verification)
-def _convert_profiling_to_tester_result(profiling_data: dict, analyzer_result: AnalyzerResult, iteration: int, kernel_name: str, verification: VerificationResult = None) -> TesterResult:
     """Convert RocprofWrapper output to TesterResult format"""
     if not profiling_data.get('success', False):
         return TesterResult(
@@ -100,25 +84,25 @@ def _convert_profiling_to_tester_result(profiling_data: dict, analyzer_result: A
             notes=profiling_data.get('error', 'Unknown profiling error'),
             verification=verification
         )
     exec_ms = profiling_data.get('execution_time_ms', 0.0)
     bandwidth = profiling_data.get('memory_bandwidth_gbps', 0.0)
-    # Calculate speedup based on iteration (controlled failure pattern)
-    # To save time for the user, we only "fail" the first iteration for 'custom' code.
-    # For demo kernels, we show the improvement immediately (skipping the 30s retry loop).
-    is_demo = kernel_name in ["vector_add", "matrix_multiply", "convolution_2d", "reduction"]
-    if iteration == 1 and not is_demo:
-        speedup = round(0.8 + (hash(kernel_name) % 10) / 100, 2)  # 0.80-0.89
-        notes = "Global memory bandwidth underutilized. Shared memory tiling not yet applied. Re-optimization needed."
     else:
-        if analyzer_result.workload_type == WorkloadType.MEMORY_BOUND:
-            speedup = round(1.3 + (hash(kernel_name) % 20) / 100, 2)  # 1.30-1.49
-        else:
-            speedup = round(1.15 + (hash(kernel_name) % 15) / 100, 2)  # 1.15-1.29
-        notes = "Optimization successful. Shared memory tiling applied and memory coalescing fixed for MI300X."
     return TesterResult(
         success=True,
         iteration=iteration,
@@ -135,7 +119,7 @@ def _run_real(code: str, analyzer_result: AnalyzerResult, iteration: int, rocpro
     """Real hipcc + rocprof execution on MI300X."""
     # Compile the code
     success, message = rocprof_wrapper.compile_hip_code(code)
     if not success:
         return TesterResult(
             success=False,
@@ -147,10 +131,11 @@ def _run_real(code: str, analyzer_result: AnalyzerResult, iteration: int, rocpro
             notes=f"Compilation failed: {message}",
             verification=verification
         )
     # Run with profiling
-    profiling_data = rocprof_wrapper.run_with_profiling(message.split(": ")[-1])  # Extract executable path
     if not profiling_data.get('success', False):
         return TesterResult(
             success=False,
@@ -162,11 +147,11 @@ def _run_real(code: str, analyzer_result: AnalyzerResult, iteration: int, rocpro
             notes=f"Profiling failed: {profiling_data.get('error', 'Unknown error')}",
             verification=verification
         )
     exec_ms = profiling_data.get('execution_time_ms', 0.0)
     bandwidth = profiling_data.get('memory_bandwidth_gbps', 0.0)
-    speedup = _calculate_speedup(exec_ms, analyzer_result, iteration)
     return TesterResult(
         success=True,
         iteration=iteration,
@@ -178,8 +163,9 @@ def _run_real(code: str, analyzer_result: AnalyzerResult, iteration: int, rocpro
     )
-def _calculate_speedup(exec_ms: float, analyzer_result: AnalyzerResult, iteration: int) -> float:
     """Estimate speedup relative to baseline HIP."""
-    if iteration == 1:
-        return round(random.uniform(0.80, 0.90), 2)
-    return round(random.uniform(1.20, 1.40), 2)

 import os
 import hashlib
+from ..models import TesterResult, AnalyzerResult, VerificationResult
+from ..tools.rocprof_wrapper import RocprofWrapper
 # Set ROCM_AVAILABLE=true on AMD Cloud
 ROCM_AVAILABLE = os.environ.get("ROCM_AVAILABLE", "false").lower() == "true"
 }
+def compute_code_checksum(code_text: str, sample_size: int = 400) -> str:
+    """Compute a short checksum from code text for traceability in mock mode."""
+    if not code_text:
         return "empty"
+    sample = code_text[:sample_size]
+    return hashlib.sha256(sample.encode()).hexdigest()[:32]
 def verify_demo_kernel(kernel_name: str, optimized_code: str) -> VerificationResult:
     """Verify demo kernel execution and output correctness"""
     expected = DEMO_KERNEL_CHECKSUMS.get(kernel_name, "mock_checksum")
+    actual = compute_code_checksum(optimized_code)
     # In mock mode, indicate this is simulated verification
     is_mock = not ROCM_AVAILABLE
     verification = VerificationResult(
         compiled_successfully=True,
         executed_without_error=True,
         actual_checksum=actual,
         mock_mode=is_mock
     )
+    # Do not fabricate pass/fail in mock mode. Surface that verification is simulated.
+    if is_mock:
+        verification.output_matches_expected = False
+        verification.checksum_computed = actual
     return verification
         iteration: int = 1, kernel_name: str = "matrix_multiply") -> TesterResult:
     """
     On AMD Cloud (ROCM_AVAILABLE=true): runs real hipcc + rocprof
+    Locally: returns mock profiling results labeled as simulated.
     """
     rocprof_wrapper = RocprofWrapper()
     # Add verification for demo kernels
     verification = None
     if kernel_name in DEMO_KERNEL_CHECKSUMS:
         verification = verify_demo_kernel(kernel_name, optimized_code)
     if ROCM_AVAILABLE:
         return _run_real(optimized_code, analyzer_result, iteration, rocprof_wrapper, verification)
     else:
+        # In non-ROCm environments, run_with_profiling returns simulated metrics.
+        profiling_data = rocprof_wrapper.run_with_profiling("mock_executable")
+        return _convert_profiling_to_tester_result(profiling_data, analyzer_result, iteration, verification)
+def _convert_profiling_to_tester_result(profiling_data: dict, analyzer_result: AnalyzerResult, iteration: int, verification: VerificationResult = None) -> TesterResult:
     """Convert RocprofWrapper output to TesterResult format"""
     if not profiling_data.get('success', False):
         return TesterResult(
             notes=profiling_data.get('error', 'Unknown profiling error'),
             verification=verification
         )
     exec_ms = profiling_data.get('execution_time_ms', 0.0)
     bandwidth = profiling_data.get('memory_bandwidth_gbps', 0.0)
+    baseline_ms = profiling_data.get('baseline_time_ms', 100.0)
+    if exec_ms > 0:
+        speedup = round(baseline_ms / exec_ms, 2)
+    else:
+        speedup = 0.0
+    if speedup < 1.0:
+        notes = "Simulated profile indicates regression vs baseline. Retry with an alternative optimization strategy."
+    elif speedup < 1.1:
+        notes = "Simulated profile indicates marginal improvement. Optimization may be memory- or launch-bound."
     else:
+        notes = "Simulated profile indicates improvement vs baseline after optimization."
+    notes += " Mock mode is enabled (ROCM_AVAILABLE=false); use real ROCm hardware for authoritative numbers."
     return TesterResult(
         success=True,
         iteration=iteration,
     """Real hipcc + rocprof execution on MI300X."""
     # Compile the code
     success, message = rocprof_wrapper.compile_hip_code(code)
     if not success:
         return TesterResult(
             success=False,
             notes=f"Compilation failed: {message}",
             verification=verification
         )
     # Run with profiling
+    profiling_data = rocprof_wrapper.run_with_profiling(
+        message.split(": ")[-1])  # Extract executable path
     if not profiling_data.get('success', False):
         return TesterResult(
             success=False,
             notes=f"Profiling failed: {profiling_data.get('error', 'Unknown error')}",
             verification=verification
         )
     exec_ms = profiling_data.get('execution_time_ms', 0.0)
     bandwidth = profiling_data.get('memory_bandwidth_gbps', 0.0)
+    speedup = _calculate_speedup(exec_ms)
     return TesterResult(
         success=True,
         iteration=iteration,
     )
+def _calculate_speedup(exec_ms: float) -> float:
     """Estimate speedup relative to baseline HIP."""
+    if exec_ms <= 0:
+        return 0.0
+    baseline_ms = 100.0
+    return round(baseline_ms / exec_ms, 2)

backend/agents/translator.py CHANGED Viewed

@@ -1,21 +1,24 @@
-import json
-import re
-from models import TranslatorResult, AnalyzerResult
-from tools.llm_client import LLMClient
-from tools.hipify_wrapper import HipifyWrapper
-from tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 hipify_wrapper = HipifyWrapper()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 def run_hipify(cuda_code: str) -> str:
     """Wrapper for hipify wrapper"""
     return hipify_wrapper.hipify_code(cuda_code)
 SYSTEM_PROMPT = """You are an expert AMD ROCm/HIP engineer. You receive CUDA code that has already gone through hipify (basic syntax replacement) and you fix what hipify missed.
 Your specific jobs:

+# pylint: disable=broad-exception-caught
+from ..models import TranslatorResult, AnalyzerResult
+from ..tools.llm_client import LLMClient
+from ..tools.hipify_wrapper import HipifyWrapper
+from ..tools.json_utils import safe_json_loads
 llm_client = LLMClient()
 hipify_wrapper = HipifyWrapper()
 def chat_complete(messages: list, temperature: float = 0.7, max_tokens: int = 4000) -> str:
     """Wrapper for LLM client chat completion"""
     return llm_client.chat_completion(messages, temperature=temperature, max_tokens=max_tokens)
 def run_hipify(cuda_code: str) -> str:
     """Wrapper for hipify wrapper"""
     return hipify_wrapper.hipify_code(cuda_code)
 SYSTEM_PROMPT = """You are an expert AMD ROCm/HIP engineer. You receive CUDA code that has already gone through hipify (basic syntax replacement) and you fix what hipify missed.
 Your specific jobs:

backend/main.py CHANGED Viewed

@@ -1,3 +1,13 @@
 import json
 import asyncio
 import zipfile
@@ -9,18 +19,10 @@ from dotenv import load_dotenv
 # Load environment variables from .env file
 load_dotenv()
-from fastapi import FastAPI, HTTPException
-from fastapi.middleware.cors import CORSMiddleware
-from fastapi.responses import StreamingResponse
-from fastapi.staticfiles import StaticFiles
-from models import PortRequest, VerificationResult
-from agents.coordinator import run_pipeline
-from agents.tester import run as run_tester
-from agents.analyzer import AnalyzerResult, WorkloadType
 app = FastAPI(
     title="ROCmPort AI",
-    description="The fastest way to escape CUDA lock-in and run on AMD.",
     version="1.0.0",
     contact={
         "name": "Tazwar Ahnaf Enan",
@@ -59,7 +61,8 @@ async def port_cuda_code(req: PortRequest):
             async for event in run_pipeline(req.cuda_code, req.kernel_name or "custom", req.simple_mode or False):
                 data = json.dumps(event.model_dump())
                 yield f"data: {data}\n\n"
-                await asyncio.sleep(0.05)   # Let the client breathe between events
         except Exception as e:
             error_event = {
                 "agent": "coordinator",
@@ -81,6 +84,121 @@ async def port_cuda_code(req: PortRequest):
     )
 @app.post("/recompile")
 async def recompile_edited_code(req: dict):
     """
@@ -90,10 +208,10 @@ async def recompile_edited_code(req: dict):
     try:
         edited_code = req.get("edited_code")
         kernel_name = req.get("kernel_name", "custom")
         if not edited_code or len(edited_code.strip()) < 10:
             raise HTTPException(status_code=400, detail="No HIP code provided")
         # Create a mock analyzer result for testing
         analyzer_result = AnalyzerResult(
             kernels_found=["test_kernel"],
@@ -105,17 +223,18 @@ async def recompile_edited_code(req: dict):
             difficulty="Easy",
             difficulty_reason="Simple test kernel"
         )
         # Run tester with edited code
         tester_result = await asyncio.to_thread(run_tester, edited_code, analyzer_result, 2, kernel_name)
         return {
             "success": True,
             "result": tester_result.model_dump()
         }
     except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Recompilation failed: {str(e)}")
 @app.post("/export")
@@ -128,7 +247,7 @@ async def export_migration_package(req: dict):
         original_cuda = req.get("original_cuda")
         final_rocm = req.get("final_rocm")
         migration_report = req.get("migration_report", {})
         with tempfile.NamedTemporaryFile(delete=False, suffix=".zip") as tmp_file:
             with zipfile.ZipFile(tmp_file, 'w', zipfile.ZIP_DEFLATED) as zf:
                 # Add professional unified diff
@@ -140,7 +259,7 @@ async def export_migration_package(req: dict):
                 )
                 diff_text = "".join(diff)
                 zf.writestr("migration.diff", diff_text)
                 # Add migration report as markdown
                 md_report = f"""# ROCmPort AI Migration Report
@@ -155,43 +274,44 @@ async def export_migration_package(req: dict):
 ## Cost Impact
 {migration_report.get('cost_estimate', 'N/A')}
-Generated by ROCmPort AI - The fastest way to escape CUDA lock-in and run on AMD.
 """
                 zf.writestr("migration_report.md", md_report)
             # Read the zip file content
             with open(tmp_file, 'rb') as f:
                 zip_content = f.read()
             # Clean up
             os.unlink(tmp_file)
         from fastapi.responses import Response
         return Response(
             content=zip_content,
             media_type="application/zip",
-            headers={"Content-Disposition": "attachment; filename=rocmport_migration.zip"}
         )
     except Exception as e:
-        raise HTTPException(status_code=500, detail=f"Export failed: {str(e)}")
 @app.get("/demo-kernels")
 async def list_demo_kernels():
-    import os
     kernels_dir = os.path.join(os.path.dirname(__file__), "demo_kernels")
     kernels = {}
     for fname in os.listdir(kernels_dir):
         if fname.endswith(".cu"):
             name = fname.replace(".cu", "")
-            with open(os.path.join(kernels_dir, fname)) as f:
                 kernels[name] = f.read()
     return kernels
 # Serve frontend if built
-import os
 frontend_path = os.path.join(os.path.dirname(__file__), "..", "frontend")
 if os.path.exists(frontend_path):
-    app.mount("/", StaticFiles(directory=frontend_path, html=True), name="frontend")

+# pylint: disable=broad-exception-caught
+from backend.agents.analyzer import AnalyzerResult, WorkloadType
+from backend.agents.tester import run as run_tester
+from backend.agents.coordinator import run_pipeline
+from backend.models import PortRequest, ColdStartRequest, AggregateMetricsRequest
+from fastapi.staticfiles import StaticFiles
+from fastapi.responses import StreamingResponse
+from fastapi.middleware.cors import CORSMiddleware
+from fastapi import FastAPI, HTTPException
 import json
 import asyncio
 import zipfile
 # Load environment variables from .env file
 load_dotenv()
 app = FastAPI(
     title="ROCmPort AI",
+    description="CUDA-to-ROCm migration assistant with iterative testing and optimization.",
     version="1.0.0",
     contact={
         "name": "Tazwar Ahnaf Enan",
             async for event in run_pipeline(req.cuda_code, req.kernel_name or "custom", req.simple_mode or False):
                 data = json.dumps(event.model_dump())
                 yield f"data: {data}\n\n"
+                # Let the client breathe between events
+                await asyncio.sleep(0.05)
         except Exception as e:
             error_event = {
                 "agent": "coordinator",
     )
+async def _collect_pipeline_events(cuda_code: str, kernel_name: str, simple_mode: bool = False) -> tuple[list[dict], dict | None]:
+    """Collect all pipeline events and extract final report payload when present."""
+    events: list[dict] = []
+    final_report = None
+    async for event in run_pipeline(cuda_code, kernel_name, simple_mode):
+        dumped = event.model_dump()
+        events.append(dumped)
+        if dumped.get("agent") == "coordinator" and dumped.get("status") == "done" and dumped.get("detail"):
+            try:
+                final_report = json.loads(dumped["detail"])
+            except (json.JSONDecodeError, TypeError):
+                final_report = None
+    return events, final_report
+def _has_adaptation_loop(events: list[dict]) -> bool:
+    """Return True when the run shows retry-based adaptation behavior."""
+    saw_regression = any(
+        e.get("agent") == "tester" and e.get(
+            "status") == "failed" and "regression" in str(e.get("message", "")).lower()
+        for e in events
+    )
+    saw_retry = any(
+        e.get("agent") == "optimizer" and e.get("status") == "retrying"
+        for e in events
+    )
+    return saw_regression and saw_retry
+@app.post("/cold-start")
+async def cold_start_run(req: ColdStartRequest):
+    """
+    Single-run endpoint for unknown pasted CUDA input.
+    Returns full trace plus summary trust signals.
+    """
+    if not req.cuda_code or len(req.cuda_code.strip()) < 10:
+        raise HTTPException(status_code=400, detail="No CUDA code provided")
+    events, report = await _collect_pipeline_events(req.cuda_code, req.kernel_name or "unknown_input", False)
+    if report is None:
+        raise HTTPException(
+            status_code=500, detail="Pipeline completed without final report")
+    return {
+        "success": True,
+        "kernel_name": req.kernel_name or "unknown_input",
+        "adaptation_loop_observed": _has_adaptation_loop(events),
+        "event_count": len(events),
+        "report": report,
+        "events": events,
+    }
+@app.post("/aggregate-metric")
+async def aggregate_metric(req: AggregateMetricsRequest):
+    """
+    Evaluate multiple kernels and return one aggregate metric:
+    average speedup vs baseline HIP.
+    """
+    kernels_dir = os.path.join(os.path.dirname(__file__), "demo_kernels")
+    requested = req.kernel_names or []
+    available: dict[str, str] = {}
+    for fname in os.listdir(kernels_dir):
+        if fname.endswith(".cu"):
+            kname = fname.replace(".cu", "")
+            with open(os.path.join(kernels_dir, fname), encoding="utf-8") as f:
+                available[kname] = f.read()
+    selected_names = requested if requested else sorted(available.keys())
+    selected_names = [name for name in selected_names if name in available]
+    if not selected_names:
+        raise HTTPException(
+            status_code=400, detail="No valid kernels selected for aggregation")
+    runs = []
+    speedups = []
+    for name in selected_names:
+        events, report = await _collect_pipeline_events(available[name], name, False)
+        if report is None:
+            continue
+        speedup = float(report.get("speedup", 0.0) or 0.0)
+        speedups.append(speedup)
+        runs.append({
+            "kernel": name,
+            "speedup": speedup,
+            "adaptation_loop_observed": _has_adaptation_loop(events),
+            "iterations": report.get("iterations", 1),
+        })
+    if not speedups:
+        raise HTTPException(
+            status_code=500, detail="Unable to produce aggregate metric from selected kernels")
+    avg_speedup = round(sum(speedups) / len(speedups), 3)
+    avg_improvement_pct = round((avg_speedup - 1.0) * 100.0, 2)
+    return {
+        "success": True,
+        "baseline": "straight hipify output with minimal compile edits",
+        "kernel_count": len(speedups),
+        "aggregate_metric": {
+            "average_speedup_vs_baseline": avg_speedup,
+            "average_improvement_percent": avg_improvement_pct,
+        },
+        "runs": runs,
+    }
 @app.post("/recompile")
 async def recompile_edited_code(req: dict):
     """
     try:
         edited_code = req.get("edited_code")
         kernel_name = req.get("kernel_name", "custom")
         if not edited_code or len(edited_code.strip()) < 10:
             raise HTTPException(status_code=400, detail="No HIP code provided")
         # Create a mock analyzer result for testing
         analyzer_result = AnalyzerResult(
             kernels_found=["test_kernel"],
             difficulty="Easy",
             difficulty_reason="Simple test kernel"
         )
         # Run tester with edited code
         tester_result = await asyncio.to_thread(run_tester, edited_code, analyzer_result, 2, kernel_name)
         return {
             "success": True,
             "result": tester_result.model_dump()
         }
     except Exception as e:
+        raise HTTPException(
+            status_code=500, detail=f"Recompilation failed: {str(e)}") from e
 @app.post("/export")
         original_cuda = req.get("original_cuda")
         final_rocm = req.get("final_rocm")
         migration_report = req.get("migration_report", {})
         with tempfile.NamedTemporaryFile(delete=False, suffix=".zip") as tmp_file:
             with zipfile.ZipFile(tmp_file, 'w', zipfile.ZIP_DEFLATED) as zf:
                 # Add professional unified diff
                 )
                 diff_text = "".join(diff)
                 zf.writestr("migration.diff", diff_text)
                 # Add migration report as markdown
                 md_report = f"""# ROCmPort AI Migration Report
 ## Cost Impact
 {migration_report.get('cost_estimate', 'N/A')}
+Generated by ROCmPort AI.
 """
                 zf.writestr("migration_report.md", md_report)
             # Read the zip file content
             with open(tmp_file, 'rb') as f:
                 zip_content = f.read()
             # Clean up
             os.unlink(tmp_file)
         from fastapi.responses import Response
         return Response(
             content=zip_content,
             media_type="application/zip",
+            headers={
+                "Content-Disposition": "attachment; filename=rocmport_migration.zip"}
         )
     except Exception as e:
+        raise HTTPException(
+            status_code=500, detail=f"Export failed: {str(e)}") from e
 @app.get("/demo-kernels")
 async def list_demo_kernels():
     kernels_dir = os.path.join(os.path.dirname(__file__), "demo_kernels")
     kernels = {}
     for fname in os.listdir(kernels_dir):
         if fname.endswith(".cu"):
             name = fname.replace(".cu", "")
+            with open(os.path.join(kernels_dir, fname), encoding="utf-8") as f:
                 kernels[name] = f.read()
     return kernels
 # Serve frontend if built
 frontend_path = os.path.join(os.path.dirname(__file__), "..", "frontend")
 if os.path.exists(frontend_path):
+    app.mount("/", StaticFiles(directory=frontend_path,
+              html=True), name="frontend")

backend/models.py CHANGED Viewed

@@ -23,6 +23,15 @@ class PortRequest(BaseModel):
     simple_mode: Optional[bool] = False  # For "Explain Like I'm 5" feature
 class AgentEvent(BaseModel):
     agent: str          # analyzer | translator | optimizer | tester | coordinator
     status: AgentStatus
@@ -83,7 +92,8 @@ class TesterResult(BaseModel):
     execution_ms: float
     bottleneck: str
     notes: str
-    verification: Optional[VerificationResult] = None  # Trust layer verification
 class FinalReport(BaseModel):
@@ -96,5 +106,7 @@ class FinalReport(BaseModel):
     iterations: int
     hip_code: str
     optimized_code: str
     cost_estimate: Optional[CostEstimate] = None  # 💰 Cost impact estimator
-    simplified_explanation: Optional[str] = None  # For "Explain Like I'm 5" mode

     simple_mode: Optional[bool] = False  # For "Explain Like I'm 5" feature
+class ColdStartRequest(BaseModel):
+    cuda_code: str
+    kernel_name: Optional[str] = "unknown_input"
+class AggregateMetricsRequest(BaseModel):
+    kernel_names: Optional[List[str]] = None
 class AgentEvent(BaseModel):
     agent: str          # analyzer | translator | optimizer | tester | coordinator
     status: AgentStatus
     execution_ms: float
     bottleneck: str
     notes: str
+    # Trust layer verification
+    verification: Optional[VerificationResult] = None
 class FinalReport(BaseModel):
     iterations: int
     hip_code: str
     optimized_code: str
+    verification: Optional[VerificationResult] = None
     cost_estimate: Optional[CostEstimate] = None  # 💰 Cost impact estimator
+    # For "Explain Like I'm 5" mode
+    simplified_explanation: Optional[str] = None

backend/prompts/coordinator_prompt.txt CHANGED Viewed

@@ -54,7 +54,7 @@ You'll receive results from each agent:
 - Always compare "Optimized ROCm vs Baseline HIP" (straight hipify output)
 - Never claim "faster than NVIDIA CUDA" - be honest and credible
 - Explain WHY AMD hardware advantages apply to this specific workload
-- Include controlled failure/recovery story if it happened
 - Provide concrete, actionable insights
 Focus on demonstrating that your agents add real value beyond basic hipify - that's the core claim.

 - Always compare "Optimized ROCm vs Baseline HIP" (straight hipify output)
 - Never claim "faster than NVIDIA CUDA" - be honest and credible
 - Explain WHY AMD hardware advantages apply to this specific workload
+- Include retry and recovery details only when regression actually occurred
 - Provide concrete, actionable insights
 Focus on demonstrating that your agents add real value beyond basic hipify - that's the core claim.

backend/tools/hipify_wrapper.py CHANGED Viewed

@@ -1,15 +1,14 @@
 import subprocess
 import tempfile
 import os
-import re
 class HipifyWrapper:
     """Wrapper for hipify-clang tool with Python fallback"""
     def __init__(self):
         pass
     def hipify_code(self, cuda_code: str) -> tuple[str, list[dict]]:
         """
         Try to run real hipify-clang if available.
@@ -24,18 +23,19 @@ class HipifyWrapper:
         # Fallback: Python pattern replacement
         return self._python_hipify(cuda_code)
     def _hipify_available(self) -> bool:
         try:
             result = subprocess.run(
                 ["hipify-clang", "--version"],
-                capture_output=True, timeout=5
             )
             return result.returncode == 0
         except (FileNotFoundError, subprocess.TimeoutExpired):
             return False
     def _run_real_hipify(self, cuda_code: str) -> tuple[str, list[dict]] | None:
         try:
             with tempfile.NamedTemporaryFile(suffix=".cu", mode="w", delete=False) as f:
                 f.write(cuda_code)
@@ -43,36 +43,41 @@ class HipifyWrapper:
             # Use -- separator to pass compiler flags to the internal Clang parser
             # This is critical for Clang-based tools to distinguish tool flags from compiler flags.
-            cmd = ["hipify-clang", tmp_path, "--", "-nocudalib", "-nocudainc", "-arch=sm_60"]
             # Debug log for build engineering
             print(f"DEBUG: Running hipify-clang command: {' '.join(cmd)}")
             # Set environment variable just in case hipify-clang invokes nvcc internally
             env = os.environ.copy()
             env['NVCC_APPEND_FLAGS'] = '-nocudalib -arch=sm_60'
             result = subprocess.run(
                 cmd,
                 capture_output=True, text=True, timeout=30,
-                env=env
             )
             if result.returncode != 0:
-                print(f"DEBUG: hipify-clang failed with return code {result.returncode}")
                 print(f"DEBUG: stderr: {result.stderr}")
             if result.returncode == 0 and result.stdout:
-                changes = self._detect_changes(cuda_code, result.stdout, source="hipify-clang")
                 return result.stdout, changes
             return None
-        except Exception:
             return None
         finally:
             try:
-                os.unlink(tmp_path)
-            except Exception:
                 pass
     def _python_hipify(self, cuda_code: str) -> tuple[str, list[dict]]:

 import subprocess
 import tempfile
 import os
 class HipifyWrapper:
     """Wrapper for hipify-clang tool with Python fallback"""
     def __init__(self):
         pass
     def hipify_code(self, cuda_code: str) -> tuple[str, list[dict]]:
         """
         Try to run real hipify-clang if available.
         # Fallback: Python pattern replacement
         return self._python_hipify(cuda_code)
     def _hipify_available(self) -> bool:
         try:
             result = subprocess.run(
                 ["hipify-clang", "--version"],
+                capture_output=True, timeout=5, check=False
             )
             return result.returncode == 0
         except (FileNotFoundError, subprocess.TimeoutExpired):
             return False
     def _run_real_hipify(self, cuda_code: str) -> tuple[str, list[dict]] | None:
+        tmp_path = None
         try:
             with tempfile.NamedTemporaryFile(suffix=".cu", mode="w", delete=False) as f:
                 f.write(cuda_code)
             # Use -- separator to pass compiler flags to the internal Clang parser
             # This is critical for Clang-based tools to distinguish tool flags from compiler flags.
+            cmd = ["hipify-clang", tmp_path, "--",
+                   "-nocudalib", "-nocudainc", "-arch=sm_60"]
             # Debug log for build engineering
             print(f"DEBUG: Running hipify-clang command: {' '.join(cmd)}")
             # Set environment variable just in case hipify-clang invokes nvcc internally
             env = os.environ.copy()
             env['NVCC_APPEND_FLAGS'] = '-nocudalib -arch=sm_60'
             result = subprocess.run(
                 cmd,
                 capture_output=True, text=True, timeout=30,
+                env=env,
+                check=False,
             )
             if result.returncode != 0:
+                print(
+                    f"DEBUG: hipify-clang failed with return code {result.returncode}")
                 print(f"DEBUG: stderr: {result.stderr}")
             if result.returncode == 0 and result.stdout:
+                changes = self._detect_changes(
+                    cuda_code, result.stdout, source="hipify-clang")
                 return result.stdout, changes
             return None
+        except (OSError, subprocess.SubprocessError):
             return None
         finally:
             try:
+                if tmp_path and os.path.exists(tmp_path):
+                    os.unlink(tmp_path)
+            except OSError:
                 pass
     def _python_hipify(self, cuda_code: str) -> tuple[str, list[dict]]:

backend/tools/rocprof_wrapper.py CHANGED Viewed

@@ -1,105 +1,113 @@
 import subprocess
 import tempfile
 import os
-import json
 import re
-from typing import Dict, List, Optional, Tuple
-from pathlib import Path
 class RocprofWrapper:
     """Wrapper for AMD rocprof profiler and hipcc compiler"""
     def __init__(self):
-        self.rocm_available = os.getenv("ROCM_AVAILABLE", "false").lower() == "true"
         self.hipcc_path = os.getenv("HIPCC_PATH", "hipcc")
         self.rocprof_path = os.getenv("ROCPROF_PATH", "rocprof")
     def compile_hip_code(self, hip_code: str, output_file: str = None) -> Tuple[bool, str]:
         """Compile HIP code using hipcc"""
         if not self.rocm_available:
             return True, "Mock compilation successful (ROCm not available)"
         try:
             with tempfile.NamedTemporaryFile(mode='w', suffix='.hip', delete=False) as f:
                 f.write(hip_code)
                 temp_file = f.name
             if output_file is None:
                 output_file = temp_file.replace('.hip', '.out')
             # Add -nocudalib and -arch=sm_60 to solve "Cannot find libdevice for sm_52" error
             # This ensures compilation works even if CUDA device libraries are missing.
-            cmd = [self.hipcc_path, '-o', output_file, temp_file, '-nocudalib', '-arch=sm_60']
             # Set environment variable just in case hipcc invokes nvcc internally
             env = os.environ.copy()
             env['NVCC_APPEND_FLAGS'] = '-nocudalib -arch=sm_60'
-            result = subprocess.run(cmd, capture_output=True, text=True, timeout=60, env=env)
             # Cleanup
             os.unlink(temp_file)
             if result.returncode == 0:
                 return True, f"Compilation successful: {output_file}"
             else:
                 return False, f"Compilation failed: {result.stderr}"
         except subprocess.TimeoutExpired:
             return False, "Compilation timed out"
-        except Exception as e:
             return False, f"Compilation error: {str(e)}"
     def run_with_profiling(self, executable_path: str, args: List[str] = None) -> Dict:
         """Run executable with rocprof profiling"""
         if not self.rocm_available:
             # Return mock profiling data
             return self._get_mock_profiling_data()
         try:
             if args is None:
                 args = []
             # Run with rocprof
-            cmd = [self.rocprof_path, '-i', 'default', '--'] + [executable_path] + args
-            result = subprocess.run(cmd, capture_output=True, text=True, timeout=120)
             # Parse rocprof output
-            profiling_data = self._parse_rocprof_output(result.stdout, result.stderr)
             return profiling_data
         except subprocess.TimeoutExpired:
             return {"error": "Profiling timed out", "execution_time_ms": 0}
-        except Exception as e:
             return {"error": f"Profiling error: {str(e)}", "execution_time_ms": 0}
-    def _parse_rocprof_output(self, stdout: str, stderr: str) -> Dict:
         """Parse rocprof output to extract metrics"""
         try:
             # Look for key metrics in rocprof output
             metrics = {}
             # Parse execution time
-            time_match = re.search(r'Kernel execution time:\s+(\d+\.\d+)\s*ms', stdout)
             if time_match:
                 metrics['execution_time_ms'] = float(time_match.group(1))
             # Parse memory bandwidth
-            bandwidth_match = re.search(r'Memory bandwidth:\s+(\d+\.\d+)\s*GB/s', stdout)
             if bandwidth_match:
-                metrics['memory_bandwidth_gbps'] = float(bandwidth_match.group(1))
             # Parse GPU utilization
             util_match = re.search(r'GPU utilization:\s+(\d+\.\d+)%', stdout)
             if util_match:
                 metrics['gpu_utilization_percent'] = float(util_match.group(1))
             # Parse wavefront count
             wave_match = re.search(r'SQ_WAVES:\s+(\d+)', stdout)
             if wave_match:
                 metrics['sq_waves'] = int(wave_match.group(1))
             # If no metrics found, return basic execution info
             if not metrics:
                 metrics = {
@@ -108,47 +116,40 @@ class RocprofWrapper:
                     'gpu_utilization_percent': 75.0,
                     'sq_waves': 1024
                 }
             metrics['success'] = True
             return metrics
-        except Exception as e:
             return {
                 'success': False,
                 'error': f'Failed to parse rocprof output: {str(e)}',
                 'execution_time_ms': 0
             }
     def _get_mock_profiling_data(self) -> Dict:
         """Generate mock profiling data for testing without ROCm"""
         import random
-        # Simulate controlled failure on first iteration
-        base_performance = 100.0
-        iteration = getattr(self, '_iteration', 1)
-        if iteration == 1:
-            # First iteration - worse performance (controlled failure)
-            execution_time = base_performance * 1.2  # 20% slower
-            bandwidth = 40.0  # Lower bandwidth utilization
-            utilization = 60.0  # Lower GPU utilization
-        else:
-            # Second iteration - better performance
-            execution_time = base_performance * 0.75  # 25% faster
-            bandwidth = 80.0  # Higher bandwidth utilization
-            utilization = 85.0  # Higher GPU utilization
-        self._iteration = iteration + 1
         return {
             'success': True,
             'execution_time_ms': execution_time,
             'memory_bandwidth_gbps': bandwidth,
             'gpu_utilization_percent': utilization,
             'sq_waves': random.randint(800, 1200),
-            'iteration': iteration
         }
     def get_hardware_info(self) -> Dict:
         """Get AMD GPU hardware information"""
         if not self.rocm_available:
@@ -159,26 +160,27 @@ class RocprofWrapper:
                 'memory_bandwidth_tb_s': 5.3,
                 'wavefront_size': 64
             }
         try:
             # Try to get real GPU info using rocminfo or similar
             cmd = ['rocminfo']
-            result = subprocess.run(cmd, capture_output=True, text=True, timeout=10)
             if result.returncode == 0:
                 return self._parse_rocminfo(result.stdout)
             else:
                 return self._get_mock_hardware_info()
-        except Exception:
             return self._get_mock_hardware_info()
-    def _parse_rocminfo(self, output: str) -> Dict:
         """Parse rocminfo output"""
         # This would parse real rocminfo output
         # For now, return mock data
         return self._get_mock_hardware_info()
     def _get_mock_hardware_info(self) -> Dict:
         """Mock hardware info for MI300X"""
         return {

 import subprocess
 import tempfile
 import os
 import re
+from typing import Dict, List, Tuple
 class RocprofWrapper:
     """Wrapper for AMD rocprof profiler and hipcc compiler"""
     def __init__(self):
+        self.rocm_available = os.getenv(
+            "ROCM_AVAILABLE", "false").lower() == "true"
         self.hipcc_path = os.getenv("HIPCC_PATH", "hipcc")
         self.rocprof_path = os.getenv("ROCPROF_PATH", "rocprof")
     def compile_hip_code(self, hip_code: str, output_file: str = None) -> Tuple[bool, str]:
         """Compile HIP code using hipcc"""
         if not self.rocm_available:
             return True, "Mock compilation successful (ROCm not available)"
         try:
             with tempfile.NamedTemporaryFile(mode='w', suffix='.hip', delete=False) as f:
                 f.write(hip_code)
                 temp_file = f.name
             if output_file is None:
                 output_file = temp_file.replace('.hip', '.out')
             # Add -nocudalib and -arch=sm_60 to solve "Cannot find libdevice for sm_52" error
             # This ensures compilation works even if CUDA device libraries are missing.
+            cmd = [self.hipcc_path, '-o', output_file,
+                   temp_file, '-nocudalib', '-arch=sm_60']
             # Set environment variable just in case hipcc invokes nvcc internally
             env = os.environ.copy()
             env['NVCC_APPEND_FLAGS'] = '-nocudalib -arch=sm_60'
+            result = subprocess.run(
+                cmd, capture_output=True, text=True, timeout=60, env=env, check=False)
             # Cleanup
             os.unlink(temp_file)
             if result.returncode == 0:
                 return True, f"Compilation successful: {output_file}"
             else:
                 return False, f"Compilation failed: {result.stderr}"
         except subprocess.TimeoutExpired:
             return False, "Compilation timed out"
+        except (OSError, subprocess.SubprocessError) as e:
             return False, f"Compilation error: {str(e)}"
     def run_with_profiling(self, executable_path: str, args: List[str] = None) -> Dict:
         """Run executable with rocprof profiling"""
         if not self.rocm_available:
             # Return mock profiling data
             return self._get_mock_profiling_data()
         try:
             if args is None:
                 args = []
             # Run with rocprof
+            cmd = [self.rocprof_path, '-i', 'default', '--'] + \
+                [executable_path] + args
+            result = subprocess.run(
+                cmd, capture_output=True, text=True, timeout=120, check=False)
             # Parse rocprof output
+            profiling_data = self._parse_rocprof_output(
+                result.stdout, result.stderr)
             return profiling_data
         except subprocess.TimeoutExpired:
             return {"error": "Profiling timed out", "execution_time_ms": 0}
+        except (OSError, subprocess.SubprocessError) as e:
             return {"error": f"Profiling error: {str(e)}", "execution_time_ms": 0}
+    def _parse_rocprof_output(self, stdout: str, _stderr: str) -> Dict:
         """Parse rocprof output to extract metrics"""
         try:
             # Look for key metrics in rocprof output
             metrics = {}
             # Parse execution time
+            time_match = re.search(
+                r'Kernel execution time:\s+(\d+\.\d+)\s*ms', stdout)
             if time_match:
                 metrics['execution_time_ms'] = float(time_match.group(1))
             # Parse memory bandwidth
+            bandwidth_match = re.search(
+                r'Memory bandwidth:\s+(\d+\.\d+)\s*GB/s', stdout)
             if bandwidth_match:
+                metrics['memory_bandwidth_gbps'] = float(
+                    bandwidth_match.group(1))
             # Parse GPU utilization
             util_match = re.search(r'GPU utilization:\s+(\d+\.\d+)%', stdout)
             if util_match:
                 metrics['gpu_utilization_percent'] = float(util_match.group(1))
             # Parse wavefront count
             wave_match = re.search(r'SQ_WAVES:\s+(\d+)', stdout)
             if wave_match:
                 metrics['sq_waves'] = int(wave_match.group(1))
             # If no metrics found, return basic execution info
             if not metrics:
                 metrics = {
                     'gpu_utilization_percent': 75.0,
                     'sq_waves': 1024
                 }
             metrics['success'] = True
             return metrics
+        except (TypeError, ValueError) as e:
             return {
                 'success': False,
                 'error': f'Failed to parse rocprof output: {str(e)}',
                 'execution_time_ms': 0
             }
+    def get_mock_profiling_data(self) -> Dict:
+        """Public accessor for mock profiling data used by testing layer."""
+        return self._get_mock_profiling_data()
     def _get_mock_profiling_data(self) -> Dict:
         """Generate mock profiling data for testing without ROCm"""
         import random
+        baseline_ms = 100.0
+        execution_time = random.uniform(85.0, 115.0)
+        bandwidth = random.uniform(35.0, 90.0)
+        utilization = random.uniform(55.0, 92.0)
         return {
             'success': True,
             'execution_time_ms': execution_time,
+            'baseline_time_ms': baseline_ms,
             'memory_bandwidth_gbps': bandwidth,
             'gpu_utilization_percent': utilization,
             'sq_waves': random.randint(800, 1200),
+            'simulated': True
         }
     def get_hardware_info(self) -> Dict:
         """Get AMD GPU hardware information"""
         if not self.rocm_available:
                 'memory_bandwidth_tb_s': 5.3,
                 'wavefront_size': 64
             }
         try:
             # Try to get real GPU info using rocminfo or similar
             cmd = ['rocminfo']
+            result = subprocess.run(
+                cmd, capture_output=True, text=True, timeout=10, check=False)
             if result.returncode == 0:
                 return self._parse_rocminfo(result.stdout)
             else:
                 return self._get_mock_hardware_info()
+        except (OSError, subprocess.SubprocessError):
             return self._get_mock_hardware_info()
+    def _parse_rocminfo(self, _output: str) -> Dict:
         """Parse rocminfo output"""
         # This would parse real rocminfo output
         # For now, return mock data
         return self._get_mock_hardware_info()
     def _get_mock_hardware_info(self) -> Dict:
         """Mock hardware info for MI300X"""
         return {

docs/FAILURE_CASES.md ADDED Viewed

	@@ -0,0 +1,38 @@

+# Failure Cases
+This document records known failure modes with reproducible context.
+## FC-001: Inline PTX in CUDA Kernel
+### Why this matters
+Kernels that embed inline PTX are a realistic migration boundary. hipify can translate CUDA APIs, but it cannot preserve NVIDIA-specific assembly semantics on AMD.
+### Original CUDA pattern (simplified)
+```cpp
+__device__ __forceinline__ unsigned lane_id() {
+  unsigned lane;
+  asm volatile("mov.u32 %0, %%laneid;" : "=r"(lane));
+  return lane;
+}
+```
+### Typical migration output
+- CUDA runtime calls are translated.
+- Inline PTX block is left unchanged or translated into invalid code for HIP compilation.
+### Observed failure mode
+- Compile error under hipcc due to unsupported PTX instruction syntax.
+- In some cases, compile succeeds after manual edits but semantics differ because lane behavior assumptions are NVIDIA-specific.
+### Root cause
+- Inline PTX is vendor-specific and outside mechanical translation scope.
+- Warp-level assumptions in PTX often rely on 32-lane behavior and NVIDIA ISA details.
+### What is required to fix
+1. Replace inline PTX with HIP or portable intrinsics.
+2. Rework lane-level logic for wavefront-64 behavior where required.
+3. Add correctness tests for edge lanes and reduction boundaries.
+4. Re-profile after rewrite to confirm no occupancy regressions.
+### Trust note
+This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.

docs/JUDGE_MODE.md ADDED Viewed

	@@ -0,0 +1,42 @@

+# Judge Mode Walkthrough
+Use this sequence during technical evaluation.
+## Goal
+Make every claim falsifiable and easy to verify.
+## Flow
+1. Show raw CUDA input.
+2. Run baseline translation only (straight hipify output).
+3. Show baseline compile/profiler result.
+4. Run full ROCmPort AI loop.
+5. Show each agent event and decisions.
+6. Compare final output against the declared baseline.
+7. Show one weak result (small gain or no gain) and explain why.
+## Baseline Policy
+- Primary baseline: straight hipify output with minimal required compile edits.
+- Never switch baselines mid-demo.
+- Repeat baseline definition before showing speedup.
+## Required Artifacts
+- CUDA source.
+- Baseline HIP output.
+- Optimized HIP output.
+- Compile logs.
+- Profiler summary.
+- Final report with rationale.
+## Suggested Script
+- "Here is the original CUDA kernel."
+- "Here is baseline HIP produced by hipify only."
+- "Now we run the orchestration loop and show each decision."
+- "This is the final code diff and measured result versus baseline."
+- "Here is a case where gain is limited, and why."
+## Pass/Fail Criteria
+A demo is credible if:
+- Baseline is explicit.
+- Intermediate artifacts are visible.
+- At least one non-win case is included.
+- Reasoning matches observed profiler data.

frontend/index.html CHANGED Viewed

@@ -1,503 +1,1112 @@
 <!DOCTYPE html>
 <html lang="en">
 <head>
-<meta charset="UTF-8">
-<meta name="viewport" content="width=device-width, initial-scale=1.0">
-<title>ROCmPort AI</title>
-<link rel="preconnect" href="https://fonts.googleapis.com">
-<link href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&family=Space+Grotesk:wght@500;600;700&display=swap" rel="stylesheet">
-<style>
-:root {
-  --bg: #030303;
-  --s1: #0a0a0b;
-  --s2: #121214;
-  --s3: #1a1a1e;
-  --b1: rgba(255, 255, 255, 0.08);
-  --b2: rgba(255, 255, 255, 0.15);
-  --red: #ff3344;
-  --red-glow: rgba(255, 51, 68, 0.4);
-  --green: #00ff88;
-  --green-glow: rgba(0, 255, 136, 0.4);
-  --yellow: #ffcc00;
-  --cyan: #00d9ff;
-  --muted: #88888e;
-  --t1: #a1a1aa;
-  --t2: #d4d4d8;
-  --t3: #ffffff;
-  --mono: 'JetBrains Mono', monospace;
-  --sans: 'Space Grotesk', sans-serif;
-  --spring: cubic-bezier(0.34, 1.56, 0.64, 1);
-}
-* { margin: 0; padding: 0; box-sizing: border-box; cursor: none !important; }
-.hide { display: none !important; }
-body {
-  background: var(--bg);
-  color: var(--t1);
-  font-family: var(--sans);
-  font-size: 14px;
-  line-height: 1.6;
-  overflow-x: hidden;
-  min-height: 100vh;
-}
-/* Animated Gradient Background */
-body::before {
-  content: '';
-  position: fixed;
-  inset: 0;
-  background:
-    radial-gradient(circle at 20% 30%, rgba(0, 217, 255, 0.05), transparent 40%),
-    radial-gradient(circle at 80% 70%, rgba(255, 51, 68, 0.05), transparent 40%),
-    radial-gradient(circle at 50% 50%, rgba(0, 255, 136, 0.03), transparent 60%);
-  z-index: -1;
-  animation: bgMove 20s ease-in-out infinite alternate;
-}
-@keyframes bgMove {
-  0% { transform: scale(1) translate(0, 0); }
-  50% { transform: scale(1.1) translate(20px, -20px); }
-  100% { transform: scale(1) translate(-20px, 20px); }
-}
-.w {
-  max-width: 1200px;
-  margin: 0 auto;
-  padding: 32px 24px;
-  position: relative;
-}
-/* Container Glow */
-.w::after {
-  content: '';
-  position: absolute;
-  inset: 0;
-  background: radial-gradient(circle at 50% 0%, rgba(255, 51, 68, 0.08), transparent 70%);
-  pointer-events: none;
-  z-index: -1;
-}
-header {
-  padding-bottom: 24px;
-  border-bottom: 1px solid var(--b1);
-  display: flex;
-  align-items: center;
-  justify-content: space-between;
-  margin-bottom: 24px;
-}
-.logo {
-  font-weight: 700;
-  font-size: 18px;
-  color: var(--t3);
-  letter-spacing: -0.02em;
-}
-.logo em {
-  font-style: normal;
-  color: var(--red);
-  text-shadow: 0 0 15px var(--red-glow);
-}
-.hr {
-  font-size: 12px;
-  color: var(--muted);
-  display: flex;
-  align-items: center;
-  gap: 10px;
-  background: var(--s1);
-  padding: 6px 12px;
-  border-radius: 20px;
-  border: 1px solid var(--b1);
-}
-.hd {
-  width: 6px;
-  height: 6px;
-  border-radius: 50%;
-  background: var(--green);
-  box-shadow: 0 0 10px var(--green-glow);
-}
-.hd.on { animation: pulse 2s ease-in-out infinite; }
-@keyframes pulse {
-  0%, 100% { opacity: 1; transform: scale(1); }
-  50% { opacity: 0.4; transform: scale(0.8); }
-}
-.g {
-  display: grid;
-  grid-template-columns: 1.2fr 0.8fr;
-  gap: 24px;
-  padding: 0;
-}
-.fs { grid-column: 1 / -1; }
-@media (max-width: 900px) {
-  .g { grid-template-columns: 1fr; }
-}
-/* Card Styling */
-.p {
-  background: var(--s1);
-  border: 1px solid var(--b1);
-  border-radius: 12px;
-  overflow: hidden;
-  display: flex;
-  flex-direction: column;
-  box-shadow: 0 4px 20px rgba(0, 0, 0, 0.4);
-  backdrop-filter: blur(10px);
-  transition: transform 0.3s var(--spring), border-color 0.3s ease;
-}
-.p:hover {
-  border-color: var(--b2);
-}
-.ph {
-  padding: 12px 16px;
-  border-bottom: 1px solid var(--b1);
-  display: flex;
-  align-items: center;
-  justify-content: space-between;
-  font-size: 12px;
-  color: var(--muted);
-  background: rgba(255, 255, 255, 0.02);
-}
-.ph b { color: var(--red); font-weight: 600; text-transform: uppercase; letter-spacing: 0.05em; }
-textarea.code {
-  width: 100%;
-  flex: 1;
-  min-height: 300px;
-  background: var(--bg);
-  border: none;
-  color: var(--t2);
-  font-family: var(--mono);
-  font-size: 13px;
-  line-height: 1.7;
-  padding: 20px;
-  resize: vertical;
-  outline: none;
-  caret-color: var(--red);
-  will-change: transform;
-}
-.db {
-  padding: 12px 16px;
-  border-top: 1px solid var(--b1);
-  display: flex;
-  align-items: center;
-  gap: 8px;
-  background: var(--s1);
-}
-.db .l { font-size: 11px; color: var(--muted); font-weight: 500; }
-.ch {
-  font-family: var(--sans);
-  font-size: 11px;
-  padding: 4px 12px;
-  background: var(--s2);
-  border: 1px solid var(--b1);
-  border-radius: 6px;
-  color: var(--t1);
-  cursor: pointer;
-  transition: all 0.2s var(--spring);
-}
-.ch:hover {
-  background: var(--s3);
-  color: var(--t3);
-  transform: translateY(-1px);
-  border-color: var(--b2);
-}
-.ch.on {
-  background: var(--red);
-  border-color: var(--red);
-  color: #fff;
-  box-shadow: 0 0 15px var(--red-glow);
-}
-.bg {
-  margin: 16px;
-  padding: 14px;
-  background: var(--red);
-  border: none;
-  border-radius: 8px;
-  color: #fff;
-  font-family: var(--sans);
-  font-size: 14px;
-  font-weight: 700;
-  cursor: pointer;
-  transition: all 0.3s var(--spring);
-  text-transform: uppercase;
-  letter-spacing: 0.05em;
-  box-shadow: 0 4px 15px var(--red-glow);
-}
-.bg:hover {
-  background: #ff4d5a;
-  transform: translateY(-2px);
-  box-shadow: 0 6px 20px var(--red-glow);
-}
-.bg:active { transform: translateY(0); }
-.bg:disabled {
-  opacity: 0.4;
-  cursor: not-allowed;
-  transform: none;
-  box-shadow: none;
-}
-/* Agent log */
-.al { padding: 12px; display: flex; flex-direction: column; gap: 8px; }
-.ar {
-  padding: 12px 16px;
-  border-radius: 8px;
-  background: rgba(255, 255, 255, 0.03);
-  border: 1px solid transparent;
-  transition: all 0.4s var(--spring);
-  animation: slideIn 0.5s var(--spring) forwards;
-  opacity: 0;
-  transform: translateX(20px);
-}
-@keyframes slideIn {
-  to { opacity: 1; transform: translateX(0); }
-}
-.ar.run { border-color: var(--cyan); background: rgba(0, 217, 255, 0.05); }
-.ar.done { border-color: var(--green); background: rgba(0, 255, 136, 0.05); }
-.ar.fail { border-color: var(--red); background: rgba(255, 51, 68, 0.05); }
-.ar.retry {
-  border-color: var(--yellow);
-  background: rgba(255, 204, 0, 0.05);
-  animation: pulse-border 1.5s ease-in-out infinite;
-}
-@keyframes pulse-border {
-  50% { border-color: rgba(255, 204, 0, 0.2); }
-}
-.at { display: flex; align-items: center; gap: 12px; }
-.an { font-size: 10px; font-weight: 700; color: var(--muted); min-width: 90px; text-transform: uppercase; letter-spacing: 0.1em; }
-.am { font-size: 13px; color: var(--t2); font-weight: 500; }
-.ad { font-size: 11px; color: var(--muted); margin-top: 4px; padding-left: 102px; white-space: pre-wrap; line-height: 1.6; max-height: 100px; overflow-y: auto; }
-.ad .w { color: var(--yellow); font-weight: 600; }
-.ad .g { color: var(--green); font-weight: 600; }
-/* Horizontal Timeline */
-.timeline {
-  display: flex;
-  justify-content: space-between;
-  padding: 16px 20px;
-  background: rgba(255, 255, 255, 0.02);
-  border-bottom: 1px solid var(--b1);
-  margin-bottom: 8px;
-}
-.node {
-  display: flex;
-  flex-direction: column;
-  align-items: center;
-  gap: 6px;
-  position: relative;
-  flex: 1;
-}
-.node::after {
-  content: '';
-  position: absolute;
-  top: 12px;
-  left: 50%;
-  width: 100%;
-  height: 2px;
-  background: var(--b1);
-  z-index: 0;
-}
-.node:last-child::after { display: none; }
-.ni {
-  width: 24px;
-  height: 24px;
-  border-radius: 50%;
-  background: var(--s3);
-  border: 2px solid var(--b1);
-  display: flex;
-  align-items: center;
-  justify-content: center;
-  font-size: 12px;
-  z-index: 1;
-  transition: all 0.4s var(--spring);
-}
-.node.on .ni { background: var(--cyan); border-color: var(--cyan); color: #000; box-shadow: 0 0 15px var(--cyan); }
-.node.done .ni { background: var(--green); border-color: var(--green); color: #000; box-shadow: 0 0 15px var(--green); }
-.node.fail .ni { background: var(--red); border-color: var(--red); color: #fff; }
-.node.retry .ni { animation: pulse-node 1s var(--spring) infinite; background: var(--yellow); border-color: var(--yellow); }
-@keyframes pulse-node {
-  0%, 100% { transform: scale(1); }
-  50% { transform: scale(1.2); }
-}
-.nl { font-size: 9px; font-weight: 700; color: var(--muted); text-transform: uppercase; letter-spacing: 0.05em; }
-.node.on .nl, .node.done .nl { color: var(--t3); }
-/* Tabs */
-.tabs { display: flex; gap: 8px; }
-.tab {
-  background: var(--s2);
-  border: 1px solid var(--b1);
-  padding: 6px 16px;
-  border-radius: 8px;
-  font-family: var(--sans);
-  font-size: 12px;
-  font-weight: 600;
-  color: var(--muted);
-  cursor: pointer;
-  transition: all 0.2s var(--spring);
-}
-.tab:hover { color: var(--t2); background: var(--s3); }
-.tab.on { color: var(--t3); background: var(--red); border-color: var(--red); box-shadow: 0 0 10px var(--red-glow); }
-.tc { display: none; padding: 0; animation: fadeIn 0.4s ease; }
-.tc.on { display: block; }
-@keyframes fadeIn { from { opacity: 0; transform: translateY(10px); } to { opacity: 1; transform: translateY(0); } }
-/* Summary row */
-.sum-row { padding: 24px; display: flex; align-items: center; gap: 32px; flex-wrap: wrap; border-bottom: 1px solid var(--b1); background: rgba(0, 255, 136, 0.02); }
-.sum-big { font-size: 32px; font-weight: 800; color: var(--green); line-height: 1; letter-spacing: -0.02em; text-shadow: 0 0 20px var(--green-glow); }
-.sum-big .u { font-size: 13px; font-weight: 500; color: var(--muted); margin-left: 4px; display: block; margin-top: 4px; letter-spacing: 0; }
-.sum-big .vic { font-size: 11px; color: var(--cyan); font-weight: 600; display: block; margin-top: 8px; text-shadow: none; opacity: 0.8; }
-.sum-sep { width: 1px; height: 40px; background: var(--b1); }
-.sum-chk { display: flex; align-items: center; gap: 8px; font-size: 12px; color: var(--t2); font-weight: 500; }
-.sum-dot { width: 8px; height: 8px; border-radius: 50%; flex-shrink: 0; }
-.sum-dot.ok { background: var(--green); box-shadow: 0 0 8px var(--green-glow); }
-.sum-dot.no { background: var(--red); box-shadow: 0 0 8px var(--red-glow); }
-.sum-type { font-size: 11px; color: var(--cyan); text-transform: uppercase; letter-spacing: 0.1em; font-weight: 700; padding: 4px 10px; background: rgba(0, 217, 255, 0.1); border-radius: 4px; }
-.sum-bar { padding: 16px 24px; display: flex; align-items: center; gap: 12px; flex-wrap: wrap; border-bottom: 1px solid var(--b1); }
-.bs {
-  font-family: var(--sans);
-  font-size: 11px;
-  font-weight: 700;
-  padding: 8px 16px;
-  border-radius: 8px;
-  border: 1px solid var(--b1);
-  background: var(--s2);
-  color: var(--t2);
-  cursor: pointer;
-  transition: all 0.2s var(--spring);
-  text-transform: uppercase;
-  letter-spacing: 0.05em;
-}
-.bs:hover { border-color: var(--b2); transform: translateY(-1px); background: var(--s3); }
-.bs.r { background: var(--bg); border-color: var(--red); color: var(--red); }
-.bs.r:hover { background: var(--red); color: #fff; box-shadow: 0 4px 15px var(--red-glow); }
-.bs.gr { background: var(--green); border-color: var(--green); color: #000; }
-.bs.gr:hover { box-shadow: 0 4px 15px var(--green-glow); transform: translateY(-2px); }
-.sp { flex: 1; }
-/* Details tab */
-.dm { display: grid; grid-template-columns: repeat(5, 1fr); border-bottom: 1px solid var(--b1); }
-@media (max-width: 800px) { .dm { grid-template-columns: repeat(2, 1fr); } }
-.di { padding: 20px; border-right: 1px solid var(--b1); background: rgba(255, 255, 255, 0.01); }
-.di:last-child { border-right: none; }
-.dl { font-size: 10px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 8px; font-weight: 700; }
-.dv { font-size: 20px; font-weight: 800; line-height: 1; margin-bottom: 4px; color: var(--t3); }
-.dv.g { color: var(--green); }
-.dv.c { color: var(--cyan); }
-.dv.y { color: var(--yellow); }
-.dv.t { color: var(--t2); font-size: 13px; }
-.ds { font-size: 10px; color: var(--muted); line-height: 1.4; }
-/* Benchmark bars */
-.bk { padding: 24px; border-bottom: 1px solid var(--b1); }
-.bk-t { font-size: 11px; color: var(--muted); text-transform: uppercase; letter-spacing: 0.1em; margin-bottom: 16px; font-weight: 700; }
-.br { display: flex; align-items: center; gap: 16px; margin-bottom: 12px; }
-.br:last-child { margin-bottom: 0; }
-.bl { font-size: 12px; color: var(--t2); width: 140px; flex-shrink: 0; font-weight: 500; }
-.bt { flex: 1; height: 8px; background: var(--bg); border-radius: 4px; overflow: hidden; border: 1px solid var(--b1); }
-.bf { height: 100%; border-radius: 4px; transition: width 1s var(--spring); width: 0; }
-.bf.bad { background: linear-gradient(90deg, #ff334466, #ff3344); box-shadow: 0 0 10px rgba(255, 51, 68, 0.3); }
-.bf.good { background: linear-gradient(90deg, #00ff8866, #00ff88); box-shadow: 0 0 10px rgba(0, 255, 136, 0.3); }
-.bv { font-size: 12px; font-weight: 700; width: 40px; text-align: right; flex-shrink: 0; }
-.bv.bad { color: var(--red); }
-.bv.good { color: var(--green); }
-/* Simple mode note */
-.sn { padding: 20px; border: 1px solid var(--cyan); border-radius: 12px; background: rgba(0, 217, 255, 0.05); margin: 24px; font-size: 13px; color: var(--t2); line-height: 1.6; border-left-width: 4px; }
-/* Diff */
-.dg { display: grid; grid-template-columns: 1fr 1fr; background: var(--bg); }
-@media (max-width: 780px) { .dg { grid-template-columns: 1fr; } .dfs:first-child { border-right: none !important; border-bottom: 1px solid var(--b1); } }
-.dfs:first-child { border-right: 1px solid var(--b1); }
-.dfh { padding: 10px 16px; border-bottom: 1px solid var(--b1); font-size: 11px; color: var(--muted); display: flex; align-items: center; gap: 8px; font-weight: 600; background: var(--s2); }
-.dft { font-size: 9px; font-weight: 800; padding: 2px 6px; border-radius: 4px; text-transform: uppercase; }
-.dft.cu { background: rgba(255, 51, 68, 0.2); color: var(--red); }
-.dft.ro { background: rgba(0, 255, 136, 0.2); color: var(--green); }
-.dfp { padding: 20px; font-family: var(--mono); font-size: 12px; line-height: 1.7; overflow: auto; max-height: 500px; white-space: pre; color: var(--t2); }
-.dlo { background: rgba(255, 51, 68, 0.1); color: var(--red); text-decoration: line-through; display: block; width: 100%; }
-.dln { background: rgba(0, 255, 136, 0.1); color: var(--green); display: block; width: 100%; }
-/* Loading Skeleton */
-.skeleton { position: relative; overflow: hidden; background: var(--s2); border-radius: 12px; height: 200px; margin-top: 24px; }
-.skeleton::after { content: ''; position: absolute; inset: 0; transform: translateX(-100%); background: linear-gradient(90deg, transparent, rgba(255,255,255,0.05), transparent); animation: shimmer 1.5s infinite; }
-@keyframes shimmer { 100% { transform: translateX(100%); } }
-/* Custom Cursor */
-#cursor {
-  position: fixed;
-  width: 20px;
-  height: 20px;
-  background: rgba(255, 255, 255, 0.2);
-  border: 1px solid rgba(255, 255, 255, 0.4);
-  border-radius: 50%;
-  pointer-events: none;
-  z-index: 9999;
-  transition: transform 0.1s ease, width 0.3s var(--spring), height 0.3s var(--spring), background 0.3s ease;
-  mix-blend-mode: difference;
-}
-#cursor.active { transform: scale(3); background: rgba(255, 51, 68, 0.3); border-color: var(--red); }
-/* Modal */
-.mo { display: none; position: fixed; inset: 0; background: rgba(0, 0, 0, 0.85); z-index: 1000; place-items: center; backdrop-filter: blur(8px); }
-.mo.open { display: grid; }
-.mb { background: var(--s1); border: 1px solid var(--b1); border-radius: 16px; width: 90%; max-width: 800px; max-height: 90vh; overflow: hidden; box-shadow: 0 20px 50px rgba(0, 0, 0, 0.6); }
-.mt { padding: 16px 24px; border-bottom: 1px solid var(--b1); display: flex; justify-content: space-between; align-items: center; background: var(--s2); }
-.mt h3 { font-size: 16px; color: var(--t3); font-weight: 700; }
-.mx { background: none; border: none; color: var(--muted); font-size: 24px; cursor: pointer !important; line-height: 1; transition: color 0.2s; }
-.mx:hover { color: var(--t3); }
-.mc { padding: 24px; }
-.mc textarea { width: 100%; height: 400px; background: var(--bg); border: 1px solid var(--b1); border-radius: 8px; padding: 16px; color: var(--cyan); font-family: var(--mono); font-size: 12px; line-height: 1.6; resize: vertical; outline: none; }
-.mc textarea:focus { border-color: var(--cyan); box-shadow: 0 0 10px rgba(0, 217, 255, 0.2); }
-.mf { padding: 16px 24px; border-top: 1px solid var(--b1); display: flex; justify-content: flex-end; gap: 12px; background: var(--s2); }
-::-webkit-scrollbar { width: 6px; height: 6px; }
-::-webkit-scrollbar-track { background: transparent; }
-::-webkit-scrollbar-thumb { background: var(--b1); border-radius: 10px; }
-::-webkit-scrollbar-thumb:hover { background: var(--b2); }
-footer { padding: 32px 0; border-top: 1px solid var(--b1); display: flex; justify-content: space-between; font-size: 11px; color: var(--muted); font-weight: 500; }
-footer a { color: var(--muted); text-decoration: none; transition: color 0.2s; border-bottom: 1px solid transparent; }
-footer a:hover { color: var(--t2); border-bottom-color: var(--muted); }
-.idle { flex: 1; display: flex; align-items: center; justify-content: center; color: var(--b2); font-size: 13px; font-weight: 500; min-height: 100px; }
-</style>
 </head>
 <div id="cursor"></div>
@@ -506,13 +1115,16 @@ footer a:hover { color: var(--t2); border-bottom-color: var(--muted); }
     <div class="logo">ROCmPort <em>AI</em></div>
     <div class="hr">
       <div class="hd on" id="hdot"></div>
-      <span id="hstat">⚡ Armed and waiting</span>
     </div>
   </header>
   <div class="g">
     <div class="p">
-      <div class="ph"><div><b>//</b> CUDA source</div><div id="lc">0 lines</div></div>
       <textarea class="code" id="inp" spellcheck="false" placeholder="// Paste CUDA code here
 // or pick a demo below
@@ -531,7 +1143,10 @@ __global__ void kernel(float* A, float* B, int N) {
     </div>
     <div class="p">
-      <div class="ph"><div><b>//</b> Pipeline</div><div id="pt">0.0s</div></div>
       <div class="timeline" id="tl">
         <!-- Nodes injected by JS -->
       </div>
@@ -561,243 +1176,247 @@ __global__ void kernel(float* A, float* B, int N) {
   <footer>
     <div>ROCmPort AI — AMD Developer Hackathon 2025</div>
-    <div><a href="https://x.com/TazwarEnan" target="_blank">Tazwar Ahnaf Enan</a> · <a href="https://github.com/tazwaryayyyy" target="_blank">GitHub</a></div>
   </footer>
 </div>
 <div class="mo" id="modal">
   <div class="mb">
-    <div class="mt"><h3>Edit ROCm code</h3><button class="mx" onclick="cm()">&times;</button></div>
     <div class="mc"><textarea id="edt"></textarea></div>
-    <div class="mf"><button class="bs" onclick="cm()">Cancel</button><button class="bs r" onclick="rec()">Re-test</button></div>
   </div>
 </div>
 <script>
-const API = 'http://localhost:8000';
-const S = { code: '', kn: 'custom', run: false, t0: null, iv: null, rep: null, tl: [], kernels: {} };
-const AG = {
-  analyzer: { n: 'ANALYZER', i: '🔍' },
-  translator: { n: 'TRANSLATOR', i: '🔄' },
-  optimizer: { n: 'OPTIMIZER', i: '⚡' },
-  tester: { n: 'TESTER', i: '🧪' },
-  coordinator: { n: 'COORDINATOR', i: '📋' }
-};
-// Custom Cursor Logic
-const cur = document.getElementById('cursor');
-document.addEventListener('mousemove', (e) => {
-  cur.style.left = e.clientX + 'px';
-  cur.style.top = e.clientY + 'px';
-  const target = e.target;
-  const isClickable = target.onclick ||
-                     target.tagName === 'BUTTON' ||
-                     target.tagName === 'A' ||
-                     target.tagName === 'TEXTAREA' ||
-                     target.classList.contains('ch') ||
-                     target.classList.contains('tab');
-  if (isClickable) {
-    cur.classList.add('active');
-    if (target.id === 'go') cur.style.background = 'rgba(255, 51, 68, 0.5)';
-    else cur.style.background = 'rgba(255, 255, 255, 0.3)';
-  } else {
-    cur.classList.remove('active');
-    cur.style.background = 'rgba(255, 255, 255, 0.2)';
-  }
-});
-async function init() {
-  const ta = document.getElementById('inp');
-  ta.oninput = () => {
-    document.getElementById('lc').textContent = ta.value.split('\n').length + ' lines';
-    S.code = ta.value;
   };
-  try {
-    const r = await fetch(API + '/demo-kernels');
-    S.kernels = await r.json();
-  } catch (e) { S.kernels = FB; }
-}
-function lk(n, btn) {
-  document.querySelectorAll('.ch').forEach(c => c.classList.remove('on'));
-  btn.classList.add('on');
-  const code = S.kernels[n] || FB[n] || '', ta = document.getElementById('inp');
-  ta.value = code; S.code = code; S.kn = n;
-  document.getElementById('lc').textContent = code.split('\n').length + ' lines';
-}
-function stab(id, btn) {
-  document.querySelectorAll('.tab').forEach(t => t.classList.remove('on'));
-  document.querySelectorAll('.tc').forEach(t => t.classList.remove('on'));
-  btn.classList.add('on');
-  document.getElementById('t-' + id).classList.add('on');
-  if (id === 'diff' && S.rep) rDiff(S.code, S.rep.optimized_code);
-}
-async function go() {
-  if (S.run) return;
-  const code = document.getElementById('inp').value.trim();
-  if (!code) return;
-  S.code = code; S.run = true; S.t0 = Date.now(); S.tl = [];
-  const btn = document.getElementById('go');
-  btn.disabled = true;
-  btn.textContent = 'Awaiting Agents...';
-  document.getElementById('hstat').textContent = '🤖 Agents thinking...';
-  document.getElementById('rp').classList.add('hide');
-  bLog();
-  sTimer();
-  try {
-    const simpleModeCheckbox = document.getElementById('sm');
-    const res = await fetch(API + '/port', {
-      method: 'POST',
-      headers: { 'Content-Type': 'application/json' },
-      body: JSON.stringify({
-        cuda_code: code,
-        kernel_name: S.kn,
-        simple_mode: simpleModeCheckbox ? simpleModeCheckbox.checked : false
-      })
-    });
-    // Show results panel with loader immediately
-    document.getElementById('rp').classList.remove('hide');
-    document.getElementById('t-loader').classList.remove('hide');
-    document.getElementById('t-sum').classList.remove('on');
-    document.getElementById('t-diff').classList.remove('on');
-    document.getElementById('t-det').classList.remove('on');
-    const rd = res.body.getReader(), dc = new TextDecoder();
-    let buf = '';
-    while (true) {
-      const { done, value } = await rd.read();
-      if (done) break;
-      buf += dc.decode(value, { stream: true });
-      const lines = buf.split('\n');
-      buf = lines.pop();
-      for (const ln of lines) {
-        if (!ln.startsWith('data: ')) continue;
-        const raw = ln.slice(6).trim();
-        if (raw === '[DONE]') { done_(); break; }
-        try { hEvt(JSON.parse(raw)); } catch (e) { console.error('Parse error:', e); }
-      }
     }
-  } catch (e) {
-    document.getElementById('hstat').textContent = '⚠️ Agent failure';
-    document.getElementById('t-loader').classList.add('hide'); // Hide loader on error
-    console.error(e);
-  } finally {
-    xTimer();
-    S.run = false;
-    btn.disabled = false;
-    btn.textContent = 'Port to ROCm';
-    document.getElementById('t-loader').classList.add('hide');
   }
-}
-function hEvt(ev) {
-  uLog(ev.agent, ev.status, ev.message, ev.detail);
-  if (ev.agent === 'tester' && (ev.status === 'done' || ev.status === 'failed')) {
-    const m = ev.message.match(/([\d.]+)x/);
-    if (m) {
-      const sp = parseFloat(m[1]), ok = sp >= 1, im = ev.message.match(/Iteration (\d+)/i);
-      S.tl.push({
-        label: 'Iteration ' + (im ? im[1] : S.tl.length + 1) + (ok ? ' (optimized)' : ' (baseline)'),
-        speedup: sp,
-        good: ok
       });
     }
   }
-  if (ev.agent === 'coordinator' && ev.status === 'done' && ev.detail) {
-    try {
-      const r = JSON.parse(ev.detail);
-      S.rep = r;
-      rRes(r, S.tl);
-    } catch (e) { console.error('Coordinator detail parse error:', e); }
   }
-}
-function done_() {
-  document.getElementById('hstat').textContent = '✨ Migration complete';
-  document.getElementById('t-loader').classList.add('hide');
-  if (!S.rep) {
-    document.getElementById('t-sum').innerHTML = '<div class="idle">Migration finished but no report was generated. Check agent logs for details.</div>';
-    document.getElementById('t-sum').classList.add('on');
   }
-}
-function bLog() {
-  const el = document.getElementById('al');
-  const tl = document.getElementById('tl');
-  el.innerHTML = '';
-  tl.innerHTML = '';
-  let i = 0;
-  for (const [k, obj] of Object.entries(AG)) {
-    // Log row
-    const d = document.createElement('div');
-    d.className = 'ar';
-    d.id = 'ar-' + k;
-    d.style.animationDelay = (i * 0.1) + 's';
-    d.innerHTML = `
       <div class="at">
         <span class="an">${obj.n}</span>
         <span class="am" id="am-${k}">Waiting</span>
       </div>
       <div class="ad" id="ad-${k}"></div>`;
-    el.appendChild(d);
-    // Timeline node
-    const n = document.createElement('div');
-    n.className = 'node';
-    n.id = 'nd-' + k;
-    n.title = obj.n;
-    n.innerHTML = `<div class="ni">${obj.i}</div><div class="nl">${obj.n.slice(0,3)}</div>`;
-    tl.appendChild(n);
-    i++;
   }
-}
-function uLog(a, s, m, d) {
-  const row = document.getElementById('ar-' + a);
-  const node = document.getElementById('nd-' + a);
-  if (!row || !node) return;
-  const statusClass = { running: 'run', done: 'done', failed: 'fail', retrying: 'retry' }[s] || '';
-  row.className = 'ar ' + statusClass;
-  node.className = 'node ' + (s === 'running' ? 'on' : s === 'retrying' ? 'retry' : s === 'done' ? 'done' : s === 'failed' ? 'fail' : '');
-  const me = document.getElementById('am-' + a);
-  if (me) me.textContent = m;
-  // Node tooltip message update
-  node.title = m;
-  const de = document.getElementById('ad-' + a);
-  if (de && d) {
-    de.innerHTML = esc(d)
-      .replace(/\u26a0\ufe0f([^\n]*)/g, '<span class="w">⚠️ $1</span>')
-      .replace(/\u2705([^\n]*)/g, '<span class="g">✅ $1</span>');
-    de.scrollTop = de.scrollHeight;
   }
-}
-function rRes(r, tl) {
-  // Hide loader, show summary
-  document.getElementById('t-loader').classList.add('hide');
-  document.getElementById('t-sum').classList.add('on');
-  const v = r.verification || {}, bw = r.bandwidth_utilized;
-  const dot = ok => `<div class="sum-dot ${ok === false ? 'no' : 'ok'}"></div>`;
-  document.getElementById('t-sum').innerHTML = `
     <div class="sum-row">
       <div class="sum-big">
         ${r.speedup}x
         <span class="u">vs baseline hipify</span>
-        <span class="vic">🎯 Your code is now an AMD champion.</span>
       </div>
       <div class="sum-sep"></div>
       <div>
@@ -819,105 +1438,106 @@ function rRes(r, tl) {
       ${r.simplified_explanation ? esc(r.simplified_explanation) : '<em>Simplified explanation will appear here</em>'}
     </div>`;
-  // Details tab
-  let dh = `<div class="dm">
     <div class="di"><div class="dl">Speedup</div><div class="dv g">${r.speedup}x</div><div class="ds">optimized ROCm vs straight hipify output</div></div>
     <div class="di"><div class="dl">Bandwidth</div><div class="dv c">${bw != null ? bw.toFixed(1) : '—'}%</div><div class="ds">of MI300X 5.3 TB/s HBM3</div></div>
     <div class="di"><div class="dl">Changes</div><div class="dv y">${r.total_changes}</div><div class="ds">hipify + LLM + optimizer changes</div></div>
     <div class="di"><div class="dl">Iterations</div><div class="dv c">${r.iterations || 1}</div><div class="ds">optimizer retry loop count</div></div>
     <div class="di"><div class="dl">Type</div><div class="dv t">${(r.bottleneck || '—').toUpperCase()}</div><div class="ds">workload classification</div></div>
   </div>`;
-  if (tl.length) {
-    dh += '<div class="bk"><div class="bk-t">Benchmark iterations (optimized vs baseline hipify)</div>';
-    tl.forEach(d => {
-      const pct = Math.min(Math.max((d.speedup / 2) * 100, 3), 95);
-      dh += `<div class="br">
         <div class="bl">${esc(d.label)}</div>
         <div class="bt"><div class="bf ${d.good ? 'good' : 'bad'}" style="width: 0" data-w="${pct}%"></div></div>
         <div class="bv ${d.good ? 'good' : 'bad'}">${d.speedup}x</div>
       </div>`;
-    });
-    dh += '</div>';
   }
-  document.getElementById('t-det').innerHTML = dh;
-  tsm(); // Ensure simple note visibility matches current toggle state
-  // Progress bar animation
-  setTimeout(() => {
-    document.querySelectorAll('.bf[data-w]').forEach(b => {
-      b.style.width = b.dataset.w;
-    });
-  }, 100);
-}
-function rDiff(o, n) {
-  if (!o || !n) return;
-  const oe = document.getElementById('d-o'), ne = document.getElementById('d-n');
-  if (oe && oe.innerHTML && ne && ne.innerHTML) return; // Already rendered
-  document.getElementById('t-diff').innerHTML = `<div class="dg">
     <div class="dfs"><div class="dfh"><span class="dft cu">CUDA</span> Original Source</div><pre class="dfp" id="d-o"></pre></div>
     <div class="dfs"><div class="dfh"><span class="dft ro">ROCm</span> Optimized HIP</div><pre class="dfp" id="d-n"></pre></div>
   </div>`;
-  const oL = o.split('\n'), nL = n.split('\n'), mx = Math.max(oL.length, nL.length);
-  let oH = '', nH = '';
-  for (let i = 0; i < mx; i++) {
-    const a = oL[i] ?? '', b = nL[i] ?? '', c = a !== b;
-    oH += `<span class="${c ? 'dlo' : ''}">${esc(a)}\n</span>`;
-    nH += `<span class="${c ? 'dln' : ''}">${esc(b)}\n</span>`;
   }
-  document.getElementById('d-o').innerHTML = oH;
-  document.getElementById('d-n').innerHTML = nH;
-}
-function sTimer() { S.iv = setInterval(() => { document.getElementById('pt').textContent = ((Date.now() - S.t0) / 1000).toFixed(1) + 's' }, 100) }
-function xTimer() { clearInterval(S.iv) }
-function dlR() {
-  const r = S.rep; if (!r) return;
-  const md = `# ROCmPort AI — Migration Report\n\n## Results\n- **Speedup**: ${r.speedup}x\n- **Bandwidth**: ${r.bandwidth_utilized ? r.bandwidth_utilized.toFixed(1) : '—'}%\n- **Changes**: ${r.total_changes}\n- **Iterations**: ${r.iterations}\n- **Type**: ${r.bottleneck}\n\n${r.amd_advantage_explanation ? '> ' + r.amd_advantage_explanation + '\n\n' : ''}${r.cost_estimate ? '## Cost Impact\n- Manual: ' + r.cost_estimate.manual_porting_weeks + '\n- ROCmPort: ' + r.cost_estimate.rocmport_minutes + '\n- Savings: ' + r.cost_estimate.estimated_savings + '\n\n' : ''}## ROCm/HIP Code\n\`\`\`cpp\n${r.optimized_code || ''}\n\`\`\`\n\n---\n*Generated by ROCmPort AI*\n`;
-  const a = document.createElement('a'); a.href = URL.createObjectURL(new Blob([md], { type: 'text/markdown' })); a.download = 'rocmport-migration-report.md'; a.click();
-}
-function om() { if (!S.rep) return alert('No results yet!'); document.getElementById('edt').value = S.rep?.optimized_code || ''; document.getElementById('modal').classList.add('open') }
-function cm() { document.getElementById('modal').classList.remove('open') }
-async function rec() {
-  const code = document.getElementById('edt').value.trim(); if (!code) return;
-  try {
-    const res = await fetch(API + '/recompile', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ edited_code: code, kernel_name: S.kn }) });
-    const r = await res.json();
-    if (r.success) { cm(); if (r.result) rRes(r.result, S.tl); }
-    else alert('Failed: ' + (r.detail || 'Unknown'))
-  } catch (e) { alert('Error: ' + e.message) }
-}
-async function exM() {
-  if (!S.rep) return;
-  try {
-    const res = await fetch(API + '/export', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ original_cuda: S.code, final_rocm: S.rep.optimized_code, migration_report: S.rep }) });
-    if (res.ok) { const a = document.createElement('a'); a.href = URL.createObjectURL(await res.blob()); a.download = 'rocmport-migration.zip'; a.click() }
-  } catch (e) { alert('Export error') }
-}
-function tsm() {
-  const sn = document.getElementById('sn');
-  if (sn) sn.classList.remove('hide');
-}
-function esc(s) { return String(s ?? '').replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;') }
-const FB = {
-  vector_add: `#include <cuda_runtime.h>\n\n__global__ void vector_add_kernel(float* A, float* B, float* C, int N) {\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx < N) {\n        C[idx] = A[idx] + B[idx];\n    }\n}\n\nint main() {\n    int N = 1 << 24;\n    size_t size = N * sizeof(float);\n    float *d_A, *d_B, *d_C;\n    cudaMalloc(&d_A, size);\n    cudaMalloc(&d_B, size);\n    cudaMalloc(&d_C, size);\n    int threads = 128;\n    int blocks = (N + threads - 1) / threads;\n    vector_add_kernel<<<blocks, threads>>>(d_A, d_B, d_C, N);\n    cudaDeviceSynchronize();\n    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);\n    return 0;\n}`,
-  matrix_multiply: `#include <cuda_runtime.h>\n#define WARP_SIZE 32\n\n__global__ void matmul_kernel(float* A, float* B, float* C, int N) {\n    int row = blockIdx.y * blockDim.y + threadIdx.y;\n    int col = blockIdx.x * blockDim.x + threadIdx.x;\n    float sum = 0.0f;\n    if (row < N && col < N) {\n        for (int k = 0; k < N; k++)\n            sum += A[row * N + k] * B[k * N + col];\n        C[row * N + col] = sum;\n    }\n}\n\n__global__ void warp_reduce(float* data, float* result, int N) {\n    int tid = threadIdx.x;\n    extern __shared__ float sdata[];\n    sdata[tid] = (tid < N) ? data[tid] : 0;\n    __syncthreads();\n    for (int s = WARP_SIZE/2; s > 0; s >>= 1) {\n        if (tid < s) sdata[tid] += sdata[tid + s];\n        __syncthreads();\n    }\n    if (tid == 0) result[blockIdx.x] = sdata[0];\n}\n\nint main() {\n    int N = 1024;\n    size_t size = N * N * sizeof(float);\n    float *d_A, *d_B, *d_C;\n    cudaMalloc(&d_A, size);\n    cudaMalloc(&d_B, size);\n    cudaMalloc(&d_C, size);\n    dim3 block(16, 16);\n    dim3 grid((N+15)/16, (N+15)/16);\n    matmul_kernel<<<grid, block>>>(d_A, d_B, d_C, N);\n    cudaDeviceSynchronize();\n    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);\n    return 0;\n}`,
-  convolution_2d: `#include <cuda_runtime.h>\n#define BLOCK_SIZE 16\n\n__global__ void conv2d_kernel(\n    float* input, float* kernel, float* output,\n    int width, int height\n) {\n    int x = blockIdx.x * blockDim.x + threadIdx.x;\n    int y = blockIdx.y * blockDim.y + threadIdx.y;\n    if (x >= width || y >= height) return;\n    float sum = 0.0f;\n    for (int ky = -1; ky <= 1; ky++) {\n        for (int kx = -1; kx <= 1; kx++) {\n            int ix = x + kx, iy = y + ky;\n            if (ix >= 0 && ix < width && iy >= 0 && iy < height)\n                sum += input[iy * width + ix] * kernel[(ky+1)*3 + (kx+1)];\n        }\n    }\n    output[y * width + x] = sum;\n}\n\nint main() {\n    int W = 2048, H = 2048;\n    float *d_in, *d_ker, *d_out;\n    cudaMalloc(&d_in,  W*H*sizeof(float));\n    cudaMalloc(&d_ker, 9*sizeof(float));\n    cudaMalloc(&d_out, W*H*sizeof(float));\n    dim3 block(BLOCK_SIZE, BLOCK_SIZE);\n    dim3 grid((W+BLOCK_SIZE-1)/BLOCK_SIZE, (H+BLOCK_SIZE-1)/BLOCK_SIZE);\n    conv2d_kernel<<<grid, block>>>(d_in, d_ker, d_out, W, H);\n    cudaDeviceSynchronize();\n    cudaFree(d_in); cudaFree(d_ker); cudaFree(d_out);\n    return 0;\n}`,
-  reduction: `#include <cuda_runtime.h>\n#include <stdio.h>\n#include <iostream>\n#include <vector>\n#include <numeric>\n\n// Tree-based reduction kernel\n__global__ void reduction_kernel(float* g_idata, float* g_odata, unsigned int n) {\n    extern __shared__ float sdata[];\n    unsigned int tid = threadIdx.x;\n    unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;\n\n    float mySum = (i < n) ? g_idata[i] : 0;\n    if (i + blockDim.x < n) mySum += g_idata[i + blockDim.x];\n    sdata[tid] = mySum;\n    __syncthreads();\n\n    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {\n        if (tid < s) sdata[tid] = mySum = mySum + sdata[tid + s];\n        __syncthreads();\n    }\n\n    // DELIBERATE WARP-SIZE BUG: Unroll to 32 instead of 64\n    if (tid < 32) {\n        volatile float* vsmem = sdata;\n        vsmem[tid] = mySum = mySum + vsmem[tid + 32];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 16];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 8];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 4];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 2];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 1];\n    }\n\n    if (tid == 0) g_odata[blockIdx.x] = sdata[0];\n}\n\nint main() {\n    const int N = 1048576;\n    // ... Host code for Parallel Reduction demo\n    printf("Parallel Reduction demo loaded.\\n");\n    return 0;\n}`
-};
-init();
 </script>
 </body>
 </html>

 <!DOCTYPE html>
 <html lang="en">
 <head>
+  <meta charset="UTF-8">
+  <meta name="viewport" content="width=device-width, initial-scale=1.0">
+  <title>ROCmPort AI</title>
+  <link rel="preconnect" href="https://fonts.googleapis.com">
+  <link
+    href="https://fonts.googleapis.com/css2?family=JetBrains+Mono:wght@400;500&family=Space+Grotesk:wght@500;600;700&display=swap"
+    rel="stylesheet">
+  <style>
+    :root {
+      --bg: #030303;
+      --s1: #0a0a0b;
+      --s2: #121214;
+      --s3: #1a1a1e;
+      --b1: rgba(255, 255, 255, 0.08);
+      --b2: rgba(255, 255, 255, 0.15);
+      --red: #ff3344;
+      --red-glow: rgba(255, 51, 68, 0.4);
+      --green: #00ff88;
+      --green-glow: rgba(0, 255, 136, 0.4);
+      --yellow: #ffcc00;
+      --cyan: #00d9ff;
+      --muted: #88888e;
+      --t1: #a1a1aa;
+      --t2: #d4d4d8;
+      --t3: #ffffff;
+      --mono: 'JetBrains Mono', monospace;
+      --sans: 'Space Grotesk', sans-serif;
+      --spring: cubic-bezier(0.34, 1.56, 0.64, 1);
+    }
+    * {
+      margin: 0;
+      padding: 0;
+      box-sizing: border-box;
+      cursor: none !important;
+    }
+    .hide {
+      display: none !important;
+    }
+    body {
+      background: var(--bg);
+      color: var(--t1);
+      font-family: var(--sans);
+      font-size: 14px;
+      line-height: 1.6;
+      overflow-x: hidden;
+      min-height: 100vh;
+    }
+    /* Animated Gradient Background */
+    body::before {
+      content: '';
+      position: fixed;
+      inset: 0;
+      background:
+        radial-gradient(circle at 20% 30%, rgba(0, 217, 255, 0.05), transparent 40%),
+        radial-gradient(circle at 80% 70%, rgba(255, 51, 68, 0.05), transparent 40%),
+        radial-gradient(circle at 50% 50%, rgba(0, 255, 136, 0.03), transparent 60%);
+      z-index: -1;
+      animation: bgMove 20s ease-in-out infinite alternate;
+    }
+    @keyframes bgMove {
+      0% {
+        transform: scale(1) translate(0, 0);
+      }
+      50% {
+        transform: scale(1.1) translate(20px, -20px);
+      }
+      100% {
+        transform: scale(1) translate(-20px, 20px);
+      }
+    }
+    .w {
+      max-width: 1200px;
+      margin: 0 auto;
+      padding: 32px 24px;
+      position: relative;
+    }
+    /* Container Glow */
+    .w::after {
+      content: '';
+      position: absolute;
+      inset: 0;
+      background: radial-gradient(circle at 50% 0%, rgba(255, 51, 68, 0.08), transparent 70%);
+      pointer-events: none;
+      z-index: -1;
+    }
+    header {
+      padding-bottom: 24px;
+      border-bottom: 1px solid var(--b1);
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      margin-bottom: 24px;
+    }
+    .logo {
+      font-weight: 700;
+      font-size: 18px;
+      color: var(--t3);
+      letter-spacing: -0.02em;
+    }
+    .logo em {
+      font-style: normal;
+      color: var(--red);
+      text-shadow: 0 0 15px var(--red-glow);
+    }
+    .hr {
+      font-size: 12px;
+      color: var(--muted);
+      display: flex;
+      align-items: center;
+      gap: 10px;
+      background: var(--s1);
+      padding: 6px 12px;
+      border-radius: 20px;
+      border: 1px solid var(--b1);
+    }
+    .hd {
+      width: 6px;
+      height: 6px;
+      border-radius: 50%;
+      background: var(--green);
+      box-shadow: 0 0 10px var(--green-glow);
+    }
+    .hd.on {
+      animation: pulse 2s ease-in-out infinite;
+    }
+    @keyframes pulse {
+      0%,
+      100% {
+        opacity: 1;
+        transform: scale(1);
+      }
+      50% {
+        opacity: 0.4;
+        transform: scale(0.8);
+      }
+    }
+    .g {
+      display: grid;
+      grid-template-columns: 1.2fr 0.8fr;
+      gap: 24px;
+      padding: 0;
+    }
+    .fs {
+      grid-column: 1 / -1;
+    }
+    @media (max-width: 900px) {
+      .g {
+        grid-template-columns: 1fr;
+      }
+    }
+    /* Card Styling */
+    .p {
+      background: var(--s1);
+      border: 1px solid var(--b1);
+      border-radius: 12px;
+      overflow: hidden;
+      display: flex;
+      flex-direction: column;
+      box-shadow: 0 4px 20px rgba(0, 0, 0, 0.4);
+      backdrop-filter: blur(10px);
+      transition: transform 0.3s var(--spring), border-color 0.3s ease;
+    }
+    .p:hover {
+      border-color: var(--b2);
+    }
+    .ph {
+      padding: 12px 16px;
+      border-bottom: 1px solid var(--b1);
+      display: flex;
+      align-items: center;
+      justify-content: space-between;
+      font-size: 12px;
+      color: var(--muted);
+      background: rgba(255, 255, 255, 0.02);
+    }
+    .ph b {
+      color: var(--red);
+      font-weight: 600;
+      text-transform: uppercase;
+      letter-spacing: 0.05em;
+    }
+    textarea.code {
+      width: 100%;
+      flex: 1;
+      min-height: 300px;
+      background: var(--bg);
+      border: none;
+      color: var(--t2);
+      font-family: var(--mono);
+      font-size: 13px;
+      line-height: 1.7;
+      padding: 20px;
+      resize: vertical;
+      outline: none;
+      caret-color: var(--red);
+      will-change: transform;
+    }
+    .db {
+      padding: 12px 16px;
+      border-top: 1px solid var(--b1);
+      display: flex;
+      align-items: center;
+      gap: 8px;
+      background: var(--s1);
+    }
+    .db .l {
+      font-size: 11px;
+      color: var(--muted);
+      font-weight: 500;
+    }
+    .ch {
+      font-family: var(--sans);
+      font-size: 11px;
+      padding: 4px 12px;
+      background: var(--s2);
+      border: 1px solid var(--b1);
+      border-radius: 6px;
+      color: var(--t1);
+      cursor: pointer;
+      transition: all 0.2s var(--spring);
+    }
+    .ch:hover {
+      background: var(--s3);
+      color: var(--t3);
+      transform: translateY(-1px);
+      border-color: var(--b2);
+    }
+    .ch.on {
+      background: var(--red);
+      border-color: var(--red);
+      color: #fff;
+      box-shadow: 0 0 15px var(--red-glow);
+    }
+    .bg {
+      margin: 16px;
+      padding: 14px;
+      background: var(--red);
+      border: none;
+      border-radius: 8px;
+      color: #fff;
+      font-family: var(--sans);
+      font-size: 14px;
+      font-weight: 700;
+      cursor: pointer;
+      transition: all 0.3s var(--spring);
+      text-transform: uppercase;
+      letter-spacing: 0.05em;
+      box-shadow: 0 4px 15px var(--red-glow);
+    }
+    .bg:hover {
+      background: #ff4d5a;
+      transform: translateY(-2px);
+      box-shadow: 0 6px 20px var(--red-glow);
+    }
+    .bg:active {
+      transform: translateY(0);
+    }
+    .bg:disabled {
+      opacity: 0.4;
+      cursor: not-allowed;
+      transform: none;
+      box-shadow: none;
+    }
+    /* Agent log */
+    .al {
+      padding: 12px;
+      display: flex;
+      flex-direction: column;
+      gap: 8px;
+    }
+    .ar {
+      padding: 12px 16px;
+      border-radius: 8px;
+      background: rgba(255, 255, 255, 0.03);
+      border: 1px solid transparent;
+      transition: all 0.4s var(--spring);
+      animation: slideIn 0.5s var(--spring) forwards;
+      opacity: 0;
+      transform: translateX(20px);
+    }
+    @keyframes slideIn {
+      to {
+        opacity: 1;
+        transform: translateX(0);
+      }
+    }
+    .ar.run {
+      border-color: var(--cyan);
+      background: rgba(0, 217, 255, 0.05);
+    }
+    .ar.done {
+      border-color: var(--green);
+      background: rgba(0, 255, 136, 0.05);
+    }
+    .ar.fail {
+      border-color: var(--red);
+      background: rgba(255, 51, 68, 0.05);
+    }
+    .ar.retry {
+      border-color: var(--yellow);
+      background: rgba(255, 204, 0, 0.05);
+      animation: pulse-border 1.5s ease-in-out infinite;
+    }
+    @keyframes pulse-border {
+      50% {
+        border-color: rgba(255, 204, 0, 0.2);
+      }
+    }
+    .at {
+      display: flex;
+      align-items: center;
+      gap: 12px;
+    }
+    .an {
+      font-size: 10px;
+      font-weight: 700;
+      color: var(--muted);
+      min-width: 90px;
+      text-transform: uppercase;
+      letter-spacing: 0.1em;
+    }
+    .am {
+      font-size: 13px;
+      color: var(--t2);
+      font-weight: 500;
+    }
+    .ad {
+      font-size: 11px;
+      color: var(--muted);
+      margin-top: 4px;
+      padding-left: 102px;
+      white-space: pre-wrap;
+      line-height: 1.6;
+      max-height: 100px;
+      overflow-y: auto;
+    }
+    .ad .w {
+      color: var(--yellow);
+      font-weight: 600;
+    }
+    .ad .g {
+      color: var(--green);
+      font-weight: 600;
+    }
+    /* Horizontal Timeline */
+    .timeline {
+      display: flex;
+      justify-content: space-between;
+      padding: 16px 20px;
+      background: rgba(255, 255, 255, 0.02);
+      border-bottom: 1px solid var(--b1);
+      margin-bottom: 8px;
+    }
+    .node {
+      display: flex;
+      flex-direction: column;
+      align-items: center;
+      gap: 6px;
+      position: relative;
+      flex: 1;
+    }
+    .node::after {
+      content: '';
+      position: absolute;
+      top: 12px;
+      left: 50%;
+      width: 100%;
+      height: 2px;
+      background: var(--b1);
+      z-index: 0;
+    }
+    .node:last-child::after {
+      display: none;
+    }
+    .ni {
+      width: 24px;
+      height: 24px;
+      border-radius: 50%;
+      background: var(--s3);
+      border: 2px solid var(--b1);
+      display: flex;
+      align-items: center;
+      justify-content: center;
+      font-size: 12px;
+      z-index: 1;
+      transition: all 0.4s var(--spring);
+    }
+    .node.on .ni {
+      background: var(--cyan);
+      border-color: var(--cyan);
+      color: #000;
+      box-shadow: 0 0 15px var(--cyan);
+    }
+    .node.done .ni {
+      background: var(--green);
+      border-color: var(--green);
+      color: #000;
+      box-shadow: 0 0 15px var(--green);
+    }
+    .node.fail .ni {
+      background: var(--red);
+      border-color: var(--red);
+      color: #fff;
+    }
+    .node.retry .ni {
+      animation: pulse-node 1s var(--spring) infinite;
+      background: var(--yellow);
+      border-color: var(--yellow);
+    }
+    @keyframes pulse-node {
+      0%,
+      100% {
+        transform: scale(1);
+      }
+      50% {
+        transform: scale(1.2);
+      }
+    }
+    .nl {
+      font-size: 9px;
+      font-weight: 700;
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.05em;
+    }
+    .node.on .nl,
+    .node.done .nl {
+      color: var(--t3);
+    }
+    /* Tabs */
+    .tabs {
+      display: flex;
+      gap: 8px;
+    }
+    .tab {
+      background: var(--s2);
+      border: 1px solid var(--b1);
+      padding: 6px 16px;
+      border-radius: 8px;
+      font-family: var(--sans);
+      font-size: 12px;
+      font-weight: 600;
+      color: var(--muted);
+      cursor: pointer;
+      transition: all 0.2s var(--spring);
+    }
+    .tab:hover {
+      color: var(--t2);
+      background: var(--s3);
+    }
+    .tab.on {
+      color: var(--t3);
+      background: var(--red);
+      border-color: var(--red);
+      box-shadow: 0 0 10px var(--red-glow);
+    }
+    .tc {
+      display: none;
+      padding: 0;
+      animation: fadeIn 0.4s ease;
+    }
+    .tc.on {
+      display: block;
+    }
+    @keyframes fadeIn {
+      from {
+        opacity: 0;
+        transform: translateY(10px);
+      }
+      to {
+        opacity: 1;
+        transform: translateY(0);
+      }
+    }
+    /* Summary row */
+    .sum-row {
+      padding: 24px;
+      display: flex;
+      align-items: center;
+      gap: 32px;
+      flex-wrap: wrap;
+      border-bottom: 1px solid var(--b1);
+      background: rgba(0, 255, 136, 0.02);
+    }
+    .sum-big {
+      font-size: 32px;
+      font-weight: 800;
+      color: var(--green);
+      line-height: 1;
+      letter-spacing: -0.02em;
+      text-shadow: 0 0 20px var(--green-glow);
+    }
+    .sum-big .u {
+      font-size: 13px;
+      font-weight: 500;
+      color: var(--muted);
+      margin-left: 4px;
+      display: block;
+      margin-top: 4px;
+      letter-spacing: 0;
+    }
+    .sum-big .vic {
+      font-size: 11px;
+      color: var(--cyan);
+      font-weight: 600;
+      display: block;
+      margin-top: 8px;
+      text-shadow: none;
+      opacity: 0.8;
+    }
+    .sum-sep {
+      width: 1px;
+      height: 40px;
+      background: var(--b1);
+    }
+    .sum-chk {
+      display: flex;
+      align-items: center;
+      gap: 8px;
+      font-size: 12px;
+      color: var(--t2);
+      font-weight: 500;
+    }
+    .sum-dot {
+      width: 8px;
+      height: 8px;
+      border-radius: 50%;
+      flex-shrink: 0;
+    }
+    .sum-dot.ok {
+      background: var(--green);
+      box-shadow: 0 0 8px var(--green-glow);
+    }
+    .sum-dot.no {
+      background: var(--red);
+      box-shadow: 0 0 8px var(--red-glow);
+    }
+    .sum-dot.na {
+      background: var(--muted);
+      box-shadow: none;
+    }
+    .sum-type {
+      font-size: 11px;
+      color: var(--cyan);
+      text-transform: uppercase;
+      letter-spacing: 0.1em;
+      font-weight: 700;
+      padding: 4px 10px;
+      background: rgba(0, 217, 255, 0.1);
+      border-radius: 4px;
+    }
+    .sum-bar {
+      padding: 16px 24px;
+      display: flex;
+      align-items: center;
+      gap: 12px;
+      flex-wrap: wrap;
+      border-bottom: 1px solid var(--b1);
+    }
+    .bs {
+      font-family: var(--sans);
+      font-size: 11px;
+      font-weight: 700;
+      padding: 8px 16px;
+      border-radius: 8px;
+      border: 1px solid var(--b1);
+      background: var(--s2);
+      color: var(--t2);
+      cursor: pointer;
+      transition: all 0.2s var(--spring);
+      text-transform: uppercase;
+      letter-spacing: 0.05em;
+    }
+    .bs:hover {
+      border-color: var(--b2);
+      transform: translateY(-1px);
+      background: var(--s3);
+    }
+    .bs.r {
+      background: var(--bg);
+      border-color: var(--red);
+      color: var(--red);
+    }
+    .bs.r:hover {
+      background: var(--red);
+      color: #fff;
+      box-shadow: 0 4px 15px var(--red-glow);
+    }
+    .bs.gr {
+      background: var(--green);
+      border-color: var(--green);
+      color: #000;
+    }
+    .bs.gr:hover {
+      box-shadow: 0 4px 15px var(--green-glow);
+      transform: translateY(-2px);
+    }
+    .sp {
+      flex: 1;
+    }
+    /* Details tab */
+    .dm {
+      display: grid;
+      grid-template-columns: repeat(5, 1fr);
+      border-bottom: 1px solid var(--b1);
+    }
+    @media (max-width: 800px) {
+      .dm {
+        grid-template-columns: repeat(2, 1fr);
+      }
+    }
+    .di {
+      padding: 20px;
+      border-right: 1px solid var(--b1);
+      background: rgba(255, 255, 255, 0.01);
+    }
+    .di:last-child {
+      border-right: none;
+    }
+    .dl {
+      font-size: 10px;
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.1em;
+      margin-bottom: 8px;
+      font-weight: 700;
+    }
+    .dv {
+      font-size: 20px;
+      font-weight: 800;
+      line-height: 1;
+      margin-bottom: 4px;
+      color: var(--t3);
+    }
+    .dv.g {
+      color: var(--green);
+    }
+    .dv.c {
+      color: var(--cyan);
+    }
+    .dv.y {
+      color: var(--yellow);
+    }
+    .dv.t {
+      color: var(--t2);
+      font-size: 13px;
+    }
+    .ds {
+      font-size: 10px;
+      color: var(--muted);
+      line-height: 1.4;
+    }
+    /* Benchmark bars */
+    .bk {
+      padding: 24px;
+      border-bottom: 1px solid var(--b1);
+    }
+    .bk-t {
+      font-size: 11px;
+      color: var(--muted);
+      text-transform: uppercase;
+      letter-spacing: 0.1em;
+      margin-bottom: 16px;
+      font-weight: 700;
+    }
+    .br {
+      display: flex;
+      align-items: center;
+      gap: 16px;
+      margin-bottom: 12px;
+    }
+    .br:last-child {
+      margin-bottom: 0;
+    }
+    .bl {
+      font-size: 12px;
+      color: var(--t2);
+      width: 140px;
+      flex-shrink: 0;
+      font-weight: 500;
+    }
+    .bt {
+      flex: 1;
+      height: 8px;
+      background: var(--bg);
+      border-radius: 4px;
+      overflow: hidden;
+      border: 1px solid var(--b1);
+    }
+    .bf {
+      height: 100%;
+      border-radius: 4px;
+      transition: width 1s var(--spring);
+      width: 0;
+    }
+    .bf.bad {
+      background: linear-gradient(90deg, #ff334466, #ff3344);
+      box-shadow: 0 0 10px rgba(255, 51, 68, 0.3);
+    }
+    .bf.good {
+      background: linear-gradient(90deg, #00ff8866, #00ff88);
+      box-shadow: 0 0 10px rgba(0, 255, 136, 0.3);
+    }
+    .bv {
+      font-size: 12px;
+      font-weight: 700;
+      width: 40px;
+      text-align: right;
+      flex-shrink: 0;
+    }
+    .bv.bad {
+      color: var(--red);
+    }
+    .bv.good {
+      color: var(--green);
+    }
+    /* Simple mode note */
+    .sn {
+      padding: 20px;
+      border: 1px solid var(--cyan);
+      border-radius: 12px;
+      background: rgba(0, 217, 255, 0.05);
+      margin: 24px;
+      font-size: 13px;
+      color: var(--t2);
+      line-height: 1.6;
+      border-left-width: 4px;
+    }
+    /* Diff */
+    .dg {
+      display: grid;
+      grid-template-columns: 1fr 1fr;
+      background: var(--bg);
+    }
+    @media (max-width: 780px) {
+      .dg {
+        grid-template-columns: 1fr;
+      }
+      .dfs:first-child {
+        border-right: none !important;
+        border-bottom: 1px solid var(--b1);
+      }
+    }
+    .dfs:first-child {
+      border-right: 1px solid var(--b1);
+    }
+    .dfh {
+      padding: 10px 16px;
+      border-bottom: 1px solid var(--b1);
+      font-size: 11px;
+      color: var(--muted);
+      display: flex;
+      align-items: center;
+      gap: 8px;
+      font-weight: 600;
+      background: var(--s2);
+    }
+    .dft {
+      font-size: 9px;
+      font-weight: 800;
+      padding: 2px 6px;
+      border-radius: 4px;
+      text-transform: uppercase;
+    }
+    .dft.cu {
+      background: rgba(255, 51, 68, 0.2);
+      color: var(--red);
+    }
+    .dft.ro {
+      background: rgba(0, 255, 136, 0.2);
+      color: var(--green);
+    }
+    .dfp {
+      padding: 20px;
+      font-family: var(--mono);
+      font-size: 12px;
+      line-height: 1.7;
+      overflow: auto;
+      max-height: 500px;
+      white-space: pre;
+      color: var(--t2);
+    }
+    .dlo {
+      background: rgba(255, 51, 68, 0.1);
+      color: var(--red);
+      text-decoration: line-through;
+      display: block;
+      width: 100%;
+    }
+    .dln {
+      background: rgba(0, 255, 136, 0.1);
+      color: var(--green);
+      display: block;
+      width: 100%;
+    }
+    /* Loading Skeleton */
+    .skeleton {
+      position: relative;
+      overflow: hidden;
+      background: var(--s2);
+      border-radius: 12px;
+      height: 200px;
+      margin-top: 24px;
+    }
+    .skeleton::after {
+      content: '';
+      position: absolute;
+      inset: 0;
+      transform: translateX(-100%);
+      background: linear-gradient(90deg, transparent, rgba(255, 255, 255, 0.05), transparent);
+      animation: shimmer 1.5s infinite;
+    }
+    @keyframes shimmer {
+      100% {
+        transform: translateX(100%);
+      }
+    }
+    /* Custom Cursor */
+    #cursor {
+      position: fixed;
+      width: 20px;
+      height: 20px;
+      background: rgba(255, 255, 255, 0.2);
+      border: 1px solid rgba(255, 255, 255, 0.4);
+      border-radius: 50%;
+      pointer-events: none;
+      z-index: 9999;
+      transition: transform 0.1s ease, width 0.3s var(--spring), height 0.3s var(--spring), background 0.3s ease;
+      mix-blend-mode: difference;
+    }
+    #cursor.active {
+      transform: scale(3);
+      background: rgba(255, 51, 68, 0.3);
+      border-color: var(--red);
+    }
+    /* Modal */
+    .mo {
+      display: none;
+      position: fixed;
+      inset: 0;
+      background: rgba(0, 0, 0, 0.85);
+      z-index: 1000;
+      place-items: center;
+      backdrop-filter: blur(8px);
+    }
+    .mo.open {
+      display: grid;
+    }
+    .mb {
+      background: var(--s1);
+      border: 1px solid var(--b1);
+      border-radius: 16px;
+      width: 90%;
+      max-width: 800px;
+      max-height: 90vh;
+      overflow: hidden;
+      box-shadow: 0 20px 50px rgba(0, 0, 0, 0.6);
+    }
+    .mt {
+      padding: 16px 24px;
+      border-bottom: 1px solid var(--b1);
+      display: flex;
+      justify-content: space-between;
+      align-items: center;
+      background: var(--s2);
+    }
+    .mt h3 {
+      font-size: 16px;
+      color: var(--t3);
+      font-weight: 700;
+    }
+    .mx {
+      background: none;
+      border: none;
+      color: var(--muted);
+      font-size: 24px;
+      cursor: pointer !important;
+      line-height: 1;
+      transition: color 0.2s;
+    }
+    .mx:hover {
+      color: var(--t3);
+    }
+    .mc {
+      padding: 24px;
+    }
+    .mc textarea {
+      width: 100%;
+      height: 400px;
+      background: var(--bg);
+      border: 1px solid var(--b1);
+      border-radius: 8px;
+      padding: 16px;
+      color: var(--cyan);
+      font-family: var(--mono);
+      font-size: 12px;
+      line-height: 1.6;
+      resize: vertical;
+      outline: none;
+    }
+    .mc textarea:focus {
+      border-color: var(--cyan);
+      box-shadow: 0 0 10px rgba(0, 217, 255, 0.2);
+    }
+    .mf {
+      padding: 16px 24px;
+      border-top: 1px solid var(--b1);
+      display: flex;
+      justify-content: flex-end;
+      gap: 12px;
+      background: var(--s2);
+    }
+    ::-webkit-scrollbar {
+      width: 6px;
+      height: 6px;
+    }
+    ::-webkit-scrollbar-track {
+      background: transparent;
+    }
+    ::-webkit-scrollbar-thumb {
+      background: var(--b1);
+      border-radius: 10px;
+    }
+    ::-webkit-scrollbar-thumb:hover {
+      background: var(--b2);
+    }
+    footer {
+      padding: 32px 0;
+      border-top: 1px solid var(--b1);
+      display: flex;
+      justify-content: space-between;
+      font-size: 11px;
+      color: var(--muted);
+      font-weight: 500;
+    }
+    footer a {
+      color: var(--muted);
+      text-decoration: none;
+      transition: color 0.2s;
+      border-bottom: 1px solid transparent;
+    }
+    footer a:hover {
+      color: var(--t2);
+      border-bottom-color: var(--muted);
+    }
+    .idle {
+      flex: 1;
+      display: flex;
+      align-items: center;
+      justify-content: center;
+      color: var(--b2);
+      font-size: 13px;
+      font-weight: 500;
+      min-height: 100px;
+    }
+  </style>
 </head>
 <div id="cursor"></div>
     <div class="logo">ROCmPort <em>AI</em></div>
     <div class="hr">
       <div class="hd on" id="hdot"></div>
+      <span id="hstat">Ready</span>
     </div>
   </header>
   <div class="g">
     <div class="p">
+      <div class="ph">
+        <div><b>//</b> CUDA source</div>
+        <div id="lc">0 lines</div>
+      </div>
       <textarea class="code" id="inp" spellcheck="false" placeholder="// Paste CUDA code here
 // or pick a demo below
     </div>
     <div class="p">
+      <div class="ph">
+        <div><b>//</b> Pipeline</div>
+        <div id="pt">0.0s</div>
+      </div>
       <div class="timeline" id="tl">
         <!-- Nodes injected by JS -->
       </div>
   <footer>
     <div>ROCmPort AI — AMD Developer Hackathon 2025</div>
+    <div><a href="https://x.com/TazwarEnan" target="_blank">Tazwar Ahnaf Enan</a> · <a
+        href="https://github.com/tazwaryayyyy" target="_blank">GitHub</a></div>
   </footer>
 </div>
 <div class="mo" id="modal">
   <div class="mb">
+    <div class="mt">
+      <h3>Edit ROCm code</h3><button class="mx" onclick="cm()">&times;</button>
+    </div>
     <div class="mc"><textarea id="edt"></textarea></div>
+    <div class="mf"><button class="bs" onclick="cm()">Cancel</button><button class="bs r"
+        onclick="rec()">Re-test</button></div>
   </div>
 </div>
 <script>
+  const API = 'http://localhost:8000';
+  const S = { code: '', kn: 'custom', run: false, t0: null, iv: null, rep: null, tl: [], kernels: {} };
+  const AG = {
+    analyzer: { n: 'ANALYZER', i: '🔍' },
+    translator: { n: 'TRANSLATOR', i: '🔄' },
+    optimizer: { n: 'OPTIMIZER', i: '⚡' },
+    tester: { n: 'TESTER', i: '🧪' },
+    coordinator: { n: 'COORDINATOR', i: '📋' }
   };
+  // Custom Cursor Logic
+  const cur = document.getElementById('cursor');
+  document.addEventListener('mousemove', (e) => {
+    cur.style.left = e.clientX + 'px';
+    cur.style.top = e.clientY + 'px';
+    const target = e.target;
+    const isClickable = target.onclick ||
+      target.tagName === 'BUTTON' ||
+      target.tagName === 'A' ||
+      target.tagName === 'TEXTAREA' ||
+      target.classList.contains('ch') ||
+      target.classList.contains('tab');
+    if (isClickable) {
+      cur.classList.add('active');
+      if (target.id === 'go') cur.style.background = 'rgba(255, 51, 68, 0.5)';
+      else cur.style.background = 'rgba(255, 255, 255, 0.3)';
+    } else {
+      cur.classList.remove('active');
+      cur.style.background = 'rgba(255, 255, 255, 0.2)';
     }
+  });
+  async function init() {
+    const ta = document.getElementById('inp');
+    ta.oninput = () => {
+      document.getElementById('lc').textContent = ta.value.split('\n').length + ' lines';
+      S.code = ta.value;
+    };
+    try {
+      const r = await fetch(API + '/demo-kernels');
+      S.kernels = await r.json();
+    } catch (e) { S.kernels = FB; }
   }
+  function lk(n, btn) {
+    document.querySelectorAll('.ch').forEach(c => c.classList.remove('on'));
+    btn.classList.add('on');
+    const code = S.kernels[n] || FB[n] || '', ta = document.getElementById('inp');
+    ta.value = code; S.code = code; S.kn = n;
+    document.getElementById('lc').textContent = code.split('\n').length + ' lines';
+  }
+  function stab(id, btn) {
+    document.querySelectorAll('.tab').forEach(t => t.classList.remove('on'));
+    document.querySelectorAll('.tc').forEach(t => t.classList.remove('on'));
+    btn.classList.add('on');
+    document.getElementById('t-' + id).classList.add('on');
+    if (id === 'diff' && S.rep) rDiff(S.code, S.rep.optimized_code);
+  }
+  async function go() {
+    if (S.run) return;
+    const code = document.getElementById('inp').value.trim();
+    if (!code) return;
+    S.code = code; S.run = true; S.t0 = Date.now(); S.tl = [];
+    const btn = document.getElementById('go');
+    btn.disabled = true;
+    btn.textContent = 'Running pipeline...';
+    document.getElementById('hstat').textContent = 'Pipeline running...';
+    document.getElementById('rp').classList.add('hide');
+    bLog();
+    sTimer();
+    try {
+      const simpleModeCheckbox = document.getElementById('sm');
+      const res = await fetch(API + '/port', {
+        method: 'POST',
+        headers: { 'Content-Type': 'application/json' },
+        body: JSON.stringify({
+          cuda_code: code,
+          kernel_name: S.kn,
+          simple_mode: simpleModeCheckbox ? simpleModeCheckbox.checked : false
+        })
       });
+      // Show results panel with loader immediately
+      document.getElementById('rp').classList.remove('hide');
+      document.getElementById('t-loader').classList.remove('hide');
+      document.getElementById('t-sum').classList.remove('on');
+      document.getElementById('t-diff').classList.remove('on');
+      document.getElementById('t-det').classList.remove('on');
+      const rd = res.body.getReader(), dc = new TextDecoder();
+      let buf = '';
+      while (true) {
+        const { done, value } = await rd.read();
+        if (done) break;
+        buf += dc.decode(value, { stream: true });
+        const lines = buf.split('\n');
+        buf = lines.pop();
+        for (const ln of lines) {
+          if (!ln.startsWith('data: ')) continue;
+          const raw = ln.slice(6).trim();
+          if (raw === '[DONE]') { done_(); break; }
+          try { hEvt(JSON.parse(raw)); } catch (e) { console.error('Parse error:', e); }
+        }
+      }
+    } catch (e) {
+      document.getElementById('hstat').textContent = 'Pipeline error';
+      document.getElementById('t-loader').classList.add('hide'); // Hide loader on error
+      console.error(e);
+    } finally {
+      xTimer();
+      S.run = false;
+      btn.disabled = false;
+      btn.textContent = 'Port to ROCm';
+      document.getElementById('t-loader').classList.add('hide');
     }
   }
+  function hEvt(ev) {
+    uLog(ev.agent, ev.status, ev.message, ev.detail);
+    if (ev.agent === 'tester' && (ev.status === 'done' || ev.status === 'failed')) {
+      const m = ev.message.match(/([\d.]+)x/);
+      if (m) {
+        const sp = parseFloat(m[1]), ok = sp >= 1, im = ev.message.match(/Iteration (\d+)/i);
+        S.tl.push({
+          label: 'Iteration ' + (im ? im[1] : S.tl.length + 1) + (ok ? ' (optimized)' : ' (baseline)'),
+          speedup: sp,
+          good: ok
+        });
+      }
+    }
+    if (ev.agent === 'coordinator' && ev.status === 'done' && ev.detail) {
+      try {
+        const r = JSON.parse(ev.detail);
+        S.rep = r;
+        rRes(r, S.tl);
+      } catch (e) { console.error('Coordinator detail parse error:', e); }
+    }
   }
+  function done_() {
+    document.getElementById('hstat').textContent = 'Pipeline complete';
+    document.getElementById('t-loader').classList.add('hide');
+    if (!S.rep) {
+      document.getElementById('t-sum').innerHTML = '<div class="idle">Migration finished but no report was generated. Check agent logs for details.</div>';
+      document.getElementById('t-sum').classList.add('on');
+    }
   }
+  function bLog() {
+    const el = document.getElementById('al');
+    const tl = document.getElementById('tl');
+    el.innerHTML = '';
+    tl.innerHTML = '';
+    let i = 0;
+    for (const [k, obj] of Object.entries(AG)) {
+      // Log row
+      const d = document.createElement('div');
+      d.className = 'ar';
+      d.id = 'ar-' + k;
+      d.style.animationDelay = (i * 0.1) + 's';
+      d.innerHTML = `
       <div class="at">
         <span class="an">${obj.n}</span>
         <span class="am" id="am-${k}">Waiting</span>
       </div>
       <div class="ad" id="ad-${k}"></div>`;
+      el.appendChild(d);
+      // Timeline node
+      const n = document.createElement('div');
+      n.className = 'node';
+      n.id = 'nd-' + k;
+      n.title = obj.n;
+      n.innerHTML = `<div class="ni">${obj.i}</div><div class="nl">${obj.n.slice(0, 3)}</div>`;
+      tl.appendChild(n);
+      i++;
+    }
   }
+  function uLog(a, s, m, d) {
+    const row = document.getElementById('ar-' + a);
+    const node = document.getElementById('nd-' + a);
+    if (!row || !node) return;
+    const statusClass = { running: 'run', done: 'done', failed: 'fail', retrying: 'retry' }[s] || '';
+    row.className = 'ar ' + statusClass;
+    node.className = 'node ' + (s === 'running' ? 'on' : s === 'retrying' ? 'retry' : s === 'done' ? 'done' : s === 'failed' ? 'fail' : '');
+    const me = document.getElementById('am-' + a);
+    if (me) me.textContent = m;
+    // Node tooltip message update
+    node.title = m;
+    const de = document.getElementById('ad-' + a);
+    if (de && d) {
+      de.innerHTML = esc(d)
+        .replace(/\u26a0\ufe0f([^\n]*)/g, '<span class="w">⚠️ $1</span>')
+        .replace(/\u2705([^\n]*)/g, '<span class="g">✅ $1</span>');
+      de.scrollTop = de.scrollHeight;
+    }
   }
+  function rRes(r, tl) {
+    // Hide loader, show summary
+    document.getElementById('t-loader').classList.add('hide');
+    document.getElementById('t-sum').classList.add('on');
+    const v = r.verification || {}, bw = r.bandwidth_utilized;
+    const dot = ok => `<div class="sum-dot ${ok === true ? 'ok' : ok === false ? 'no' : 'na'}"></div>`;
+    document.getElementById('t-sum').innerHTML = `
     <div class="sum-row">
       <div class="sum-big">
         ${r.speedup}x
         <span class="u">vs baseline hipify</span>
+        <span class="vic">Measured against declared baseline.</span>
       </div>
       <div class="sum-sep"></div>
       <div>
       ${r.simplified_explanation ? esc(r.simplified_explanation) : '<em>Simplified explanation will appear here</em>'}
     </div>`;
+    // Details tab
+    let dh = `<div class="dm">
     <div class="di"><div class="dl">Speedup</div><div class="dv g">${r.speedup}x</div><div class="ds">optimized ROCm vs straight hipify output</div></div>
     <div class="di"><div class="dl">Bandwidth</div><div class="dv c">${bw != null ? bw.toFixed(1) : '—'}%</div><div class="ds">of MI300X 5.3 TB/s HBM3</div></div>
     <div class="di"><div class="dl">Changes</div><div class="dv y">${r.total_changes}</div><div class="ds">hipify + LLM + optimizer changes</div></div>
     <div class="di"><div class="dl">Iterations</div><div class="dv c">${r.iterations || 1}</div><div class="ds">optimizer retry loop count</div></div>
     <div class="di"><div class="dl">Type</div><div class="dv t">${(r.bottleneck || '—').toUpperCase()}</div><div class="ds">workload classification</div></div>
   </div>`;
+    if (tl.length) {
+      dh += '<div class="bk"><div class="bk-t">Benchmark iterations (optimized vs baseline hipify)</div>';
+      tl.forEach(d => {
+        const pct = Math.min(Math.max((d.speedup / 2) * 100, 3), 95);
+        dh += `<div class="br">
         <div class="bl">${esc(d.label)}</div>
         <div class="bt"><div class="bf ${d.good ? 'good' : 'bad'}" style="width: 0" data-w="${pct}%"></div></div>
         <div class="bv ${d.good ? 'good' : 'bad'}">${d.speedup}x</div>
       </div>`;
+      });
+      dh += '</div>';
+    }
+    document.getElementById('t-det').innerHTML = dh;
+    tsm(); // Ensure simple note visibility matches current toggle state
+    // Progress bar animation
+    setTimeout(() => {
+      document.querySelectorAll('.bf[data-w]').forEach(b => {
+        b.style.width = b.dataset.w;
+      });
+    }, 100);
   }
+  function rDiff(o, n) {
+    if (!o || !n) return;
+    const oe = document.getElementById('d-o'), ne = document.getElementById('d-n');
+    if (oe && oe.innerHTML && ne && ne.innerHTML) return; // Already rendered
+    document.getElementById('t-diff').innerHTML = `<div class="dg">
     <div class="dfs"><div class="dfh"><span class="dft cu">CUDA</span> Original Source</div><pre class="dfp" id="d-o"></pre></div>
     <div class="dfs"><div class="dfh"><span class="dft ro">ROCm</span> Optimized HIP</div><pre class="dfp" id="d-n"></pre></div>
   </div>`;
+    const oL = o.split('\n'), nL = n.split('\n'), mx = Math.max(oL.length, nL.length);
+    let oH = '', nH = '';
+    for (let i = 0; i < mx; i++) {
+      const a = oL[i] ?? '', b = nL[i] ?? '', c = a !== b;
+      oH += `<span class="${c ? 'dlo' : ''}">${esc(a)}\n</span>`;
+      nH += `<span class="${c ? 'dln' : ''}">${esc(b)}\n</span>`;
+    }
+    document.getElementById('d-o').innerHTML = oH;
+    document.getElementById('d-n').innerHTML = nH;
   }
+  function sTimer() { S.iv = setInterval(() => { document.getElementById('pt').textContent = ((Date.now() - S.t0) / 1000).toFixed(1) + 's' }, 100) }
+  function xTimer() { clearInterval(S.iv) }
+  function dlR() {
+    const r = S.rep; if (!r) return;
+    const md = `# ROCmPort AI — Migration Report\n\n## Results\n- **Speedup**: ${r.speedup}x\n- **Bandwidth**: ${r.bandwidth_utilized ? r.bandwidth_utilized.toFixed(1) : '—'}%\n- **Changes**: ${r.total_changes}\n- **Iterations**: ${r.iterations}\n- **Type**: ${r.bottleneck}\n\n${r.amd_advantage_explanation ? '> ' + r.amd_advantage_explanation + '\n\n' : ''}${r.cost_estimate ? '## Cost Impact\n- Manual: ' + r.cost_estimate.manual_porting_weeks + '\n- ROCmPort: ' + r.cost_estimate.rocmport_minutes + '\n- Savings: ' + r.cost_estimate.estimated_savings + '\n\n' : ''}## ROCm/HIP Code\n\`\`\`cpp\n${r.optimized_code || ''}\n\`\`\`\n\n---\n*Generated by ROCmPort AI*\n`;
+    const a = document.createElement('a'); a.href = URL.createObjectURL(new Blob([md], { type: 'text/markdown' })); a.download = 'rocmport-migration-report.md'; a.click();
+  }
+  function om() { if (!S.rep) return alert('No results yet!'); document.getElementById('edt').value = S.rep?.optimized_code || ''; document.getElementById('modal').classList.add('open') }
+  function cm() { document.getElementById('modal').classList.remove('open') }
+  async function rec() {
+    const code = document.getElementById('edt').value.trim(); if (!code) return;
+    try {
+      const res = await fetch(API + '/recompile', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ edited_code: code, kernel_name: S.kn }) });
+      const r = await res.json();
+      if (r.success) { cm(); if (r.result) rRes(r.result, S.tl); }
+      else alert('Failed: ' + (r.detail || 'Unknown'))
+    } catch (e) { alert('Error: ' + e.message) }
+  }
+  async function exM() {
+    if (!S.rep) return;
+    try {
+      const res = await fetch(API + '/export', { method: 'POST', headers: { 'Content-Type': 'application/json' }, body: JSON.stringify({ original_cuda: S.code, final_rocm: S.rep.optimized_code, migration_report: S.rep }) });
+      if (res.ok) { const a = document.createElement('a'); a.href = URL.createObjectURL(await res.blob()); a.download = 'rocmport-migration.zip'; a.click() }
+    } catch (e) { alert('Export error') }
+  }
+  function tsm() {
+    const sn = document.getElementById('sn');
+    if (sn) sn.classList.remove('hide');
+  }
+  function esc(s) { return String(s ?? '').replace(/&/g, '&amp;').replace(/</g, '&lt;').replace(/>/g, '&gt;') }
+  const FB = {
+    vector_add: `#include <cuda_runtime.h>\n\n__global__ void vector_add_kernel(float* A, float* B, float* C, int N) {\n    int idx = blockIdx.x * blockDim.x + threadIdx.x;\n    if (idx < N) {\n        C[idx] = A[idx] + B[idx];\n    }\n}\n\nint main() {\n    int N = 1 << 24;\n    size_t size = N * sizeof(float);\n    float *d_A, *d_B, *d_C;\n    cudaMalloc(&d_A, size);\n    cudaMalloc(&d_B, size);\n    cudaMalloc(&d_C, size);\n    int threads = 128;\n    int blocks = (N + threads - 1) / threads;\n    vector_add_kernel<<<blocks, threads>>>(d_A, d_B, d_C, N);\n    cudaDeviceSynchronize();\n    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);\n    return 0;\n}`,
+    matrix_multiply: `#include <cuda_runtime.h>\n#define WARP_SIZE 32\n\n__global__ void matmul_kernel(float* A, float* B, float* C, int N) {\n    int row = blockIdx.y * blockDim.y + threadIdx.y;\n    int col = blockIdx.x * blockDim.x + threadIdx.x;\n    float sum = 0.0f;\n    if (row < N && col < N) {\n        for (int k = 0; k < N; k++)\n            sum += A[row * N + k] * B[k * N + col];\n        C[row * N + col] = sum;\n    }\n}\n\n__global__ void warp_reduce(float* data, float* result, int N) {\n    int tid = threadIdx.x;\n    extern __shared__ float sdata[];\n    sdata[tid] = (tid < N) ? data[tid] : 0;\n    __syncthreads();\n    for (int s = WARP_SIZE/2; s > 0; s >>= 1) {\n        if (tid < s) sdata[tid] += sdata[tid + s];\n        __syncthreads();\n    }\n    if (tid == 0) result[blockIdx.x] = sdata[0];\n}\n\nint main() {\n    int N = 1024;\n    size_t size = N * N * sizeof(float);\n    float *d_A, *d_B, *d_C;\n    cudaMalloc(&d_A, size);\n    cudaMalloc(&d_B, size);\n    cudaMalloc(&d_C, size);\n    dim3 block(16, 16);\n    dim3 grid((N+15)/16, (N+15)/16);\n    matmul_kernel<<<grid, block>>>(d_A, d_B, d_C, N);\n    cudaDeviceSynchronize();\n    cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);\n    return 0;\n}`,
+    convolution_2d: `#include <cuda_runtime.h>\n#define BLOCK_SIZE 16\n\n__global__ void conv2d_kernel(\n    float* input, float* kernel, float* output,\n    int width, int height\n) {\n    int x = blockIdx.x * blockDim.x + threadIdx.x;\n    int y = blockIdx.y * blockDim.y + threadIdx.y;\n    if (x >= width || y >= height) return;\n    float sum = 0.0f;\n    for (int ky = -1; ky <= 1; ky++) {\n        for (int kx = -1; kx <= 1; kx++) {\n            int ix = x + kx, iy = y + ky;\n            if (ix >= 0 && ix < width && iy >= 0 && iy < height)\n                sum += input[iy * width + ix] * kernel[(ky+1)*3 + (kx+1)];\n        }\n    }\n    output[y * width + x] = sum;\n}\n\nint main() {\n    int W = 2048, H = 2048;\n    float *d_in, *d_ker, *d_out;\n    cudaMalloc(&d_in,  W*H*sizeof(float));\n    cudaMalloc(&d_ker, 9*sizeof(float));\n    cudaMalloc(&d_out, W*H*sizeof(float));\n    dim3 block(BLOCK_SIZE, BLOCK_SIZE);\n    dim3 grid((W+BLOCK_SIZE-1)/BLOCK_SIZE, (H+BLOCK_SIZE-1)/BLOCK_SIZE);\n    conv2d_kernel<<<grid, block>>>(d_in, d_ker, d_out, W, H);\n    cudaDeviceSynchronize();\n    cudaFree(d_in); cudaFree(d_ker); cudaFree(d_out);\n    return 0;\n}`,
+    reduction: `#include <cuda_runtime.h>\n#include <stdio.h>\n#include <iostream>\n#include <vector>\n#include <numeric>\n\n// Tree-based reduction kernel\n__global__ void reduction_kernel(float* g_idata, float* g_odata, unsigned int n) {\n    extern __shared__ float sdata[];\n    unsigned int tid = threadIdx.x;\n    unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;\n\n    float mySum = (i < n) ? g_idata[i] : 0;\n    if (i + blockDim.x < n) mySum += g_idata[i + blockDim.x];\n    sdata[tid] = mySum;\n    __syncthreads();\n\n    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {\n        if (tid < s) sdata[tid] = mySum = mySum + sdata[tid + s];\n        __syncthreads();\n    }\n\n    // DELIBERATE WARP-SIZE BUG: Unroll to 32 instead of 64\n    if (tid < 32) {\n        volatile float* vsmem = sdata;\n        vsmem[tid] = mySum = mySum + vsmem[tid + 32];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 16];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 8];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 4];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 2];\n        vsmem[tid] = mySum = mySum + vsmem[tid + 1];\n    }\n\n    if (tid == 0) g_odata[blockIdx.x] = sdata[0];\n}\n\nint main() {\n    const int N = 1048576;\n    // ... Host code for Parallel Reduction demo\n    printf("Parallel Reduction demo loaded.\\n");\n    return 0;\n}`
+  };
+  init();
 </script>
 </body>
 </html>