tazwarrrr commited on
Commit
3de7600
·
1 Parent(s): b521314

fixing docs proof

Browse files
README.md CHANGED
@@ -233,9 +233,26 @@ A basic weekend clone can chain hipify and an LLM. The differentiator is reliabl
233
  | Backend unavailable | Verify FastAPI server is running on port `8000`. |
234
  | No improvement observed | Re-check baseline definition, kernel size, and profiler counters. |
235
 
236
- ## License
237
 
238
- See `LICENSE`.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
239
 
240
  ## ✅ Live Results on AMD Instinct MI300X
241
 
@@ -249,3 +266,7 @@ All demo kernels migrated, compiled, and profiled on real MI300X hardware (AMD D
249
  | convolution_2d | 13 | warp-32 + LDS padding | ✅ Compiled |
250
 
251
  `data_source: real_rocm` — verified on AMD DevCloud MI300X instance.
 
 
 
 
 
233
  | Backend unavailable | Verify FastAPI server is running on port `8000`. |
234
  | No improvement observed | Re-check baseline definition, kernel size, and profiler counters. |
235
 
 
236
 
237
+ ## Why Not Just Use hipify?
238
+
239
+ hipify-clang is AMD's official translation tool. ROCmPort AI uses it as a first pass. The problem is what hipify cannot catch.
240
+
241
+ **The reduction kernel example:**
242
+
243
+ hipify successfully translates `reduction.cu` — it compiles, it runs, it returns a result. No errors. But the result is silently wrong on AMD hardware.
244
+
245
+ The root cause: line 59 assumes `warpSize=32` in the final unrolled reduction stage. On AMD, wavefront size is 64. Lanes 32–63 are skipped entirely in the final summation. The output looks plausible but is numerically incorrect.
246
+
247
+ hipify has no knowledge of this. It performs mechanical API renaming. It cannot reason about hardware architecture assumptions baked into kernel logic.
248
+
249
+ ROCmPort AI catches this before execution:
250
+
251
+ - Static scanner flags line 59 as CRITICAL risk: "hardcoded warp-32 conditional — assumes NVIDIA warpSize=32. On AMD wavefront=64 this silently skips lanes 32–63"
252
+ - LLM correction pass rewrites the final reduction stage to be wavefront-64 aware
253
+ - Compiler + rocprof verification confirms the fix compiles and executes correctly on gfx942
254
+
255
+ This is the gap between "it compiles" and "it is correct."
256
 
257
  ## ✅ Live Results on AMD Instinct MI300X
258
 
 
266
  | convolution_2d | 13 | warp-32 + LDS padding | ✅ Compiled |
267
 
268
  `data_source: real_rocm` — verified on AMD DevCloud MI300X instance.
269
+
270
+ ## License
271
+
272
+ See `LICENSE`.
backend/agents/coordinator.py CHANGED
@@ -24,15 +24,15 @@ def calculate_cost_estimate(analyzer_result: AnalyzerResult) -> CostEstimate:
24
 
25
  if complexity <= 3:
26
  manual_weeks = "1-2 weeks"
27
- savings = "$5,000-$10,000"
28
  factor = "Low"
29
  elif complexity <= 7:
30
  manual_weeks = "3-6 weeks"
31
- savings = "$20,000-$50,000"
32
  factor = "Medium"
33
  else:
34
  manual_weeks = "6-10 weeks"
35
- savings = "$50,000-$100,000"
36
  factor = "High"
37
 
38
  return CostEstimate(
@@ -77,8 +77,6 @@ async def run_pipeline(
77
  simple_mode: bool = False,
78
  ) -> AsyncGenerator[AgentEvent, None]:
79
  """Run full pipeline and stream AgentEvent objects."""
80
- _ = simple_mode
81
-
82
  yield AgentEvent(
83
  agent="analyzer",
84
  status=AgentStatus.RUNNING,
 
24
 
25
  if complexity <= 3:
26
  manual_weeks = "1-2 weeks"
27
+ savings = f"~{complexity * 5}-{complexity * 10} eng-days × team rate (complexity {complexity}/10)"
28
  factor = "Low"
29
  elif complexity <= 7:
30
  manual_weeks = "3-6 weeks"
31
+ savings = f"~{complexity * 5}-{complexity * 10} eng-days × team rate (complexity {complexity}/10)"
32
  factor = "Medium"
33
  else:
34
  manual_weeks = "6-10 weeks"
35
+ savings = f"~{complexity * 5}-{complexity * 10} eng-days × team rate (complexity {complexity}/10)"
36
  factor = "High"
37
 
38
  return CostEstimate(
 
77
  simple_mode: bool = False,
78
  ) -> AsyncGenerator[AgentEvent, None]:
79
  """Run full pipeline and stream AgentEvent objects."""
 
 
80
  yield AgentEvent(
81
  agent="analyzer",
82
  status=AgentStatus.RUNNING,
backend/tools/rocprof_wrapper.py CHANGED
@@ -56,7 +56,7 @@ class RocprofWrapper:
56
  """Run executable with rocprof profiling"""
57
  if not self.rocm_available:
58
  # Return mock profiling data
59
- return self._get_mock_profiling_data()
60
 
61
  try:
62
  if args is None:
 
56
  """Run executable with rocprof profiling"""
57
  if not self.rocm_available:
58
  # Return mock profiling data
59
+ return self.get_mock_profiling_data()
60
 
61
  try:
62
  if args is None:
docs/FAILURE_CASES.md CHANGED
@@ -36,3 +36,23 @@ __device__ __forceinline__ unsigned lane_id() {
36
 
37
  ### Trust note
38
  This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
36
 
37
  ### Trust note
38
  This is a deliberate example of where ROCmPort AI should report risk, not pretend full automation.
39
+
40
+ ## Failure Case: Library-Heavy CUDA Code (CUB, Thrust, cuDNN)
41
+
42
+ **Input type**: CUDA kernels that call into CUB, Thrust, or cuDNN directly
43
+
44
+ **Example pattern**:
45
+ ```cpp
46
+ #include <cub/cub.cuh>
47
+ cub::DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_in, d_out, num_items);
48
+ ```
49
+
50
+ **What happens**: hipify-clang renames the include to `<hipcub/hipcub.hpp>` and the namespace to `hipcub`. ROCmPort AI passes this through. The translation is mechanically correct.
51
+
52
+ **The limitation**: hipCUB API coverage is not 1:1 with CUB. Some primitives behave differently under ROCm, and performance characteristics differ significantly due to wavefront width. ROCmPort AI does not currently benchmark library calls against rocPRIM equivalents.
53
+
54
+ **What ROCmPort AI does**: flags the library dependency in the static scan, marks it HIGH risk, and recommends manual review by a ROCm-experienced engineer.
55
+
56
+ **What ROCmPort AI does not do**: guarantee correctness or performance parity for library-heavy code without human validation.
57
+
58
+ **Fix requirement**: Manual comparison of CUB vs hipCUB primitive behavior for the specific use case, or replacement with rocPRIM equivalents.