|
|
| <html> |
| <head> |
| <meta charset="UTF-8"> |
| </head> |
| <style> |
| |
| table td { vertical-align: top; } |
| |
| .stack-trie { white-space: nowrap; font-family: monospace; } |
| .stack-trie ul { padding-left: 1ch; } |
| .stack-trie li { margin-left: 1ch; list-style-type: none; } |
| .stack-trie .marker { |
| cursor: pointer; |
| } |
| .stack-trie .marker.collapsed::before { |
| content: "+ "; |
| } |
| .stack-trie .marker:not(.collapsed)::before { |
| content: "- "; |
| } |
| .stack-trie a { text-decoration: none; } |
| .stack-trie a:hover { text-decoration: underline; } |
| .status-missing { background-color: purple; color: white; } |
| .status-error { background-color: red; color: white; } |
| .status-empty { background-color: white; color: black; } |
| .status-ok { background-color: green; color: white; } |
| .status-break { background-color: lime; color: black; } |
| summary::-webkit-details-marker { color: #00ACF3; font-size: 125%; margin-right: 2px; } |
| summary:focus { outline-style: none; } |
| article > details > summary { font-size: 28px; margin-top: 16px; } |
| details > p { margin-left: 24px; } |
| details details summary { font-size: 16px; } |
| |
| </style> |
| <script> |
| |
| function toggleList(toggleItem) { |
| const listItem = toggleItem.parentNode; |
| const nestedList = listItem.querySelector('ul'); |
| if (nestedList) { |
| nestedList.style.display = nestedList.style.display === 'none' ? 'block' : 'none'; |
| |
| |
| toggleItem.classList.toggle('collapsed'); |
| } |
| } |
| |
| </script> |
| <body> |
| <div> |
|
|
| <h2>Stack trie</h2> |
| <p> |
| The <strong>stack trie</strong> is a way of getting a quick orientation on where all the |
| compilations in a model take place, esp., if you are compiling a codebase you are unfamiliar with. |
| It is a tree of stack frames, for all stacks that triggered PT2 compilation. If only a single |
| stack is in the tree, you will simply see a plain list of frames (most recent call last). With |
| multiple stacks, at every point where two stacks diverge from having a common prefix, we increase |
| the indentation of the list and have a separate sub-list per sub-tree. |
| </p> |
| <p> |
| Links to particular compilation are color coded by status: |
| <span class="status-ok">[Success]</span>, |
| <span class="status-break">[Success with restart (e.g., graph break)]</span>, |
| <span class="status-empty">[Empty graph]</span>, |
| <span class="status-error">[Error]</span>, |
| <span class="status-missing">[Metrics were missing]</span> |
| </p> |
| <details><summary>Stack</summary><div class='stack-trie'><ul><li>/shared_volume/repos/quark/bench_qdq.py:161 in <module><br> mean, median = do_bench(run_scaled_fake_quantize_comp, kwargs_scaled_fake_quantize, num_runs=num_runs, num_warmup=num_warmup, name="quark qdq")</li> |
| <li>/shared_volume/repos/quark/bench_qdq.py:70 in do_bench<br> f(**kwargs)</li> |
| <li><a href='#[0/0]' class='status-ok'>[0/0]</a> /shared_volume/repos/quark/bench_qdq.py:7 in run_scaled_fake_quantize<br> </li> |
| </ul></div></details> |
| </div> |
| <div> |
|
|
| <h2>IR dumps</h2> |
| <p> |
| The <strong>IR dumps</strong> collected dumped intermediate products from various points of the PT2 |
| compilation process. The products are organized by compile id, and then sorted in chronological |
| order. |
| </p> |
| <p> |
| A <strong>compile id</strong> uniquely identifies are particular compilation inside a PT2 |
| program. It is traditionally written as <code>[x/y]</code>, where the <strong>frame id</strong> x |
| identifies the particular Python frame which we are compiling, and <strong>frame compile |
| id</strong> y identifies how many times we've recompiled this same frame. For example, |
| <code>[0/0]</code> refers to the very first frame compiled by PT2; <code>[0/1]</code> refers to the |
| first recompilation of this frame, while <code>[1/0]</code> refers to a different frame, within |
| distinct code cache, which we are compiling next (perhaps because of a graph break). Although |
| Dynamo treats distinct frames as completely unrelated, a frame compilation could overlap with another |
| frame; for example, if you graph break in an inlined function, Dynamo will typically try to compile |
| the nested frame again on an inner frame. You can identify the hierarchical relationship between |
| frames by looking at the stack trie above. |
| </p> |
| <p> |
| In some situations, the compile id will have an extra signifier <code>[x/y_z]</code>, where z is the |
| <strong>attempt</strong> for this particular (re)compilation. Certain conditions will cause Dynamo to |
| restart analysis, when Dynamo discovers that it needs to undo a decision it previously made. The most |
| common cause of recompilation is a graph break in an inlined function call, which forces to restart |
| and avoid inlining the function in the first place. |
| </p> |
| <p> |
| When compiled autograd is enabled, the compile id will include a prefix signifier <code>[!a/x/y]</code>, |
| where a is the <strong>compiled autograd id</strong>. For instance, <code>[!0/-/-]</code> refers |
| to the first graph captured by compiled autograd. It is then traced by torch.compile as <code>[!0/x/y_z]</code>. |
| </p> |
| <p> |
| Here is a high level description of PT2's compilation phases, and the intermediate products each |
| phase generates: |
| </p> |
| <ol> |
| <li><em>Optional:</em> If compiled autograd is enabled, and we are processing a backward call, compiled autograd will trace the autograd graph from the autograd engine, and produce an FX graph <code>compiled_autograd_graph</code> that will be Dynamo traced. Otherwise, Dynamo will directly trace user's bytecode.</li> |
| <li>Dynamo symbolically evaluates the Python bytecode of a program, producing <code>dynamo_output_graph</code></li> |
| <li><em>Optional:</em> If <code>optimize_ddp</code> is enabled, the DDPOptimizer will split the Dynamo output graph to improve pipelining communications. Each split subgraph is <code>optimize_ddp_split_child_submod</code>, and the high level graph that plumbs the graphs together is <code>optimize_ddp_split_graph</code>. If there are multiple splits, each subsequent build product will be produced multiple times, one for each split.</li> |
| <li>AOTAutograd traces the (possibly split) Dynamo output graph, producing a <code>aot_joint_graph</code> if backwards is enabled. It then partitions the graph into <code>aot_forward_graph</code> and <code>aot_backward_graph</code>. If training is not needed, there may only be an <code>aot_inference_graph</code>.</li> |
| <li>Inductor will apply some post grad FX passes, producing <code>inductor_post_grad_graph</code></li> |
| <li>Inductor will perform code generation, producing the final <code>inductor_output_code</code> which will be executed at runtime. This output is a valid Python program and can be directly run.</li> |
| </ol> |
|
|
|
|
| <h2> Chromium Events </h2> |
| PT2 generates <a href='chromium_events.json'>Chromium Trace Events</a> in JSON on specific events during compilation. |
| You can download and view them in a tool like <a href='https://ui.perfetto.dev/'>Perfetto</a>. |
|
|
| <p> |
| Build products below: |
| </p> |
| <ul> |
|
|
| <li><a id="[0/0]">[0/0]</a> |
| <ul> |
| |
| <li><a href="-_0_0_0/dynamo_output_graph_0.txt">-_0_0_0/dynamo_output_graph_0.txt</a> (0)</li> |
| |
| <li><a href="-_0_0_0/inductor_pre_grad_graph_1.txt">-_0_0_0/inductor_pre_grad_graph_1.txt</a> (1)</li> |
| |
| <li><a href="-_0_0_0/before_recompile_pre_grad_2.txt">-_0_0_0/before_recompile_pre_grad_2.txt</a> (2)</li> |
| |
| <li><a href="-_0_0_0/after_recompile_pre_grad_3.txt">-_0_0_0/after_recompile_pre_grad_3.txt</a> (3)</li> |
| |
| <li><a href="-_0_0_0/aot_forward_graph_fw_metadata_4.txt">-_0_0_0/aot_forward_graph_fw_metadata_4.txt</a> (4)</li> |
| |
| <li><a href="-_0_0_0/aot_inference_graph_5.txt">-_0_0_0/aot_inference_graph_5.txt</a> (5)</li> |
| |
| <li><a href="-_0_0_0/torch._functorch.config_6.txt">-_0_0_0/torch._functorch.config_6.txt</a> (6)</li> |
| |
| <li><a href="-_0_0_0/inductor_output_code_ch44xxkifazlcpkp6mi44xhqeej2j5mbgwmesiwx6y3oajzmixxp_7.html">-_0_0_0/inductor_output_code_ch44xxkifazlcpkp6mi44xhqeej2j5mbgwmesiwx6y3oajzmixxp_7.html</a> (7)</li> |
| |
| <li><a href="-_0_0_0/fx_graph_cache_hit_8.json">-_0_0_0/fx_graph_cache_hit_8.json</a> ✅ (8)</li> |
| |
| <li><a href="-_0_0_0/aotautograd_cache_bypass_9.json">-_0_0_0/aotautograd_cache_bypass_9.json</a> ❓ (9)</li> |
| |
| <li><a href="-_0_0_0/dynamo_cpp_guards_str_10.txt">-_0_0_0/dynamo_cpp_guards_str_10.txt</a> (10)</li> |
| |
| <li><a href="-_0_0_0/compilation_metrics_11.html">-_0_0_0/compilation_metrics_11.html</a> (11)</li> |
| |
| </ul> |
| </li> |
|
|
| </ul> |
| </div> |
|
|
|
|
|
|
|
|
|
|
|
|
| <script> |
| document.addEventListener('DOMContentLoaded', function() { |
| |
| |
| const queryParams = new URLSearchParams(window.location.search); |
| if (queryParams.size === 0) return url; |
| |
| function appendQueryParams(url) { |
| const newURL = new URL((new Request(url)).url); |
| const newSearchParams = new URLSearchParams(newURL.searchParams); |
| console.log(newURL.searchParams); |
| console.log(newSearchParams); |
| |
| |
| for (const [key, value] of queryParams) { |
| newSearchParams.set(key, value); |
| } |
| |
| newURL.search = newSearchParams; |
| return newURL; |
| } |
| |
| |
| const relativeLinks = document.querySelectorAll('a[href]:not([href^="http://"]):not([href^="https://"]):not([href^="\#"])'); |
| |
| |
| relativeLinks.forEach((link) => { |
| link.setAttribute("href", appendQueryParams(link.getAttribute("href"))) |
| }); |
| }); |
| </script> |
|
|
| </body> |
| </html> |
|
|