| license: apache-2.0 | |
| tags: | |
| - code | |
| - multi-language | |
| - pretraining-data | |
| # code-graph-v4 | |
| Packaged git clones for the graphjepa / code-transformer project. | |
| with full git history. | |
| ## Contents | |
| - clones_csharp_full.tar.gz | |
| - clones_java_full.tar.gz | |
| - clones_javascript_full.tar.gz | |
| - clones_python_full.tar.gz | |
| - clones_typescript_full.tar.gz | |
| Each tarball contains `{language}/{repo_id}/...` — extract anywhere, | |
| point the parser at the extracted directory. | |
| ## On the receiving (big) machine | |
| ```bash | |
| from huggingface_hub import hf_hub_download | |
| path = hf_hub_download( | |
| repo_id="IDMedicine/code-graph-v4", | |
| filename="clones_python_full.tar.gz", | |
| repo_type="model", | |
| local_dir=".", | |
| ) | |
| tar -xzf $path -C ./data_multilang/ | |
| # Then process each repo with build_bundle.py (needs include_git=True for | |
| # temporal processing; or single-snapshot parsing if code-only). | |
| ``` | |
| ## Limitations | |
| - If packaged without `.git` (the `_code` variants), **no temporal | |
| processing is possible** downstream — only single-snapshot SSL. | |
| - If packaged with `.git` (the `_full` variants), tarballs are larger | |
| but the full commit history is preserved for `build_bundle.py`. | |