Spaces:

axentx
/

surrogate-1

Running

Ashira Pitchayapakayakul commited on 8 days ago

Commit

e161478

1 Parent(s): 8056cbe

feat(harvest): lift source-side length caps 6K/8K → 100K/200K chars

User intent: 'เอาทั้งหมดมาเลย' — capture full-length pairs from upstream
datasets so long-context training material (SWE-Bench traces, Toucan
agent loops, OpenCodeReasoning derivations, multi-file IaC, full
stacktraces) is preserved end-to-end.

Old caps were 6000 prompt + 8000 response chars (~3.5K tokens combined).
That truncated meaningful long-context signal at HARVEST time, before
any downstream stage could see it. Measured distribution in current
harvested data shows max=1486 tokens, p99=1364 — i.e., the 6K/8K cap
was the binding constraint, not the natural data shape.

New caps:
prompt 100,000 chars (~25K tokens)
response 200,000 chars (~50K tokens)
combined ≤ ~75K tokens — covers SWE-Bench (≤30K typical), Toucan
(≤8K), OpenCodeReasoning-2 (≤15K), full IaC modules. Outer bound
preserved to prevent runaway storage on full-repo dumps.

Updated 10 scripts:
HARVEST (writes to bulk-mirror jsonl + commits to HF dataset shards):
bin/v2/streaming-mirror-worker.sh primary streaming HF mirror
bin/v2/bulk-mirror-worker.sh legacy non-streaming HF download
bin/v2/build-data-pipeline.sh curated SFT/Tools/Agent/DPO matrix
bin/dataset-mirror.sh alternate mirror path
bin/parquet-direct-ingest.sh direct parquet ingest

SYNTHETIC (generates pairs that flow into dataset):
bin/v2/magpie-self-instruct.py Magpie self-instruct synthesis
bin/v2/tool-trace-collector.py agent tool-use trace capture
bin/v2/verify-trace-generator.py verification trace synthesis
bin/v2/self-refine-loop.py self-refinement loop output
bin/v2/sdft-trainer.py SDFT distilled output writer

NOT changed:
- Existing 921M pairs already harvested under old cap stay as-is
(dedup keys on prompt SHA-256, so re-harvest skips dupes; getting
longer versions of legacy rows would need a dedup-DB rebuild).
- Training-time seq_len cap unchanged: v1.5 stays seq=4K (Kaggle
T4×2 budget binding constraint), v2 stays seq=16K (H200+72B
budget). Long pairs > those caps get truncated at training time
by axolotl's data collator — that's the correct layer to drop
them, not at harvest.
- len(p)<20 / len(r)<30 minimum-length filter retained (drops
trivial garbage rows).

Going forward: every new harvest tick captures full-length material.
Once compute scales (Lightning H200 in v2 path) the seq=16K training
will start using the long pairs effectively.

Files changed (10) hide show

bin/dataset-mirror.sh +2 -2
bin/parquet-direct-ingest.sh +4 -4
bin/v2/build-data-pipeline.sh +1 -1
bin/v2/bulk-mirror-worker.sh +1 -1
bin/v2/magpie-self-instruct.py +2 -2
bin/v2/sdft-trainer.py +2 -2
bin/v2/self-refine-loop.py +2 -2
bin/v2/streaming-mirror-worker.sh +1 -1
bin/v2/tool-trace-collector.py +1 -1
bin/v2/verify-trace-generator.py +2 -2

bin/dataset-mirror.sh CHANGED Viewed

@@ -313,8 +313,8 @@ for src_id, slug in SOURCES:
                 if not r and row.get("answer"):
                     r = str(row["answer"])
-                p = str(p).strip()[:6000]
-                r = str(r).strip()[:8000]
                 if len(p) < 20 or len(r) < 30:
                     continue
                 if not is_relevant(p, r):

                 if not r and row.get("answer"):
                     r = str(row["answer"])
+                p = str(p).strip()[:100000]
+                r = str(r).strip()[:200000]
                 if len(p) < 20 or len(r) < 30:
                     continue
                 if not is_relevant(p, r):

bin/parquet-direct-ingest.sh CHANGED Viewed

@@ -96,7 +96,7 @@ try:
         row = {c: table.column(c)[i].as_py() for c in cols}
         # Detect schema by available columns + extract prompt+response
         if 'text' in cols:
-            text = str(row.get('text','') or '')[:8000]
             if len(text) < 500: skipped += 1; continue
             # Web-text quality filter
             if not any(s in text for s in ('?','\`\`\`','# ','## ')) and not any(s in text.lower() for s in ('step ','first,','to solve','function ','def ','class ')):
@@ -110,10 +110,10 @@ try:
             response = text
         elif 'instruction' in cols and 'response' in cols:
             prompt = str(row.get('instruction','') or '')[:4000]
-            response = str(row.get('response','') or '')[:8000]
             if len(prompt) < 30 or len(response) < 30: skipped += 1; continue
         elif 'content' in cols and 'language' in cols:
-            code = str(row.get('content','') or '')[:6000]
             lang = str(row.get('language','') or 'code')
             if len(code) < 80 or len(code) > 6000: skipped += 1; continue
             prompt = f'Explain this {lang} code:'
@@ -130,7 +130,7 @@ try:
                 'ts': time.time(),
                 'source': f'parquet:{src_repo}',
                 'parquet_shard': '\$shard_name',
-                'prompt': prompt[:8000],
                 'response': response[:12000],
             }, ensure_ascii=False) + '\n')
         written += 1

         row = {c: table.column(c)[i].as_py() for c in cols}
         # Detect schema by available columns + extract prompt+response
         if 'text' in cols:
+            text = str(row.get('text','') or '')[:200000]
             if len(text) < 500: skipped += 1; continue
             # Web-text quality filter
             if not any(s in text for s in ('?','\`\`\`','# ','## ')) and not any(s in text.lower() for s in ('step ','first,','to solve','function ','def ','class ')):
             response = text
         elif 'instruction' in cols and 'response' in cols:
             prompt = str(row.get('instruction','') or '')[:4000]
+            response = str(row.get('response','') or '')[:200000]
             if len(prompt) < 30 or len(response) < 30: skipped += 1; continue
         elif 'content' in cols and 'language' in cols:
+            code = str(row.get('content','') or '')[:100000]
             lang = str(row.get('language','') or 'code')
             if len(code) < 80 or len(code) > 6000: skipped += 1; continue
             prompt = f'Explain this {lang} code:'
                 'ts': time.time(),
                 'source': f'parquet:{src_repo}',
                 'parquet_shard': '\$shard_name',
+                'prompt': prompt[:100000],
                 'response': response[:12000],
             }, ensure_ascii=False) + '\n')
         written += 1

bin/v2/build-data-pipeline.sh CHANGED Viewed

@@ -106,7 +106,7 @@ with open(out_path, "w") as f:
             if u and a: p, r = u, a
         if not p or not r: continue
-        p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
         # Sanitize: drop polluted/PII/secrets/refusals
         v = filter_pair(p, r)

             if u and a: p, r = u, a
         if not p or not r: continue
+        p, r = str(p)[:100000].strip(), str(r)[:200000].strip()
         # Sanitize: drop polluted/PII/secrets/refusals
         v = filter_pair(p, r)

bin/v2/bulk-mirror-worker.sh CHANGED Viewed

@@ -66,7 +66,7 @@ with open(out_path, "w") as f:
             a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
             if u and a: p, r = u, a
         if not p or not r: continue
-        p, r = str(p)[:6000].strip(), str(r)[:8000].strip()
         if len(p) < 20 or len(r) < 30: continue
         v = filter_pair(p, r)
         if not v["keep"]: continue

             a = next((m.get("content","") or m.get("value","") for m in msgs if m.get("role") in ("assistant","gpt") or m.get("from") in ("assistant","gpt")), "")
             if u and a: p, r = u, a
         if not p or not r: continue
+        p, r = str(p)[:100000].strip(), str(r)[:200000].strip()
         if len(p) < 20 or len(r) < 30: continue
         v = filter_pair(p, r)
         if not v["keep"]: continue

bin/v2/magpie-self-instruct.py CHANGED Viewed

@@ -154,8 +154,8 @@ def main():
                 continue
             fout.write(json.dumps({
-                "prompt": user_q[:6000],
-                "response": asst_r[:8000],
                 "source": f"magpie-{use_model}",
                 "domain_persona": sys_prompt,
                 "ts": datetime.utcnow().isoformat(),

                 continue
             fout.write(json.dumps({
+                "prompt": user_q[:100000],
+                "response": asst_r[:200000],
                 "source": f"magpie-{use_model}",
                 "domain_persona": sys_prompt,
                 "ts": datetime.utcnow().isoformat(),

bin/v2/sdft-trainer.py CHANGED Viewed

@@ -121,8 +121,8 @@ def process(prompt: str, gold: str) -> dict | None:
     if not filter_pair(prompt, distilled)["keep"]:
         return None
     return {
-        "prompt": prompt[:6000],
-        "response": distilled[:6000],
         "source": "sdft",
         "meta": {
             "y_hat_len": len(y_hat),

     if not filter_pair(prompt, distilled)["keep"]:
         return None
     return {
+        "prompt": prompt[:100000],
+        "response": distilled[:200000],
         "source": "sdft",
         "meta": {
             "y_hat_len": len(y_hat),

bin/v2/self-refine-loop.py CHANGED Viewed

@@ -113,8 +113,8 @@ def process(prompt: str) -> dict | None:
         return None
     return {
-        "prompt": prompt[:6000],
-        "response": answer[:6000],
         "source": "self-refine",
         "meta": {
             "iterations_used": len(history),

         return None
     return {
+        "prompt": prompt[:100000],
+        "response": answer[:200000],
         "source": "self-refine",
         "meta": {
             "iterations_used": len(history),

bin/v2/streaming-mirror-worker.sh CHANGED Viewed

@@ -114,7 +114,7 @@ with open(out_path, "a") as f:
                 p, r = t[:cut].strip(), t[cut:].strip()
             else:
                 continue
-        p = str(p)[:6000].strip(); r = str(r)[:8000].strip()
         if len(p) < 20 or len(r) < 30: continue
         v = filter_pair(p, r)
         if not v["keep"]: continue

                 p, r = t[:cut].strip(), t[cut:].strip()
             else:
                 continue
+        p = str(p)[:100000].strip(); r = str(r)[:200000].strip()
         if len(p) < 20 or len(r) < 30: continue
         v = filter_pair(p, r)
         if not v["keep"]: continue

bin/v2/tool-trace-collector.py CHANGED Viewed

@@ -132,7 +132,7 @@ def _trace_to_pair(prompt_ctx: str, traces: list[dict]) -> dict | None:
         return None
     return {
         "prompt": prompt_ctx[:4000],
-        "response": asst_text[:6000],
         "source": "tool-trace",
         "meta": {
             "n_calls": len(traces),

         return None
     return {
         "prompt": prompt_ctx[:4000],
+        "response": asst_text[:200000],
         "source": "tool-trace",
         "meta": {
             "n_calls": len(traces),

bin/v2/verify-trace-generator.py CHANGED Viewed

@@ -157,8 +157,8 @@ def synthesize_trace(prompt: str, gold: str) -> dict | None:
         return None
     return {
-        "prompt": prompt[:6000],
-        "response": trace[:8000],
         "source": "verify-trace",
         "meta": {"domain": domain, "n_probes": len(probes)},
     }

         return None
     return {
+        "prompt": prompt[:100000],
+        "response": trace[:200000],
         "source": "verify-trace",
         "meta": {"domain": domain, "n_probes": len(probes)},
     }