VLAarchtests4 / docs /public_benchmark_progress_2026-04-01.md
lsnu's picture
Add files using upload-large-folder tool
c725033 verified

Public Benchmark Progress

Date: 2026-04-01 UTC

Confirmed Real Public Benchmark Result

  • Public occlusion proxy: ManiSkill PickClutterYCB-v1
  • Strongest adapter-specific result so far:
    • summary: /workspace/workspace/reports/maniskill_pickclutter_smoke_v5_eval_tuned_softerpref/public_benchmark_package_summary.json
    • trunk_only_ft = 0.04
    • adapter_noop = 0.04
    • adapter_active_ft = 0.62
    • delta_active_vs_trunk = +0.58
    • 95% CI = [0.44, 0.72]
    • intervention_rate = 1.0
    • non_base_selection_rate = 1.0
  • Interpretation:
    • this is real adapter-specific sign of life on a public occlusion benchmark
    • the gain is not coming from a stronger shared trunk, because adapter_noop stays flat

BEHAVIOR Bag Proxy Investigation

Target public task family:

  • official BEHAVIOR grocery-store bag/container retrieval proxy
  • primary candidate: paying_for_purchases
  • stricter but currently unusable candidate: buy_basic_garden_tools

Environment used:

  • BEHAVIOR assets: /workspace/workspace/BEHAVIOR-1K
  • venv used for probes: /workspace/envs/behavior

Findings:

  • buy_basic_garden_tools is blocked by official scene-task geometry:
    • repeated failure on ontop ['rake.n.03_1', 'grocery_shelf.n.01_1']
    • even with whitelist attempts, the sampler never found a valid shelf placement
  • paying_for_purchases is much healthier:
    • grocery_store_convenience, grocery_store_cafe, and grocery_store_asian all load
    • object scope binds the real task objects:
      • shopping_basket.n.01_1
      • money.n.01_1
      • checkout.n.03_1
      • floor.n.01_1
  • Root sampler bug:
    • official online sampling fails on the floor / agent chain
    • without patching, the blocking warning is:
      • Room type [grocery_store] ... floor.n.01_1: , checkout.n.03_1: grocery_store_0
    • after removing the agent-on-floor condition from the sampler pipeline, the next blocker is:
      • ontop ['shopping_basket.n.01_1', 'floor.n.01_1'] False
  • Critical state-probe result:
    • even when object bindings exist, the sampled movable objects remain parked at their far-away import positions
    • observed example on grocery_store_asian:
      • basket position near [120, 120, -80]
      • money position near [115, 115, -85]
      • apples position near [110, 110, -90] and [105, 105, -95]
    • money inside basket = False
    • apple1 inside basket = False
    • apple2 inside basket = False
  • Conclusion:
    • as of 2026-04-01, the BEHAVIOR bag proxy is not yet a usable fair evaluation track in this workspace
    • the public task objects bind, but the online sampler does not materialize a valid initial scene for training or evaluation

Garment / Cloth Proxy Status

  • GarmentLab repo cloned:
    • /workspace/workspace/GarmentLab
  • Immediate constraint:
    • the repo expects Isaac Sim 4.0.0 plus external Google Drive assets
  • Current status:
    • code inspected only
    • no runnable public cloth benchmark execution completed yet in this workspace

Next Public Proxy Candidates

Given the BEHAVIOR blocker, the next-lightest public candidates already available locally are:

  • OpenCabinetDrawer-v1
    • public ManiSkill task
    • good container reveal / access proxy
  • PutEggplantInBasketScene-v1
    • public ManiSkill bridge-dataset task
    • public basket / container interaction proxy
  • PutSpoonOnTableClothInScene-v1
    • public ManiSkill bridge-dataset cloth interaction proxy

Immediate Recommendation

  • Keep the confirmed PickClutterYCB-v1 result as the anchor public success case.
  • Do not spend more time on BEHAVIOR online sampling until either:
    • a cached valid scene instance is created, or
    • the sampler is patched deeply enough to place container objects correctly instead of leaving them at far-away import positions.
  • Pivot the next train/eval smoke to a lighter public ManiSkill proxy before returning to BEHAVIOR.