anurag008w commited on
Commit
d721339
·
1 Parent(s): 33980b3

Fail fingerprint pass on transient file races

Browse files
Files changed (3) hide show
  1. CHANGELOG.md +38 -0
  2. openclaw-sync.py +39 -16
  3. start.sh +7 -2
CHANGELOG.md CHANGED
@@ -2,6 +2,44 @@
2
 
3
  All notable changes to this project will be documented in this file.
4
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
5
  ## [1.4.0] - 2026-04-25
6
 
7
  ### Added
 
2
 
3
  All notable changes to this project will be documented in this file.
4
 
5
+ ## [1.5.0] - 2026-05-13
6
+
7
+ ### Added
8
+
9
+ - **NVIDIA key-rotation support** — added `nvidia-key-rotator.cjs` wiring and startup integration so deployments can rotate NVIDIA credentials similarly to other provider key-rotation flows.
10
+ - **Cloudflare keep-alive automation** — added/expanded `cloudflare-keepalive-setup.py` flow and startup wiring to provision keep-alive through Cloudflare Worker automation instead of the older UptimeRobot-first approach.
11
+ - **Sync metadata marker model** — introduced a structured workspace marker `(file_count, total_size, newest_mtime, metadata_hash)` to support stronger change introspection in sync code.
12
+
13
+ ### Changed
14
+
15
+ - **Workspace sync script rename finalized** — `workspace-sync.py` flow was migrated to `openclaw-sync.py` in Docker/startup/docs so restore/sync behavior is centralized under one script.
16
+ - **Sync trigger behavior hardened for config churn** — OpenClaw config sync now debounces until JSON settles before immediate sync, reducing false/partial syncs during rapid config writes.
17
+ - **Gateway restart flow now saves state first** — restart path was updated to run a pre-restart one-shot state sync so gateway reloads are less likely to drop recent state.
18
+ - **Shutdown backup now uses a two-step pass** — graceful shutdown now attempts `sync-once-settled` then a final `sync-once` pass to better capture last-second writes.
19
+ - **Telegram allowlist simplified** — consolidated Telegram allowlist into `TELEGRAM_ALLOWED_USERS` and aligned docs/examples.
20
+ - **Plugin startup behavior aligned** — startup-installed plugins are synced into `plugins.allow` before gateway launch so runtime-installed plugins are recognized cleanly.
21
+ - **Cloudflare proxy path matured** — multiple iterations improved fetch/proxy behavior (header handling, endpoint scoping, API root routing, URL parsing, and logging noise reduction), then simplified unstable undici patching paths.
22
+ - **Health dashboard polish** — sync timestamps now show local time, footer credits were corrected, and status rendering/docs were updated for the Cloudflare keep-alive model.
23
+ - **CI workflow churn documented** — GitHub workflow files for HF sync were added/renamed/cleaned multiple times as space/repo naming stabilized.
24
+
25
+ ### Fixed
26
+
27
+ - **Missed rapid backup updates** — sync logic now relies on content fingerprint checks for no-op decisions so same-second or quick successive changes are less likely to be skipped.
28
+ - **Non-deterministic metadata hashing** — metadata hashing now iterates paths deterministically to avoid hash jitter from traversal order.
29
+ - **Transient file race sync failures** — sync fingerprinting/snapshot copy paths now tolerate transient `OSError` (file rotated/deleted mid-scan) instead of aborting the whole sync pass.
30
+ - **State restore migration edge cases** — restore flow includes migration/cleanup behavior for legacy hidden state paths and stale backup entries.
31
+ - **Startup/env robustness** — fixed shell export formatting/syntax issues (e.g., NVIDIA/XAI lines) and unbound-variable pitfalls in startup scripts.
32
+ - **Proxy runtime errors and noise** — fixed specific proxy runtime issues (including `UND_ERR_INVALID_ARG`, fetch duplex handling, and upstream error visibility) and reduced noisy stdout logs that interfered with clean process output.
33
+ - **HF workflow/repo reference mismatches** — corrected and later cleaned workflow repository references during repo migration/restructure.
34
+
35
+ ### Docs
36
+
37
+ - README/.env/security docs were refreshed across multiple commits to reflect:
38
+ - Cloudflare keep-alive replacing UptimeRobot setup path,
39
+ - updated secrets and startup environment behavior,
40
+ - provider/key-rotation options,
41
+ - backup/sync behavior and troubleshooting guidance.
42
+
43
  ## [1.4.0] - 2026-04-25
44
 
45
  ### Added
openclaw-sync.py CHANGED
@@ -18,6 +18,7 @@ import sys
18
  import tempfile
19
  import threading
20
  import time
 
21
  from pathlib import Path
22
 
23
  os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "1")
@@ -76,6 +77,7 @@ RESET_MARKER = WORKSPACE / ".reset_credentials"
76
  HF_API = HfApi(token=HF_TOKEN) if HF_TOKEN else None
77
  STOP_EVENT = threading.Event()
78
  _REPO_ID_CACHE: str | None = None
 
79
 
80
 
81
  def write_status(status: str, message: str) -> None:
@@ -280,14 +282,15 @@ def file_marker(path: Path) -> tuple[int, int, int]:
280
  return (1, int(stat.st_size), int(stat.st_mtime_ns))
281
 
282
 
283
- def metadata_marker(root: Path) -> tuple[int, int, int]:
284
  if not root.exists():
285
- return (0, 0, 0)
286
 
287
  file_count = 0
288
  total_size = 0
289
  newest_mtime = 0
290
- for path in root.rglob("*"):
 
291
  if not path.is_file():
292
  continue
293
  rel = path.relative_to(root).as_posix()
@@ -298,9 +301,17 @@ def metadata_marker(root: Path) -> tuple[int, int, int]:
298
  except OSError:
299
  continue
300
  file_count += 1
301
- total_size += int(stat.st_size)
302
- newest_mtime = max(newest_mtime, int(stat.st_mtime_ns))
303
- return (file_count, total_size, newest_mtime)
 
 
 
 
 
 
 
 
304
 
305
 
306
  def fingerprint_dir(root: Path) -> str:
@@ -313,9 +324,16 @@ def fingerprint_dir(root: Path) -> str:
313
  if _should_exclude(rel, path):
314
  continue
315
  hasher.update(rel.encode("utf-8"))
316
- with path.open("rb") as handle:
317
- for chunk in iter(lambda: handle.read(1024 * 1024), b""):
318
- hasher.update(chunk)
 
 
 
 
 
 
 
319
  return hasher.hexdigest()
320
 
321
 
@@ -331,7 +349,13 @@ def create_snapshot_dir(source_root: Path) -> Path:
331
  target.mkdir(parents=True, exist_ok=True)
332
  continue
333
  target.parent.mkdir(parents=True, exist_ok=True)
334
- shutil.copy2(path, target)
 
 
 
 
 
 
335
  return staging_root
336
 
337
 
@@ -396,16 +420,15 @@ def restore_workspace() -> bool:
396
 
397
  def _sync_once_unlocked(
398
  last_fingerprint: str | None = None,
399
- last_marker: tuple[int, int, int] | None = None,
400
- ) -> tuple[str, tuple[int, int, int]]:
401
  if not HF_TOKEN:
402
  write_status("disabled", "HF_TOKEN is not configured.")
403
- return (last_fingerprint or "", last_marker or (0, 0, 0))
404
 
405
  snapshot_state_into_workspace()
406
  repo_id = ensure_repo_exists()
407
  current_marker = metadata_marker(WORKSPACE)
408
-
409
  if last_marker is not None and current_marker == last_marker:
410
  write_status("synced", "No workspace changes detected.")
411
  return (last_fingerprint or "", current_marker)
@@ -444,8 +467,8 @@ def _sync_once_unlocked(
444
 
445
  def sync_once(
446
  last_fingerprint: str | None = None,
447
- last_marker: tuple[int, int, int] | None = None,
448
- ) -> tuple[str, tuple[int, int, int]]:
449
  SYNC_LOCK_FILE.parent.mkdir(parents=True, exist_ok=True)
450
  with SYNC_LOCK_FILE.open("w", encoding="utf-8") as lock_handle:
451
  fcntl.flock(lock_handle, fcntl.LOCK_EX)
 
18
  import tempfile
19
  import threading
20
  import time
21
+ from typing import TypeAlias
22
  from pathlib import Path
23
 
24
  os.environ.setdefault("HF_HUB_DISABLE_PROGRESS_BARS", "1")
 
77
  HF_API = HfApi(token=HF_TOKEN) if HF_TOKEN else None
78
  STOP_EVENT = threading.Event()
79
  _REPO_ID_CACHE: str | None = None
80
+ WorkspaceMarker: TypeAlias = tuple[int, int, int, str]
81
 
82
 
83
  def write_status(status: str, message: str) -> None:
 
282
  return (1, int(stat.st_size), int(stat.st_mtime_ns))
283
 
284
 
285
+ def metadata_marker(root: Path) -> WorkspaceMarker:
286
  if not root.exists():
287
+ return (0, 0, 0, "")
288
 
289
  file_count = 0
290
  total_size = 0
291
  newest_mtime = 0
292
+ metadata_hasher = hashlib.sha256()
293
+ for path in sorted(root.rglob("*")):
294
  if not path.is_file():
295
  continue
296
  rel = path.relative_to(root).as_posix()
 
301
  except OSError:
302
  continue
303
  file_count += 1
304
+ size = int(stat.st_size)
305
+ mtime_ns = int(stat.st_mtime_ns)
306
+ total_size += size
307
+ newest_mtime = max(newest_mtime, mtime_ns)
308
+ metadata_hasher.update(rel.encode("utf-8"))
309
+ metadata_hasher.update(b"\0")
310
+ metadata_hasher.update(str(size).encode("ascii"))
311
+ metadata_hasher.update(b"\0")
312
+ metadata_hasher.update(str(mtime_ns).encode("ascii"))
313
+ metadata_hasher.update(b"\0")
314
+ return (file_count, total_size, newest_mtime, metadata_hasher.hexdigest())
315
 
316
 
317
  def fingerprint_dir(root: Path) -> str:
 
324
  if _should_exclude(rel, path):
325
  continue
326
  hasher.update(rel.encode("utf-8"))
327
+ try:
328
+ with path.open("rb") as handle:
329
+ for chunk in iter(lambda: handle.read(1024 * 1024), b""):
330
+ hasher.update(chunk)
331
+ except (FileNotFoundError, IsADirectoryError, NotADirectoryError):
332
+ # Fingerprint must represent a complete view of the workspace.
333
+ # Retry next sync pass instead of silently hashing a partial tree.
334
+ raise RuntimeError(
335
+ f"Workspace changed while hashing {rel}; retrying next sync pass."
336
+ )
337
  return hasher.hexdigest()
338
 
339
 
 
349
  target.mkdir(parents=True, exist_ok=True)
350
  continue
351
  target.parent.mkdir(parents=True, exist_ok=True)
352
+ try:
353
+ shutil.copy2(path, target)
354
+ except (FileNotFoundError, IsADirectoryError, NotADirectoryError):
355
+ # Do not upload a partial snapshot; let caller retry on next loop.
356
+ raise RuntimeError(
357
+ f"Snapshot changed while copying {rel_posix}; retrying next sync pass."
358
+ )
359
  return staging_root
360
 
361
 
 
420
 
421
  def _sync_once_unlocked(
422
  last_fingerprint: str | None = None,
423
+ last_marker: WorkspaceMarker | None = None,
424
+ ) -> tuple[str, WorkspaceMarker]:
425
  if not HF_TOKEN:
426
  write_status("disabled", "HF_TOKEN is not configured.")
427
+ return (last_fingerprint or "", last_marker or (0, 0, 0, ""))
428
 
429
  snapshot_state_into_workspace()
430
  repo_id = ensure_repo_exists()
431
  current_marker = metadata_marker(WORKSPACE)
 
432
  if last_marker is not None and current_marker == last_marker:
433
  write_status("synced", "No workspace changes detected.")
434
  return (last_fingerprint or "", current_marker)
 
467
 
468
  def sync_once(
469
  last_fingerprint: str | None = None,
470
+ last_marker: WorkspaceMarker | None = None,
471
+ ) -> tuple[str, WorkspaceMarker]:
472
  SYNC_LOCK_FILE.parent.mkdir(parents=True, exist_ok=True)
473
  with SYNC_LOCK_FILE.open("w", encoding="utf-8") as lock_handle:
474
  fcntl.flock(lock_handle, fcntl.LOCK_EX)
start.sh CHANGED
@@ -540,10 +540,15 @@ fi
540
  # ── Trap SIGTERM for graceful shutdown ──
541
  graceful_shutdown() {
542
  echo "Shutting down..."
543
- if [ -f "/home/node/app/openclaw-sync.py" ]; then
544
  echo "Saving state before exit..."
 
 
 
545
  python3 /home/node/app/openclaw-sync.py sync-once || \
546
- echo "Warning: could not complete shutdown sync"
 
 
547
  fi
548
  kill $(jobs -p) 2>/dev/null
549
  exit 0
 
540
  # ── Trap SIGTERM for graceful shutdown ──
541
  graceful_shutdown() {
542
  echo "Shutting down..."
543
+ if [ -f "/home/node/app/openclaw-sync.py" ] && [ -n "${HF_TOKEN:-}" ]; then
544
  echo "Saving state before exit..."
545
+ timeout 8s python3 /home/node/app/openclaw-sync.py sync-once-settled || \
546
+ echo "Warning: could not complete settled shutdown sync"
547
+ sleep 1
548
  python3 /home/node/app/openclaw-sync.py sync-once || \
549
+ echo "Warning: could not complete final shutdown sync"
550
+ elif [ -f "/home/node/app/openclaw-sync.py" ]; then
551
+ echo "HF_TOKEN not set; skipping shutdown backup sync."
552
  fi
553
  kill $(jobs -p) 2>/dev/null
554
  exit 0