Step 10j (Python side): cache contract + --reextract command

Implements the two-tier contract from docs/CACHE_CONTRACT.md (extension
repo, locked at step 9):

  cache_schema       on-disk shape; mismatch -> force rebuild
  id_rules           bumps when extraction rules change
  id_rules_signature sha256 over canonical rule text; catches drift
                     when the integer bump is forgotten

New constants in rcjav/cache.py:

  CACHE_SCHEMA_VERSION = 1
  ID_RULES_VERSION = 1     (the legacy "version: 3" cache reads as
                            id_rules: 0 after in-place migration)

New helpers:

  rcjav.ids.current_rules_signature()
      Sha256 over the canonical text of every rule that influences
      a jav_id: built-in regexes, BUILTIN_PART_RES, PART_RES (which
      captures user-added part patterns), FC2 handling.

  rcjav.cache.load_cache(signature=None)
      Reads cache.json. Legacy `version: 3` headers get an in-place
      header upgrade with no forced rescan; the cache is stamped as
      `id_rules: 0` + signature "legacy" so it surfaces as
      "stale by rules" in cache_state. Schema mismatch on the new
      header still forces a rebuild.

  rcjav.cache.cache_state(cache, signature)
      Classifies a cache as "fresh" / "stale_by_rules" /
      "schema_mismatch". Drives the three-state extension UX.

  rcjav.cache.stamp_current_rules(cache, signature)
      Updates id_rules and id_rules_signature in place. Called after
      a successful full scan or --reextract.

New CLI command:

  rc-jav.py --reextract

Walks `cache["remotes"][r]["files"]` against the live rule set and
updates `jav_id` in place. No rclone calls — fast path (seconds on
a 7k-file cache). Reports changed/unchanged/dropped per remote.
Stamps current rules into the saved cache.

--scan (full, no --scan-since) now also stamps current rules.
--scan --scan-since deliberately does NOT stamp: it only re-walks
recently-modified files, so older entries may still carry jav_ids
from previous rules; cache stays "stale by rules" until a full scan
or --reextract.

Verified:
  - python rc-jav.py --reextract --format json on the live 7124-file
    cache → 0 changes (existing IDs already canonical), cache.json
    rewritten with new header
  - cache_state on the post-migration cache → "fresh"
  - tests + fixtures + --help all pass

Extension-side (host's cache_status response + options-cache.js
three-state UX + Re-extract IDs button) ships in a separate commit
in the extension repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
admin
2026-05-22 22:07:13 +02:00
parent 1cc2c38128
commit 33c495ad57
4 changed files with 233 additions and 21 deletions
+88 -1
View File
@@ -404,6 +404,11 @@ def main():
help="Relative path of the file to rename (within --remote).")
ap.add_argument("--new-path", metavar="PATH",
help="New relative path after rename (within --remote).")
ap.add_argument("--reextract", action="store_true",
help="Walk cache.json and recompute jav_id on every file entry "
"using the current ID extraction rules. No rclone calls — "
"fast path for picking up rule changes without re-scanning. "
"Outputs JSON when --format json, plain otherwise.")
ap.add_argument("--basic", action="store_true",
help="Plain text output, no rich tables/panels/progress bars. "
"Useful for piping or low-bandwidth terminals.")
@@ -480,6 +485,79 @@ def main():
args.catalog = list(DEFAULT_CATALOG)
# --library-issues: read-only cache scan for non-canonical filenames.
# --reextract: rebuild jav_id values from current rules without re-scanning.
if args.reextract:
from rcjav.ids import current_rules_signature
from rcjav.cache import stamp_current_rules
sig = current_rules_signature()
cache = load_cache(sig)
changed = 0
unchanged = 0
dropped = 0
per_remote = []
for remote, entry in (cache.get("remotes") or {}).items():
r_changed = 0
r_unchanged = 0
r_dropped = 0
files = entry.get("files") or []
for f in files:
old_id = f.get("jav_id") or ""
new_id = extract_id(Path(f.get("path", "")).name)
if new_id is None:
if old_id:
f["jav_id"] = ""
r_dropped += 1
continue
if new_id != old_id:
f["jav_id"] = new_id
r_changed += 1
else:
r_unchanged += 1
changed += r_changed
unchanged += r_unchanged
dropped += r_dropped
per_remote.append({
"remote": remote,
"changed": r_changed,
"unchanged": r_unchanged,
"dropped": r_dropped,
"files": len(files),
})
stamp_current_rules(cache, sig)
save_cache(cache)
summary = {
"ok": True,
"changed": changed,
"unchanged": unchanged,
"dropped": dropped,
"total": changed + unchanged + dropped,
"id_rules_signature": sig,
"remotes": per_remote,
}
if args.format == "json" or BASIC:
print(json.dumps(summary))
else:
console.print(Panel(
f"[bold]Re-extracted IDs against current rules[/]\n"
f" changed: [yellow]{changed:,}[/]\n"
f" unchanged: [dim]{unchanged:,}[/]\n"
f" dropped: [red]{dropped:,}[/]\n"
f" total: {summary['total']:,}",
title="Re-extract", border_style="green"))
if per_remote:
from rich.table import Table as _Tbl
t = _Tbl(title="Per-remote", show_lines=False)
t.add_column("Remote", style="cyan")
t.add_column("Changed", justify="right", style="yellow")
t.add_column("Unchanged", justify="right", style="dim")
t.add_column("Dropped", justify="right", style="red")
t.add_column("Files", justify="right")
for r in per_remote:
t.add_row(r["remote"], f"{r['changed']:,}", f"{r['unchanged']:,}",
f"{r['dropped']:,}", f"{r['files']:,}")
console.print(t)
sys.exit(0)
if args.library_issues:
cache = load_cache()
issues = find_library_issues(cache)
@@ -546,7 +624,10 @@ def main():
console.print(f"[red]invalid --scan-since value: {args.scan_since!r} "
f"(expected e.g. 24h, 7d, 30m, 90s)[/]")
sys.exit(2)
cache = load_cache()
from rcjav.ids import current_rules_signature
from rcjav.cache import stamp_current_rules
_scan_sig = current_rules_signature()
cache = load_cache(_scan_sig)
cache_meta: dict[str, dict] = {}
skipped: list[tuple[str, str]] = []
t0 = time.perf_counter()
@@ -566,6 +647,12 @@ def main():
use_cache=not args.no_cache, force_update=True,
cache_meta=cache_meta, scan_since=scan_since))
if not args.no_cache:
# Stamp current rules only on a FULL scan. An incremental
# (--scan-since) only re-walked some files; older files in the
# cache may still have jav_ids from the previous rule set, so the
# cache remains "stale by rules" until a full scan or --reextract.
if not scan_since:
stamp_current_rules(cache, _scan_sig)
save_cache(cache)
elapsed = time.perf_counter() - t0
if BASIC: