Step 10j (Python side): cache contract + --reextract command

Implements the two-tier contract from docs/CACHE_CONTRACT.md (extension
repo, locked at step 9):

  cache_schema       on-disk shape; mismatch -> force rebuild
  id_rules           bumps when extraction rules change
  id_rules_signature sha256 over canonical rule text; catches drift
                     when the integer bump is forgotten

New constants in rcjav/cache.py:

  CACHE_SCHEMA_VERSION = 1
  ID_RULES_VERSION = 1     (the legacy "version: 3" cache reads as
                            id_rules: 0 after in-place migration)

New helpers:

  rcjav.ids.current_rules_signature()
      Sha256 over the canonical text of every rule that influences
      a jav_id: built-in regexes, BUILTIN_PART_RES, PART_RES (which
      captures user-added part patterns), FC2 handling.

  rcjav.cache.load_cache(signature=None)
      Reads cache.json. Legacy `version: 3` headers get an in-place
      header upgrade with no forced rescan; the cache is stamped as
      `id_rules: 0` + signature "legacy" so it surfaces as
      "stale by rules" in cache_state. Schema mismatch on the new
      header still forces a rebuild.

  rcjav.cache.cache_state(cache, signature)
      Classifies a cache as "fresh" / "stale_by_rules" /
      "schema_mismatch". Drives the three-state extension UX.

  rcjav.cache.stamp_current_rules(cache, signature)
      Updates id_rules and id_rules_signature in place. Called after
      a successful full scan or --reextract.

New CLI command:

  rc-jav.py --reextract

Walks `cache["remotes"][r]["files"]` against the live rule set and
updates `jav_id` in place. No rclone calls — fast path (seconds on
a 7k-file cache). Reports changed/unchanged/dropped per remote.
Stamps current rules into the saved cache.

--scan (full, no --scan-since) now also stamps current rules.
--scan --scan-since deliberately does NOT stamp: it only re-walks
recently-modified files, so older entries may still carry jav_ids
from previous rules; cache stays "stale by rules" until a full scan
or --reextract.

Verified:
  - python rc-jav.py --reextract --format json on the live 7124-file
    cache → 0 changes (existing IDs already canonical), cache.json
    rewritten with new header
  - cache_state on the post-migration cache → "fresh"
  - tests + fixtures + --help all pass

Extension-side (host's cache_status response + options-cache.js
three-state UX + Re-extract IDs button) ships in a separate commit
in the extension repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
admin
2026-05-22 22:07:13 +02:00
parent 1cc2c38128
commit 33c495ad57
4 changed files with 233 additions and 21 deletions
+5
View File
@@ -82,11 +82,15 @@ from rcjav.dupes import ( # noqa: F401
from rcjav.cache import ( # noqa: F401 from rcjav.cache import ( # noqa: F401
CACHE_PATH, CACHE_PATH,
CACHE_VERSION, CACHE_VERSION,
CACHE_SCHEMA_VERSION,
ID_RULES_VERSION,
CACHE_STALE_HOURS, CACHE_STALE_HOURS,
load_cache, load_cache,
save_cache, save_cache,
cache_age_hours, cache_age_hours,
fmt_age, fmt_age,
cache_state,
stamp_current_rules,
) )
from rcjav.ids import ( # noqa: F401 from rcjav.ids import ( # noqa: F401
PRIMARY_ID_RE, PRIMARY_ID_RE,
@@ -103,4 +107,5 @@ from rcjav.ids import ( # noqa: F401
normalize_id, normalize_id,
describe_id_match, describe_id_match,
expand_range, expand_range,
current_rules_signature,
) )
+108 -20
View File
@@ -2,10 +2,20 @@
This module owns the on-disk cache contract: where the file lives, This module owns the on-disk cache contract: where the file lives,
what the header looks like, and how mismatches are handled. The what the header looks like, and how mismatches are handled. The
current shape predates the two-tier `cache_schema` + `id_rules` split contract is the two-tier `cache_schema` + `id_rules` model from
documented in docs/CACHE_CONTRACT.md (extension repo) — step 10j docs/CACHE_CONTRACT.md (extension repo).
implements that contract; until then this is the legacy
`version: 3` reader. cache_schema on-disk shape. Mismatch -> force rebuild.
id_rules integer; bumps when extraction rules change.
Mismatch -> mark stale, allow lazy re-extract.
id_rules_signature sha256 over canonical rule text (see
rcjav.ids.current_rules_signature). Belt-and-
braces drift check that catches a forgotten
`id_rules` bump.
Legacy users on `version: 3` get an in-place header upgrade with no
forced rescan; the cache is marked as `id_rules: 0` so it shows up
as "stale by rules" until they Re-extract IDs.
""" """
from __future__ import annotations from __future__ import annotations
@@ -19,29 +29,107 @@ from pathlib import Path
# Lives next to rc-jav.py at the repo root. # Lives next to rc-jav.py at the repo root.
CACHE_PATH = Path(__file__).resolve().parents[1] / "cache.json" CACHE_PATH = Path(__file__).resolve().parents[1] / "cache.json"
CACHE_VERSION = 3 # bumped: extract_id handles bracket-wrapped IDs + no-hyphen fallback
CACHE_STALE_HOURS = 24 CACHE_STALE_HOURS = 24
# Two-tier version contract (see docs/CACHE_CONTRACT.md):
CACHE_SCHEMA_VERSION = 1 # on-disk shape; bump = force rebuild
ID_RULES_VERSION = 1 # extraction rules; bump = mark stale (lazy re-extract)
# Legacy alias preserved for any external caller that still imports it.
# Maps to CACHE_SCHEMA_VERSION + ID_RULES_VERSION under the new contract.
CACHE_VERSION = 3
def _fresh_cache(signature: str = "unknown") -> dict:
return {
"cache_schema": CACHE_SCHEMA_VERSION,
"id_rules": ID_RULES_VERSION,
"id_rules_signature": signature,
"remotes": {},
}
def _migrate_legacy_v3(data: dict) -> dict:
"""Translate a legacy `version: 3` cache to the new header in place.
Sets `id_rules: 0` so the cache reads as "stale by rules" — user
sees the new amber state and can opt into a fast Re-extract without
a rclone re-scan.
"""
return {
"cache_schema": CACHE_SCHEMA_VERSION,
"id_rules": 0,
"id_rules_signature": "legacy",
"remotes": data.get("remotes", {}),
}
def load_cache(current_signature: str | None = None) -> dict:
"""Read and (if necessary) migrate cache.json.
`current_signature` is the value of `rcjav.ids.current_rules_signature()`
captured by the caller. It's only stamped into the header when this
function has to mint a *fresh* cache; when migrating legacy data we
deliberately stamp `"legacy"` so the cache reads as stale-by-rules.
"""
fresh_sig = current_signature or "unknown"
def load_cache() -> dict:
if not CACHE_PATH.exists(): if not CACHE_PATH.exists():
return {"version": CACHE_VERSION, "remotes": {}} return _fresh_cache(fresh_sig)
try: try:
data = json.loads(CACHE_PATH.read_text(encoding="utf-8")) data = json.loads(CACHE_PATH.read_text(encoding="utf-8"))
if (
not isinstance(data, dict)
or data.get("version") != CACHE_VERSION
or not isinstance(data.get("remotes"), dict)
):
if isinstance(data, dict) and "version" in data and data["version"] != CACHE_VERSION:
sys.stderr.write(
f"[warn] cache version mismatch (got {data['version']}, "
f"expected {CACHE_VERSION}); forcing full rescan.\n"
)
return {"version": CACHE_VERSION, "remotes": {}}
return data
except (json.JSONDecodeError, OSError): except (json.JSONDecodeError, OSError):
return {"version": CACHE_VERSION, "remotes": {}} return _fresh_cache(fresh_sig)
if not isinstance(data, dict) or not isinstance(data.get("remotes"), dict):
return _fresh_cache(fresh_sig)
# Legacy header: { "version": 3, "remotes": {...} } — migrate in place.
if "version" in data and "cache_schema" not in data:
if data.get("version") == 3:
return _migrate_legacy_v3(data)
sys.stderr.write(
f"[warn] unknown legacy cache version {data.get('version')!r}; "
f"rebuilding.\n"
)
return _fresh_cache(fresh_sig)
# New header: validate schema. Mismatch = force rebuild (per contract).
if data.get("cache_schema") != CACHE_SCHEMA_VERSION:
sys.stderr.write(
f"[warn] cache_schema mismatch (got {data.get('cache_schema')!r}, "
f"expected {CACHE_SCHEMA_VERSION}); forcing full rescan.\n"
)
return _fresh_cache(fresh_sig)
return data
def cache_state(cache: dict, current_signature: str) -> str:
"""Classify a cache dict against the live rule set.
Returns one of: "fresh", "stale_by_rules", "schema_mismatch".
"schema_mismatch" should normally never reach the caller — load_cache
already rebuilds. It's reported for diagnostics flows that read
cache.json directly without going through load_cache.
"""
if cache.get("cache_schema") != CACHE_SCHEMA_VERSION:
return "schema_mismatch"
rules_match = cache.get("id_rules") == ID_RULES_VERSION
sig_match = cache.get("id_rules_signature") == current_signature
return "fresh" if (rules_match and sig_match) else "stale_by_rules"
def stamp_current_rules(cache: dict, current_signature: str) -> None:
"""Stamp `id_rules` and `id_rules_signature` to current values in place.
Use after a successful re-extract or full scan completes against the
live rule set.
"""
cache["id_rules"] = ID_RULES_VERSION
cache["id_rules_signature"] = current_signature
def save_cache(cache: dict) -> None: def save_cache(cache: dict) -> None:
+88 -1
View File
@@ -404,6 +404,11 @@ def main():
help="Relative path of the file to rename (within --remote).") help="Relative path of the file to rename (within --remote).")
ap.add_argument("--new-path", metavar="PATH", ap.add_argument("--new-path", metavar="PATH",
help="New relative path after rename (within --remote).") help="New relative path after rename (within --remote).")
ap.add_argument("--reextract", action="store_true",
help="Walk cache.json and recompute jav_id on every file entry "
"using the current ID extraction rules. No rclone calls — "
"fast path for picking up rule changes without re-scanning. "
"Outputs JSON when --format json, plain otherwise.")
ap.add_argument("--basic", action="store_true", ap.add_argument("--basic", action="store_true",
help="Plain text output, no rich tables/panels/progress bars. " help="Plain text output, no rich tables/panels/progress bars. "
"Useful for piping or low-bandwidth terminals.") "Useful for piping or low-bandwidth terminals.")
@@ -480,6 +485,79 @@ def main():
args.catalog = list(DEFAULT_CATALOG) args.catalog = list(DEFAULT_CATALOG)
# --library-issues: read-only cache scan for non-canonical filenames. # --library-issues: read-only cache scan for non-canonical filenames.
# --reextract: rebuild jav_id values from current rules without re-scanning.
if args.reextract:
from rcjav.ids import current_rules_signature
from rcjav.cache import stamp_current_rules
sig = current_rules_signature()
cache = load_cache(sig)
changed = 0
unchanged = 0
dropped = 0
per_remote = []
for remote, entry in (cache.get("remotes") or {}).items():
r_changed = 0
r_unchanged = 0
r_dropped = 0
files = entry.get("files") or []
for f in files:
old_id = f.get("jav_id") or ""
new_id = extract_id(Path(f.get("path", "")).name)
if new_id is None:
if old_id:
f["jav_id"] = ""
r_dropped += 1
continue
if new_id != old_id:
f["jav_id"] = new_id
r_changed += 1
else:
r_unchanged += 1
changed += r_changed
unchanged += r_unchanged
dropped += r_dropped
per_remote.append({
"remote": remote,
"changed": r_changed,
"unchanged": r_unchanged,
"dropped": r_dropped,
"files": len(files),
})
stamp_current_rules(cache, sig)
save_cache(cache)
summary = {
"ok": True,
"changed": changed,
"unchanged": unchanged,
"dropped": dropped,
"total": changed + unchanged + dropped,
"id_rules_signature": sig,
"remotes": per_remote,
}
if args.format == "json" or BASIC:
print(json.dumps(summary))
else:
console.print(Panel(
f"[bold]Re-extracted IDs against current rules[/]\n"
f" changed: [yellow]{changed:,}[/]\n"
f" unchanged: [dim]{unchanged:,}[/]\n"
f" dropped: [red]{dropped:,}[/]\n"
f" total: {summary['total']:,}",
title="Re-extract", border_style="green"))
if per_remote:
from rich.table import Table as _Tbl
t = _Tbl(title="Per-remote", show_lines=False)
t.add_column("Remote", style="cyan")
t.add_column("Changed", justify="right", style="yellow")
t.add_column("Unchanged", justify="right", style="dim")
t.add_column("Dropped", justify="right", style="red")
t.add_column("Files", justify="right")
for r in per_remote:
t.add_row(r["remote"], f"{r['changed']:,}", f"{r['unchanged']:,}",
f"{r['dropped']:,}", f"{r['files']:,}")
console.print(t)
sys.exit(0)
if args.library_issues: if args.library_issues:
cache = load_cache() cache = load_cache()
issues = find_library_issues(cache) issues = find_library_issues(cache)
@@ -546,7 +624,10 @@ def main():
console.print(f"[red]invalid --scan-since value: {args.scan_since!r} " console.print(f"[red]invalid --scan-since value: {args.scan_since!r} "
f"(expected e.g. 24h, 7d, 30m, 90s)[/]") f"(expected e.g. 24h, 7d, 30m, 90s)[/]")
sys.exit(2) sys.exit(2)
cache = load_cache() from rcjav.ids import current_rules_signature
from rcjav.cache import stamp_current_rules
_scan_sig = current_rules_signature()
cache = load_cache(_scan_sig)
cache_meta: dict[str, dict] = {} cache_meta: dict[str, dict] = {}
skipped: list[tuple[str, str]] = [] skipped: list[tuple[str, str]] = []
t0 = time.perf_counter() t0 = time.perf_counter()
@@ -566,6 +647,12 @@ def main():
use_cache=not args.no_cache, force_update=True, use_cache=not args.no_cache, force_update=True,
cache_meta=cache_meta, scan_since=scan_since)) cache_meta=cache_meta, scan_since=scan_since))
if not args.no_cache: if not args.no_cache:
# Stamp current rules only on a FULL scan. An incremental
# (--scan-since) only re-walked some files; older files in the
# cache may still have jav_ids from the previous rule set, so the
# cache remains "stale by rules" until a full scan or --reextract.
if not scan_since:
stamp_current_rules(cache, _scan_sig)
save_cache(cache) save_cache(cache)
elapsed = time.perf_counter() - t0 elapsed = time.perf_counter() - t0
if BASIC: if BASIC:
+32
View File
@@ -231,6 +231,38 @@ def describe_id_match(display_query: str, matched_query: str, matched_id: str,
} }
def current_rules_signature() -> str:
"""Sha256 over the canonical text of every rule that influences a jav_id.
Includes built-in regex sources, BUILTIN_PART_RES sources, and PART_RES
(which captures user-added part patterns applied by
`configure_part_patterns`). Output prefixed with `sha256:` so callers can
sniff the algorithm without re-deriving it.
Stable across invocations: dict is dumped with sort_keys=True. Bumping a
regex changes the digest; reordering BUILTIN_PART_RES also changes it
(order is part of the contract because part-detection short-circuits).
"""
import hashlib
import json as _json
data = {
"schema": 1, # bump when this signature schema itself changes
"primary": PRIMARY_ID_RE.pattern,
"compound": COMPOUND_ID_RE.pattern,
"fallback": FALLBACK_ID_RE.pattern,
"nohyphen": _NOHYPHEN_ID_RE.pattern,
"bracket": _BRACKET_ID_RE.pattern,
"variant": _VARIANT_SUFFIX_RE.pattern,
"xofy": _XOFY_PRIORITY_RE.pattern,
"resolution_tag": _RESOLUTION_TAG_RE.pattern,
"builtin_part_res": [r.pattern for r in BUILTIN_PART_RES],
"part_res": [r.pattern for r in PART_RES],
"fc2_handling": "fc2_to_ppv",
}
text = _json.dumps(data, sort_keys=True, ensure_ascii=False)
return "sha256:" + hashlib.sha256(text.encode("utf-8")).hexdigest()
def expand_range(raw: str) -> list[str] | None: def expand_range(raw: str) -> list[str] | None:
"""Expand a bracket range like 'IPZZ-[820-860]' into individual ID strings. """Expand a bracket range like 'IPZZ-[820-860]' into individual ID strings.
Returns None if no range marker present.""" Returns None if no range marker present."""