# Cache contract — design (Step 9) Status: **design only**. No code changes yet. Step 10 implements. This document is the source of truth for `cache.json` versioning and the rebuild policy that both the Python `rc-jav.py` CLI and the browser extension follow. It supersedes the single `CACHE_VERSION` constant currently in `rc-jav.py`. ## Why split CACHE_VERSION `CACHE_VERSION = 3` in `rc-jav.py` today is a single integer that covers two unrelated things: 1. **Schema** — the shape of `cache.json` itself (top-level keys, nested object shape, what fields a file entry carries). 2. **Rules** — the ID extraction logic (`extract_id`, normalizers, part detectors, FC2-PPV handling). These influence the `jav_id` field stored inside file entries. Conflating them has a real cost: - The last `CACHE_VERSION` bump (`3`, comment "extract_id handles bracket-wrapped IDs + no-hyphen fallback") was a **rules** change. It forced every user to do a full library rescan, which on large remotes can take 30+ minutes per remote, even though the file entries' shape was unchanged. - A user who hasn't pulled the new rules can't tell from cache.json whether their existing cache is "wrong shape" (unusable) or "stale IDs" (usable but missing some matches). - The extension can't surface the distinction in its UI either, so the Cache & Scans pane shows a single "stale" state regardless. ## Two-tier model | Tier | Bumps when | Effect | |-------------------|--------------------------------------------------------------------|-------------------------------------------------------| | `cache_schema` | The cache.json structure changes (new field, removed key, etc.) | **Force rebuild.** Cache is unusable. | | `id_rules` | Any extraction rule changes — regex, normalizer, part detector | **Mark stale.** Cache stays usable; offer re-extract. | `cache_schema` corresponds to the current `CACHE_VERSION` semantics (force rebuild). `id_rules` is new and has weaker semantics — the cache is still readable, the file list is still accurate, only the derived `jav_id` field may be wrong for some entries. ## Cache header shape (new) ```json { "cache_schema": 1, "id_rules": 4, "id_rules_signature": "sha256:…", "remotes": { … } } ``` Notes: - `cache_schema` starts at `1` for the new contract. Migration from the legacy `version: 3` field is a one-shot read-side translation in `load_cache()` (see Migration below). - `id_rules` is a monotonic counter. Bump on every change to the rules listed under "What counts as a rules change" below. - `id_rules_signature` is a sha256 over the canonical text of the rule definitions (regex source strings + normalizer fmts + part detector patterns + FC2 handling toggle). It's a **belt-and-braces check**: if a developer forgets to bump `id_rules`, the signature catches drift. If a user has local custom rules in `config.json`, the signature also drifts and is treated as a stale rules state. ## What counts as a rules change Anything that influences the `jav_id` value stored in a file entry: - `PRIMARY_ID_RE`, `COMPOUND_ID_RE`, `FALLBACK_ID_RE`, `_NOHYPHEN_ID_RE`, `_BRACKET_ID_RE` in `rc-jav.py` - Built-in part detectors (`BUILTIN_PART_RES`) and detection order - FC2-PPV normalization branch - `detect_part_from_stem` and `part_key` behavior - `extract_id`'s overall control flow (variant-letter detection, width-preserving padding, etc.) - User-added normalizers from `config.json` (`id_normalizers`) - User-added part patterns from `config.json` (`partPatterns`) **Not** a rules change (no `id_rules` bump): - Bug fixes to non-extraction code paths (`save_cache`, `walk_remote`, `find_dupes`, keep-ranking logic, output formatting) - Changes to extension-side display, since the extension never edits cache.json - Adding a new shared fixture case to `fixtures/` ## Behavior matrix | User's cache | `cache_schema` | `id_rules` | Action | |-------------------------------|-------------------|-------------------|-------------------------------------------------------------------| | Fresh / matches both | = | = | Use as-is. | | Schema mismatch | ≠ | (any) | **Force rebuild.** Same as today's `CACHE_VERSION` mismatch. | | Schema match, rules stale | = | ≠ or sig drift | **Mark stale.** Use file list as-is; warn that some `jav_id`s may be out-of-date; offer "Re-extract IDs" (cheap, no remote scan). | | Legacy `version: 3` (no new) | (translated to =) | (translated to =) | One-shot migration: replace header in place, do not force rebuild. | "Re-extract IDs" is a new fast path: walk the existing `files[]` array and recompute `jav_id` on each entry using the current rules. No network or rclone call. Costs O(N) regex against N filenames, which is seconds even for large libraries. ## Migration from `version: 3` `load_cache()` becomes: ```python def load_cache() -> dict: if not CACHE_PATH.exists(): return _fresh_cache() try: data = json.loads(...) except Exception: return _fresh_cache() # Legacy header: { "version": 3, "remotes": {...} } # Translate in place. Treat as fresh-rules so user sees "stale" not "wipe". if "version" in data and "cache_schema" not in data: if data.get("version") == 3: data = { "cache_schema": CACHE_SCHEMA_VERSION, "id_rules": 0, # forces "stale by rules" amber "id_rules_signature": "legacy", "remotes": data.get("remotes", {}), } else: return _fresh_cache() # unknown legacy version → wipe # New header validation if data.get("cache_schema") != CACHE_SCHEMA_VERSION: return _fresh_cache() return data ``` Users with `version: 3` get an in-place upgrade with no rescan. The cache shows up as "stale by rules" until they click Re-extract IDs. ## Extension UX (Cache & Scans pane) Three states instead of today's two: | State | Color | Pane copy | Action button | |-------------------------|----------|------------------------------------------------------------------------------------|------------------------------| | Fresh | green ✓ | "Cache up to date." | "Re-scan" (manual) | | Stale by rules | amber ! | "ID extraction rules have changed since this cache was built. Some IDs may be out of date." | **"Re-extract IDs"** (fast) | | Schema mismatch / wipe | red ✗ | "Cache version is unreadable. A full re-scan is required." | "Re-scan now" | Background still has `cache-status` message. Response gains: ```js { ok: true, cache_exists: true, cache_schema: 1, id_rules: 4, id_rules_current: 4, id_rules_match: true, id_rules_signature_match: true, // existing fields preserved: remotes, warnings, etc. } ``` `renderCacheStatus` in `options-cache.js` reads these and picks the state. Tests live in fixtures or in `options-cache.js` mocks (no need to extend the JSON corpus for this). ## Open questions 1. **Where does the user's "id_rules_signature" come from?** The signature must be computable from a single canonical text. Easiest: sha256 over a sorted JSON dump of `{primary_re_source, compound_re_source, fallback_re_source, nohyphen_re_source, bracket_re_source, part_res_sources, fc2_handling: "enabled", user_normalizers, user_part_patterns}`. Punt on exact shape until step 10. 2. **Should the extension trigger Re-extract IDs?** Yes — `chrome.runtime.sendMessage({ type: "reextract-ids" })`, background forwards to host, host calls a new `rc-jav.py --reextract` command that walks cache.json without re-listing the remote. 3. **Per-remote tracking?** Today `id_rules` would be a single top-level integer. Could go per-remote (`remotes[name].id_rules_at_scan`) so "Re-extract IDs" can be triggered on a single remote. Recommend storing per-remote and computing top-level "stale by rules" as "any remote.id_rules_at_scan < id_rules_current". Defer detailed design to step 10. 4. **Custom rules in config.json.** When a user adds a normalizer, `id_rules_signature` drifts and their cache appears stale. That's correct — their `jav_id`s really are out of date. But the global `id_rules` integer didn't change. UI copy should distinguish "rules updated upstream" from "your custom rules changed". ## Out of scope (step 9) - Actually implementing the new header — that's step 10. - Re-extract IDs CLI/host wiring — step 10. - Bumping `cache_schema` to `1` and shipping new write code — step 10. - Cache compaction, partial scans, incremental updates — separate work. ## Reference - Current `CACHE_VERSION` constant: `D:\DEV\Project\rclone-jav\rc-jav.py` line 376. - `load_cache()` / `save_cache()` around line 416 of the same file. - Extension consumer: `options-cache.js` `renderCacheStatus`, message type `cache-status` in `background.js`. - Shared fixture corpus that exercises the rule set: `D:\DEV\Project\rclone-jav\fixtures\`.