Files
ext-rclone-jav/docs/CACHE_CONTRACT.md
admin 41e9a500d0 Step 9: cache contract design doc
Adds docs/CACHE_CONTRACT.md defining the two-tier replacement for
today's single CACHE_VERSION=3 constant:

  cache_schema       force rebuild on mismatch (today's semantics)
  id_rules           mark stale, allow lazy re-extract w/o rescan
  id_rules_signature sha256 over canonical text of all extraction
                     rule sources (regexes, normalizers, part
                     detectors, FC2 handling, user-config rules)
                     as a belt-and-braces drift check

Documents:

  - new cache.json header shape
  - one-shot in-place migration for legacy `version: 3` users (no
    forced rescan)
  - behavior matrix for the three resulting states
  - extension UX: fresh / stale-by-rules amber / schema-mismatch red
  - new "Re-extract IDs" action that walks files[] in place and
    never touches rclone
  - what counts as a rules change vs. unrelated code change
  - open questions deferred to step 10 (per-remote tracking,
    custom-rules signature handling, host wiring)

No code changes — step 10 implements. This commit only locks the
contract so step 10 has a single source of truth for both the
Python and extension sides.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 21:37:01 +02:00

9.4 KiB

Cache contract — design (Step 9)

Status: design only. No code changes yet. Step 10 implements.

This document is the source of truth for cache.json versioning and the rebuild policy that both the Python rc-jav.py CLI and the browser extension follow. It supersedes the single CACHE_VERSION constant currently in rc-jav.py.

Why split CACHE_VERSION

CACHE_VERSION = 3 in rc-jav.py today is a single integer that covers two unrelated things:

  1. Schema — the shape of cache.json itself (top-level keys, nested object shape, what fields a file entry carries).
  2. Rules — the ID extraction logic (extract_id, normalizers, part detectors, FC2-PPV handling). These influence the jav_id field stored inside file entries.

Conflating them has a real cost:

  • The last CACHE_VERSION bump (3, comment "extract_id handles bracket-wrapped IDs + no-hyphen fallback") was a rules change. It forced every user to do a full library rescan, which on large remotes can take 30+ minutes per remote, even though the file entries' shape was unchanged.
  • A user who hasn't pulled the new rules can't tell from cache.json whether their existing cache is "wrong shape" (unusable) or "stale IDs" (usable but missing some matches).
  • The extension can't surface the distinction in its UI either, so the Cache & Scans pane shows a single "stale" state regardless.

Two-tier model

Tier Bumps when Effect
cache_schema The cache.json structure changes (new field, removed key, etc.) Force rebuild. Cache is unusable.
id_rules Any extraction rule changes — regex, normalizer, part detector Mark stale. Cache stays usable; offer re-extract.

cache_schema corresponds to the current CACHE_VERSION semantics (force rebuild). id_rules is new and has weaker semantics — the cache is still readable, the file list is still accurate, only the derived jav_id field may be wrong for some entries.

Cache header shape (new)

{
  "cache_schema": 1,
  "id_rules": 4,
  "id_rules_signature": "sha256:…",
  "remotes": {  }
}

Notes:

  • cache_schema starts at 1 for the new contract. Migration from the legacy version: 3 field is a one-shot read-side translation in load_cache() (see Migration below).
  • id_rules is a monotonic counter. Bump on every change to the rules listed under "What counts as a rules change" below.
  • id_rules_signature is a sha256 over the canonical text of the rule definitions (regex source strings + normalizer fmts + part detector patterns + FC2 handling toggle). It's a belt-and-braces check: if a developer forgets to bump id_rules, the signature catches drift. If a user has local custom rules in config.json, the signature also drifts and is treated as a stale rules state.

What counts as a rules change

Anything that influences the jav_id value stored in a file entry:

  • PRIMARY_ID_RE, COMPOUND_ID_RE, FALLBACK_ID_RE, _NOHYPHEN_ID_RE, _BRACKET_ID_RE in rc-jav.py
  • Built-in part detectors (BUILTIN_PART_RES) and detection order
  • FC2-PPV normalization branch
  • detect_part_from_stem and part_key behavior
  • extract_id's overall control flow (variant-letter detection, width-preserving padding, etc.)
  • User-added normalizers from config.json (id_normalizers)
  • User-added part patterns from config.json (partPatterns)

Not a rules change (no id_rules bump):

  • Bug fixes to non-extraction code paths (save_cache, walk_remote, find_dupes, keep-ranking logic, output formatting)
  • Changes to extension-side display, since the extension never edits cache.json
  • Adding a new shared fixture case to fixtures/

Behavior matrix

User's cache cache_schema id_rules Action
Fresh / matches both = = Use as-is.
Schema mismatch (any) Force rebuild. Same as today's CACHE_VERSION mismatch.
Schema match, rules stale = ≠ or sig drift Mark stale. Use file list as-is; warn that some jav_ids may be out-of-date; offer "Re-extract IDs" (cheap, no remote scan).
Legacy version: 3 (no new) (translated to =) (translated to =) One-shot migration: replace header in place, do not force rebuild.

"Re-extract IDs" is a new fast path: walk the existing files[] array and recompute jav_id on each entry using the current rules. No network or rclone call. Costs O(N) regex against N filenames, which is seconds even for large libraries.

Migration from version: 3

load_cache() becomes:

def load_cache() -> dict:
    if not CACHE_PATH.exists():
        return _fresh_cache()
    try:
        data = json.loads(...)
    except Exception:
        return _fresh_cache()

    # Legacy header: { "version": 3, "remotes": {...} }
    # Translate in place. Treat as fresh-rules so user sees "stale" not "wipe".
    if "version" in data and "cache_schema" not in data:
        if data.get("version") == 3:
            data = {
                "cache_schema": CACHE_SCHEMA_VERSION,
                "id_rules": 0,           # forces "stale by rules" amber
                "id_rules_signature": "legacy",
                "remotes": data.get("remotes", {}),
            }
        else:
            return _fresh_cache()  # unknown legacy version → wipe

    # New header validation
    if data.get("cache_schema") != CACHE_SCHEMA_VERSION:
        return _fresh_cache()

    return data

Users with version: 3 get an in-place upgrade with no rescan. The cache shows up as "stale by rules" until they click Re-extract IDs.

Extension UX (Cache & Scans pane)

Three states instead of today's two:

State Color Pane copy Action button
Fresh green ✓ "Cache up to date." "Re-scan" (manual)
Stale by rules amber ! "ID extraction rules have changed since this cache was built. Some IDs may be out of date." "Re-extract IDs" (fast)
Schema mismatch / wipe red ✗ "Cache version is unreadable. A full re-scan is required." "Re-scan now"

Background still has cache-status message. Response gains:

{
  ok: true,
  cache_exists: true,
  cache_schema: 1,
  id_rules: 4,
  id_rules_current: 4,
  id_rules_match: true,
  id_rules_signature_match: true,
  // existing fields preserved: remotes, warnings, etc.
}

renderCacheStatus in options-cache.js reads these and picks the state. Tests live in fixtures or in options-cache.js mocks (no need to extend the JSON corpus for this).

Open questions

  1. Where does the user's "id_rules_signature" come from? The signature must be computable from a single canonical text. Easiest: sha256 over a sorted JSON dump of {primary_re_source, compound_re_source, fallback_re_source, nohyphen_re_source, bracket_re_source, part_res_sources, fc2_handling: "enabled", user_normalizers, user_part_patterns}. Punt on exact shape until step 10.
  2. Should the extension trigger Re-extract IDs? Yes — chrome.runtime.sendMessage({ type: "reextract-ids" }), background forwards to host, host calls a new rc-jav.py --reextract command that walks cache.json without re-listing the remote.
  3. Per-remote tracking? Today id_rules would be a single top-level integer. Could go per-remote (remotes[name].id_rules_at_scan) so "Re-extract IDs" can be triggered on a single remote. Recommend storing per-remote and computing top-level "stale by rules" as "any remote.id_rules_at_scan < id_rules_current". Defer detailed design to step 10.
  4. Custom rules in config.json. When a user adds a normalizer, id_rules_signature drifts and their cache appears stale. That's correct — their jav_ids really are out of date. But the global id_rules integer didn't change. UI copy should distinguish "rules updated upstream" from "your custom rules changed".

Out of scope (step 9)

  • Actually implementing the new header — that's step 10.
  • Re-extract IDs CLI/host wiring — step 10.
  • Bumping cache_schema to 1 and shipping new write code — step 10.
  • Cache compaction, partial scans, incremental updates — separate work.

Reference

  • Current CACHE_VERSION constant: D:\DEV\Project\rclone-jav\rc-jav.py line 376.
  • load_cache() / save_cache() around line 416 of the same file.
  • Extension consumer: options-cache.js renderCacheStatus, message type cache-status in background.js.
  • Shared fixture corpus that exercises the rule set: D:\DEV\Project\rclone-jav\fixtures\.