Adds docs/CACHE_CONTRACT.md defining the two-tier replacement for
today's single CACHE_VERSION=3 constant:
cache_schema force rebuild on mismatch (today's semantics)
id_rules mark stale, allow lazy re-extract w/o rescan
id_rules_signature sha256 over canonical text of all extraction
rule sources (regexes, normalizers, part
detectors, FC2 handling, user-config rules)
as a belt-and-braces drift check
Documents:
- new cache.json header shape
- one-shot in-place migration for legacy `version: 3` users (no
forced rescan)
- behavior matrix for the three resulting states
- extension UX: fresh / stale-by-rules amber / schema-mismatch red
- new "Re-extract IDs" action that walks files[] in place and
never touches rclone
- what counts as a rules change vs. unrelated code change
- open questions deferred to step 10 (per-remote tracking,
custom-rules signature handling, host wiring)
No code changes — step 10 implements. This commit only locks the
contract so step 10 has a single source of truth for both the
Python and extension sides.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
9.4 KiB
Cache contract — design (Step 9)
Status: design only. No code changes yet. Step 10 implements.
This document is the source of truth for cache.json versioning and
the rebuild policy that both the Python rc-jav.py CLI and the
browser extension follow. It supersedes the single CACHE_VERSION
constant currently in rc-jav.py.
Why split CACHE_VERSION
CACHE_VERSION = 3 in rc-jav.py today is a single integer that
covers two unrelated things:
- Schema — the shape of
cache.jsonitself (top-level keys, nested object shape, what fields a file entry carries). - Rules — the ID extraction logic (
extract_id, normalizers, part detectors, FC2-PPV handling). These influence thejav_idfield stored inside file entries.
Conflating them has a real cost:
- The last
CACHE_VERSIONbump (3, comment "extract_id handles bracket-wrapped IDs + no-hyphen fallback") was a rules change. It forced every user to do a full library rescan, which on large remotes can take 30+ minutes per remote, even though the file entries' shape was unchanged. - A user who hasn't pulled the new rules can't tell from cache.json whether their existing cache is "wrong shape" (unusable) or "stale IDs" (usable but missing some matches).
- The extension can't surface the distinction in its UI either, so the Cache & Scans pane shows a single "stale" state regardless.
Two-tier model
| Tier | Bumps when | Effect |
|---|---|---|
cache_schema |
The cache.json structure changes (new field, removed key, etc.) | Force rebuild. Cache is unusable. |
id_rules |
Any extraction rule changes — regex, normalizer, part detector | Mark stale. Cache stays usable; offer re-extract. |
cache_schema corresponds to the current CACHE_VERSION semantics
(force rebuild). id_rules is new and has weaker semantics — the
cache is still readable, the file list is still accurate, only the
derived jav_id field may be wrong for some entries.
Cache header shape (new)
{
"cache_schema": 1,
"id_rules": 4,
"id_rules_signature": "sha256:…",
"remotes": { … }
}
Notes:
cache_schemastarts at1for the new contract. Migration from the legacyversion: 3field is a one-shot read-side translation inload_cache()(see Migration below).id_rulesis a monotonic counter. Bump on every change to the rules listed under "What counts as a rules change" below.id_rules_signatureis a sha256 over the canonical text of the rule definitions (regex source strings + normalizer fmts + part detector patterns + FC2 handling toggle). It's a belt-and-braces check: if a developer forgets to bumpid_rules, the signature catches drift. If a user has local custom rules inconfig.json, the signature also drifts and is treated as a stale rules state.
What counts as a rules change
Anything that influences the jav_id value stored in a file entry:
PRIMARY_ID_RE,COMPOUND_ID_RE,FALLBACK_ID_RE,_NOHYPHEN_ID_RE,_BRACKET_ID_REinrc-jav.py- Built-in part detectors (
BUILTIN_PART_RES) and detection order - FC2-PPV normalization branch
detect_part_from_stemandpart_keybehaviorextract_id's overall control flow (variant-letter detection, width-preserving padding, etc.)- User-added normalizers from
config.json(id_normalizers) - User-added part patterns from
config.json(partPatterns)
Not a rules change (no id_rules bump):
- Bug fixes to non-extraction code paths (
save_cache,walk_remote,find_dupes, keep-ranking logic, output formatting) - Changes to extension-side display, since the extension never edits cache.json
- Adding a new shared fixture case to
fixtures/
Behavior matrix
| User's cache | cache_schema |
id_rules |
Action |
|---|---|---|---|
| Fresh / matches both | = | = | Use as-is. |
| Schema mismatch | ≠ | (any) | Force rebuild. Same as today's CACHE_VERSION mismatch. |
| Schema match, rules stale | = | ≠ or sig drift | Mark stale. Use file list as-is; warn that some jav_ids may be out-of-date; offer "Re-extract IDs" (cheap, no remote scan). |
Legacy version: 3 (no new) |
(translated to =) | (translated to =) | One-shot migration: replace header in place, do not force rebuild. |
"Re-extract IDs" is a new fast path: walk the existing files[] array
and recompute jav_id on each entry using the current rules. No
network or rclone call. Costs O(N) regex against N filenames, which
is seconds even for large libraries.
Migration from version: 3
load_cache() becomes:
def load_cache() -> dict:
if not CACHE_PATH.exists():
return _fresh_cache()
try:
data = json.loads(...)
except Exception:
return _fresh_cache()
# Legacy header: { "version": 3, "remotes": {...} }
# Translate in place. Treat as fresh-rules so user sees "stale" not "wipe".
if "version" in data and "cache_schema" not in data:
if data.get("version") == 3:
data = {
"cache_schema": CACHE_SCHEMA_VERSION,
"id_rules": 0, # forces "stale by rules" amber
"id_rules_signature": "legacy",
"remotes": data.get("remotes", {}),
}
else:
return _fresh_cache() # unknown legacy version → wipe
# New header validation
if data.get("cache_schema") != CACHE_SCHEMA_VERSION:
return _fresh_cache()
return data
Users with version: 3 get an in-place upgrade with no rescan. The
cache shows up as "stale by rules" until they click Re-extract IDs.
Extension UX (Cache & Scans pane)
Three states instead of today's two:
| State | Color | Pane copy | Action button |
|---|---|---|---|
| Fresh | green ✓ | "Cache up to date." | "Re-scan" (manual) |
| Stale by rules | amber ! | "ID extraction rules have changed since this cache was built. Some IDs may be out of date." | "Re-extract IDs" (fast) |
| Schema mismatch / wipe | red ✗ | "Cache version is unreadable. A full re-scan is required." | "Re-scan now" |
Background still has cache-status message. Response gains:
{
ok: true,
cache_exists: true,
cache_schema: 1,
id_rules: 4,
id_rules_current: 4,
id_rules_match: true,
id_rules_signature_match: true,
// existing fields preserved: remotes, warnings, etc.
}
renderCacheStatus in options-cache.js reads these and picks the
state. Tests live in fixtures or in options-cache.js mocks (no need
to extend the JSON corpus for this).
Open questions
- Where does the user's "id_rules_signature" come from? The
signature must be computable from a single canonical text. Easiest:
sha256 over a sorted JSON dump of
{primary_re_source, compound_re_source, fallback_re_source, nohyphen_re_source, bracket_re_source, part_res_sources, fc2_handling: "enabled", user_normalizers, user_part_patterns}. Punt on exact shape until step 10. - Should the extension trigger Re-extract IDs? Yes —
chrome.runtime.sendMessage({ type: "reextract-ids" }), background forwards to host, host calls a newrc-jav.py --reextractcommand that walks cache.json without re-listing the remote. - Per-remote tracking? Today
id_ruleswould be a single top-level integer. Could go per-remote (remotes[name].id_rules_at_scan) so "Re-extract IDs" can be triggered on a single remote. Recommend storing per-remote and computing top-level "stale by rules" as "any remote.id_rules_at_scan < id_rules_current". Defer detailed design to step 10. - Custom rules in config.json. When a user adds a normalizer,
id_rules_signaturedrifts and their cache appears stale. That's correct — theirjav_ids really are out of date. But the globalid_rulesinteger didn't change. UI copy should distinguish "rules updated upstream" from "your custom rules changed".
Out of scope (step 9)
- Actually implementing the new header — that's step 10.
- Re-extract IDs CLI/host wiring — step 10.
- Bumping
cache_schemato1and shipping new write code — step 10. - Cache compaction, partial scans, incremental updates — separate work.
Reference
- Current
CACHE_VERSIONconstant:D:\DEV\Project\rclone-jav\rc-jav.pyline 376. load_cache()/save_cache()around line 416 of the same file.- Extension consumer:
options-cache.jsrenderCacheStatus, message typecache-statusinbackground.js. - Shared fixture corpus that exercises the rule set:
D:\DEV\Project\rclone-jav\fixtures\.