Files
ext-rclone-jav/docs/CACHE_CONTRACT.md
admin 41e9a500d0 Step 9: cache contract design doc
Adds docs/CACHE_CONTRACT.md defining the two-tier replacement for
today's single CACHE_VERSION=3 constant:

  cache_schema       force rebuild on mismatch (today's semantics)
  id_rules           mark stale, allow lazy re-extract w/o rescan
  id_rules_signature sha256 over canonical text of all extraction
                     rule sources (regexes, normalizers, part
                     detectors, FC2 handling, user-config rules)
                     as a belt-and-braces drift check

Documents:

  - new cache.json header shape
  - one-shot in-place migration for legacy `version: 3` users (no
    forced rescan)
  - behavior matrix for the three resulting states
  - extension UX: fresh / stale-by-rules amber / schema-mismatch red
  - new "Re-extract IDs" action that walks files[] in place and
    never touches rclone
  - what counts as a rules change vs. unrelated code change
  - open questions deferred to step 10 (per-remote tracking,
    custom-rules signature handling, host wiring)

No code changes — step 10 implements. This commit only locks the
contract so step 10 has a single source of truth for both the
Python and extension sides.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 21:37:01 +02:00

212 lines
9.4 KiB
Markdown

# Cache contract — design (Step 9)
Status: **design only**. No code changes yet. Step 10 implements.
This document is the source of truth for `cache.json` versioning and
the rebuild policy that both the Python `rc-jav.py` CLI and the
browser extension follow. It supersedes the single `CACHE_VERSION`
constant currently in `rc-jav.py`.
## Why split CACHE_VERSION
`CACHE_VERSION = 3` in `rc-jav.py` today is a single integer that
covers two unrelated things:
1. **Schema** — the shape of `cache.json` itself (top-level keys,
nested object shape, what fields a file entry carries).
2. **Rules** — the ID extraction logic (`extract_id`, normalizers,
part detectors, FC2-PPV handling). These influence the `jav_id`
field stored inside file entries.
Conflating them has a real cost:
- The last `CACHE_VERSION` bump (`3`, comment "extract_id handles
bracket-wrapped IDs + no-hyphen fallback") was a **rules** change.
It forced every user to do a full library rescan, which on large
remotes can take 30+ minutes per remote, even though the file
entries' shape was unchanged.
- A user who hasn't pulled the new rules can't tell from cache.json
whether their existing cache is "wrong shape" (unusable) or "stale
IDs" (usable but missing some matches).
- The extension can't surface the distinction in its UI either, so
the Cache & Scans pane shows a single "stale" state regardless.
## Two-tier model
| Tier | Bumps when | Effect |
|-------------------|--------------------------------------------------------------------|-------------------------------------------------------|
| `cache_schema` | The cache.json structure changes (new field, removed key, etc.) | **Force rebuild.** Cache is unusable. |
| `id_rules` | Any extraction rule changes — regex, normalizer, part detector | **Mark stale.** Cache stays usable; offer re-extract. |
`cache_schema` corresponds to the current `CACHE_VERSION` semantics
(force rebuild). `id_rules` is new and has weaker semantics — the
cache is still readable, the file list is still accurate, only the
derived `jav_id` field may be wrong for some entries.
## Cache header shape (new)
```json
{
"cache_schema": 1,
"id_rules": 4,
"id_rules_signature": "sha256:…",
"remotes": { }
}
```
Notes:
- `cache_schema` starts at `1` for the new contract. Migration from
the legacy `version: 3` field is a one-shot read-side translation
in `load_cache()` (see Migration below).
- `id_rules` is a monotonic counter. Bump on every change to the
rules listed under "What counts as a rules change" below.
- `id_rules_signature` is a sha256 over the canonical text of the
rule definitions (regex source strings + normalizer fmts + part
detector patterns + FC2 handling toggle). It's a **belt-and-braces
check**: if a developer forgets to bump `id_rules`, the signature
catches drift. If a user has local custom rules in `config.json`,
the signature also drifts and is treated as a stale rules state.
## What counts as a rules change
Anything that influences the `jav_id` value stored in a file entry:
- `PRIMARY_ID_RE`, `COMPOUND_ID_RE`, `FALLBACK_ID_RE`,
`_NOHYPHEN_ID_RE`, `_BRACKET_ID_RE` in `rc-jav.py`
- Built-in part detectors (`BUILTIN_PART_RES`) and detection order
- FC2-PPV normalization branch
- `detect_part_from_stem` and `part_key` behavior
- `extract_id`'s overall control flow (variant-letter detection,
width-preserving padding, etc.)
- User-added normalizers from `config.json` (`id_normalizers`)
- User-added part patterns from `config.json` (`partPatterns`)
**Not** a rules change (no `id_rules` bump):
- Bug fixes to non-extraction code paths (`save_cache`, `walk_remote`,
`find_dupes`, keep-ranking logic, output formatting)
- Changes to extension-side display, since the extension never edits
cache.json
- Adding a new shared fixture case to `fixtures/`
## Behavior matrix
| User's cache | `cache_schema` | `id_rules` | Action |
|-------------------------------|-------------------|-------------------|-------------------------------------------------------------------|
| Fresh / matches both | = | = | Use as-is. |
| Schema mismatch | ≠ | (any) | **Force rebuild.** Same as today's `CACHE_VERSION` mismatch. |
| Schema match, rules stale | = | ≠ or sig drift | **Mark stale.** Use file list as-is; warn that some `jav_id`s may be out-of-date; offer "Re-extract IDs" (cheap, no remote scan). |
| Legacy `version: 3` (no new) | (translated to =) | (translated to =) | One-shot migration: replace header in place, do not force rebuild. |
"Re-extract IDs" is a new fast path: walk the existing `files[]` array
and recompute `jav_id` on each entry using the current rules. No
network or rclone call. Costs O(N) regex against N filenames, which
is seconds even for large libraries.
## Migration from `version: 3`
`load_cache()` becomes:
```python
def load_cache() -> dict:
if not CACHE_PATH.exists():
return _fresh_cache()
try:
data = json.loads(...)
except Exception:
return _fresh_cache()
# Legacy header: { "version": 3, "remotes": {...} }
# Translate in place. Treat as fresh-rules so user sees "stale" not "wipe".
if "version" in data and "cache_schema" not in data:
if data.get("version") == 3:
data = {
"cache_schema": CACHE_SCHEMA_VERSION,
"id_rules": 0, # forces "stale by rules" amber
"id_rules_signature": "legacy",
"remotes": data.get("remotes", {}),
}
else:
return _fresh_cache() # unknown legacy version → wipe
# New header validation
if data.get("cache_schema") != CACHE_SCHEMA_VERSION:
return _fresh_cache()
return data
```
Users with `version: 3` get an in-place upgrade with no rescan. The
cache shows up as "stale by rules" until they click Re-extract IDs.
## Extension UX (Cache & Scans pane)
Three states instead of today's two:
| State | Color | Pane copy | Action button |
|-------------------------|----------|------------------------------------------------------------------------------------|------------------------------|
| Fresh | green ✓ | "Cache up to date." | "Re-scan" (manual) |
| Stale by rules | amber ! | "ID extraction rules have changed since this cache was built. Some IDs may be out of date." | **"Re-extract IDs"** (fast) |
| Schema mismatch / wipe | red ✗ | "Cache version is unreadable. A full re-scan is required." | "Re-scan now" |
Background still has `cache-status` message. Response gains:
```js
{
ok: true,
cache_exists: true,
cache_schema: 1,
id_rules: 4,
id_rules_current: 4,
id_rules_match: true,
id_rules_signature_match: true,
// existing fields preserved: remotes, warnings, etc.
}
```
`renderCacheStatus` in `options-cache.js` reads these and picks the
state. Tests live in fixtures or in `options-cache.js` mocks (no need
to extend the JSON corpus for this).
## Open questions
1. **Where does the user's "id_rules_signature" come from?** The
signature must be computable from a single canonical text. Easiest:
sha256 over a sorted JSON dump of `{primary_re_source, compound_re_source,
fallback_re_source, nohyphen_re_source, bracket_re_source,
part_res_sources, fc2_handling: "enabled", user_normalizers,
user_part_patterns}`. Punt on exact shape until step 10.
2. **Should the extension trigger Re-extract IDs?** Yes —
`chrome.runtime.sendMessage({ type: "reextract-ids" })`, background
forwards to host, host calls a new `rc-jav.py --reextract` command
that walks cache.json without re-listing the remote.
3. **Per-remote tracking?** Today `id_rules` would be a single top-level
integer. Could go per-remote (`remotes[name].id_rules_at_scan`) so
"Re-extract IDs" can be triggered on a single remote. Recommend
storing per-remote and computing top-level "stale by rules" as
"any remote.id_rules_at_scan < id_rules_current". Defer detailed
design to step 10.
4. **Custom rules in config.json.** When a user adds a normalizer,
`id_rules_signature` drifts and their cache appears stale. That's
correct — their `jav_id`s really are out of date. But the global
`id_rules` integer didn't change. UI copy should distinguish
"rules updated upstream" from "your custom rules changed".
## Out of scope (step 9)
- Actually implementing the new header — that's step 10.
- Re-extract IDs CLI/host wiring — step 10.
- Bumping `cache_schema` to `1` and shipping new write code — step 10.
- Cache compaction, partial scans, incremental updates — separate work.
## Reference
- Current `CACHE_VERSION` constant: `D:\DEV\Project\rclone-jav\rc-jav.py`
line 376.
- `load_cache()` / `save_cache()` around line 416 of the same file.
- Extension consumer: `options-cache.js` `renderCacheStatus`, message
type `cache-status` in `background.js`.
- Shared fixture corpus that exercises the rule set:
`D:\DEV\Project\rclone-jav\fixtures\`.