Step 9: cache contract design doc
Adds docs/CACHE_CONTRACT.md defining the two-tier replacement for
today's single CACHE_VERSION=3 constant:
cache_schema force rebuild on mismatch (today's semantics)
id_rules mark stale, allow lazy re-extract w/o rescan
id_rules_signature sha256 over canonical text of all extraction
rule sources (regexes, normalizers, part
detectors, FC2 handling, user-config rules)
as a belt-and-braces drift check
Documents:
- new cache.json header shape
- one-shot in-place migration for legacy `version: 3` users (no
forced rescan)
- behavior matrix for the three resulting states
- extension UX: fresh / stale-by-rules amber / schema-mismatch red
- new "Re-extract IDs" action that walks files[] in place and
never touches rclone
- what counts as a rules change vs. unrelated code change
- open questions deferred to step 10 (per-remote tracking,
custom-rules signature handling, host wiring)
No code changes — step 10 implements. This commit only locks the
contract so step 10 has a single source of truth for both the
Python and extension sides.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
This commit is contained in:
@@ -139,14 +139,14 @@ Done in rc-jav catalog loading. Catalog CSV/XML paths are normalized from Window
|
|||||||
6b. **options.js split — Library Issues extraction.** `options.js` 2356 → 1903 lines. New file: `options-library-issues.js` (453 lines) — covers `lastLibraryIssues`, `_libraryIssuesDirty`, `renderLibraryIssues`, `_closeLibraryIssues`, and the bottom IIFE that wraps `_optScanTimer` / `_setOptScanningState` / `_pollOptProgress` for optimization-scan progress polling. Block was fully self-contained (no external callers of its identifiers). Reads `_configuredScanRoots` / `_cacheSkippedByRemote` / calls `rememberConfiguredScanRoots` from `options-cache.js` — same cross-file binding pattern proven in step 6. Script-tag order in `options.html`: cache → dupe-review → library-issues → options.js. `node --check` passes on each file and on concatenation; line count of concat (3133) matches pre-split total exactly.
|
6b. **options.js split — Library Issues extraction.** `options.js` 2356 → 1903 lines. New file: `options-library-issues.js` (453 lines) — covers `lastLibraryIssues`, `_libraryIssuesDirty`, `renderLibraryIssues`, `_closeLibraryIssues`, and the bottom IIFE that wraps `_optScanTimer` / `_setOptScanningState` / `_pollOptProgress` for optimization-scan progress polling. Block was fully self-contained (no external callers of its identifiers). Reads `_configuredScanRoots` / `_cacheSkippedByRemote` / calls `rememberConfiguredScanRoots` from `options-cache.js` — same cross-file binding pattern proven in step 6. Script-tag order in `options.html`: cache → dupe-review → library-issues → options.js. `node --check` passes on each file and on concatenation; line count of concat (3133) matches pre-split total exactly.
|
||||||
7a. **Bulk Check standalone window.** New `bulk-check.{html,js,css}` opened as detached `chrome.windows.create({ type: 'popup', width: 640, height: 540 })`. Launcher = 📋 icon button in popup header next to ⚙ Options; click sends `open-bulk-check` message to background and closes the popup. Background owns the lifecycle: `openBulkCheckWindow()` reads `chrome.storage.session.bulkCheckWindowId`; existing id → `chrome.windows.update({ focused, drawAttention })`; failure or no id → create new window + stash id. `chrome.windows.onRemoved` clears the stale id on close. Last-paste persisted to `chrome.storage.local.bulkCheckLastPaste` (debounced 500ms), restored on window open. `quickMode` read from settings on each run (parity with old options behavior). Removed the Bulk ID Check fieldset from `options.html` (Library Review pane description updated to note the relocation) and its handlers from `options.js` (1903 → 1852 lines). No manifest permission changes needed.
|
7a. **Bulk Check standalone window.** New `bulk-check.{html,js,css}` opened as detached `chrome.windows.create({ type: 'popup', width: 640, height: 540 })`. Launcher = 📋 icon button in popup header next to ⚙ Options; click sends `open-bulk-check` message to background and closes the popup. Background owns the lifecycle: `openBulkCheckWindow()` reads `chrome.storage.session.bulkCheckWindowId`; existing id → `chrome.windows.update({ focused, drawAttention })`; failure or no id → create new window + stash id. `chrome.windows.onRemoved` clears the stale id on close. Last-paste persisted to `chrome.storage.local.bulkCheckLastPaste` (debounced 500ms), restored on window open. `quickMode` read from settings on each run (parity with old options behavior). Removed the Bulk ID Check fieldset from `options.html` (Library Review pane description updated to note the relocation) and its handlers from `options.js` (1903 → 1852 lines). No manifest permission changes needed.
|
||||||
8. **Shared fixture corpus.** Seeded `D:\DEV\Project\rclone-jav\fixtures\` (top-level in the Python repo, conceptually shared with this extension). Files: `filename-extraction.json` (12 cases, Python `extract_id` contract), `query-extraction.json` (10 cases, extension `content.js` `normalizeId` contract), `shared-normalization.json` (5 cases, both sides must agree), `README.md`, and a self-contained Python runner `run.py` (no third-party deps; imports `rc-jav.py` in place). All 17 Python-side cases pass against current `rc-jav.py`. The runner uses `|` and `->` instead of `·` and `→` so it works on Windows cp1252 consoles. Documented one intentional divergence: the extension normalizes the compact `FC2PPV1841460` form (page-title surface) while Python `extract_id` does not (filename surface — compact form doesn't appear on disk). No Node-side runner today — `content.js` lives in an injected IIFE and importing it would require duplicating regexes; the JSON corpus is the canonical spec until that lands.
|
8. **Shared fixture corpus.** Seeded `D:\DEV\Project\rclone-jav\fixtures\` (top-level in the Python repo, conceptually shared with this extension). Files: `filename-extraction.json` (12 cases, Python `extract_id` contract), `query-extraction.json` (10 cases, extension `content.js` `normalizeId` contract), `shared-normalization.json` (5 cases, both sides must agree), `README.md`, and a self-contained Python runner `run.py` (no third-party deps; imports `rc-jav.py` in place). All 17 Python-side cases pass against current `rc-jav.py`. The runner uses `|` and `->` instead of `·` and `→` so it works on Windows cp1252 consoles. Documented one intentional divergence: the extension normalizes the compact `FC2PPV1841460` form (page-title surface) while Python `extract_id` does not (filename surface — compact form doesn't appear on disk). No Node-side runner today — `content.js` lives in an injected IIFE and importing it would require duplicating regexes; the JSON corpus is the canonical spec until that lands.
|
||||||
|
9. **Cache contract design — shipped as a design doc, not code.** `docs/CACHE_CONTRACT.md` defines a two-tier model that splits today's single `CACHE_VERSION = 3` into `cache_schema` (force rebuild on mismatch) and `id_rules` (mark stale, allow lazy re-extract without re-scanning). Adds `id_rules_signature` (sha256 over canonical text of all extraction-rule sources, including user-added normalizers from config.json) as a belt-and-braces drift check. Specifies the new cache header shape, a one-shot in-place migration for users on legacy `version: 3` (no forced rescan), the behavior matrix for the three resulting states, and the extension's three-state UX (fresh / stale-by-rules amber / schema-mismatch red) with a new "Re-extract IDs" action that walks `files[]` in place and never touches rclone. Step 10 implements; step 9 only locks the contract.
|
||||||
|
|
||||||
(Step 4 in the plan is a paired-extraction sub-task of step 6; folded into step 6 ship.)
|
(Step 4 in the plan is a paired-extraction sub-task of step 6; folded into step 6 ship.)
|
||||||
|
|
||||||
**Pending (in execution order):**
|
**Pending (in execution order):**
|
||||||
|
|
||||||
- **Step 6c — finish options.js split (optional).** Remaining options.js (1852 lines) still holds: settings load/save, backup/restore, recent activity, search test bench, adapters, ID normalizers, part detectors, element picker, overlay previews, diagnostics, profiles, paths, and the bottom-entry IIFE. Candidates for extraction: Diagnostics (~250 lines), Profiles (~265 lines), Adapters + ID normalizers + Part detectors as a "rules editors" file (~330 lines combined). Diminishing returns past this point — bottom IIFE + load/save core should stay in `options.js` as the entry point.
|
- **Step 6c — finish options.js split (optional).** Remaining options.js (1852 lines) still holds: settings load/save, backup/restore, recent activity, search test bench, adapters, ID normalizers, part detectors, element picker, overlay previews, diagnostics, profiles, paths, and the bottom-entry IIFE. Candidates for extraction: Diagnostics (~250 lines), Profiles (~265 lines), Adapters + ID normalizers + Part detectors as a "rules editors" file (~330 lines combined). Diminishing returns past this point — bottom IIFE + load/save core should stay in `options.js` as the entry point.
|
||||||
- **Step 9 — Cache contract design.** CACHE_VERSION already exists (currently 3). Add ID_RULES_VERSION concept: schema bump = force rebuild, rules bump = warn-and-mark-stale.
|
- **Step 10 — `rc-jav.py` module split** into `rcjav/` package (ids, cache, dupes, catalog, rclone_io, output, cli). Keep `rc-jav.py` as thin entrypoint that imports from `rcjav.cli.main`. Step 10 is also where the cache-contract design from step 9 gets implemented: split `CACHE_VERSION` into `cache_schema` + `id_rules` + `id_rules_signature`, add the legacy-`version: 3` in-place migration, add a `--reextract` CLI flag that walks `files[]` without re-listing remotes, and update the extension's `cache-status` consumer (`options-cache.js`) to render the three-state UX from `docs/CACHE_CONTRACT.md`.
|
||||||
- **Step 10 — `rc-jav.py` module split** into `rcjav/` package (ids, cache, dupes, catalog, rclone_io, output, cli). Keep `rc-jav.py` as thin entrypoint that imports from `rcjav.cli.main`.
|
|
||||||
- **Step 11 — Host fast-path benchmark and decide.** Measure popup search latency under (a) idle Python and (b) Python actively scanning. If host fast path is the only thing keeping popup responsive under scan = narrow to dict lookup only and document. If not needed = delete entirely.
|
- **Step 11 — Host fast-path benchmark and decide.** Measure popup search latency under (a) idle Python and (b) Python actively scanning. If host fast path is the only thing keeping popup responsive under scan = narrow to dict lookup only and document. If not needed = delete entirely.
|
||||||
|
|
||||||
**Architecture (locked — do not relitigate):**
|
**Architecture (locked — do not relitigate):**
|
||||||
|
|||||||
@@ -0,0 +1,211 @@
|
|||||||
|
# Cache contract — design (Step 9)
|
||||||
|
|
||||||
|
Status: **design only**. No code changes yet. Step 10 implements.
|
||||||
|
|
||||||
|
This document is the source of truth for `cache.json` versioning and
|
||||||
|
the rebuild policy that both the Python `rc-jav.py` CLI and the
|
||||||
|
browser extension follow. It supersedes the single `CACHE_VERSION`
|
||||||
|
constant currently in `rc-jav.py`.
|
||||||
|
|
||||||
|
## Why split CACHE_VERSION
|
||||||
|
|
||||||
|
`CACHE_VERSION = 3` in `rc-jav.py` today is a single integer that
|
||||||
|
covers two unrelated things:
|
||||||
|
|
||||||
|
1. **Schema** — the shape of `cache.json` itself (top-level keys,
|
||||||
|
nested object shape, what fields a file entry carries).
|
||||||
|
2. **Rules** — the ID extraction logic (`extract_id`, normalizers,
|
||||||
|
part detectors, FC2-PPV handling). These influence the `jav_id`
|
||||||
|
field stored inside file entries.
|
||||||
|
|
||||||
|
Conflating them has a real cost:
|
||||||
|
|
||||||
|
- The last `CACHE_VERSION` bump (`3`, comment "extract_id handles
|
||||||
|
bracket-wrapped IDs + no-hyphen fallback") was a **rules** change.
|
||||||
|
It forced every user to do a full library rescan, which on large
|
||||||
|
remotes can take 30+ minutes per remote, even though the file
|
||||||
|
entries' shape was unchanged.
|
||||||
|
- A user who hasn't pulled the new rules can't tell from cache.json
|
||||||
|
whether their existing cache is "wrong shape" (unusable) or "stale
|
||||||
|
IDs" (usable but missing some matches).
|
||||||
|
- The extension can't surface the distinction in its UI either, so
|
||||||
|
the Cache & Scans pane shows a single "stale" state regardless.
|
||||||
|
|
||||||
|
## Two-tier model
|
||||||
|
|
||||||
|
| Tier | Bumps when | Effect |
|
||||||
|
|-------------------|--------------------------------------------------------------------|-------------------------------------------------------|
|
||||||
|
| `cache_schema` | The cache.json structure changes (new field, removed key, etc.) | **Force rebuild.** Cache is unusable. |
|
||||||
|
| `id_rules` | Any extraction rule changes — regex, normalizer, part detector | **Mark stale.** Cache stays usable; offer re-extract. |
|
||||||
|
|
||||||
|
`cache_schema` corresponds to the current `CACHE_VERSION` semantics
|
||||||
|
(force rebuild). `id_rules` is new and has weaker semantics — the
|
||||||
|
cache is still readable, the file list is still accurate, only the
|
||||||
|
derived `jav_id` field may be wrong for some entries.
|
||||||
|
|
||||||
|
## Cache header shape (new)
|
||||||
|
|
||||||
|
```json
|
||||||
|
{
|
||||||
|
"cache_schema": 1,
|
||||||
|
"id_rules": 4,
|
||||||
|
"id_rules_signature": "sha256:…",
|
||||||
|
"remotes": { … }
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- `cache_schema` starts at `1` for the new contract. Migration from
|
||||||
|
the legacy `version: 3` field is a one-shot read-side translation
|
||||||
|
in `load_cache()` (see Migration below).
|
||||||
|
- `id_rules` is a monotonic counter. Bump on every change to the
|
||||||
|
rules listed under "What counts as a rules change" below.
|
||||||
|
- `id_rules_signature` is a sha256 over the canonical text of the
|
||||||
|
rule definitions (regex source strings + normalizer fmts + part
|
||||||
|
detector patterns + FC2 handling toggle). It's a **belt-and-braces
|
||||||
|
check**: if a developer forgets to bump `id_rules`, the signature
|
||||||
|
catches drift. If a user has local custom rules in `config.json`,
|
||||||
|
the signature also drifts and is treated as a stale rules state.
|
||||||
|
|
||||||
|
## What counts as a rules change
|
||||||
|
|
||||||
|
Anything that influences the `jav_id` value stored in a file entry:
|
||||||
|
|
||||||
|
- `PRIMARY_ID_RE`, `COMPOUND_ID_RE`, `FALLBACK_ID_RE`,
|
||||||
|
`_NOHYPHEN_ID_RE`, `_BRACKET_ID_RE` in `rc-jav.py`
|
||||||
|
- Built-in part detectors (`BUILTIN_PART_RES`) and detection order
|
||||||
|
- FC2-PPV normalization branch
|
||||||
|
- `detect_part_from_stem` and `part_key` behavior
|
||||||
|
- `extract_id`'s overall control flow (variant-letter detection,
|
||||||
|
width-preserving padding, etc.)
|
||||||
|
- User-added normalizers from `config.json` (`id_normalizers`)
|
||||||
|
- User-added part patterns from `config.json` (`partPatterns`)
|
||||||
|
|
||||||
|
**Not** a rules change (no `id_rules` bump):
|
||||||
|
|
||||||
|
- Bug fixes to non-extraction code paths (`save_cache`, `walk_remote`,
|
||||||
|
`find_dupes`, keep-ranking logic, output formatting)
|
||||||
|
- Changes to extension-side display, since the extension never edits
|
||||||
|
cache.json
|
||||||
|
- Adding a new shared fixture case to `fixtures/`
|
||||||
|
|
||||||
|
## Behavior matrix
|
||||||
|
|
||||||
|
| User's cache | `cache_schema` | `id_rules` | Action |
|
||||||
|
|-------------------------------|-------------------|-------------------|-------------------------------------------------------------------|
|
||||||
|
| Fresh / matches both | = | = | Use as-is. |
|
||||||
|
| Schema mismatch | ≠ | (any) | **Force rebuild.** Same as today's `CACHE_VERSION` mismatch. |
|
||||||
|
| Schema match, rules stale | = | ≠ or sig drift | **Mark stale.** Use file list as-is; warn that some `jav_id`s may be out-of-date; offer "Re-extract IDs" (cheap, no remote scan). |
|
||||||
|
| Legacy `version: 3` (no new) | (translated to =) | (translated to =) | One-shot migration: replace header in place, do not force rebuild. |
|
||||||
|
|
||||||
|
"Re-extract IDs" is a new fast path: walk the existing `files[]` array
|
||||||
|
and recompute `jav_id` on each entry using the current rules. No
|
||||||
|
network or rclone call. Costs O(N) regex against N filenames, which
|
||||||
|
is seconds even for large libraries.
|
||||||
|
|
||||||
|
## Migration from `version: 3`
|
||||||
|
|
||||||
|
`load_cache()` becomes:
|
||||||
|
|
||||||
|
```python
|
||||||
|
def load_cache() -> dict:
|
||||||
|
if not CACHE_PATH.exists():
|
||||||
|
return _fresh_cache()
|
||||||
|
try:
|
||||||
|
data = json.loads(...)
|
||||||
|
except Exception:
|
||||||
|
return _fresh_cache()
|
||||||
|
|
||||||
|
# Legacy header: { "version": 3, "remotes": {...} }
|
||||||
|
# Translate in place. Treat as fresh-rules so user sees "stale" not "wipe".
|
||||||
|
if "version" in data and "cache_schema" not in data:
|
||||||
|
if data.get("version") == 3:
|
||||||
|
data = {
|
||||||
|
"cache_schema": CACHE_SCHEMA_VERSION,
|
||||||
|
"id_rules": 0, # forces "stale by rules" amber
|
||||||
|
"id_rules_signature": "legacy",
|
||||||
|
"remotes": data.get("remotes", {}),
|
||||||
|
}
|
||||||
|
else:
|
||||||
|
return _fresh_cache() # unknown legacy version → wipe
|
||||||
|
|
||||||
|
# New header validation
|
||||||
|
if data.get("cache_schema") != CACHE_SCHEMA_VERSION:
|
||||||
|
return _fresh_cache()
|
||||||
|
|
||||||
|
return data
|
||||||
|
```
|
||||||
|
|
||||||
|
Users with `version: 3` get an in-place upgrade with no rescan. The
|
||||||
|
cache shows up as "stale by rules" until they click Re-extract IDs.
|
||||||
|
|
||||||
|
## Extension UX (Cache & Scans pane)
|
||||||
|
|
||||||
|
Three states instead of today's two:
|
||||||
|
|
||||||
|
| State | Color | Pane copy | Action button |
|
||||||
|
|-------------------------|----------|------------------------------------------------------------------------------------|------------------------------|
|
||||||
|
| Fresh | green ✓ | "Cache up to date." | "Re-scan" (manual) |
|
||||||
|
| Stale by rules | amber ! | "ID extraction rules have changed since this cache was built. Some IDs may be out of date." | **"Re-extract IDs"** (fast) |
|
||||||
|
| Schema mismatch / wipe | red ✗ | "Cache version is unreadable. A full re-scan is required." | "Re-scan now" |
|
||||||
|
|
||||||
|
Background still has `cache-status` message. Response gains:
|
||||||
|
|
||||||
|
```js
|
||||||
|
{
|
||||||
|
ok: true,
|
||||||
|
cache_exists: true,
|
||||||
|
cache_schema: 1,
|
||||||
|
id_rules: 4,
|
||||||
|
id_rules_current: 4,
|
||||||
|
id_rules_match: true,
|
||||||
|
id_rules_signature_match: true,
|
||||||
|
// existing fields preserved: remotes, warnings, etc.
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
`renderCacheStatus` in `options-cache.js` reads these and picks the
|
||||||
|
state. Tests live in fixtures or in `options-cache.js` mocks (no need
|
||||||
|
to extend the JSON corpus for this).
|
||||||
|
|
||||||
|
## Open questions
|
||||||
|
|
||||||
|
1. **Where does the user's "id_rules_signature" come from?** The
|
||||||
|
signature must be computable from a single canonical text. Easiest:
|
||||||
|
sha256 over a sorted JSON dump of `{primary_re_source, compound_re_source,
|
||||||
|
fallback_re_source, nohyphen_re_source, bracket_re_source,
|
||||||
|
part_res_sources, fc2_handling: "enabled", user_normalizers,
|
||||||
|
user_part_patterns}`. Punt on exact shape until step 10.
|
||||||
|
2. **Should the extension trigger Re-extract IDs?** Yes —
|
||||||
|
`chrome.runtime.sendMessage({ type: "reextract-ids" })`, background
|
||||||
|
forwards to host, host calls a new `rc-jav.py --reextract` command
|
||||||
|
that walks cache.json without re-listing the remote.
|
||||||
|
3. **Per-remote tracking?** Today `id_rules` would be a single top-level
|
||||||
|
integer. Could go per-remote (`remotes[name].id_rules_at_scan`) so
|
||||||
|
"Re-extract IDs" can be triggered on a single remote. Recommend
|
||||||
|
storing per-remote and computing top-level "stale by rules" as
|
||||||
|
"any remote.id_rules_at_scan < id_rules_current". Defer detailed
|
||||||
|
design to step 10.
|
||||||
|
4. **Custom rules in config.json.** When a user adds a normalizer,
|
||||||
|
`id_rules_signature` drifts and their cache appears stale. That's
|
||||||
|
correct — their `jav_id`s really are out of date. But the global
|
||||||
|
`id_rules` integer didn't change. UI copy should distinguish
|
||||||
|
"rules updated upstream" from "your custom rules changed".
|
||||||
|
|
||||||
|
## Out of scope (step 9)
|
||||||
|
|
||||||
|
- Actually implementing the new header — that's step 10.
|
||||||
|
- Re-extract IDs CLI/host wiring — step 10.
|
||||||
|
- Bumping `cache_schema` to `1` and shipping new write code — step 10.
|
||||||
|
- Cache compaction, partial scans, incremental updates — separate work.
|
||||||
|
|
||||||
|
## Reference
|
||||||
|
|
||||||
|
- Current `CACHE_VERSION` constant: `D:\DEV\Project\rclone-jav\rc-jav.py`
|
||||||
|
line 376.
|
||||||
|
- `load_cache()` / `save_cache()` around line 416 of the same file.
|
||||||
|
- Extension consumer: `options-cache.js` `renderCacheStatus`, message
|
||||||
|
type `cache-status` in `background.js`.
|
||||||
|
- Shared fixture corpus that exercises the rule set:
|
||||||
|
`D:\DEV\Project\rclone-jav\fixtures\`.
|
||||||
Reference in New Issue
Block a user