Implements the two-tier contract from docs/CACHE_CONTRACT.md (extension
repo, locked at step 9):
cache_schema on-disk shape; mismatch -> force rebuild
id_rules bumps when extraction rules change
id_rules_signature sha256 over canonical rule text; catches drift
when the integer bump is forgotten
New constants in rcjav/cache.py:
CACHE_SCHEMA_VERSION = 1
ID_RULES_VERSION = 1 (the legacy "version: 3" cache reads as
id_rules: 0 after in-place migration)
New helpers:
rcjav.ids.current_rules_signature()
Sha256 over the canonical text of every rule that influences
a jav_id: built-in regexes, BUILTIN_PART_RES, PART_RES (which
captures user-added part patterns), FC2 handling.
rcjav.cache.load_cache(signature=None)
Reads cache.json. Legacy `version: 3` headers get an in-place
header upgrade with no forced rescan; the cache is stamped as
`id_rules: 0` + signature "legacy" so it surfaces as
"stale by rules" in cache_state. Schema mismatch on the new
header still forces a rebuild.
rcjav.cache.cache_state(cache, signature)
Classifies a cache as "fresh" / "stale_by_rules" /
"schema_mismatch". Drives the three-state extension UX.
rcjav.cache.stamp_current_rules(cache, signature)
Updates id_rules and id_rules_signature in place. Called after
a successful full scan or --reextract.
New CLI command:
rc-jav.py --reextract
Walks `cache["remotes"][r]["files"]` against the live rule set and
updates `jav_id` in place. No rclone calls — fast path (seconds on
a 7k-file cache). Reports changed/unchanged/dropped per remote.
Stamps current rules into the saved cache.
--scan (full, no --scan-since) now also stamps current rules.
--scan --scan-since deliberately does NOT stamp: it only re-walks
recently-modified files, so older entries may still carry jav_ids
from previous rules; cache stays "stale by rules" until a full scan
or --reextract.
Verified:
- python rc-jav.py --reextract --format json on the live 7124-file
cache → 0 changes (existing IDs already canonical), cache.json
rewritten with new header
- cache_state on the post-migration cache → "fresh"
- tests + fixtures + --help all pass
Extension-side (host's cache_status response + options-cache.js
three-state UX + Re-extract IDs button) ships in a separate commit
in the extension repo.
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls the duplicate-detection and keep-ranking surface out of
rc-jav.py:
DEFAULT_KEEP_RANKING
_KEEP_RANKING (module global)
decide_keep_with_reason
decide_keep
find_dupes
_SUSPICIOUS_MULTIPART_TAIL_RE
describe_dupe_risks
find_variant_alerts
Same mutable-rebound pattern as PART_RES: `_KEEP_RANKING` is now
configured via `set_keep_ranking(dict)` rather than a `global` write
in rc-jav.py's main(). Reads happen only inside the module that owns
the binding, so callers never see a stale snapshot.
rc-jav.py: 1972 → 1763 lines (209 extracted).
rcjav/dupes.py: 244 lines.
Verified:
- python rc-jav.py --help → ok
- python fixtures/run.py → 17/17 cases pass
- python -m unittest tests.test_rules → 5/5 OK
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Pulls CACHE_PATH, CACHE_VERSION, CACHE_STALE_HOURS, load_cache,
save_cache, cache_age_hours, and fmt_age out of rc-jav.py and into a
new self-contained module. No behavior change.
rc-jav.py: 2019 → 1972 lines.
The new module's `CACHE_PATH = Path(__file__).resolve().parents[1] /
"cache.json"` keeps the file at the repo root next to rc-jav.py (one
directory above the package), matching the legacy `Path(__file__).
resolve().parent / "cache.json"` location.
rcjav/__init__.py now re-exports the cache public surface alongside
the model and ids surface.
Verified:
- python rc-jav.py --help → ok
- python fixtures/run.py → 17/17 cases pass
- python -m unittest tests.test_rules → 5/5 OK
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Carves the first slice out of the monolithic rc-jav.py (now 2017
lines, was 2230). Two new modules:
rcjav/model.py FileEntry dataclass — the one shared shape that
every other submodule will need.
rcjav/ids.py Single source of truth for everything that
influences a FileEntry.jav_id: PRIMARY_ID_RE,
FALLBACK_ID_RE, COMPOUND_ID_RE, BUILTIN_PART_RES,
configure_part_patterns, detect_part,
detect_part_from_stem, part_key, extract_id,
normalize_id, describe_id_match, expand_range,
plus the supporting "private" regexes
(_BRACKET_ID_RE, _RESOLUTION_TAG_RE, etc.) that
other code in rc-jav.py still reads.
rcjav/__init__.py re-exports the public surface so future external
consumers can `from rcjav import extract_id` without caring which
submodule it lives in.
rc-jav.py drops the inline ID block and pulls everything from
rcjav.ids via a single import statement. PART_RES is intentionally
NOT imported — it's mutated by configure_part_patterns at runtime, so
a captured top-level reference would go stale. A small helper
`_current_part_res()` reads it dynamically via `_rcjav_ids.PART_RES`.
fixtures/run.py fix: synthesized importlib module name changed from
"rcjav" (which now collides with the real package directory) to
"rcjav_script". Also prepends ROOT to sys.path so rc-jav.py's
`from rcjav.model import …` resolves when run as
`python fixtures/run.py`.
Verified:
- python rc-jav.py --help → usage banner prints
- python fixtures/run.py → 17/17 cases pass
- python -m unittest tests.test_rules → 5/5 OK
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>