Commit Graph

13 Commits

Author SHA1 Message Date
admin 8d6bdb81af Add Node-side fixture runner — both sides now exercise the corpus
Mirrors `content.js` normalizeId() in a self-contained
`fixtures/run-node.mjs`. Loads `query-extraction.json` and
`shared-normalization.json` and asserts each case the same way the
Python runner does.

content.js can't be imported directly — it lives inside an injected
IIFE in the extension — so the runner duplicates the regexes
(ID_RE_DASHED, ID_RE_UNDASHED, BUILTIN_ID_NORMALIZERS). Inline
comment + README update flag that they must be kept in sync.

Why this matters: `shared-normalization.json` now actually catches
cross-side drift. A case that passes one side but fails the other is
the canary — without a Node runner, the contract was aspirational.

Verified:
  $ node fixtures/run-node.mjs
  query-extraction.json     -> normalizeId (10 cases): 10 passed
  shared-normalization.json -> normalizeId (5 cases):  5 passed
  OK: all 15 cases passed

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:18:52 +02:00
admin b9a24b3fb5 Step 11: benchmark host fast-path, decision = keep
Adds benchmarks/host-fast-path.py and benchmarks/README.md.

The benchmark compares two paths for a cached single-ID search:
  1. fast-path: in-process dict walk inside the native host
     (handle_cached_search_fast in rcjav-host.py)
  2. subprocess: shell out to `rc-jav.py --search ID --cache --format json`

Idle baseline against the live 7124-file cache (5 queries × 5 iter):

  fast-path:   median 0.46ms  p95 0.61ms  max 0.72ms
  subprocess:  median 919ms   p95 1233ms  max 1385ms
  median speedup: 2000x

Decision: keep the fast path. The ~920ms subprocess cost is dominated
by Python interpreter startup + 1.3MB cache.json parse. That's
structural — it applies under idle Python too, not just when a scan
is running. The "Python actively scanning" condition from the original
roadmap doesn't change the verdict; it would only make the subprocess
path even slower while leaving the in-process path unaffected (the
fast path doesn't touch the scanning process).

The fast path is already correctly scoped — bails out for wildcards,
ranges, name searches, and --quick mode. Narrowing further would just
push more queries through the slow path with no upside.

Possible follow-up (not in scope here): memoize _load_host_cache with
mtime-based invalidation so the fast path doesn't reparse cache.json
on every call. Current per-call median (0.46ms) is already fast enough
that this is optional.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-23 11:12:33 +02:00
admin 66f82eb214 Add --print-rules-info CLI flag for host cache freshness lookup 2026-05-22 22:08:05 +02:00
admin 33c495ad57 Step 10j (Python side): cache contract + --reextract command
Implements the two-tier contract from docs/CACHE_CONTRACT.md (extension
repo, locked at step 9):

  cache_schema       on-disk shape; mismatch -> force rebuild
  id_rules           bumps when extraction rules change
  id_rules_signature sha256 over canonical rule text; catches drift
                     when the integer bump is forgotten

New constants in rcjav/cache.py:

  CACHE_SCHEMA_VERSION = 1
  ID_RULES_VERSION = 1     (the legacy "version: 3" cache reads as
                            id_rules: 0 after in-place migration)

New helpers:

  rcjav.ids.current_rules_signature()
      Sha256 over the canonical text of every rule that influences
      a jav_id: built-in regexes, BUILTIN_PART_RES, PART_RES (which
      captures user-added part patterns), FC2 handling.

  rcjav.cache.load_cache(signature=None)
      Reads cache.json. Legacy `version: 3` headers get an in-place
      header upgrade with no forced rescan; the cache is stamped as
      `id_rules: 0` + signature "legacy" so it surfaces as
      "stale by rules" in cache_state. Schema mismatch on the new
      header still forces a rebuild.

  rcjav.cache.cache_state(cache, signature)
      Classifies a cache as "fresh" / "stale_by_rules" /
      "schema_mismatch". Drives the three-state extension UX.

  rcjav.cache.stamp_current_rules(cache, signature)
      Updates id_rules and id_rules_signature in place. Called after
      a successful full scan or --reextract.

New CLI command:

  rc-jav.py --reextract

Walks `cache["remotes"][r]["files"]` against the live rule set and
updates `jav_id` in place. No rclone calls — fast path (seconds on
a 7k-file cache). Reports changed/unchanged/dropped per remote.
Stamps current rules into the saved cache.

--scan (full, no --scan-since) now also stamps current rules.
--scan --scan-since deliberately does NOT stamp: it only re-walks
recently-modified files, so older entries may still carry jav_ids
from previous rules; cache stays "stale by rules" until a full scan
or --reextract.

Verified:
  - python rc-jav.py --reextract --format json on the live 7124-file
    cache → 0 changes (existing IDs already canonical), cache.json
    rewritten with new header
  - cache_state on the post-migration cache → "fresh"
  - tests + fixtures + --help all pass

Extension-side (host's cache_status response + options-cache.js
three-state UX + Re-extract IDs button) ships in a separate commit
in the extension repo.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 22:07:13 +02:00
admin 1cc2c38128 Step 10i: rc-jav.py becomes a thin shim; main() lives in rcjav/cli.py
The real entrypoint moved into rcjav/cli.py (845 lines: imports + the
remaining top-level glue + collectors + main()). rc-jav.py is now a
25-line shim that does:

  - `from rcjav import *` to re-export the package surface for callers
    that load this script via importlib.spec_from_file_location
    (tests/test_rules.py, fixtures/run.py, the native-messaging host
    via importlib).
  - `from rcjav.cli import main` and call it under `__main__`.

Verified all four entry points:
  - python rc-jav.py --help              → ok (legacy CLI invocation)
  - python -m rcjav.cli --help           → ok (package-direct)
  - python fixtures/run.py               → 17/17 cases pass
  - python -m unittest tests.test_rules  → 5/5 OK

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 22:01:52 +02:00
admin fb5700cdab Step 10h: extract renderers + file outputs into rcjav/output.py 2026-05-22 22:00:22 +02:00
admin 550482a7a2 Step 10g: extract library issues + renaming into rcjav/library.py 2026-05-22 21:54:49 +02:00
admin 90054e4d0b Step 10f: extract rclone subprocess wrappers into rcjav/rclone_io.py 2026-05-22 21:53:36 +02:00
admin 41f7c80f1b Step 10e: extract WinCatalog ingest into rcjav/catalog.py 2026-05-22 21:51:09 +02:00
admin 8d636ec633 Step 10d: extract dupes/keep-ranking into rcjav/dupes.py
Pulls the duplicate-detection and keep-ranking surface out of
rc-jav.py:

  DEFAULT_KEEP_RANKING
  _KEEP_RANKING (module global)
  decide_keep_with_reason
  decide_keep
  find_dupes
  _SUSPICIOUS_MULTIPART_TAIL_RE
  describe_dupe_risks
  find_variant_alerts

Same mutable-rebound pattern as PART_RES: `_KEEP_RANKING` is now
configured via `set_keep_ranking(dict)` rather than a `global` write
in rc-jav.py's main(). Reads happen only inside the module that owns
the binding, so callers never see a stale snapshot.

rc-jav.py: 1972 → 1763 lines (209 extracted).
rcjav/dupes.py: 244 lines.

Verified:
  - python rc-jav.py --help              → ok
  - python fixtures/run.py               → 17/17 cases pass
  - python -m unittest tests.test_rules  → 5/5 OK

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 21:49:14 +02:00
admin f03d032336 Step 10c: extract cache I/O into rcjav/cache.py
Pulls CACHE_PATH, CACHE_VERSION, CACHE_STALE_HOURS, load_cache,
save_cache, cache_age_hours, and fmt_age out of rc-jav.py and into a
new self-contained module. No behavior change.

rc-jav.py: 2019 → 1972 lines.

The new module's `CACHE_PATH = Path(__file__).resolve().parents[1] /
"cache.json"` keeps the file at the repo root next to rc-jav.py (one
directory above the package), matching the legacy `Path(__file__).
resolve().parent / "cache.json"` location.

rcjav/__init__.py now re-exports the cache public surface alongside
the model and ids surface.

Verified:
  - python rc-jav.py --help              → ok
  - python fixtures/run.py               → 17/17 cases pass
  - python -m unittest tests.test_rules  → 5/5 OK

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 21:46:20 +02:00
admin ba57b7fd21 Step 10a + 10b: scaffold rcjav/ package, extract ID rules
Carves the first slice out of the monolithic rc-jav.py (now 2017
lines, was 2230). Two new modules:

  rcjav/model.py    FileEntry dataclass — the one shared shape that
                    every other submodule will need.
  rcjav/ids.py      Single source of truth for everything that
                    influences a FileEntry.jav_id: PRIMARY_ID_RE,
                    FALLBACK_ID_RE, COMPOUND_ID_RE, BUILTIN_PART_RES,
                    configure_part_patterns, detect_part,
                    detect_part_from_stem, part_key, extract_id,
                    normalize_id, describe_id_match, expand_range,
                    plus the supporting "private" regexes
                    (_BRACKET_ID_RE, _RESOLUTION_TAG_RE, etc.) that
                    other code in rc-jav.py still reads.

rcjav/__init__.py re-exports the public surface so future external
consumers can `from rcjav import extract_id` without caring which
submodule it lives in.

rc-jav.py drops the inline ID block and pulls everything from
rcjav.ids via a single import statement. PART_RES is intentionally
NOT imported — it's mutated by configure_part_patterns at runtime, so
a captured top-level reference would go stale. A small helper
`_current_part_res()` reads it dynamically via `_rcjav_ids.PART_RES`.

fixtures/run.py fix: synthesized importlib module name changed from
"rcjav" (which now collides with the real package directory) to
"rcjav_script". Also prepends ROOT to sys.path so rc-jav.py's
`from rcjav.model import …` resolves when run as
`python fixtures/run.py`.

Verified:
  - python rc-jav.py --help              → usage banner prints
  - python fixtures/run.py               → 17/17 cases pass
  - python -m unittest tests.test_rules  → 5/5 OK

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
2026-05-22 21:43:57 +02:00
admin e029e898e9 Initial snapshot before step 10 package split 2026-05-22 21:39:09 +02:00