Carves the first slice out of the monolithic rc-jav.py (now 2017
lines, was 2230). Two new modules:
rcjav/model.py FileEntry dataclass — the one shared shape that
every other submodule will need.
rcjav/ids.py Single source of truth for everything that
influences a FileEntry.jav_id: PRIMARY_ID_RE,
FALLBACK_ID_RE, COMPOUND_ID_RE, BUILTIN_PART_RES,
configure_part_patterns, detect_part,
detect_part_from_stem, part_key, extract_id,
normalize_id, describe_id_match, expand_range,
plus the supporting "private" regexes
(_BRACKET_ID_RE, _RESOLUTION_TAG_RE, etc.) that
other code in rc-jav.py still reads.
rcjav/__init__.py re-exports the public surface so future external
consumers can `from rcjav import extract_id` without caring which
submodule it lives in.
rc-jav.py drops the inline ID block and pulls everything from
rcjav.ids via a single import statement. PART_RES is intentionally
NOT imported — it's mutated by configure_part_patterns at runtime, so
a captured top-level reference would go stale. A small helper
`_current_part_res()` reads it dynamically via `_rcjav_ids.PART_RES`.
fixtures/run.py fix: synthesized importlib module name changed from
"rcjav" (which now collides with the real package directory) to
"rcjav_script". Also prepends ROOT to sys.path so rc-jav.py's
`from rcjav.model import …` resolves when run as
`python fixtures/run.py`.
Verified:
- python rc-jav.py --help → usage banner prints
- python fixtures/run.py → 17/17 cases pass
- python -m unittest tests.test_rules → 5/5 OK
Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
Shared JAV ID fixture corpus
JSON cases shared between the Python rc-jav.py CLI and the browser
extension at D:\DEV\Extensions\Production\rclone-jav\. Each side
reads the cases relevant to its own extraction surface.
Files
| File | Domain | Consumer | Notes |
|---|---|---|---|
filename-extraction.json |
filename | Python extract_id(name) |
Has #partN expectations for multipart files |
query-extraction.json |
query | Extension content.js normalizeId |
Looser context; extension never emits part suffix |
shared-normalization.json |
shared | BOTH | Contract: any mismatch here is a bug, not a fixture issue |
All files share the same shape:
{
"version": 1,
"domain": "…",
"description": "…",
"case_schema": { … },
"cases": [
{ "name": "…", "input": "…", "expected": "…" }
]
}
expected: null means "no ID should be detected".
Running the Python side
python fixtures/run.py
The runner imports rc-jav.py in place, exercises extract_id against
filename-extraction.json, and normalize_id against
shared-normalization.json. Exit code is non-zero on any failure.
Running the extension side
No automated runner today. content.js lives inside an IIFE that the
browser injects into pages, so importing it from Node would require
either an extraction refactor or a duplicated copy of the regex. Until
that lands, treat query-extraction.json and shared-normalization.json
as the canonical specification: if you touch ID_RE_DASHED,
ID_RE_UNDASHED, or BUILTIN_ID_NORMALIZERS in content.js, eyeball
this corpus and confirm the cases still describe expected behavior.
Adding a case
- Pick the file matching the surface you're testing.
- Append a
{ "name", "input", "expected" }entry. Keepnamedescriptive — it's the only label shown when the runner fails. - If the case exercises a guarantee both sides must honor, add it to
shared-normalization.jsonas well. - Run
python fixtures/run.pyto confirm Python still passes.
Known cross-side divergences (intentional)
These are NOT bugs — they reflect the different surfaces each side extracts from. Recorded here so future contributors don't try to "fix" them.
FC2PPV1841460compact form (no dashes). The extension'sBUILTIN_ID_NORMALIZERSincontent.jsrewrites this toFC2-PPV-1841460when seen in page titles. Pythonextract_iddoes NOT — the compact form doesn't realistically appear in filenames on disk. Hence the case lives inquery-extraction.jsononly, not infilename-extraction.jsonorshared-normalization.json.
If a case belongs to one side's contract but not the other's, file it
under the specific domain (filename- or query-) — not under
shared-.
Ownership
This directory lives in the Python repo only because the Python repo is the more stable root. Conceptually it's joint property of both codebases. Don't add anything Python-specific to the JSON files — keep them tool-neutral.