Includes: - cli.py path fix (parents[1]) for config/catalog resolution - Library cleanup feature design docs (TODO.md, mockup) - Audit + bug-queue markdowns from May 2026 reliability pass - .gitignore expanded for transient artifacts
rc-jav
Read-only duplicate scanner for JAV files across rclone remotes. Groups files by JAV ID (e.g. SSIS-001) and reports which copy to keep based on priority rules.
Priority rules
- Video files inside configured VIP folders win first. Default VIP folder:
ClearJAV. - If no VIP-folder video exists, Source always wins regardless of resolution/size.
.tsfiles rank below other video containers, even when the transport-stream copy is larger.- If no Source copy exists in the group, largest file size wins among the remaining Targets.
- Suggestions only — script never deletes. Manual cleanup.
ID matching
Filename stem is matched against:
- Primary:
^([A-Za-z]+)-(\d+)—SSIS-001,MIDV-123,ABP-456 - Compound:
^(\w+(?:-\w+)+)-(\d+)—FC2-PPV-4894535,HEYZO-HD-1234 - Fallback:
^([A-Za-z0-9]+)-(\d+)—1pondo-123,carib-456
IDs normalized to uppercase with leading zeros stripped from the number (so ssis-001 == SSIS-1 == SSIS-001). Anything after the ID ( - Actress [1080p]) is ignored for matching.
Part-suffix handling
Multi-part files (_1, _2, -1, -2, _A, _B, .1of4, (1), -pt1, -part1, -cd1, -disc1, trailing N) are normalized as {ID}#partN so they do not collide as false duplicates. Searching the base ID still finds all parts. Lettered _A / _B suffixes become part 1 / part 2.
Add more suffix shapes with repeatable --part-pattern regexes. The first capture group is the part number or one part letter and the pattern runs against the filename stem:
python rc-jav.py --scan --part-pattern '[-_ ]side[-_ ]?(A|B)$'
python rc-jav.py --part-pattern '_([CD])$' --save
Saved rules live in config.json as part_patterns. The extension Options page has the same custom part detector list for host-triggered searches, duplicate review, and cache rebuilds.
Files with no parseable ID are listed under "Skipped" at the end so you can spot misnamed files.
Rule checks
Focused rule tests cover ID extraction, multipart grouping safety, and duplicate KEEP ranking:
python -B -m unittest discover -s tests -v
Usage
python rc-jav.py \
--source cq:personal-files/ClearJAV/ichika-matsumoto \
--target cq:personal-files/JAV/TMP \
Flags:
--source/-s REMOTE— priority remote path. Repeat for multiple.--target/-t REMOTE— non-priority remote path. Repeat for multiple.--format {console,txt,csv,json,all}— defaultconsole. Non-console formats write to--output-dir.--output-dir DIR— default./reports.--no-color— disable ANSI colors but keep rich layout (tables, panels).--basic— plain text output, no rich tables/panels. Progress ticks every 25 files on stderr. Useful for piping or simple terminals.--rclone-bin PATH— path to rclone executable (default:rcloneon PATH). Example:--rclone-bin C:\Programs\rclone\rclone.exe.--clearjav— shortcut: sets source =DEFAULT_SOURCE, target =DEFAULT_TARGET. Equivalent to--source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP. Combine with--source/--targetto override one side.
Examples:
# full library dupe scan, one flag
python rc-jav.py --clearjav
# same but only check one actress folder against TMP
python rc-jav.py --clearjav --source cq:personal-files/ClearJAV/ichika-matsumoto
Search mode
Check whether a JAV ID already exists in your library before downloading:
python rc-jav.py --search SSIS-001
python rc-jav.py --search SSIS-001 --search FC2-PPV-4894535
# wildcards (quote to avoid shell glob expansion)
python rc-jav.py --search "IPZZ-*"
python rc-jav.py --search "FC2-PPV-*"
python rc-jav.py --search "SSIS-???" # exact 3-digit numeric
Wildcard syntax: * (any chars) and ? (one char), case-insensitive. Matches against normalized IDs in the index, including #partN suffixes automatically.
Range syntax: [N-M] inclusive both ends. Works inside any prefix.
python rc-jav.py --search "IPZZ-[820-860]"
python rc-jav.py --search "FC2-PPV-[4894500-4894600]"
python rc-jav.py --search "MIDV-[001-010]" # zero-padding preserved
Quote in PowerShell/bash so [...] reaches Python literally. Reversed ranges (860-820) auto-swap.
With no --source / --target flags, only DEFAULT_TARGET (TMP) is scanned — the typical case for "do I already have this in my unsorted pile?". Pass --source cq:personal-files/ClearJAV to also check the priority library. Edit DEFAULT_SOURCE / DEFAULT_TARGET at the top of the script to change defaults. Remote scans are recursive.
Exit code: 0 if every query had at least one hit, 1 otherwise — useful for shell automation.
Name search (--name)
Substring search against filenames (case-insensitive). Find all files by actress, studio, tag, anything that appears in the filename.
python rc-jav.py --name Ichika
python rc-jav.py --name "Ichika Matsumoto"
python rc-jav.py --name Ichika --name Yui # OR — files matching either
python rc-jav.py --name "Mat*" # glob wildcard
python rc-jav.py --search IPZZ-860 --name Ichika # both — separate result blocks
- Multiple
--nametokens = OR. Use one combined--name "foo bar"for AND/exact-substring. - Matches against the filename stem only (not folder names).
- Auto-routes to cached mode because substring globs can't be server-side filtered on most backends. Pass
-qto force quick anyway (slower).
Smart search mode (auto quick / cached)
The script auto-picks the right execution path per query and prints which one it chose:
| Query shape | Picked mode | Reason |
|---|---|---|
Single exact ID (IPZZ-860) |
quick | live rclone --include, ~1–2s even on huge trees |
Wildcard (IPZZ-*, SSIS-???) |
cached | reliable normalized matching |
Range (IPZZ-[820-860]) |
cached | avoids N rclone calls |
Multiple --search flags |
cached | warmup amortizes |
Override:
--quick/-q— force live rclone lookup (skips cache).--cache— force cache (builds it if cold).
Quick mode never reads or writes the cache. Cache mode honors --update and --no-cache as before.
Cache
Search mode caches each remote's file list in ./cache.json next to the script. Subsequent searches are near-instant.
- First run: scans + writes cache.
- Later runs: reads cache (banner shows
CACHED 14m (154 files)). --update/-u: force re-scan + overwrite cache for the requested remotes.--no-cache: bypass cache (no read, no write).- Stale warning when cache is older than 24h — still used, marked
CACHED-STALE. - Ctrl+C during a scan: rclone is terminated, cache for in-flight remote is NOT written.
Delete cache.json to reset everything.
Saving defaults (--save)
Persist --source, --target, --catalog, and/or --part-pattern to config.json so you don't have to type them every run.
# set default target
python rc-jav.py --target cq:personal-files/JAV/TMP --save
# set source + multiple targets at once
python rc-jav.py --source cq:personal-files/ClearJAV ^
--target cq:personal-files/JAV/TMP ^
--target cq:personal-files/JAV/SORTED ^
--save
# inspect
type config.json
Only the keys you explicitly pass are written — running --save --target X won't wipe a saved default_source. Delete config.json to reset to the hardcoded defaults at the top of rc-jav.py.
Scan-only (--scan)
Refresh the cache without running a search or dupe report — useful for Task Scheduler / cron pre-warming.
# default: refresh DEFAULT_TARGET (TMP)
python rc-jav.py --scan
# refresh both source and target
python rc-jav.py --scan --source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP
# nightly via Task Scheduler
schtasks /Create /SC DAILY /ST 03:00 /TN "rc-jav nightly scan" ^
/TR "python D:\DEV\Project\rclone-jav\rc-jav.py --scan --basic"
--scan always overwrites the cache for the remotes you list. Exit 0 = success, non-zero = rclone failure.
python rc-jav.py --search MIDV-999 ; if ($LASTEXITCODE -eq 0) { "have it" } else { "download" }
WinCatalog integration
WinCatalog's native .wcat format is proprietary, so the script reads its exports instead.
- In WinCatalog: File → Export → choose CSV or XML.
- Save into the
wincatalog/folder next to the script. All*.csvand*.xmlfiles there are auto-loaded — drop in as many discs as you want. - Run as normal:
python rc-jav.py --search IPZZ-860 - Override or add extra paths with
--catalog PATH(file or folder, repeatable). - To change the default folder, edit
DEFAULT_CATALOGat the top of the script.
Re-export when your catalog changes; the script re-reads on every run (catalog data is not cached — it's already a local file).
Role of catalog hits:
- Search: shown as rows with source label
Catalog. The disc/volume name is encoded into the path so you know which offline backup holds the file. - Dupe mode: catalog entries appear in groups for awareness but are never marked KEEP or DELETE? — they're offline, can't be touched. A group is only flagged as a dupe when 2+ rclone copies exist.
CSV column auto-detection (case-insensitive, first match wins):
- Name:
Name,File Name,Filename,Title - Path:
Path,Full Path,Location,Folder - Size:
Size,File Size,Bytes,Size (bytes) - Disc:
Disc,Disc Name,Disc Label,Volume,Source,Catalog,Media
XML: walks the tree, treats <File> / <f> nodes inside <Disc> / <Catalog> / <Volume> containers, with <Folder> nesting.
Requirements
- Python 3.9+
pip install rich(used for progress bars + themed output)rcloneonPATHwith the relevant remotes configured.
UI
- Live per-file progress bar during scans (
rclone size --jsonfor total, thenrclone lsf --files-only -R --format pststreamed). - Banner panel showing run mode + per-remote cache status.
- Rich tables for search hits and duplicate groups.
--no-colorfor plain output (CI, piping).
Roadmap
- Phase 1 (current): report duplicates + search.
- Phase 2:
--applymode that runsrclone deleteonDELETE?candidates behind a confirmation gate. - Phase 3: resolution-aware tiebreakers, move-to-review folder, scheduled runs.