# rc-jav Read-only duplicate scanner for JAV files across rclone remotes. Groups files by JAV ID (e.g. `SSIS-001`) and reports which copy to keep based on priority rules. ## Priority rules 1. Video files inside configured **VIP folders** win first. Default VIP folder: `ClearJAV`. 2. If no VIP-folder video exists, **Source always wins** regardless of resolution/size. 3. `.ts` files rank below other video containers, even when the transport-stream copy is larger. 4. If no Source copy exists in the group, **largest file size wins** among the remaining Targets. 5. Suggestions only — script never deletes. Manual cleanup. ## ID matching Filename stem is matched against: - Primary: `^([A-Za-z]+)-(\d+)` — `SSIS-001`, `MIDV-123`, `ABP-456` - Compound: `^(\w+(?:-\w+)+)-(\d+)` — `FC2-PPV-4894535`, `HEYZO-HD-1234` - Fallback: `^([A-Za-z0-9]+)-(\d+)` — `1pondo-123`, `carib-456` IDs normalized to uppercase with leading zeros stripped from the number (so `ssis-001` == `SSIS-1` == `SSIS-001`). Anything after the ID (` - Actress [1080p]`) is ignored for matching. ### Part-suffix handling Multi-part files (`_1`, `_2`, `-1`, `-2`, `_A`, `_B`, `.1of4`, ` (1)`, `-pt1`, `-part1`, `-cd1`, `-disc1`, trailing ` N`) are normalized as `{ID}#partN` so they do not collide as false duplicates. Searching the base ID still finds all parts. Lettered `_A` / `_B` suffixes become part 1 / part 2. Add more suffix shapes with repeatable `--part-pattern` regexes. The first capture group is the part number or one part letter and the pattern runs against the filename stem: ```powershell python rc-jav.py --scan --part-pattern '[-_ ]side[-_ ]?(A|B)$' python rc-jav.py --part-pattern '_([CD])$' --save ``` Saved rules live in `config.json` as `part_patterns`. The extension Options page has the same custom part detector list for host-triggered searches, duplicate review, and cache rebuilds. Files with no parseable ID are listed under "Skipped" at the end so you can spot misnamed files. ### Rule checks Focused rule tests cover ID extraction, multipart grouping safety, and duplicate KEEP ranking: ```powershell python -B -m unittest discover -s tests -v ``` ## Usage ``` python rc-jav.py \ --source cq:personal-files/ClearJAV/ichika-matsumoto \ --target cq:personal-files/JAV/TMP \ ``` Flags: - `--source` / `-s REMOTE` — priority remote path. Repeat for multiple. - `--target` / `-t REMOTE` — non-priority remote path. Repeat for multiple. - `--format {console,txt,csv,json,all}` — default `console`. Non-console formats write to `--output-dir`. - `--output-dir DIR` — default `./reports`. - `--no-color` — disable ANSI colors but keep rich layout (tables, panels). - `--basic` — plain text output, no rich tables/panels. Progress ticks every 25 files on stderr. Useful for piping or simple terminals. - `--rclone-bin PATH` — path to rclone executable (default: `rclone` on PATH). Example: `--rclone-bin C:\Programs\rclone\rclone.exe`. - `--clearjav` — shortcut: sets source = `DEFAULT_SOURCE`, target = `DEFAULT_TARGET`. Equivalent to `--source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP`. Combine with `--source`/`--target` to override one side. Examples: ``` # full library dupe scan, one flag python rc-jav.py --clearjav # same but only check one actress folder against TMP python rc-jav.py --clearjav --source cq:personal-files/ClearJAV/ichika-matsumoto ``` ## Search mode Check whether a JAV ID already exists in your library before downloading: ``` python rc-jav.py --search SSIS-001 python rc-jav.py --search SSIS-001 --search FC2-PPV-4894535 # wildcards (quote to avoid shell glob expansion) python rc-jav.py --search "IPZZ-*" python rc-jav.py --search "FC2-PPV-*" python rc-jav.py --search "SSIS-???" # exact 3-digit numeric ``` Wildcard syntax: `*` (any chars) and `?` (one char), case-insensitive. Matches against normalized IDs in the index, including `#partN` suffixes automatically. Range syntax: `[N-M]` inclusive both ends. Works inside any prefix. ``` python rc-jav.py --search "IPZZ-[820-860]" python rc-jav.py --search "FC2-PPV-[4894500-4894600]" python rc-jav.py --search "MIDV-[001-010]" # zero-padding preserved ``` Quote in PowerShell/bash so `[...]` reaches Python literally. Reversed ranges (`860-820`) auto-swap. With no `--source` / `--target` flags, only `DEFAULT_TARGET` (TMP) is scanned — the typical case for "do I already have this in my unsorted pile?". Pass `--source cq:personal-files/ClearJAV` to also check the priority library. Edit `DEFAULT_SOURCE` / `DEFAULT_TARGET` at the top of the script to change defaults. Remote scans are recursive. Exit code: `0` if every query had at least one hit, `1` otherwise — useful for shell automation. ## Name search (`--name`) Substring search against filenames (case-insensitive). Find all files by actress, studio, tag, anything that appears in the filename. ``` python rc-jav.py --name Ichika python rc-jav.py --name "Ichika Matsumoto" python rc-jav.py --name Ichika --name Yui # OR — files matching either python rc-jav.py --name "Mat*" # glob wildcard python rc-jav.py --search IPZZ-860 --name Ichika # both — separate result blocks ``` - Multiple `--name` tokens = OR. Use one combined `--name "foo bar"` for AND/exact-substring. - Matches against the filename stem only (not folder names). - Auto-routes to **cached** mode because substring globs can't be server-side filtered on most backends. Pass `-q` to force quick anyway (slower). ### Smart search mode (auto quick / cached) The script auto-picks the right execution path per query and prints which one it chose: | Query shape | Picked mode | Reason | |---|---|---| | Single exact ID (`IPZZ-860`) | quick | live rclone `--include`, ~1–2s even on huge trees | | Wildcard (`IPZZ-*`, `SSIS-???`) | cached | reliable normalized matching | | Range (`IPZZ-[820-860]`) | cached | avoids N rclone calls | | Multiple `--search` flags | cached | warmup amortizes | Override: - `--quick` / `-q` — force live rclone lookup (skips cache). - `--cache` — force cache (builds it if cold). Quick mode never reads or writes the cache. Cache mode honors `--update` and `--no-cache` as before. ### Cache Search mode caches each remote's file list in `./cache.json` next to the script. Subsequent searches are near-instant. - First run: scans + writes cache. - Later runs: reads cache (banner shows `CACHED 14m (154 files)`). - `--update` / `-u`: force re-scan + overwrite cache for the requested remotes. - `--no-cache`: bypass cache (no read, no write). - Stale warning when cache is older than 24h — still used, marked `CACHED-STALE`. - Ctrl+C during a scan: rclone is terminated, cache for in-flight remote is NOT written. Delete `cache.json` to reset everything. ### Saving defaults (--save) Persist `--source`, `--target`, `--catalog`, and/or `--part-pattern` to `config.json` so you don't have to type them every run. ``` # set default target python rc-jav.py --target cq:personal-files/JAV/TMP --save # set source + multiple targets at once python rc-jav.py --source cq:personal-files/ClearJAV ^ --target cq:personal-files/JAV/TMP ^ --target cq:personal-files/JAV/SORTED ^ --save # inspect type config.json ``` Only the keys you explicitly pass are written — running `--save --target X` won't wipe a saved `default_source`. Delete `config.json` to reset to the hardcoded defaults at the top of `rc-jav.py`. ### Scan-only (--scan) Refresh the cache without running a search or dupe report — useful for Task Scheduler / cron pre-warming. ``` # default: refresh DEFAULT_TARGET (TMP) python rc-jav.py --scan # refresh both source and target python rc-jav.py --scan --source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP # nightly via Task Scheduler schtasks /Create /SC DAILY /ST 03:00 /TN "rc-jav nightly scan" ^ /TR "python D:\DEV\Project\rclone-jav\rc-jav.py --scan --basic" ``` `--scan` always overwrites the cache for the remotes you list. Exit 0 = success, non-zero = rclone failure. ``` python rc-jav.py --search MIDV-999 ; if ($LASTEXITCODE -eq 0) { "have it" } else { "download" } ``` ## WinCatalog integration WinCatalog's native `.wcat` format is proprietary, so the script reads its exports instead. 1. In WinCatalog: **File → Export** → choose **CSV** or **XML**. 2. Save into the `wincatalog/` folder next to the script. All `*.csv` and `*.xml` files there are auto-loaded — drop in as many discs as you want. 3. Run as normal: `python rc-jav.py --search IPZZ-860` 4. Override or add extra paths with `--catalog PATH` (file or folder, repeatable). 5. To change the default folder, edit `DEFAULT_CATALOG` at the top of the script. Re-export when your catalog changes; the script re-reads on every run (catalog data is **not** cached — it's already a local file). **Role of catalog hits:** - Search: shown as rows with source label `Catalog`. The disc/volume name is encoded into the path so you know which offline backup holds the file. - Dupe mode: catalog entries appear in groups for awareness but are **never marked KEEP or DELETE?** — they're offline, can't be touched. A group is only flagged as a dupe when 2+ rclone copies exist. **CSV column auto-detection** (case-insensitive, first match wins): - Name: `Name`, `File Name`, `Filename`, `Title` - Path: `Path`, `Full Path`, `Location`, `Folder` - Size: `Size`, `File Size`, `Bytes`, `Size (bytes)` - Disc: `Disc`, `Disc Name`, `Disc Label`, `Volume`, `Source`, `Catalog`, `Media` XML: walks the tree, treats `` / `` nodes inside `` / `` / `` containers, with `` nesting. ## Requirements - Python 3.9+ - `pip install rich` (used for progress bars + themed output) - `rclone` on `PATH` with the relevant remotes configured. ## UI - Live per-file progress bar during scans (`rclone size --json` for total, then `rclone lsf --files-only -R --format pst` streamed). - Banner panel showing run mode + per-remote cache status. - Rich tables for search hits and duplicate groups. - `--no-color` for plain output (CI, piping). ## Roadmap - Phase 1 (current): report duplicates + search. - Phase 2: `--apply` mode that runs `rclone delete` on `DELETE?` candidates behind a confirmation gate. - Phase 3: resolution-aware tiebreakers, move-to-review folder, scheduled runs.