234 lines
10 KiB
Markdown
234 lines
10 KiB
Markdown
# rc-jav
|
||
|
||
Read-only duplicate scanner for JAV files across rclone remotes. Groups files by JAV ID (e.g. `SSIS-001`) and reports which copy to keep based on priority rules.
|
||
|
||
## Priority rules
|
||
|
||
1. Video files inside configured **VIP folders** win first. Default VIP folder: `ClearJAV`.
|
||
2. If no VIP-folder video exists, **Source always wins** regardless of resolution/size.
|
||
3. `.ts` files rank below other video containers, even when the transport-stream copy is larger.
|
||
4. If no Source copy exists in the group, **largest file size wins** among the remaining Targets.
|
||
5. Suggestions only — script never deletes. Manual cleanup.
|
||
|
||
## ID matching
|
||
|
||
Filename stem is matched against:
|
||
|
||
- Primary: `^([A-Za-z]+)-(\d+)` — `SSIS-001`, `MIDV-123`, `ABP-456`
|
||
- Compound: `^(\w+(?:-\w+)+)-(\d+)` — `FC2-PPV-4894535`, `HEYZO-HD-1234`
|
||
- Fallback: `^([A-Za-z0-9]+)-(\d+)` — `1pondo-123`, `carib-456`
|
||
|
||
IDs normalized to uppercase with leading zeros stripped from the number (so `ssis-001` == `SSIS-1` == `SSIS-001`). Anything after the ID (` - Actress [1080p]`) is ignored for matching.
|
||
|
||
### Part-suffix handling
|
||
|
||
Multi-part files (`_1`, `_2`, `-1`, `-2`, `_A`, `_B`, `.1of4`, ` (1)`, `-pt1`, `-part1`, `-cd1`, `-disc1`, trailing ` N`) are normalized as `{ID}#partN` so they do not collide as false duplicates. Searching the base ID still finds all parts. Lettered `_A` / `_B` suffixes become part 1 / part 2.
|
||
|
||
Add more suffix shapes with repeatable `--part-pattern` regexes. The first capture group is the part number or one part letter and the pattern runs against the filename stem:
|
||
|
||
```powershell
|
||
python rc-jav.py --scan --part-pattern '[-_ ]side[-_ ]?(A|B)$'
|
||
python rc-jav.py --part-pattern '_([CD])$' --save
|
||
```
|
||
|
||
Saved rules live in `config.json` as `part_patterns`. The extension Options page has the same custom part detector list for host-triggered searches, duplicate review, and cache rebuilds.
|
||
|
||
Files with no parseable ID are listed under "Skipped" at the end so you can spot misnamed files.
|
||
|
||
### Rule checks
|
||
|
||
Focused rule tests cover ID extraction, multipart grouping safety, and duplicate KEEP ranking:
|
||
|
||
```powershell
|
||
python -B -m unittest discover -s tests -v
|
||
```
|
||
|
||
## Usage
|
||
|
||
```
|
||
python rc-jav.py \
|
||
--source cq:personal-files/ClearJAV/ichika-matsumoto \
|
||
--target cq:personal-files/JAV/TMP \
|
||
```
|
||
|
||
Flags:
|
||
- `--source` / `-s REMOTE` — priority remote path. Repeat for multiple.
|
||
- `--target` / `-t REMOTE` — non-priority remote path. Repeat for multiple.
|
||
- `--format {console,txt,csv,json,all}` — default `console`. Non-console formats write to `--output-dir`.
|
||
- `--output-dir DIR` — default `./reports`.
|
||
- `--no-color` — disable ANSI colors but keep rich layout (tables, panels).
|
||
- `--basic` — plain text output, no rich tables/panels. Progress ticks every 25 files on stderr. Useful for piping or simple terminals.
|
||
- `--rclone-bin PATH` — path to rclone executable (default: `rclone` on PATH). Example: `--rclone-bin C:\Programs\rclone\rclone.exe`.
|
||
- `--clearjav` — shortcut: sets source = `DEFAULT_SOURCE`, target = `DEFAULT_TARGET`. Equivalent to `--source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP`. Combine with `--source`/`--target` to override one side.
|
||
|
||
Examples:
|
||
|
||
```
|
||
# full library dupe scan, one flag
|
||
python rc-jav.py --clearjav
|
||
|
||
# same but only check one actress folder against TMP
|
||
python rc-jav.py --clearjav --source cq:personal-files/ClearJAV/ichika-matsumoto
|
||
```
|
||
|
||
## Search mode
|
||
|
||
Check whether a JAV ID already exists in your library before downloading:
|
||
|
||
```
|
||
python rc-jav.py --search SSIS-001
|
||
python rc-jav.py --search SSIS-001 --search FC2-PPV-4894535
|
||
|
||
# wildcards (quote to avoid shell glob expansion)
|
||
python rc-jav.py --search "IPZZ-*"
|
||
python rc-jav.py --search "FC2-PPV-*"
|
||
python rc-jav.py --search "SSIS-???" # exact 3-digit numeric
|
||
```
|
||
|
||
Wildcard syntax: `*` (any chars) and `?` (one char), case-insensitive. Matches against normalized IDs in the index, including `#partN` suffixes automatically.
|
||
|
||
Range syntax: `[N-M]` inclusive both ends. Works inside any prefix.
|
||
|
||
```
|
||
python rc-jav.py --search "IPZZ-[820-860]"
|
||
python rc-jav.py --search "FC2-PPV-[4894500-4894600]"
|
||
python rc-jav.py --search "MIDV-[001-010]" # zero-padding preserved
|
||
```
|
||
|
||
Quote in PowerShell/bash so `[...]` reaches Python literally. Reversed ranges (`860-820`) auto-swap.
|
||
|
||
With no `--source` / `--target` flags, only `DEFAULT_TARGET` (TMP) is scanned — the typical case for "do I already have this in my unsorted pile?". Pass `--source cq:personal-files/ClearJAV` to also check the priority library. Edit `DEFAULT_SOURCE` / `DEFAULT_TARGET` at the top of the script to change defaults. Remote scans are recursive.
|
||
|
||
Exit code: `0` if every query had at least one hit, `1` otherwise — useful for shell automation.
|
||
|
||
## Name search (`--name`)
|
||
|
||
Substring search against filenames (case-insensitive). Find all files by actress, studio, tag, anything that appears in the filename.
|
||
|
||
```
|
||
python rc-jav.py --name Ichika
|
||
python rc-jav.py --name "Ichika Matsumoto"
|
||
python rc-jav.py --name Ichika --name Yui # OR — files matching either
|
||
python rc-jav.py --name "Mat*" # glob wildcard
|
||
python rc-jav.py --search IPZZ-860 --name Ichika # both — separate result blocks
|
||
```
|
||
|
||
- Multiple `--name` tokens = OR. Use one combined `--name "foo bar"` for AND/exact-substring.
|
||
- Matches against the filename stem only (not folder names).
|
||
- Auto-routes to **cached** mode because substring globs can't be server-side filtered on most backends. Pass `-q` to force quick anyway (slower).
|
||
|
||
### Smart search mode (auto quick / cached)
|
||
|
||
The script auto-picks the right execution path per query and prints which one it chose:
|
||
|
||
| Query shape | Picked mode | Reason |
|
||
|---|---|---|
|
||
| Single exact ID (`IPZZ-860`) | quick | live rclone `--include`, ~1–2s even on huge trees |
|
||
| Wildcard (`IPZZ-*`, `SSIS-???`) | cached | reliable normalized matching |
|
||
| Range (`IPZZ-[820-860]`) | cached | avoids N rclone calls |
|
||
| Multiple `--search` flags | cached | warmup amortizes |
|
||
|
||
Override:
|
||
- `--quick` / `-q` — force live rclone lookup (skips cache).
|
||
- `--cache` — force cache (builds it if cold).
|
||
|
||
Quick mode never reads or writes the cache. Cache mode honors `--update` and `--no-cache` as before.
|
||
|
||
### Cache
|
||
|
||
Search mode caches each remote's file list in `./cache.json` next to the script. Subsequent searches are near-instant.
|
||
|
||
- First run: scans + writes cache.
|
||
- Later runs: reads cache (banner shows `CACHED 14m (154 files)`).
|
||
- `--update` / `-u`: force re-scan + overwrite cache for the requested remotes.
|
||
- `--no-cache`: bypass cache (no read, no write).
|
||
- Stale warning when cache is older than 24h — still used, marked `CACHED-STALE`.
|
||
- Ctrl+C during a scan: rclone is terminated, cache for in-flight remote is NOT written.
|
||
|
||
Delete `cache.json` to reset everything.
|
||
|
||
### Saving defaults (--save)
|
||
|
||
Persist `--source`, `--target`, `--catalog`, and/or `--part-pattern` to `config.json` so you don't have to type them every run.
|
||
|
||
```
|
||
# set default target
|
||
python rc-jav.py --target cq:personal-files/JAV/TMP --save
|
||
|
||
# set source + multiple targets at once
|
||
python rc-jav.py --source cq:personal-files/ClearJAV ^
|
||
--target cq:personal-files/JAV/TMP ^
|
||
--target cq:personal-files/JAV/SORTED ^
|
||
--save
|
||
|
||
# inspect
|
||
type config.json
|
||
```
|
||
|
||
Only the keys you explicitly pass are written — running `--save --target X` won't wipe a saved `default_source`. Delete `config.json` to reset to the hardcoded defaults at the top of `rc-jav.py`.
|
||
|
||
### Scan-only (--scan)
|
||
|
||
Refresh the cache without running a search or dupe report — useful for Task Scheduler / cron pre-warming.
|
||
|
||
```
|
||
# default: refresh DEFAULT_TARGET (TMP)
|
||
python rc-jav.py --scan
|
||
|
||
# refresh both source and target
|
||
python rc-jav.py --scan --source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP
|
||
|
||
# nightly via Task Scheduler
|
||
schtasks /Create /SC DAILY /ST 03:00 /TN "rc-jav nightly scan" ^
|
||
/TR "python D:\DEV\Project\rclone-jav\rc-jav.py --scan --basic"
|
||
```
|
||
|
||
`--scan` always overwrites the cache for the remotes you list. Exit 0 = success, non-zero = rclone failure.
|
||
|
||
```
|
||
python rc-jav.py --search MIDV-999 ; if ($LASTEXITCODE -eq 0) { "have it" } else { "download" }
|
||
```
|
||
|
||
## WinCatalog integration
|
||
|
||
WinCatalog's native `.wcat` format is proprietary, so the script reads its exports instead.
|
||
|
||
1. In WinCatalog: **File → Export** → choose **CSV** or **XML**.
|
||
2. Save into the `wincatalog/` folder next to the script. All `*.csv` and `*.xml` files there are auto-loaded — drop in as many discs as you want.
|
||
3. Run as normal: `python rc-jav.py --search IPZZ-860`
|
||
4. Override or add extra paths with `--catalog PATH` (file or folder, repeatable).
|
||
5. To change the default folder, edit `DEFAULT_CATALOG` at the top of the script.
|
||
|
||
Re-export when your catalog changes; the script re-reads on every run (catalog data is **not** cached — it's already a local file).
|
||
|
||
**Role of catalog hits:**
|
||
- Search: shown as rows with source label `Catalog`. The disc/volume name is encoded into the path so you know which offline backup holds the file.
|
||
- Dupe mode: catalog entries appear in groups for awareness but are **never marked KEEP or DELETE?** — they're offline, can't be touched. A group is only flagged as a dupe when 2+ rclone copies exist.
|
||
|
||
**CSV column auto-detection** (case-insensitive, first match wins):
|
||
- Name: `Name`, `File Name`, `Filename`, `Title`
|
||
- Path: `Path`, `Full Path`, `Location`, `Folder`
|
||
- Size: `Size`, `File Size`, `Bytes`, `Size (bytes)`
|
||
- Disc: `Disc`, `Disc Name`, `Disc Label`, `Volume`, `Source`, `Catalog`, `Media`
|
||
|
||
XML: walks the tree, treats `<File>` / `<f>` nodes inside `<Disc>` / `<Catalog>` / `<Volume>` containers, with `<Folder>` nesting.
|
||
|
||
## Requirements
|
||
|
||
- Python 3.9+
|
||
- `pip install rich` (used for progress bars + themed output)
|
||
- `rclone` on `PATH` with the relevant remotes configured.
|
||
|
||
## UI
|
||
|
||
- Live per-file progress bar during scans (`rclone size --json` for total, then `rclone lsf --files-only -R --format pst` streamed).
|
||
- Banner panel showing run mode + per-remote cache status.
|
||
- Rich tables for search hits and duplicate groups.
|
||
- `--no-color` for plain output (CI, piping).
|
||
|
||
## Roadmap
|
||
|
||
- Phase 1 (current): report duplicates + search.
|
||
- Phase 2: `--apply` mode that runs `rclone delete` on `DELETE?` candidates behind a confirmation gate.
|
||
- Phase 3: resolution-aware tiebreakers, move-to-review folder, scheduled runs.
|