Files

234 lines
10 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# rc-jav
Read-only duplicate scanner for JAV files across rclone remotes. Groups files by JAV ID (e.g. `SSIS-001`) and reports which copy to keep based on priority rules.
## Priority rules
1. Video files inside configured **VIP folders** win first. Default VIP folder: `ClearJAV`.
2. If no VIP-folder video exists, **Source always wins** regardless of resolution/size.
3. `.ts` files rank below other video containers, even when the transport-stream copy is larger.
4. If no Source copy exists in the group, **largest file size wins** among the remaining Targets.
5. Suggestions only — script never deletes. Manual cleanup.
## ID matching
Filename stem is matched against:
- Primary: `^([A-Za-z]+)-(\d+)``SSIS-001`, `MIDV-123`, `ABP-456`
- Compound: `^(\w+(?:-\w+)+)-(\d+)``FC2-PPV-4894535`, `HEYZO-HD-1234`
- Fallback: `^([A-Za-z0-9]+)-(\d+)``1pondo-123`, `carib-456`
IDs normalized to uppercase with leading zeros stripped from the number (so `ssis-001` == `SSIS-1` == `SSIS-001`). Anything after the ID (` - Actress [1080p]`) is ignored for matching.
### Part-suffix handling
Multi-part files (`_1`, `_2`, `-1`, `-2`, `_A`, `_B`, `.1of4`, ` (1)`, `-pt1`, `-part1`, `-cd1`, `-disc1`, trailing ` N`) are normalized as `{ID}#partN` so they do not collide as false duplicates. Searching the base ID still finds all parts. Lettered `_A` / `_B` suffixes become part 1 / part 2.
Add more suffix shapes with repeatable `--part-pattern` regexes. The first capture group is the part number or one part letter and the pattern runs against the filename stem:
```powershell
python rc-jav.py --scan --part-pattern '[-_ ]side[-_ ]?(A|B)$'
python rc-jav.py --part-pattern '_([CD])$' --save
```
Saved rules live in `config.json` as `part_patterns`. The extension Options page has the same custom part detector list for host-triggered searches, duplicate review, and cache rebuilds.
Files with no parseable ID are listed under "Skipped" at the end so you can spot misnamed files.
### Rule checks
Focused rule tests cover ID extraction, multipart grouping safety, and duplicate KEEP ranking:
```powershell
python -B -m unittest discover -s tests -v
```
## Usage
```
python rc-jav.py \
--source cq:personal-files/ClearJAV/ichika-matsumoto \
--target cq:personal-files/JAV/TMP \
```
Flags:
- `--source` / `-s REMOTE` — priority remote path. Repeat for multiple.
- `--target` / `-t REMOTE` — non-priority remote path. Repeat for multiple.
- `--format {console,txt,csv,json,all}` — default `console`. Non-console formats write to `--output-dir`.
- `--output-dir DIR` — default `./reports`.
- `--no-color` — disable ANSI colors but keep rich layout (tables, panels).
- `--basic` — plain text output, no rich tables/panels. Progress ticks every 25 files on stderr. Useful for piping or simple terminals.
- `--rclone-bin PATH` — path to rclone executable (default: `rclone` on PATH). Example: `--rclone-bin C:\Programs\rclone\rclone.exe`.
- `--clearjav` — shortcut: sets source = `DEFAULT_SOURCE`, target = `DEFAULT_TARGET`. Equivalent to `--source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP`. Combine with `--source`/`--target` to override one side.
Examples:
```
# full library dupe scan, one flag
python rc-jav.py --clearjav
# same but only check one actress folder against TMP
python rc-jav.py --clearjav --source cq:personal-files/ClearJAV/ichika-matsumoto
```
## Search mode
Check whether a JAV ID already exists in your library before downloading:
```
python rc-jav.py --search SSIS-001
python rc-jav.py --search SSIS-001 --search FC2-PPV-4894535
# wildcards (quote to avoid shell glob expansion)
python rc-jav.py --search "IPZZ-*"
python rc-jav.py --search "FC2-PPV-*"
python rc-jav.py --search "SSIS-???" # exact 3-digit numeric
```
Wildcard syntax: `*` (any chars) and `?` (one char), case-insensitive. Matches against normalized IDs in the index, including `#partN` suffixes automatically.
Range syntax: `[N-M]` inclusive both ends. Works inside any prefix.
```
python rc-jav.py --search "IPZZ-[820-860]"
python rc-jav.py --search "FC2-PPV-[4894500-4894600]"
python rc-jav.py --search "MIDV-[001-010]" # zero-padding preserved
```
Quote in PowerShell/bash so `[...]` reaches Python literally. Reversed ranges (`860-820`) auto-swap.
With no `--source` / `--target` flags, only `DEFAULT_TARGET` (TMP) is scanned — the typical case for "do I already have this in my unsorted pile?". Pass `--source cq:personal-files/ClearJAV` to also check the priority library. Edit `DEFAULT_SOURCE` / `DEFAULT_TARGET` at the top of the script to change defaults. Remote scans are recursive.
Exit code: `0` if every query had at least one hit, `1` otherwise — useful for shell automation.
## Name search (`--name`)
Substring search against filenames (case-insensitive). Find all files by actress, studio, tag, anything that appears in the filename.
```
python rc-jav.py --name Ichika
python rc-jav.py --name "Ichika Matsumoto"
python rc-jav.py --name Ichika --name Yui # OR — files matching either
python rc-jav.py --name "Mat*" # glob wildcard
python rc-jav.py --search IPZZ-860 --name Ichika # both — separate result blocks
```
- Multiple `--name` tokens = OR. Use one combined `--name "foo bar"` for AND/exact-substring.
- Matches against the filename stem only (not folder names).
- Auto-routes to **cached** mode because substring globs can't be server-side filtered on most backends. Pass `-q` to force quick anyway (slower).
### Smart search mode (auto quick / cached)
The script auto-picks the right execution path per query and prints which one it chose:
| Query shape | Picked mode | Reason |
|---|---|---|
| Single exact ID (`IPZZ-860`) | quick | live rclone `--include`, ~12s even on huge trees |
| Wildcard (`IPZZ-*`, `SSIS-???`) | cached | reliable normalized matching |
| Range (`IPZZ-[820-860]`) | cached | avoids N rclone calls |
| Multiple `--search` flags | cached | warmup amortizes |
Override:
- `--quick` / `-q` — force live rclone lookup (skips cache).
- `--cache` — force cache (builds it if cold).
Quick mode never reads or writes the cache. Cache mode honors `--update` and `--no-cache` as before.
### Cache
Search mode caches each remote's file list in `./cache.json` next to the script. Subsequent searches are near-instant.
- First run: scans + writes cache.
- Later runs: reads cache (banner shows `CACHED 14m (154 files)`).
- `--update` / `-u`: force re-scan + overwrite cache for the requested remotes.
- `--no-cache`: bypass cache (no read, no write).
- Stale warning when cache is older than 24h — still used, marked `CACHED-STALE`.
- Ctrl+C during a scan: rclone is terminated, cache for in-flight remote is NOT written.
Delete `cache.json` to reset everything.
### Saving defaults (--save)
Persist `--source`, `--target`, `--catalog`, and/or `--part-pattern` to `config.json` so you don't have to type them every run.
```
# set default target
python rc-jav.py --target cq:personal-files/JAV/TMP --save
# set source + multiple targets at once
python rc-jav.py --source cq:personal-files/ClearJAV ^
--target cq:personal-files/JAV/TMP ^
--target cq:personal-files/JAV/SORTED ^
--save
# inspect
type config.json
```
Only the keys you explicitly pass are written — running `--save --target X` won't wipe a saved `default_source`. Delete `config.json` to reset to the hardcoded defaults at the top of `rc-jav.py`.
### Scan-only (--scan)
Refresh the cache without running a search or dupe report — useful for Task Scheduler / cron pre-warming.
```
# default: refresh DEFAULT_TARGET (TMP)
python rc-jav.py --scan
# refresh both source and target
python rc-jav.py --scan --source cq:personal-files/ClearJAV --target cq:personal-files/JAV/TMP
# nightly via Task Scheduler
schtasks /Create /SC DAILY /ST 03:00 /TN "rc-jav nightly scan" ^
/TR "python D:\DEV\Project\rclone-jav\rc-jav.py --scan --basic"
```
`--scan` always overwrites the cache for the remotes you list. Exit 0 = success, non-zero = rclone failure.
```
python rc-jav.py --search MIDV-999 ; if ($LASTEXITCODE -eq 0) { "have it" } else { "download" }
```
## WinCatalog integration
WinCatalog's native `.wcat` format is proprietary, so the script reads its exports instead.
1. In WinCatalog: **File → Export** → choose **CSV** or **XML**.
2. Save into the `wincatalog/` folder next to the script. All `*.csv` and `*.xml` files there are auto-loaded — drop in as many discs as you want.
3. Run as normal: `python rc-jav.py --search IPZZ-860`
4. Override or add extra paths with `--catalog PATH` (file or folder, repeatable).
5. To change the default folder, edit `DEFAULT_CATALOG` at the top of the script.
Re-export when your catalog changes; the script re-reads on every run (catalog data is **not** cached — it's already a local file).
**Role of catalog hits:**
- Search: shown as rows with source label `Catalog`. The disc/volume name is encoded into the path so you know which offline backup holds the file.
- Dupe mode: catalog entries appear in groups for awareness but are **never marked KEEP or DELETE?** — they're offline, can't be touched. A group is only flagged as a dupe when 2+ rclone copies exist.
**CSV column auto-detection** (case-insensitive, first match wins):
- Name: `Name`, `File Name`, `Filename`, `Title`
- Path: `Path`, `Full Path`, `Location`, `Folder`
- Size: `Size`, `File Size`, `Bytes`, `Size (bytes)`
- Disc: `Disc`, `Disc Name`, `Disc Label`, `Volume`, `Source`, `Catalog`, `Media`
XML: walks the tree, treats `<File>` / `<f>` nodes inside `<Disc>` / `<Catalog>` / `<Volume>` containers, with `<Folder>` nesting.
## Requirements
- Python 3.9+
- `pip install rich` (used for progress bars + themed output)
- `rclone` on `PATH` with the relevant remotes configured.
## UI
- Live per-file progress bar during scans (`rclone size --json` for total, then `rclone lsf --files-only -R --format pst` streamed).
- Banner panel showing run mode + per-remote cache status.
- Rich tables for search hits and duplicate groups.
- `--no-color` for plain output (CI, piping).
## Roadmap
- Phase 1 (current): report duplicates + search.
- Phase 2: `--apply` mode that runs `rclone delete` on `DELETE?` candidates behind a confirmation gate.
- Phase 3: resolution-aware tiebreakers, move-to-review folder, scheduled runs.