Files
admin f7fc15b17c Sync working tree before initial Gitea push
Includes:
- cli.py path fix (parents[1]) for config/catalog resolution
- Library cleanup feature design docs (TODO.md, mockup)
- Audit + bug-queue markdowns from May 2026 reliability pass
- .gitignore expanded for transient artifacts
2026-05-26 22:35:42 +02:00

88 lines
12 KiB
Markdown

# Bug Report — Native Host — audit-snapshot-2026-05-24T15-55Z.md
Snapshot: audit-snapshot-2026-05-24T15-55Z.md
Required-reading docs read: AGENTS.md / mockup / CACHE_CONTRACT.md / bug-audit-plan.md / project memory
Auditor agent: fresh Explore agent (chunk 2 auditor)
Verifier agents: fresh Explore agents per candidate, blind context, stricter contract-check prompt + external-vs-internal-input rule
**Chunk 2 calibration note:** Moderate verification yielded 2 confirmed bugs + 1 demoted (M→L) with 40% pure-rejection rate (2/5 REFUTED). Auditor's recurring weaknesses: (1) flagging gate logic that's fail-SAFE as if it were fail-OPEN (C-1), (2) ignoring browser/protocol-level caps when worrying about host-side validation (C-2). Stricter verifier prompt with external-input + protocol-spec checks caught both false positives. **Light candidates were NOT verified per audit-plan stop condition** (>30% rejection → halt L verification). See `bugs-candidates-host.md` for unverified L list (C-6, C-7, C-8, C-9, C-10, C-11) and Needs Input C-12.
---
## Severe (S)
(none flagged by auditor in this chunk)
---
## Moderate (M)
### M-1 — post_discord_alert blocks main message loop for up to 5 s
- **File:** `D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:174-289` (post_discord_alert refactored into `_discord_post_worker` + `_build_discord_body` helpers + public `post_discord_alert` thin wrapper after M-4 fix; was line 174-217 pre-fix), with callsites in `handle_test_alerts_config` + 4 main-loop sites (conn_close abnormal, read_message exception, handler exception, write_message exception)
- **Symptom (one sentence):** When a handler exception or abnormal port close fires AND the Discord webhook URL is configured AND Discord is slow/unreachable, the main message loop blocks for up to 5 seconds inside `urllib.request.urlopen(timeout=5)`, delaying the failure response to the extension by the same 5 s.
- **Why it's a bug:** All 5 callsites of `post_discord_alert` execute on the main thread that runs the native messaging loop. Of those: callsites 2-5 are rate-limited via `_alert_rate_limited()` (LAST_ALERT_FILE check at line 184-185) so the FIRST exception per 10-minute window blocks; callsite 1 (`handle_test_alerts_config`) deliberately deletes LAST_ALERT_FILE to bypass rate limiting (line 258) before calling `post_discord_alert` — every Test (host) button click is a guaranteed 5 s main-thread block when Discord slow. During the block, the extension's RPC promise hangs waiting for the response.
- **Reproduction:**
1. Input: configure Discord webhook URL pointing at a slow/down endpoint (or kill network). Open Setup → Alerts → click Test (host).
2. Expected: test fires asynchronously; UI returns immediately with "sent (still pending)" or similar
3. Actual: Options page hangs ~5 s waiting for the host's RPC response, because host's main loop is blocked in urlopen
- **Suggested fix sketch:** spawn a background thread for `urlopen` (fire-and-forget), or use a 1 s timeout instead of 5 s, or move webhook delivery into a worker queue consumed by a dedicated thread. Mirror the extension-side webhook post pattern (which already uses `fetch().catch(...)` without blocking the SW event loop).
- **Verifier agent:** fresh Explore, blind context, stricter prompt
- **Verifier verdict:** CONFIRMED
- **Verifier confidence:** high
- **Contract refs verifier read:** native messaging response timing expectations; threading model of `main()`
- **Mirror check needed in:** extension-side `postDiscordAlert` in background.js — already non-blocking (uses fetch), but verify pattern consistency
- **Status:** fixed
- **Fix:** `D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:174-289` — refactored post_discord_alert into shared internal worker (`_discord_post_worker`) + helper (`_build_discord_body`). Two public modes: (a) `post_discord_alert(...)` spawns daemon thread, returns immediately (used by 4 main-loop callsites: conn_close, read_error, handler_exception, write_error — each now passes `alert_source` label for analytics); (b) `handle_test_alerts_config` builds payload, spawns same worker with event+holder, waits 6 s, returns synchronous pass/fail or explicit timeout error `"Discord webhook timed out after 6s; background post may still complete (see events.log)"`. Worker logs every outcome via `log_event("discord_post", ok=, status=, error=, alert_kind=, alert_source=, elapsed_ms=)` — visibility preserved despite async execution. Error text capped at 120 chars; never logs webhook URL or full payload. Main message loop no longer blocks on Discord. Manifest bumped 0.1.38 → 0.1.39. Python syntax verified via `py_compile`. Worker mechanics smoke-tested in isolation: bogus URL → 404 ok:False; bad domain → URLError ok:False with reason captured; fire-and-forget mode (no event/holder) → no raise. Test button still returns synchronous pass/fail for user experience.
### M-2 — handle_scan returns success before _scan_worker can detect Popen failure
- **File:** `D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2235-2264` (handle_scan) + `:2053-2110` (_scan_worker Popen path) + `:2211-2220` (_scan_worker exception path)
- **Symptom (one sentence):** When `subprocess.Popen` in `_scan_worker` fails (python missing, rc-jav.py path wrong, permission denied, etc.), `handle_scan` has already returned `{"ok": True, "started": True}` to the extension because the thread was started but had not yet executed Popen; extension shows "scan started" for 1-2 seconds before the next `scan-progress` poll surfaces the actual error.
- **Why it's a bug:** `handle_scan` calls `thread.start()` at line 2263 then returns at line 2264 without waiting for Popen to succeed. If Popen raises (line 2092-2098) the worker's exception handler writes `scan_ok: false, error: ...` to SCAN_STATE_FILE (line 2211-2220) — but the extension already received `ok: true` and only learns of the failure on the next progress poll. Race window: short (1-2 s typically) but user-visible — UI shows "scan started" then suddenly "scan failed" with cryptic OS-level error.
- **Reproduction:**
1. Input: trigger Rebuild Cache from extension while python is not on PATH (or rc-jav.py path mis-set, or cwd has permission issue)
2. Expected: handle_scan returns an error immediately so extension can show clear message before any "started" state
3. Actual: extension shows "scan started" briefly → next poll → "scan failed: FileNotFoundError" or similar OS error
- **Suggested fix sketch:** validate Popen preconditions synchronously in `handle_scan` before returning (python exists, rc-jav.py exists, cwd writable). OR use a sync event/queue from worker to handle_scan so it can wait briefly for the first state-file write before returning.
- **Verifier agent:** fresh Explore, blind context, stricter prompt
- **Verifier verdict:** CONFIRMED
- **Verifier confidence:** very high (100%)
- **Contract refs verifier read:** _scan_worker exception path; SCAN_STATE_FILE write timing; handle_scan_progress detection logic
- **Mirror check needed in:** none — Popen race specific to scan path; other RPCs run handlers synchronously
- **Status:** fixed
- **Fix:** `D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2053-2305` — added per-invocation `spawn_event` (threading.Event) + `spawn_result` dict, both passed from `handle_scan` into `_scan_worker`. Worker sets `spawn_result["spawn_ok"] = True` immediately after `subprocess.Popen` returns OR `spawn_ok = False` + `error` on exception, then sets event. `handle_scan` waits up to 500 ms via `spawn_event.wait(timeout=0.5)` then branches: spawn_ok=True → `{ok: true, started: true}`; spawn_ok=False → `{ok: false, started: false, error}`; timeout → `{ok: true, started: true, startup_pending: true}` (backward compatible — existing UI ignores the new key). Per-invocation holder isolates the handoff from globals (`_scan_proc`) and state file (UI/progress surface) so cross-invocation contamination is impossible. Manifest bumped 0.1.36 → 0.1.37. Python syntax verified via `py_compile`. Threading harness smoke-tested in isolation: success → `{spawn_ok: True}` + event set; Popen fail (nonexistent binary) → `{spawn_ok: False, error: "[WinError 2] ..."}` + event set; slow Popen → event NOT set after 500 ms (timeout branch fires). All 3 cases behave correctly. **Runtime repro verified** via temporary instrumentation (injected `raise FileNotFoundError("simulated spawn fail")` immediately before the `subprocess.Popen` line in `_scan_worker`, reloaded extension, triggered Rebuild Cache, UI showed `scan failed: FileNotFoundError: simulated spawn fail` synchronously with no misleading "scan started" flash). Instrumentation reverted post-test; manifest stayed at 0.1.37 because no code-of-record change. **Note:** the bad-rcjavPath test (point Setup → rcjavPath to non-existent path) does NOT exercise this fix path — that goes through Popen success → rc-jav.py exits 2 → existing async exception handler. M-3 specifically targets Popen-itself-raising, which is reachable via Python-on-PATH missing, OS permission denied at spawn time, or analogous OS-level interference. Use the instrumented-raise technique for any future regression test.
---
## Light (L)
### L-1 — Stderr blocking read freezes progress display for up to 5 s on rc-jav stall
- **File:** `D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2053-2227` (_scan_worker), specifically `:2101` (stderr iterator loop), `:2267-2275` (deferred kill)
- **Symptom (one sentence):** When rc-jav.py stalls mid-scan (e.g. rclone blocked on unresponsive remote), the `for raw in proc.stderr:` iterator at line 2101 blocks until either a stderr line arrives or proc exits — during which the scan-state file is not updated, so the extension's progress display shows stale state for up to 5 s (until the deferred-kill mechanism forces proc.terminate).
- **Why it's a bug (demoted from M to L):** Originally flagged as M. Re-verifier confirmed the blocking is real but: no data loss occurs, cancel still works (delayed by up to 5 s as terminate fires), zombie process not left behind. Pure UX progress-freeze, not workflow-breaking.
- **Reproduction:**
1. Input: rclone remote becomes unresponsive mid-scan
2. Expected: progress display updates with "stalled, will cancel in <N>s" indicator, OR heartbeat that resumes when remote recovers
3. Actual: progress frozen for 5 s, then deferred kill fires, scan marked complete with last-known progress
- **Suggested fix sketch:** add a watchdog timer that emits a heartbeat to SCAN_STATE_FILE every 1-2 s while stderr is silent, OR use non-blocking stderr reads with select/poll (cross-platform via threading)
- **Verifier agent:** fresh Explore, blind context, stricter prompt
- **Verifier verdict:** PARTIAL — symptom real, severity originally over-stated
- **Verifier confidence:** high (100%)
- **Contract refs verifier read:** cancel path; deferred-kill behavior; SCAN_STATE_FILE update timing
- **Mirror check needed in:** none
- **Status:** open
---
## Needs Input (N)
(C-12 from candidates was N — _load_host_cache memoization key collision — left unverified per stop condition; candidate scratch retains it)
---
## False Positives (discarded)
- `host/rcjav-host.py:1216-1221` (_path_in_allowed_prefixes case-sensitivity) — flagged as Moderate "security bypass via uppercase remote". REFUTED. The gate is fail-SAFE, not fail-OPEN: case-mismatch causes the comparison to fail, which REJECTS the operation. No bypass possible. Verifier noted a related usability issue (legitimate uppercase paths get confusing rejection) but that's a UX gap, not a security bug.
- `host/rcjav-host.py:306-316` (read_message unbounded length prefix) — flagged as Moderate "DoS via 4 GiB length". REFUTED. Chrome native messaging protocol caps extension-to-host messages at 64 MiB browser-side per Chrome dev docs. Non-Brave processes cannot write to host stdin (it's piped by the browser into the host child process). The theoretical 4 GiB read cannot actually be triggered through any practical attack surface. Pure defensive-coding gap, not a real DoS.