Files
admin f7fc15b17c Sync working tree before initial Gitea push
Includes:
- cli.py path fix (parents[1]) for config/catalog resolution
- Library cleanup feature design docs (TODO.md, mockup)
- Audit + bug-queue markdowns from May 2026 reliability pass
- .gitignore expanded for transient artifacts
2026-05-26 22:35:42 +02:00

12 KiB

Bug Report — Native Host — audit-snapshot-2026-05-24T15-55Z.md

Snapshot: audit-snapshot-2026-05-24T15-55Z.md Required-reading docs read: AGENTS.md / mockup / CACHE_CONTRACT.md / bug-audit-plan.md / project memory Auditor agent: fresh Explore agent (chunk 2 auditor) Verifier agents: fresh Explore agents per candidate, blind context, stricter contract-check prompt + external-vs-internal-input rule

Chunk 2 calibration note: Moderate verification yielded 2 confirmed bugs + 1 demoted (M→L) with 40% pure-rejection rate (2/5 REFUTED). Auditor's recurring weaknesses: (1) flagging gate logic that's fail-SAFE as if it were fail-OPEN (C-1), (2) ignoring browser/protocol-level caps when worrying about host-side validation (C-2). Stricter verifier prompt with external-input + protocol-spec checks caught both false positives. Light candidates were NOT verified per audit-plan stop condition (>30% rejection → halt L verification). See bugs-candidates-host.md for unverified L list (C-6, C-7, C-8, C-9, C-10, C-11) and Needs Input C-12.


Severe (S)

(none flagged by auditor in this chunk)


Moderate (M)

M-1 — post_discord_alert blocks main message loop for up to 5 s

  • File: D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:174-289 (post_discord_alert refactored into _discord_post_worker + _build_discord_body helpers + public post_discord_alert thin wrapper after M-4 fix; was line 174-217 pre-fix), with callsites in handle_test_alerts_config + 4 main-loop sites (conn_close abnormal, read_message exception, handler exception, write_message exception)
  • Symptom (one sentence): When a handler exception or abnormal port close fires AND the Discord webhook URL is configured AND Discord is slow/unreachable, the main message loop blocks for up to 5 seconds inside urllib.request.urlopen(timeout=5), delaying the failure response to the extension by the same 5 s.
  • Why it's a bug: All 5 callsites of post_discord_alert execute on the main thread that runs the native messaging loop. Of those: callsites 2-5 are rate-limited via _alert_rate_limited() (LAST_ALERT_FILE check at line 184-185) so the FIRST exception per 10-minute window blocks; callsite 1 (handle_test_alerts_config) deliberately deletes LAST_ALERT_FILE to bypass rate limiting (line 258) before calling post_discord_alert — every Test (host) button click is a guaranteed 5 s main-thread block when Discord slow. During the block, the extension's RPC promise hangs waiting for the response.
  • Reproduction:
    1. Input: configure Discord webhook URL pointing at a slow/down endpoint (or kill network). Open Setup → Alerts → click Test (host).
    2. Expected: test fires asynchronously; UI returns immediately with "sent (still pending)" or similar
    3. Actual: Options page hangs ~5 s waiting for the host's RPC response, because host's main loop is blocked in urlopen
  • Suggested fix sketch: spawn a background thread for urlopen (fire-and-forget), or use a 1 s timeout instead of 5 s, or move webhook delivery into a worker queue consumed by a dedicated thread. Mirror the extension-side webhook post pattern (which already uses fetch().catch(...) without blocking the SW event loop).
  • Verifier agent: fresh Explore, blind context, stricter prompt
  • Verifier verdict: CONFIRMED
  • Verifier confidence: high
  • Contract refs verifier read: native messaging response timing expectations; threading model of main()
  • Mirror check needed in: extension-side postDiscordAlert in background.js — already non-blocking (uses fetch), but verify pattern consistency
  • Status: fixed
  • Fix: D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:174-289 — refactored post_discord_alert into shared internal worker (_discord_post_worker) + helper (_build_discord_body). Two public modes: (a) post_discord_alert(...) spawns daemon thread, returns immediately (used by 4 main-loop callsites: conn_close, read_error, handler_exception, write_error — each now passes alert_source label for analytics); (b) handle_test_alerts_config builds payload, spawns same worker with event+holder, waits 6 s, returns synchronous pass/fail or explicit timeout error "Discord webhook timed out after 6s; background post may still complete (see events.log)". Worker logs every outcome via log_event("discord_post", ok=, status=, error=, alert_kind=, alert_source=, elapsed_ms=) — visibility preserved despite async execution. Error text capped at 120 chars; never logs webhook URL or full payload. Main message loop no longer blocks on Discord. Manifest bumped 0.1.38 → 0.1.39. Python syntax verified via py_compile. Worker mechanics smoke-tested in isolation: bogus URL → 404 ok:False; bad domain → URLError ok:False with reason captured; fire-and-forget mode (no event/holder) → no raise. Test button still returns synchronous pass/fail for user experience.

M-2 — handle_scan returns success before _scan_worker can detect Popen failure

  • File: D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2235-2264 (handle_scan) + :2053-2110 (_scan_worker Popen path) + :2211-2220 (_scan_worker exception path)
  • Symptom (one sentence): When subprocess.Popen in _scan_worker fails (python missing, rc-jav.py path wrong, permission denied, etc.), handle_scan has already returned {"ok": True, "started": True} to the extension because the thread was started but had not yet executed Popen; extension shows "scan started" for 1-2 seconds before the next scan-progress poll surfaces the actual error.
  • Why it's a bug: handle_scan calls thread.start() at line 2263 then returns at line 2264 without waiting for Popen to succeed. If Popen raises (line 2092-2098) the worker's exception handler writes scan_ok: false, error: ... to SCAN_STATE_FILE (line 2211-2220) — but the extension already received ok: true and only learns of the failure on the next progress poll. Race window: short (1-2 s typically) but user-visible — UI shows "scan started" then suddenly "scan failed" with cryptic OS-level error.
  • Reproduction:
    1. Input: trigger Rebuild Cache from extension while python is not on PATH (or rc-jav.py path mis-set, or cwd has permission issue)
    2. Expected: handle_scan returns an error immediately so extension can show clear message before any "started" state
    3. Actual: extension shows "scan started" briefly → next poll → "scan failed: FileNotFoundError" or similar OS error
  • Suggested fix sketch: validate Popen preconditions synchronously in handle_scan before returning (python exists, rc-jav.py exists, cwd writable). OR use a sync event/queue from worker to handle_scan so it can wait briefly for the first state-file write before returning.
  • Verifier agent: fresh Explore, blind context, stricter prompt
  • Verifier verdict: CONFIRMED
  • Verifier confidence: very high (100%)
  • Contract refs verifier read: _scan_worker exception path; SCAN_STATE_FILE write timing; handle_scan_progress detection logic
  • Mirror check needed in: none — Popen race specific to scan path; other RPCs run handlers synchronously
  • Status: fixed
  • Fix: D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2053-2305 — added per-invocation spawn_event (threading.Event) + spawn_result dict, both passed from handle_scan into _scan_worker. Worker sets spawn_result["spawn_ok"] = True immediately after subprocess.Popen returns OR spawn_ok = False + error on exception, then sets event. handle_scan waits up to 500 ms via spawn_event.wait(timeout=0.5) then branches: spawn_ok=True → {ok: true, started: true}; spawn_ok=False → {ok: false, started: false, error}; timeout → {ok: true, started: true, startup_pending: true} (backward compatible — existing UI ignores the new key). Per-invocation holder isolates the handoff from globals (_scan_proc) and state file (UI/progress surface) so cross-invocation contamination is impossible. Manifest bumped 0.1.36 → 0.1.37. Python syntax verified via py_compile. Threading harness smoke-tested in isolation: success → {spawn_ok: True} + event set; Popen fail (nonexistent binary) → {spawn_ok: False, error: "[WinError 2] ..."} + event set; slow Popen → event NOT set after 500 ms (timeout branch fires). All 3 cases behave correctly. Runtime repro verified via temporary instrumentation (injected raise FileNotFoundError("simulated spawn fail") immediately before the subprocess.Popen line in _scan_worker, reloaded extension, triggered Rebuild Cache, UI showed scan failed: FileNotFoundError: simulated spawn fail synchronously with no misleading "scan started" flash). Instrumentation reverted post-test; manifest stayed at 0.1.37 because no code-of-record change. Note: the bad-rcjavPath test (point Setup → rcjavPath to non-existent path) does NOT exercise this fix path — that goes through Popen success → rc-jav.py exits 2 → existing async exception handler. M-3 specifically targets Popen-itself-raising, which is reachable via Python-on-PATH missing, OS permission denied at spawn time, or analogous OS-level interference. Use the instrumented-raise technique for any future regression test.

Light (L)

L-1 — Stderr blocking read freezes progress display for up to 5 s on rc-jav stall

  • File: D:\DEV\Extensions\Production\rclone-jav\host\rcjav-host.py:2053-2227 (_scan_worker), specifically :2101 (stderr iterator loop), :2267-2275 (deferred kill)
  • Symptom (one sentence): When rc-jav.py stalls mid-scan (e.g. rclone blocked on unresponsive remote), the for raw in proc.stderr: iterator at line 2101 blocks until either a stderr line arrives or proc exits — during which the scan-state file is not updated, so the extension's progress display shows stale state for up to 5 s (until the deferred-kill mechanism forces proc.terminate).
  • Why it's a bug (demoted from M to L): Originally flagged as M. Re-verifier confirmed the blocking is real but: no data loss occurs, cancel still works (delayed by up to 5 s as terminate fires), zombie process not left behind. Pure UX progress-freeze, not workflow-breaking.
  • Reproduction:
    1. Input: rclone remote becomes unresponsive mid-scan
    2. Expected: progress display updates with "stalled, will cancel in s" indicator, OR heartbeat that resumes when remote recovers
    3. Actual: progress frozen for 5 s, then deferred kill fires, scan marked complete with last-known progress
  • Suggested fix sketch: add a watchdog timer that emits a heartbeat to SCAN_STATE_FILE every 1-2 s while stderr is silent, OR use non-blocking stderr reads with select/poll (cross-platform via threading)
  • Verifier agent: fresh Explore, blind context, stricter prompt
  • Verifier verdict: PARTIAL — symptom real, severity originally over-stated
  • Verifier confidence: high (100%)
  • Contract refs verifier read: cancel path; deferred-kill behavior; SCAN_STATE_FILE update timing
  • Mirror check needed in: none
  • Status: open

Needs Input (N)

(C-12 from candidates was N — _load_host_cache memoization key collision — left unverified per stop condition; candidate scratch retains it)


False Positives (discarded)

  • host/rcjav-host.py:1216-1221 (_path_in_allowed_prefixes case-sensitivity) — flagged as Moderate "security bypass via uppercase remote". REFUTED. The gate is fail-SAFE, not fail-OPEN: case-mismatch causes the comparison to fail, which REJECTS the operation. No bypass possible. Verifier noted a related usability issue (legitimate uppercase paths get confusing rejection) but that's a UX gap, not a security bug.

  • host/rcjav-host.py:306-316 (read_message unbounded length prefix) — flagged as Moderate "DoS via 4 GiB length". REFUTED. Chrome native messaging protocol caps extension-to-host messages at 64 MiB browser-side per Chrome dev docs. Non-Brave processes cannot write to host stdin (it's piped by the browser into the host child process). The theoretical 4 GiB read cannot actually be triggered through any practical attack surface. Pure defensive-coding gap, not a real DoS.