mirror of
https://github.com/home-assistant/supervisor.git
synced 2026-05-19 06:08:51 +01:00
0ac8b42062
* Rework Supervisor connectivity check with coalescing and force flag Previously, a failed connectivity probe could strand Supervisor in a "no connectivity" state indefinitely. After an Ethernet reconnect, a probe kicked by NetworkManager's connectivity transition could race with CoreDNS being restarted (due to DNS locals changing), time out on DNS, and leave supervisor.connectivity = False. The retry that _on_dns_container_running was meant to fire landed inside the 5 s JobThrottle window from the just-failed probe and was silently dropped, since JobThrottle.THROTTLE drops rather than waits. The rework replaces the @Job(throttle=THROTTLE) decorator and the public connectivity setter with a single authoritative state-updating method: - check_and_update_connectivity(force=False) is the only path that runs the HTTP probe and updates the cached state. Concurrent callers coalesce onto a single in-flight probe. A min-interval throttle lives inside the method and reuses the cached result within window instead of dropping calls. - request_connectivity_check(force=False) is a fire-and-forget wrapper for signal handlers (D-Bus, plugin callbacks) that must return quickly without blocking signal dispatch on the HTTP round-trip. - force=True bypasses the min-interval and, when a probe is in flight, sets a trailing-rerun flag so the owning task runs one more probe after the current one completes. Used for signals that carry fresh state-change information (NM connectivity transition to FULL, DNS container RUNNING, startup, post-NTP sync). - _update_connectivity is the sole writer of the cached flag and emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions. Call sites migrate accordingly. The opportunistic supervisor.connectivity = False writes in update_apparmor, updater.fetch_data, os.manager, and addon_pwned error paths are replaced with request_connectivity_check() calls so the probe remains authoritative - an endpoint-specific failure no longer lies about the overall connectivity state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Propagate connectivity-probe cancellation and skip last-check on cancel Awaiting an asyncio.Task does not propagate cancellation INTO the task, so the previous owner-doesn't-shield comment was misleading: a cancelled owner left the spawned probe running orphaned, and the next caller could start a second probe alongside it. The owner now explicitly cancels and awaits the probe on CancelledError before re-raising. The last-check timestamp is also moved out of the finally block so a cancelled probe does not leave a "fresh result just ran" cache behind that would short-circuit the next non-forced caller. A regression test exercises both: that owner cancellation clears the in-flight reference and leaves the timestamp untouched, and that a subsequent non-forced check therefore still actually probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Clarify why post-NTP-sync forces a connectivity probe The previous comment claimed the last-check timestamp may be unreliable after a time jump, but _connectivity_last_check uses loop.time() which is monotonic and unaffected by wall-clock corrections. The real reason to force a fresh probe is TLS validation: certificates that appeared expired or not-yet-valid before the system clock was corrected may now verify, so a probe that just failed with an SSL error can succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add debug logs to Supervisor connectivity probe paths The original stuck-offline bug was hard to spot in logs because the silent throttle-drop and the cached state had no audit trail. With debug-level logging at each decision point, a future investigation can reconstruct from a single log file: - who requested a check (force flag distinguishes signal-driven probes from precondition / opportunistic-error-path requests) - why a probe did not actually run (in-flight coalesce, cached within min-interval, owner cancellation) - when a forced rerun was queued and when it ran (the precise failure mode that stranded the supervisor in the original incident) - when the cached state actually flipped (with the previous value in the message so transitions are visible) All new lines are debug-level. The existing _do_connectivity_check "failed" / "succeeded" lines are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Skip system-checks fan-out in test_events_on_issue_changes The test asserts that apply_suggestion fires an ISSUE_REMOVED event. ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before apply_suggestion calls healthcheck. The healthcheck call afterwards is incidental to this test's intent, but it fans out into check_system() which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes against the NetworkManager mock's stub nameserver 192.168.30.1 that each hit the default ~10 s aiodns timeout. The file took ~21 s to run. The slowness has been latent since #3818 (Aug 2022), which added the apply_suggestion step at the end of test_events_on_issue_changes two days after the DNS check landed in its current form (#3811). The default 24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in full-suite runs once any earlier test has tripped the throttle, which is likely why this slipped through. Mock coresys.resolution.healthcheck for just this one apply_suggestion call rather than introducing a file-wide DNS mock. The patch is local to the slow call site and the test's assertion is unaffected. The file drops from ~21 s to ~2.5 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>