supervisor

mirror of https://github.com/home-assistant/supervisor.git synced 2026-05-30 11:33:05 +01:00

Author	SHA1	Message	Date
Stefan Agner	0ac8b42062	Rework Supervisor connectivity check with coalescing and force flag (#6765 ) * Rework Supervisor connectivity check with coalescing and force flag Previously, a failed connectivity probe could strand Supervisor in a "no connectivity" state indefinitely. After an Ethernet reconnect, a probe kicked by NetworkManager's connectivity transition could race with CoreDNS being restarted (due to DNS locals changing), time out on DNS, and leave supervisor.connectivity = False. The retry that _on_dns_container_running was meant to fire landed inside the 5 s JobThrottle window from the just-failed probe and was silently dropped, since JobThrottle.THROTTLE drops rather than waits. The rework replaces the @Job(throttle=THROTTLE) decorator and the public connectivity setter with a single authoritative state-updating method: - check_and_update_connectivity(force=False) is the only path that runs the HTTP probe and updates the cached state. Concurrent callers coalesce onto a single in-flight probe. A min-interval throttle lives inside the method and reuses the cached result within window instead of dropping calls. - request_connectivity_check(force=False) is a fire-and-forget wrapper for signal handlers (D-Bus, plugin callbacks) that must return quickly without blocking signal dispatch on the HTTP round-trip. - force=True bypasses the min-interval and, when a probe is in flight, sets a trailing-rerun flag so the owning task runs one more probe after the current one completes. Used for signals that carry fresh state-change information (NM connectivity transition to FULL, DNS container RUNNING, startup, post-NTP sync). - _update_connectivity is the sole writer of the cached flag and emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions. Call sites migrate accordingly. The opportunistic supervisor.connectivity = False writes in update_apparmor, updater.fetch_data, os.manager, and addon_pwned error paths are replaced with request_connectivity_check() calls so the probe remains authoritative - an endpoint-specific failure no longer lies about the overall connectivity state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Propagate connectivity-probe cancellation and skip last-check on cancel Awaiting an asyncio.Task does not propagate cancellation INTO the task, so the previous owner-doesn't-shield comment was misleading: a cancelled owner left the spawned probe running orphaned, and the next caller could start a second probe alongside it. The owner now explicitly cancels and awaits the probe on CancelledError before re-raising. The last-check timestamp is also moved out of the finally block so a cancelled probe does not leave a "fresh result just ran" cache behind that would short-circuit the next non-forced caller. A regression test exercises both: that owner cancellation clears the in-flight reference and leaves the timestamp untouched, and that a subsequent non-forced check therefore still actually probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Clarify why post-NTP-sync forces a connectivity probe The previous comment claimed the last-check timestamp may be unreliable after a time jump, but _connectivity_last_check uses loop.time() which is monotonic and unaffected by wall-clock corrections. The real reason to force a fresh probe is TLS validation: certificates that appeared expired or not-yet-valid before the system clock was corrected may now verify, so a probe that just failed with an SSL error can succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add debug logs to Supervisor connectivity probe paths The original stuck-offline bug was hard to spot in logs because the silent throttle-drop and the cached state had no audit trail. With debug-level logging at each decision point, a future investigation can reconstruct from a single log file: - who requested a check (force flag distinguishes signal-driven probes from precondition / opportunistic-error-path requests) - why a probe did not actually run (in-flight coalesce, cached within min-interval, owner cancellation) - when a forced rerun was queued and when it ran (the precise failure mode that stranded the supervisor in the original incident) - when the cached state actually flipped (with the previous value in the message so transitions are visible) All new lines are debug-level. The existing _do_connectivity_check "failed" / "succeeded" lines are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Skip system-checks fan-out in test_events_on_issue_changes The test asserts that apply_suggestion fires an ISSUE_REMOVED event. ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before apply_suggestion calls healthcheck. The healthcheck call afterwards is incidental to this test's intent, but it fans out into check_system() which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes against the NetworkManager mock's stub nameserver 192.168.30.1 that each hit the default ~10 s aiodns timeout. The file took ~21 s to run. The slowness has been latent since #3818 (Aug 2022), which added the apply_suggestion step at the end of test_events_on_issue_changes two days after the DNS check landed in its current form (#3811). The default 24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in full-suite runs once any earlier test has tripped the throttle, which is likely why this slipped through. Mock coresys.resolution.healthcheck for just this one apply_suggestion call rather than introducing a file-wide DNS mock. The patch is local to the slow call site and the test's assertion is unaffected. The file drops from ~21 s to ~2.5 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:14:13 +02:00
Mike Degatano	ba8c49935b	Refactor internal addon references to app/apps (#6717 ) * Rename addon→app in docstrings and comments Updates all docstrings and inline comments across supervisor/ and tests/ to use the new app/apps terminology. No runtime behaviour is changed by this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rename addon→app in code (variables, args, class names, functions) Renames all internal Python identifiers from addon/addons to app/apps: - Variable and argument names - Function and method names - Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp, all exception, check, and fixup classes, etc.) - String literals used as Python identifiers (pytest fixtures, parametrize param names, patch.object attribute strings, URL route match_info keys) External API contracts are preserved: JSON keys, error codes, discovery protocol fields, TypedDict/attr.s field names. Import module paths (supervisor/addons/) are also unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix partial backup/restore API to remap addons key to apps The external API accepts `addons` as the request body key (since ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial now take an `apps` parameter after the rename. The *body expansion in both endpoints would pass `addons=...` causing a TypeError. Remap the key before expansion in both backup_partial and restore_partial: if ATTR_APPS in body: body["apps"] = body.pop(ATTR_APPS) Also adds test_restore_partial_with_addons_key to verify the restore path correctly receives apps= when addons is passed in the request body. This path had no existing test coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Fix merge error * Adjust AppLoggerAdapter to use app_name Co-authored-by: Stefan Agner <stefan@agner.ch> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Stefan Agner <stefan@agner.ch>	2026-04-14 16:47:20 +02:00
Mike Degatano	a122b5f1e9	Migrate info, events and container logs to aiodocker (#6514 ) * Migrate info and events to aiodocker * Migrate container logs to aiodocker * Fix dns plugin loop test * Fix mocking for docker info * Fixes from feedback * Harden monitor error handling * Deleted failing tests because they were not useful	2026-02-03 18:36:41 +01:00
Stefan Agner	9a0f530a2f	Add Supervisor connectivity check after DNS restart (#6005 ) * Add Supervisor connectivity check after DNS restart When the DNS plug-in got restarted, check Supervisor connectivity in case the DNS plug-in configuration change influenced Supervisor connectivity. This is helpful when a DHCP server gets started after Home Assistant is up. In that case the network provided DNS server (local DNS server) becomes available after the DNS plug-in restart. Without this change, the Supervisor connectivity will remain false until the a Job triggers a connectivity check, for example the periodic update check (which causes a updater and store reload) by Core. * Fix pytest and add coverage for new functionality	2025-07-10 11:08:10 +02:00
Stefan Agner	953f7d01d7	Improve DNS plug-in restart (#5999 ) * Improve DNS plug-in restart Instead of simply go by PrimaryConnectioon change, use the DnsManager Configuration property. This property is ultimately used to write the DNS plug-in configuration, so it is really the relevant information we pass on to the plug-in. * Check for changes and restart DNS plugin * Check for changes in plug-in DNS Cache last local (NetworkManager) provided DNS servers. Check against this DNS server list when deciding when to restart the DNS plug-in. * Check connectivity unthrottled in certain situations * Fix pytest * Fix pytest * Improve test coverage for DNS plugins restart functionality * Apply suggestions from code review Co-authored-by: Mike Degatano <michael.degatano@gmail.com> * Debounce local DNS changes and event based connectivity checks * Remove connection check logic * Remove unthrottled connectivity check * Fix delayed call * Store restart task and cancel in case a restart is running * Improve DNS configuration change tests * Remove stale code * Improve DNS plug-in tests, less mocking * Cover multiple private functions at once Improve tests around notify_locals_changed() to cover multiple functions at once. --------- Co-authored-by: Mike Degatano <michael.degatano@gmail.com>	2025-07-09 11:35:03 +02:00
Mike Degatano	0e8ace949a	Fix mypy issues in `plugins` and `resolution` (#5946 ) * Fix mypy issues in plugins * Fix mypy issues in resolution module * fix misses in resolution check * Fix signatures on evaluate methods * nitpick fix suggestions	2025-06-16 14:12:47 -04:00
Stefan Agner	f6faa18409	Bump pre-commit ruff to 0.5.7 and reformat (#5242 ) It seems that the codebase is not formatted with the latest ruff version. This PR reformats the codebase with ruff 0.5.7.	2024-08-13 20:53:56 +02:00
Mike Degatano	3cc6bd19ad	Mark system as unhealthy on OSError Bad message errors (#4750 ) * Bad message error marks system as unhealthy * Finish adding test cases for changes * Rename test file for uniqueness * bad_message to oserror_bad_message * Omit some checks and check for network mounts	2023-12-21 18:05:29 +01:00
Mike Degatano	1611beccd1	Add job group execution limit option (#4457 ) * Add job group execution limit option * Fix pylint issues * Assign variable before usage * Cleanup jobs when done * Remove isinstance check for performance * Explicitly raise from None * Add some more documentation info	2023-08-08 16:49:17 -04:00
Mike Degatano	1f92ab42ca	Reduce executor code for docker (#4438 ) * Reduce executor code for docker * Fix pylint errors and move import/export image * Fix test and a couple other risky executor calls * Fix dataclass and return * Fix test case and add one for corrupt docker * Add some coverage * Undo changes to docker manager startup	2023-07-18 11:39:39 -04:00
Mike Degatano	d19166bb86	Docker events based watchdog and docker healthchecks (#3725 ) * Docker events based watchdog * Separate monitor from DockerAPI since it needs coresys * Move monitor into dockerAPI * Fix properties on coresys * Add watchdog tests * Added tests * pylint issue * Current state failures test * Thread-safe event processing * Use labels property	2022-07-15 09:21:59 +02:00
Mike Degatano	8bb4596d04	Add API option to disable fallback DNS (#3586 ) * Add API option to disable fallback DNS * DNS unsupported evaluation and fallback in sentry	2022-04-25 18:15:40 +02:00
Mike Degatano	f3e2ccce43	Create issue for detected DNS server problem (#3578 ) * Create issue for detected DNS server problem * Validate behavior on restart as well * tls:// not supported, remove check * Move DNS server checks into resolution checks * Revert all changes to plugins.dns * Run DNS server checks if affected * Mock aiodns query during all checks tests	2022-04-21 10:55:49 +02:00

13 Commits