* Rework Supervisor connectivity check with coalescing and force flag
Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.
The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:
- check_and_update_connectivity(force=False) is the only path that
runs the HTTP probe and updates the cached state. Concurrent callers
coalesce onto a single in-flight probe. A min-interval throttle
lives inside the method and reuses the cached result within window
instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
for signal handlers (D-Bus, plugin callbacks) that must return
quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
sets a trailing-rerun flag so the owning task runs one more probe
after the current one completes. Used for signals that carry fresh
state-change information (NM connectivity transition to FULL, DNS
container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.
Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Propagate connectivity-probe cancellation and skip last-check on cancel
Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.
The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.
A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Clarify why post-NTP-sync forces a connectivity probe
The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add debug logs to Supervisor connectivity probe paths
The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:
- who requested a check (force flag distinguishes signal-driven probes
from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
the message so transitions are visible)
All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Skip system-checks fan-out in test_events_on_issue_changes
The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.
The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.
Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rename addon→app in docstrings and comments
Updates all docstrings and inline comments across supervisor/ and
tests/ to use the new app/apps terminology. No runtime behaviour
is changed by this commit.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Rename addon→app in code (variables, args, class names, functions)
Renames all internal Python identifiers from addon/addons to app/apps:
- Variable and argument names
- Function and method names
- Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp,
all exception, check, and fixup classes, etc.)
- String literals used as Python identifiers (pytest fixtures,
parametrize param names, patch.object attribute strings,
URL route match_info keys)
External API contracts are preserved: JSON keys, error codes,
discovery protocol fields, TypedDict/attr.s field names.
Import module paths (supervisor/addons/) are also unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix partial backup/restore API to remap addons key to apps
The external API accepts `addons` as the request body key (since
ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial
now take an `apps` parameter after the rename. The **body expansion
in both endpoints would pass `addons=...` causing a TypeError.
Remap the key before expansion in both backup_partial and
restore_partial:
if ATTR_APPS in body:
body["apps"] = body.pop(ATTR_APPS)
Also adds test_restore_partial_with_addons_key to verify the restore
path correctly receives apps= when addons is passed in the request
body. This path had no existing test coverage.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix merge error
* Adjust AppLoggerAdapter to use app_name
Co-authored-by: Stefan Agner <stefan@agner.ch>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Migrate info and events to aiodocker
* Migrate container logs to aiodocker
* Fix dns plugin loop test
* Fix mocking for docker info
* Fixes from feedback
* Harden monitor error handling
* Deleted failing tests because they were not useful
* Add Supervisor connectivity check after DNS restart
When the DNS plug-in got restarted, check Supervisor connectivity
in case the DNS plug-in configuration change influenced Supervisor
connectivity. This is helpful when a DHCP server gets started after
Home Assistant is up. In that case the network provided DNS server
(local DNS server) becomes available after the DNS plug-in restart.
Without this change, the Supervisor connectivity will remain false
until the a Job triggers a connectivity check, for example the
periodic update check (which causes a updater and store reload) by
Core.
* Fix pytest and add coverage for new functionality
* Improve DNS plug-in restart
Instead of simply go by PrimaryConnectioon change, use the DnsManager
Configuration property. This property is ultimately used to write the
DNS plug-in configuration, so it is really the relevant information
we pass on to the plug-in.
* Check for changes and restart DNS plugin
* Check for changes in plug-in DNS
Cache last local (NetworkManager) provided DNS servers. Check against
this DNS server list when deciding when to restart the DNS plug-in.
* Check connectivity unthrottled in certain situations
* Fix pytest
* Fix pytest
* Improve test coverage for DNS plugins restart functionality
* Apply suggestions from code review
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Debounce local DNS changes and event based connectivity checks
* Remove connection check logic
* Remove unthrottled connectivity check
* Fix delayed call
* Store restart task and cancel in case a restart is running
* Improve DNS configuration change tests
* Remove stale code
* Improve DNS plug-in tests, less mocking
* Cover multiple private functions at once
Improve tests around notify_locals_changed() to cover multiple
functions at once.
---------
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Bad message error marks system as unhealthy
* Finish adding test cases for changes
* Rename test file for uniqueness
* bad_message to oserror_bad_message
* Omit some checks and check for network mounts
* Add job group execution limit option
* Fix pylint issues
* Assign variable before usage
* Cleanup jobs when done
* Remove isinstance check for performance
* Explicitly raise from None
* Add some more documentation info
* Reduce executor code for docker
* Fix pylint errors and move import/export image
* Fix test and a couple other risky executor calls
* Fix dataclass and return
* Fix test case and add one for corrupt docker
* Add some coverage
* Undo changes to docker manager startup
* Docker events based watchdog
* Separate monitor from DockerAPI since it needs coresys
* Move monitor into dockerAPI
* Fix properties on coresys
* Add watchdog tests
* Added tests
* pylint issue
* Current state failures test
* Thread-safe event processing
* Use labels property
* Create issue for detected DNS server problem
* Validate behavior on restart as well
* tls:// not supported, remove check
* Move DNS server checks into resolution checks
* Revert all changes to plugins.dns
* Run DNS server checks if affected
* Mock aiodns query during all checks tests