* Replace fixed-duration sleeps after bus events with gather
Several tests use ``await asyncio.sleep(...)`` to "wait for the
listener to run" after firing a bus event. The fixed duration is
real wall-clock time and the wait can be indeterministic — if the
handler chain happens to need slightly more time on a busy CI
runner, the assertion races the handler.
``Bus.fire_event`` returns the listener tasks since #6252; capture
and ``await asyncio.gather(*tasks)`` instead of sleeping. Touches
test_bus.py (the bus tests were poking scheduling instead of
verifying their assertions), test_home_assistant_watchdog.py,
test_plugin_base.py, addons/test_manager.py, docker/test_addon.py,
and test_store_execute_reload.py.
Other cleanups in the same spirit:
- ``_fire_test_event`` in addons/test_addon.py becomes ``async def``
and gathers the listener tasks itself, so its 17 call sites
collapse to a single ``await _fire_test_event(...)``.
- The two test_store_execute_reload.py sites that used the private
``_update_connectivity()`` helper are reworked to set the cached
connectivity flag directly and fire the event themselves so they
can gather the listener tasks the same way.
- The two ``sleep(1)`` post-pull drains in docker/test_interface.py
collapse to ``sleep(0)`` (handler tasks are already gathered
inside pull_image), saving ~2s.
- The ``sleep(0.01)`` waits inside ``container_events()`` task
bodies (api/test_addons.py, api/test_store.py,
backups/test_manager.py) are just one-yield-to-the-parent and
become ``sleep(0)``.
Switching to ``gather`` exposes a few latent test mocks that were
silently swallowing TypeErrors as background-task failures before:
- ``CGroup.add_devices_allowed`` is ``async def`` but was patched
as a plain MagicMock in docker/test_addon.py — now patched via
``new_callable=AsyncMock``.
- The watchdog does ``await (await self.start())`` /
``await (await self.restart())`` because ``App.start`` /
``App.restart`` return ``asyncio.Task``. The mocks in
addons/test_addon.py (test_app_watchdog, test_watchdog_on_stop,
test_watchdog_during_attach) needed
``AsyncMock(return_value=<settled future>)`` to mirror that
shape rather than a plain MagicMock.
* Factor bus.fire_event + gather pattern into a helper
Per review feedback, the ``await asyncio.gather(*coresys.bus.fire_event(...))``
incantation was scattered across many call sites. Add
``tests.common.fire_bus_event`` that takes the coresys, event and data,
fires the event and awaits the spawned listener tasks. Convert all
matching sites to use it, including the ``_fire_test_event`` wrapper
in addons/test_addon.py which now just builds the
``DockerContainerStateEvent`` and delegates.
* Rework Supervisor connectivity check with coalescing and force flag
Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.
The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:
- check_and_update_connectivity(force=False) is the only path that
runs the HTTP probe and updates the cached state. Concurrent callers
coalesce onto a single in-flight probe. A min-interval throttle
lives inside the method and reuses the cached result within window
instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
for signal handlers (D-Bus, plugin callbacks) that must return
quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
sets a trailing-rerun flag so the owning task runs one more probe
after the current one completes. Used for signals that carry fresh
state-change information (NM connectivity transition to FULL, DNS
container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.
Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Propagate connectivity-probe cancellation and skip last-check on cancel
Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.
The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.
A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Clarify why post-NTP-sync forces a connectivity probe
The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add debug logs to Supervisor connectivity probe paths
The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:
- who requested a check (force flag distinguishes signal-driven probes
from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
the message so transitions are visible)
All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Skip system-checks fan-out in test_events_on_issue_changes
The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.
The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.
Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triaging SUPERVISOR-1JWK turned up a missed port conflict:
RE_PORT_CONFLICT_ERROR only matched one of the Docker daemon's
port-in-use message shapes. The two variants produced by current moby
— "Bind for <ip>:<port> failed: port is already allocated" from
portallocator and "failed to bind host port <ip>:<port>/<proto>:
address already in use" from osallocator — fell through to
DockerAPIError, got re-raised as AppUnknownError, and the watchdog
shipped them to Sentry as unknown errors.
Widen the regex to match all known shapes (including the older form
embedding the container endpoint, still observed from older daemons
and wrappers), anchored on the "failed to set up container networking"
prefix and one of the "address already in use" or "port is already
allocated" suffixes. Log the raw Docker message at debug level before
converting, so curious users can still see the exact upstream text
(host IP, container endpoint, protocol) when investigating which
process is holding the port.
The watchdog's _restart_after_problem now catches AppPortConflict
explicitly ahead of the generic AppsError handler: log a warning,
break the retry loop, do not call async_capture_exception. A port
conflict is an environment condition — another process grabbed the
port while the add-on was down — so retrying cannot make it succeed
and reporting to Sentry is noise.
With port conflicts now raised as typed APIError subclasses at the
detection site, the DockerAPIError → format_message() rewrite fallback
in api_return_error has no work left. Drop the fallback and delete
supervisor/utils/log_format.py along with its tests; the module only
ever handled port-conflict prose.
Fixes SUPERVISOR-1JWK
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rename addon→app in docstrings and comments
Updates all docstrings and inline comments across supervisor/ and
tests/ to use the new app/apps terminology. No runtime behaviour
is changed by this commit.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Rename addon→app in code (variables, args, class names, functions)
Renames all internal Python identifiers from addon/addons to app/apps:
- Variable and argument names
- Function and method names
- Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp,
all exception, check, and fixup classes, etc.)
- String literals used as Python identifiers (pytest fixtures,
parametrize param names, patch.object attribute strings,
URL route match_info keys)
External API contracts are preserved: JSON keys, error codes,
discovery protocol fields, TypedDict/attr.s field names.
Import module paths (supervisor/addons/) are also unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix partial backup/restore API to remap addons key to apps
The external API accepts `addons` as the request body key (since
ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial
now take an `apps` parameter after the rename. The **body expansion
in both endpoints would pass `addons=...` causing a TypeError.
Remap the key before expansion in both backup_partial and
restore_partial:
if ATTR_APPS in body:
body["apps"] = body.pop(ATTR_APPS)
Also adds test_restore_partial_with_addons_key to verify the restore
path correctly receives apps= when addons is passed in the request
body. This path had no existing test coverage.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix merge error
* Adjust AppLoggerAdapter to use app_name
Co-authored-by: Stefan Agner <stefan@agner.ch>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Respect auto-update setting for plug-in auto-updates
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
* Also skip auto-updating plug-ins in decorator
* Raise if auto-update flag is not set and plug-in is not up to date
---------
Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
* Migrate info and events to aiodocker
* Migrate container logs to aiodocker
* Fix dns plugin loop test
* Fix mocking for docker info
* Fixes from feedback
* Harden monitor error handling
* Deleted failing tests because they were not useful
The aiodocker 0.25.0 upgrade (PR #6448) changed how DockerError handles
the message parameter. The library now extracts the message string from
Docker API JSON responses before passing it to DockerError, rather than
passing the entire dict.
The port conflict detection tests were written before this change and
incorrectly passed dicts to DockerError. This caused TypeErrors when
the port conflict detection code tried to match err.message with a
regex, expecting a string but receiving a dict.
Update both test_addon_start_port_conflict_error and
test_observer_start_port_conflict to pass message strings directly,
matching the real aiodocker 0.25.0 behavior.
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Map port conflict on start error into a known error
* Apply suggestions from code review
* Run ruff format
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Migrate all docker container interactions to aiodocker
* Remove containers_legacy since its no longer used
* Add back remove color logic
* Revert accidental invert of conditional in setup_network
* Fix typos found by copilot
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Revert "Apply suggestions from code review"
This reverts commit 0a475433ea.
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix private registry authentication for aiodocker image pulls
After PR #6252 migrated image pulling from dockerpy to aiodocker,
private registry authentication stopped working. The old _docker_login()
method stored credentials in ~/.docker/config.json via dockerpy, but
aiodocker doesn't read that file - it requires credentials passed
explicitly via the auth parameter.
Changes:
- Remove unused _docker_login() method (dockerpy login was ineffective)
- Pass credentials directly to pull_image() via new auth parameter
- Add auth parameter to DockerAPI.pull_image() method
- Add unit tests for Docker Hub and custom registry authentication
Fixes#6345🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Ignore protected access in test
* Fix plug-in pull test
* Fix HA core tests
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Migrate images from dockerpy to aiodocker
* Add missing coverage and fix bug in repair
* Bind libraries to different files and refactor images.pull
* Use the same socket again
Try using the same socket again.
* Fix pytest
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Formally deprecate CodeNotary build config
* Remove CodeNotary specific integrity checking
The current code is specific to how CodeNotary was doing integrity
checking. A future integrity checking mechanism likely will work
differently (e.g. through EROFS based containers). Remove the current
code to make way for a future implementation.
* Drop CodeNotary integrity fixups
* Drop unused tests
* Fix pytest
* Fix pytest
* Remove CodeNotary related exceptions and handling
Remove CodeNotary related exceptions and handling from the Docker
interface.
* Drop unnecessary comment
* Remove Codenotary specific IssueType/SuggestionType
* Drop Codenotary specific environment and secret reference
* Remove unused constants
* Introduce APIGone exception for removed APIs
Introduce a new exception class APIGone to indicate that certain API
features have been removed and are no longer available. Update the
security integrity check endpoint to raise this new exception instead
of a generic APIError, providing clearer communication to clients that
the feature has been intentionally removed.
* Drop content trust
A cosign based signature verification will likely be named differently
to avoid confusion with existing implementations. For now, remove the
content trust option entirely.
* Drop code sign test
* Remove source_mods/content_trust evaluations
* Remove content_trust reference in bootstrap.py
* Fix security tests
* Drop unused tests
* Drop codenotary from schema
Since we have "remove extra" in voluptuous, we can remove the
codenotary field from the addon schema.
* Remove content_trust from tests
* Remove content_trust unsupported reason
* Remove unnecessary comment
* Remove unrelated pytest
* Remove unrelated fixtures
* Write cidfiles of Docker containers and mount them individually to /run/cid
There is no standard way to get the container ID in the container
itself, which can be needed for instance for #6006. The usual pattern is
to use the --cidfile argument of Docker CLI and mount the generated file
to the container. However, this is feature of Docker CLI and we can't
use it when creating the containers via API. To get container ID to
implement native logging in e.g. Core as well, we need the help of the
Supervisor.
This change implements similar feature fully in Supervisor's DockerAPI
class that orchestrates lifetime of all containers managed by
Supervisor. The files are created in the SUPERVISOR_DATA directory, as
it needs to be persisted between reboots, just as the instances of
Docker containers are.
Supervisor's cidfile must be created when starting the Supervisor
itself, for that see home-assistant/operating-system#4276.
* Address review comments, fix mounting of the cidfile
* Stop refreshing the update information on outdated OS versions
Add `JobCondition.OS_SUPPORTED` to the updater job to avoid
refreshing update information when the OS version is unsupported.
This effectively freezes installations on unsupported OS versions
and blocks Supervisor updates. Once deployed, this ensures that any
Supervisor will always run on at least the minimum supported OS
version.
This requires to move the OS version check before Supervisor updater
initialization to allow the `JobCondition.OS_SUPPORTED` to work
correctly.
* Run only OS version check in setup loads
Instead of running a full system evaluation, only run the OS version
check right after the OS manager is loaded. This allows the
updater job condition to work correctly without running the full
system evaluation, which is not needed at this point.
* Prevent Core and Add-on updates on unsupported OS versions
Also prevent Home Assistant Core and Add-on updates on unsupported OS
versions. We could imply `JobCondition.SUPERVISOR_UPDATED` whenever
OS is outdated, but this would also prevent the OS update itself. So
we need this separate condition everywhere where
`JobCondition.SUPERVISOR_UPDATED` is used except for OS updates.
It should also be safe to let the add-on store update, we simply
don't allow the add-on to be installed or updated if the OS is
outdated.
* Remove unnecessary Host info update
It seems that the CPE information are already loaded in the HostInfo
object. Remove the unnecessary update call.
* Fix pytest
* Delay refreshing of update data
Delay refreshing of update data until after setup phase. This allows to
use the JobCondition.OS_SUPPORTED safely. We still have to fetch the
updater data in case OS information is outdated. This typically happens
on device wipe.
Note also that plug-ins will automatically refresh updater data in case
it is missing the latest version information.
This will reverse the order of updates when there are new plug-in and
Supervisor update information available (e.g. on first startup):
Previously the updater data got refreshed before the plug-in started,
which caused them to update first. Then the Supervisor got update in
startup phase. Now the updater data gets refreshed in startup phase,
which then causes the Supervisor to update first before the plug-ins
get updated after Supervisor restart.
* Fix pytest
* Fix updater tests
* Add new tests to verify that updater reload is skipped
* Fix pylint
* Apply suggestions from code review
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Add debug message when we delay version fetch
---------
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Send progress updates during image pull for install/update
* Add extra to tests about job APIs
* Sent out of date progress to sentry and combine done event
* Pulling container image layer
* Add Supervisor connectivity check after DNS restart
When the DNS plug-in got restarted, check Supervisor connectivity
in case the DNS plug-in configuration change influenced Supervisor
connectivity. This is helpful when a DHCP server gets started after
Home Assistant is up. In that case the network provided DNS server
(local DNS server) becomes available after the DNS plug-in restart.
Without this change, the Supervisor connectivity will remain false
until the a Job triggers a connectivity check, for example the
periodic update check (which causes a updater and store reload) by
Core.
* Fix pytest and add coverage for new functionality
* Improve DNS plug-in restart
Instead of simply go by PrimaryConnectioon change, use the DnsManager
Configuration property. This property is ultimately used to write the
DNS plug-in configuration, so it is really the relevant information
we pass on to the plug-in.
* Check for changes and restart DNS plugin
* Check for changes in plug-in DNS
Cache last local (NetworkManager) provided DNS servers. Check against
this DNS server list when deciding when to restart the DNS plug-in.
* Check connectivity unthrottled in certain situations
* Fix pytest
* Fix pytest
* Improve test coverage for DNS plugins restart functionality
* Apply suggestions from code review
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Debounce local DNS changes and event based connectivity checks
* Remove connection check logic
* Remove unthrottled connectivity check
* Fix delayed call
* Store restart task and cancel in case a restart is running
* Improve DNS configuration change tests
* Remove stale code
* Improve DNS plug-in tests, less mocking
* Cover multiple private functions at once
Improve tests around notify_locals_changed() to cover multiple
functions at once.
---------
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
* Recreate aiohttp ClientSession after DNS plug-in load
Create a temporary ClientSession early in case we need to load version
information from the internet. This doesn't use the final DNS setup
and hence might fail to load in certain situations since we don't have
the fallback mechanims in place yet. But if the DNS container image
is present, we'll continue the setup and load the DNS plug-in. We then
can recreate the ClientSession such that it uses the DNS plug-in.
This works around an issue with aiodns, which today doesn't reload
`resolv.conf` automatically when it changes. This lead to Supervisor
using the initial `resolv.conf` as created by Docker. It meant that
we did not use the DNS plug-in (and its fallback capabilities) in
Supervisor. Also it meant that changes to the DNS setup at runtime
did not propagate to the aiohttp ClientSession (as observed in #5332).
* Mock aiohttp.ClientSession for all tests
Currently in several places pytest actually uses the aiohttp
ClientSession and reaches out to the internet. This is not ideal
for unit tests and should be avoided.
This creates several new fixtures to aid this effort: The `websession`
fixture simply returns a mocked aiohttp.ClientSession, which can be
used whenever a function is tested which needs the global websession.
A separate new fixture to mock the connectivity check named
`supervisor_internet` since this is often used through the Job
decorator which require INTERNET_SYSTEM.
And the `mock_update_data` uses the already existing update json
test data from the fixture directory instead of loading the data
from the internet.
* Log ClientSession nameserver information
When recreating the aiohttp ClientSession, log information what
nameservers exactly are going to be used.
* Refuse ClientSession initialization when API is available
Previous attempts to reinitialize the ClientSession have shown
use of the ClientSession after it was closed due to API requets
being handled in parallel to the reinitialization (see #5851).
Make sure this is not possible by refusing to reinitialize the
ClientSession when the API is available.
* Fix pytests
Also sure we don't create aiohttp ClientSession objects unnecessarily.
* Apply suggestions from code review
Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>
---------
Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>
* Delay inital version fetch until there is connectivity
* Add test
* Only mock get not whole websession object
* drive delayed fetch off of supervisor connectivity not host
* Fix test to not rely on sleep guessing to track tasks
* Use fixture to remove job throttle temporarily
* Allow adoption of existing data disk
* Fix existing tests
* Add test cases and fix image issues
* Fix addon build test
* Run checks during setup not startup
* Addon load mimics plugin and HA load for docker part
* Default image accessible in except
* Migrate to Ruff for lint and format
* Fix pylint issues
* DBus property sets into normal awaitable methods
* Fix tests relying on separate tasks in connect
* Fixes from feedback
* Bad message error marks system as unhealthy
* Finish adding test cases for changes
* Rename test file for uniqueness
* bad_message to oserror_bad_message
* Omit some checks and check for network mounts
* Remove race with watchdog during backup, restore and update
* Fix pylint issues and test
* Stop after image pull during update
* Add test for max failed attempts for plugin watchdog
* Add job group execution limit option
* Fix pylint issues
* Assign variable before usage
* Cleanup jobs when done
* Remove isinstance check for performance
* Explicitly raise from None
* Add some more documentation info
* Reduce executor code for docker
* Fix pylint errors and move import/export image
* Fix test and a couple other risky executor calls
* Fix dataclass and return
* Fix test case and add one for corrupt docker
* Add some coverage
* Undo changes to docker manager startup
* Add update freeze option
* Freeze to auto update and plugin condition
* Add tests
* Add supervisor_version evaluation
* OS updates require supervisor up to date
* Run version check during startup
* Docker events based watchdog
* Separate monitor from DockerAPI since it needs coresys
* Move monitor into dockerAPI
* Fix properties on coresys
* Add watchdog tests
* Added tests
* pylint issue
* Current state failures test
* Thread-safe event processing
* Use labels property
* Create issue for detected DNS server problem
* Validate behavior on restart as well
* tls:// not supported, remove check
* Move DNS server checks into resolution checks
* Revert all changes to plugins.dns
* Run DNS server checks if affected
* Mock aiodns query during all checks tests