PR #6726 removed the early return after a HomeAssistantError from the
post-update get_config() call so that a Core that stopped responding
after an update would correctly trigger a rollback. That early return
was, however, also load-bearing for the backup restore flow:
Backup.restore_homeassistant() stops and removes Core before invoking
core.update(target_version) and starts Core later in its own
await_home_assistant_restart stage. With Core not running, _update()
correctly skips the start step, but the unconditional post-update
get_config() now always raises, sets error_state, and triggers a
spurious rollback that re-pulls the previous image and leaves the
system on the wrong version after the restore completes.
Return early from update() when Core was not running on entry. The
caller is responsible for starting Core and there is no live API to
health-check at this point. Genuine update failures (Core was running,
update broke it) are unaffected and still roll back.
Also rename the local rollback to rollback_version for clarity.
The UNIX_SOCKET_CORE_API feature flag has been the only way to opt into
Unix socket communication between Supervisor and Home Assistant Core.
Now that the implementation has settled, enable it by default for Core
versions at or above 2026.5.1. Versions in the supported range below
that (down to CORE_UNIX_SOCKET_MIN_VERSION) continue to require the
feature flag, preserving the existing opt-in behavior for early dev
builds.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This flattens the Docker image from 9 layers to 5 by using multi-stage
build that squashes layers into logical blocks. The first layer on top
of the base image adds system-wide packages and uv (which is not updated
so often - if it were, it may be wise to move it into the next or
separate layer; it weights roughly 50 MB) which should be preserved
between releases, while the next layer adds all Supervisor Python code
and dependencies.
This means that unless the base image or packages installed in the first
stage are changed (or in other words, only Supervisor code is changed),
only a single layer is pulled from the repository. Previously, it
generally resulted in pull of all the following 4 layers, as just a
change in the requirements invalidated the following layers. The fetched
payload size remains roughly the same.
* Replace fixed-duration sleeps after bus events with gather
Several tests use ``await asyncio.sleep(...)`` to "wait for the
listener to run" after firing a bus event. The fixed duration is
real wall-clock time and the wait can be indeterministic — if the
handler chain happens to need slightly more time on a busy CI
runner, the assertion races the handler.
``Bus.fire_event`` returns the listener tasks since #6252; capture
and ``await asyncio.gather(*tasks)`` instead of sleeping. Touches
test_bus.py (the bus tests were poking scheduling instead of
verifying their assertions), test_home_assistant_watchdog.py,
test_plugin_base.py, addons/test_manager.py, docker/test_addon.py,
and test_store_execute_reload.py.
Other cleanups in the same spirit:
- ``_fire_test_event`` in addons/test_addon.py becomes ``async def``
and gathers the listener tasks itself, so its 17 call sites
collapse to a single ``await _fire_test_event(...)``.
- The two test_store_execute_reload.py sites that used the private
``_update_connectivity()`` helper are reworked to set the cached
connectivity flag directly and fire the event themselves so they
can gather the listener tasks the same way.
- The two ``sleep(1)`` post-pull drains in docker/test_interface.py
collapse to ``sleep(0)`` (handler tasks are already gathered
inside pull_image), saving ~2s.
- The ``sleep(0.01)`` waits inside ``container_events()`` task
bodies (api/test_addons.py, api/test_store.py,
backups/test_manager.py) are just one-yield-to-the-parent and
become ``sleep(0)``.
Switching to ``gather`` exposes a few latent test mocks that were
silently swallowing TypeErrors as background-task failures before:
- ``CGroup.add_devices_allowed`` is ``async def`` but was patched
as a plain MagicMock in docker/test_addon.py — now patched via
``new_callable=AsyncMock``.
- The watchdog does ``await (await self.start())`` /
``await (await self.restart())`` because ``App.start`` /
``App.restart`` return ``asyncio.Task``. The mocks in
addons/test_addon.py (test_app_watchdog, test_watchdog_on_stop,
test_watchdog_during_attach) needed
``AsyncMock(return_value=<settled future>)`` to mirror that
shape rather than a plain MagicMock.
* Factor bus.fire_event + gather pattern into a helper
Per review feedback, the ``await asyncio.gather(*coresys.bus.fire_event(...))``
incantation was scattered across many call sites. Add
``tests.common.fire_bus_event`` that takes the coresys, event and data,
fires the event and awaits the spawned listener tasks. Convert all
matching sites to use it, including the ``_fire_test_event`` wrapper
in addons/test_addon.py which now just builds the
``DockerContainerStateEvent`` and delegates.
* Improve and extend frontend probe after update with WebSocket check
The post-update health check introduced in #6311 added
HomeAssistantAPI.check_frontend_available, which fetched the frontend
through the existing Supervisor-internal API connection to Core.
Since #6742 that connection optionally runs over a Unix socket with
no authentication, so the request no longer exercises the same
transport, auth and routing path that an external HTTP client uses.
Move the frontend probe out of HomeAssistantAPI into a small
frontend_check module that talks to Core's TCP endpoints via the
plain websession with no authentication, mirroring what an external
client would see.
While doing this, extend the post-update verification to also probe
the WebSocket endpoint: open /api/websocket and confirm the first
frame is the auth_required text message. This catches the kind of
WebSocket breakage seen in #6802, where api/config still listed
websocket_api as loaded and GET / still returned HTML, but the
WebSocket handshake completed with an immediate close frame and the
frontend was unusable.
The component check now also requires "http" to be loaded, in
addition to "frontend" and "websocket_api", and iterates so every
missing component is logged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Address review feedback on WebSocket probe
- Wrap ws_connect in asyncio.wait_for so the handshake has an explicit
bounded timeout (the global websession's default timeout would
otherwise apply).
- Validate that the auth_required payload is a JSON object before
calling .get("type"); a list/string would otherwise raise
AttributeError at runtime.
- Add a regression test covering a non-dict JSON payload.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The fire-and-forget _async_send_command path was raised from DEBUG to
WARNING in #6725 for better visibility. In practice it's noisy during
normal Core lifecycle events (restart, update): Supervisor fires
supervisor_job_start/supervisor_job_end events towards Core while the
container is intentionally not running, and each event logs a warning.
The DEBUG line from the API layer just above ("Core container is not
running") already explains the cause, so the WARNING just restates it.
Synchronous async_send_command callers still see raised exceptions, so
genuine failures that callers care about are not hidden. Restores the
original DEBUG level introduced together with the raise-on-failure
behavior in #6553.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Migrate persisted 'addon' field to 'app' in discovery and services config
Rename the 'addon' key to 'app' in persisted configuration files for
discovery messages (discovery.json), service modules (services.json),
and supervisor config (supervisor.json), as part of the broader
addon->app terminology migration.
Changes:
- Add ATTR_ADDON = "addon" to const.py for V1 API compat/migration
- Add ATTR_ADDONS_CUSTOM_LIST = "addons_custom_list" to const.py for migration
- Change ATTR_APPS_CUSTOM_LIST value from "addons_custom_list" to "apps_custom_list"
- Add _migrate_supervisor_config() schema pre-processor in validate.py to
transparently load old supervisor.json files using the old key
- Add ATTR_ADDON to services/const.py; change ATTR_APP value to "app"
- Add _migrate_addon_to_app() pre-processors to MQTT, MySQL, and discovery
schemas to load old config files that used the "addon" key
- Rename Message.addon -> Message.app in Discovery and update all references
- Keep hassio_push/discovery payload using "addon" key for HA compatibility
- GET /services/{service} and GET /discovery: V1 returns "addon" key,
V2 returns "app" key, via dedicated _v1 handler methods following the
backups/store pattern, registered with AppVersion guards in
_register_services() and _register_discovery()
- Broaden FileConfiguration schema type annotation to accept vol.All
validators in addition to vol.Schema
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add schema migration tests for addon->app config key rename
Test that backwards-compatible migration of old 'addon'/'addons_custom_list'
keys to 'app'/'apps_custom_list' works correctly in all affected schemas,
and that the new keys are accepted without modification.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add an __init__ to discovery tests
* Add app_api_client_with_prefix fixture and update V1/V2 tests
Move the app-level V1/V2 fixture to tests/api/conftest.py as
app_api_client_with_prefix for use across any endpoint that requires
app-level credentials (services_role, app.discovery, etc.).
- Add app_api_client_with_prefix fixture to conftest.py
- Update test_set_service_already_provided and test_del_service_not_provided
to use app_api_client_with_prefix (covers both v1 and v2)
- Add test_get_service_v1_v2_keys asserting addon/app key per version
- Update test_api_discovery_forbidden, test_api_send_del_discovery,
test_api_invalid_discovery to use app_api_client_with_prefix
- Split test_discovery_not_found into test_discovery_not_found_get
(uses api_client_with_prefix, GET requires homeassistant) and
test_discovery_not_found_delete (uses app_api_client_with_prefix)
- Add test_get_discovery_v1_v2_keys asserting addon/app key per version
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Per CLAUDE.md, plain test_* functions are the project style; class-
based test grouping is considered legacy. Convert the 24 test methods
in test_pull_progress.py (TestLayerProgress, TestImagePullProgress)
to module-level functions — none of them used self, so the rewrite is
mechanical.
Also rename three helper classes whose names accidentally matched
pytest's Test* collection pattern, even though they are fakes/fixtures
rather than test cases:
- TestAddon -> FakeApp (data holder used as a fake App in pwned tests)
- TestDockerInterface -> FakeDockerInterface (fixture/inner helper in
docker tests)
The two DBusServiceMock subclasses named TestInterface already had
__test__ = False and are left alone.
CI and tox both passed ``--timeout=10`` explicitly, but a plain local
``pytest`` had no timeout — a hung asyncio task or stuck D-Bus signal
handler could stall a developer's run indefinitely while passing CI.
Move the timeout into ``[tool.pytest.ini_options]`` so it applies
everywhere (pytest auto-discovers ``pyproject.toml`` in the repo
root) and drop the now-redundant ``--timeout=10`` flags from
``ci.yaml`` and ``tox.ini``. The full suite already fits comfortably
under 10s per test, and ``@pytest.mark.timeout(N)`` remains
available for per-test overrides if a specific test ever needs more
headroom.
The pytest config sets ``asyncio_mode = "auto"``, which already
auto-marks every ``async def test_*`` as a coroutine test. The 38
``@pytest.mark.asyncio`` decorators sprinkled across the suite were
no-ops kept around from before that flag was set. Remove them along
with the now-unused ``import pytest`` lines they were the only
consumer of.
Pure mechanical cleanup; no test behavior changes.
The cleanup of a leftover cidfile path before creating a new container
silently suppressed any OSError from rmdir/unlink. When that cleanup
fails (e.g. the path is a non-empty directory or still busy from a
pending bind unmount), the subsequent touch() raises IsADirectoryError
with no breadcrumb explaining why the path was in an unexpected state.
Replace the bare suppress(OSError) with an explicit error log so the
underlying failure is visible in the Supervisor log when the follow-up
touch() blows up. Behavior is otherwise unchanged: a failed cleanup
still falls through to touch() as before.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up on #6739: with HassioError now logged and captured by Sentry
in api_process, BackupMountDownError surfaced as an "unexpected" 400
with a noisy log entry and a Sentry event (SUPERVISOR-1JXW), even
though the user had simply asked to back up to a mount that was not
currently available.
Map this through properly so the API returns a clean, structured 400:
- Make BackupMountDownError inherit from APIError, with error_key
"backup_mount_down", message_template "Backup mount '{mount}' is
down", and the mount name in extra_fields. Clients now get a
normalized, translatable message and a stable key instead of the raw
"<name> is down, cannot back-up to it" / "...cannot copy to it"
strings.
- Simplify both raise sites in BackupManager (_check_location and
_copy_to_location) to just pass mount=. @api_process turns the
result into a 400 without logging or Sentry capture, since this is
now a modeled client-state error rather than an unexpected one.
The mount being down is a runtime state issue users hit when their
NAS/CIFS share is briefly unreachable, not a Supervisor bug worth
paging on.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The builder workflow used a blanket `cancel-in-progress: true`, which is
fine for PR runs but harmful on `main`: when several PRs merge in quick
succession and one of them touches `requirements.txt`, the wheels
publish step from the in-flight run gets killed mid-upload. Subsequent
CI runs (and downstream consumers) then fail to install the wheels for
the latest requirements.
Scope `cancel-in-progress` to `pull_request` events so pushes to `main`
queue behind each other through the existing concurrency group, while
PRs still collapse to the latest commit as before.
#6765 renamed Supervisor.check_connectivity to
check_and_update_connectivity, but the mocked_setup_loads fixture in
tests/test_core.py still patched the old name. The patch.object call
raised AttributeError at fixture setup, erroring out the
test_setup_app_file_read_error_not_captured test before it could run.
Update the patch target to the new method name so Core.setup() sees an
AsyncMock for the connectivity probe again.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Treat JobException as a client-side API error
Job condition guards (system not running, no free space, etc.) and
concurrency rejections (another job in flight) raised by the @Job
decorator are explicit precondition failures with descriptive messages,
not unexpected errors. JobException inheriting HassioError directly
meant api_process caught them in its HassioError branch — which since
#6739 logs them as unexpected and captures them to Sentry.
Inherit APIError instead so api_process surfaces these through its
APIError branch with the original message and skips the
unexpected-error path. Status stays at APIError's default 400, so the
API contract is unchanged.
Extended test_backup_immediate_errors to assert async_capture_exception
is not called for the freeze and free-space condition guards.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Silence too-many-ancestors on plugin job error mixins
The plugin-specific job error subclasses (CliJobError, ObserverJobError,
MulticastJobError, CoreDNSJobError, AudioJobError) cross pylint's
too-many-ancestors threshold once JobException inherits APIError. Add
the same `# pylint: disable=too-many-ancestors` already used on the
ResolutionNotFound subclasses with similar diamond inheritance.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Disable too-many-ancestors globally instead of per class
The pylint config already disables every other too-many-* rule "for the
sake of readability", but kept too-many-ancestors and forced inline
disables on diamond-inherited exception classes (the ResolutionNotFound
subclasses, and now five plugin job error mixins after the JobException
APIError change).
Add too-many-ancestors to the global disable list and drop all eight
inline annotations.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up on #6739: with HassioError now logged and captured by Sentry
in api_process, hostname rejections from systemd-hostnamed surfaced as
"unexpected" 400s with noisy log entries and a Sentry event, even
though the user had simply submitted an invalid hostname.
Map this through properly so the API returns a clean, structured 400:
- Split ErrorType.INVALID_ARGS out of DBusInterfaceMethodError into its
own DBusInvalidArgsError. The two cases collapsed there before are
semantically different: UNKNOWN_METHOD / INVALID_SIGNATURE mean the
call is broken (method missing or types wrong); INVALID_ARGS means
the call is valid but the service rejected an argument's value.
- Add HostInvalidHostnameError(HostError, APIError) with error_key and
extra_fields so clients get a normalized message and a stable key
rather than systemd's raw "Invalid static hostname '...'" text.
- Translate DBusInvalidArgsError to HostInvalidHostnameError in
SystemControl.set_hostname. @api_process turns the result into a 400
without logging or Sentry capture, since this is now a modeled
client-input error rather than an unexpected one.
Validation continues to live in hostnamed (hostname_is_valid() in
systemd's src/basic/hostname-util.c); Supervisor only translates the
rejection.
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Rework Supervisor connectivity check with coalescing and force flag
Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.
The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:
- check_and_update_connectivity(force=False) is the only path that
runs the HTTP probe and updates the cached state. Concurrent callers
coalesce onto a single in-flight probe. A min-interval throttle
lives inside the method and reuses the cached result within window
instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
for signal handlers (D-Bus, plugin callbacks) that must return
quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
sets a trailing-rerun flag so the owning task runs one more probe
after the current one completes. Used for signals that carry fresh
state-change information (NM connectivity transition to FULL, DNS
container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.
Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Propagate connectivity-probe cancellation and skip last-check on cancel
Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.
The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.
A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Clarify why post-NTP-sync forces a connectivity probe
The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add debug logs to Supervisor connectivity probe paths
The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:
- who requested a check (force flag distinguishes signal-driven probes
from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
the message so transitions are visible)
All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Skip system-checks fan-out in test_events_on_issue_changes
The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.
The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.
Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Refactor Supervisor network reattach path
On fresh startup the Supervisor Docker network is created and known
plugin containers are re-attached. Plugin containers (observer, cli,
dns, audio) legitimately don't exist yet at that point, which produced
noisy ERROR lines before the exception was suppressed by the caller.
- attach_container_by_name() now raises DockerNotFound silently on 404
and DockerError without implicit logging on other Docker API errors.
- _create_supervisor_network() iterates all managed containers in a
single loop using explicit try/except, replacing three separate
suppress(DockerError) blocks. Missing containers are logged at DEBUG,
unexpected Docker errors at ERROR.
- Drop the alias argument on the reattach path. Docker adds the
container name as an implicit network alias, and inter-container
lookups go through ExtraHosts (/etc/hosts), not Docker DNS, so the
explicit alias list was cosmetic and inconsistent with the
first-create path anyway.
- Consolidate AUDIO_DOCKER_NAME, CLI_DOCKER_NAME, DNS_DOCKER_NAME in
supervisor/const.py alongside the existing OBSERVER_DOCKER_NAME and
SUPERVISOR_DOCKER_NAME constants.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Escalate network attach failures and handle Supervisor specially
Pull the Supervisor container out of the reattach loop since it must
exist — Supervisor is running the code. Any failure attaching it is a
real problem, so log at CRITICAL with exc_info so Sentry captures the
full traceback.
For plugin containers, escalate non-404 errors from ERROR to CRITICAL
(also with exc_info). A DockerError there typically means Docker
itself is unhealthy, which affects the whole system and warrants a
Sentry report. Missing plugin containers (DockerNotFound) continue to
be a DEBUG log since they're expected on fresh install.
Addresses review feedback on #6760.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Handle add-on filesystem errors gracefully and reduce Sentry noise
Add AddonFileReadError for add-on metadata read failures (long_description,
refresh_path_cache) caused by filesystem errors like EBADMSG (errno 74).
The new exception calls check_oserror() to mark the system unhealthy via
the resolution system, then raises a translatable API error so callers
get a proper error response instead of an unhandled OSError.
Fixes SUPERVISOR-BC6 (548K events from the API path) and
SUPERVISOR-BZJ (from the startup/load path).
In core.py setup(), skip reporting exceptions to Sentry when the error
has already been handled by the resolution system. This is detected by
checking if a new unhealthy reason was added during the task execution
(e.g. via check_oserror). In that case the user is already notified, so
we log at error level (no stack trace) instead of critical (which would
also send to Sentry via the LoggingIntegration) and skip the explicit
capture_exception call.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Skip Sentry capture for AppFileReadError in setup()
Replace the unhealthy-state comparison logic with an explicit
`except AppFileReadError` clause. The error is already reported to
the user via the resolution system (check_oserror adds an unhealthy
reason), so capturing it to Sentry just adds noise.
Log at error level without stack trace instead of critical to avoid
the LoggingIntegration picking it up.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Add tests for AppFileReadError and setup() Sentry handling
Test that long_description and refresh_path_cache raise AppFileReadError
and mark the system unhealthy for EBADMSG errors, and raise without
marking unhealthy for other OSError types.
Also test Core.setup() to verify AppFileReadError is handled without
Sentry capture while other exceptions are captured as before.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The core_proxy middleware was Supervisor's side of the original
Core-to-Supervisor proxy auth scheme: when pre-2023.3.4 Home Assistant
Core forwarded a user-issued request to Supervisor under its
privileged supervisor token, it tagged the request with X-Hass-User-ID
and X-Hass-Is-Admin headers identifying the upstream user. Supervisor
inspected those headers, plus a header adjacency heuristic
fingerprinting aiohttp's proxy header layout, to distinguish forwarded
requests from native Core calls and reject proxied requests that
lacked user identity.
Core 2023.3.4 (PR home-assistant/core#89379) replaced that scheme:
Core now does the path-level gating itself before proxying and no
longer sends the X-Hass-* headers, so the middleware short-circuits
for any Core newer than that. With the 2-year Core support policy
introduced in #6148, every supported installation is well past
2023.3.4, making the middleware unreachable in practice.
Drop the middleware along with its now-unused supports: the
_CORE_VERSION constant, the supervisor_frontend pattern field (a
duplicate of the frontend asset list already exempted via
no_security_check), and the AwesomeVersion / LANDINGPAGE /
version_is_new_enough imports it relied on. The frontend asset bypass
itself is unchanged — it still lives in no_security_check.
* Refactor API registration to support v1/v2 via shared methods
- Add AppVersion StrEnum (V1, V2) to supervisor/api/const.py
- Replace self.v2_app with self._v2_app and expose a versions property
(dict[AppVersion, web.Application]) computed dynamically so that test
fixtures reassigning self.webapp are automatically reflected in V1
- All _register_* methods now accept a required app: web.Application
parameter; version-specific routes are gated with
"if app is self.versions[AppVersion.V1/V2]:"
- load() loops over enabled_versions (V1 always, V2 when feature-flagged)
and calls each registration method once per version, no duplication
- Static resources are registered before webapp.add_subapp() to avoid
registering into a frozen router
- add_subapp uses self.webapp directly for readability
- Fold _register_v2_apps/_register_v2_backups/_register_v2_store into
their respective unified methods; remove the now-defunct _register_v2_*
helpers and the _api_apps/_api_backups/_api_store instance vars
- _register_proxy and _register_ingress updated to accept app; legacy
/homeassistant/* proxy routes gated behind V1 conditional
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Add dual v1/v2 parametrization to API tests
All 163 tests across 17 API modules that register identically on both
v1 and v2 now run against both versions via api_client_with_prefix.
- tests/api/conftest.py: advanced_logs_tester switched to
api_client_with_prefix so log-endpoint tests are auto-parametrized;
accepts optional v2_path_prefix kwarg for paths that differ by version
- tests/api/test_{auth,discovery,dns,docker,hardware,host,ingress,
jobs,mounts,network,os,resolution,security,services,supervisor}.py:
api_client -> api_client_with_prefix with path prefix unpacking
- supervisor/api/__init__.py: _register_panel() moved outside the
version loop -- frontend static assets are V1-only
- tests/api/test_panel.py: kept on plain api_client (V1-only)
Tests intentionally kept V1-only:
- auth/discovery: use indirect api_client parametrize for addon context
- homeassistant: all tests call legacy /homeassistant/* paths (V1-only)
- jobs (4 tests): inner @Job-decorated classes register names into a
module-level set; re-running the same test raises RuntimeError
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Extend dual v1/v2 parametrization to homeassistant and jobs tests
tests/api/conftest.py:
- Add core_api_client_with_root fixture parametrized over three paths:
v1-core: /core/... (canonical v1 path)
v1-legacy: /homeassistant/... (legacy v1 alias, same handlers)
v2-core: /v2/core/... (canonical v2 path)
tests/api/test_homeassistant.py:
- Switch all 17 api_client tests to core_api_client_with_root so each
test runs against all three access paths (v1 canonical, v1 legacy
alias, v2 canonical), exercising every registered route
tests/api/test_jobs.py:
- Promote four inner TestClass definitions to module-level helpers
(_JobsTreeTestHelper, _JobManualCleanupTestHelper,
_JobsSortedTestHelper, _JobWithErrorTestHelper) so that @Job name
registration into the global _JOB_NAMES set only happens once at
import time rather than on each parametrized test run
- Replace closure references to outer-scope coresys with self.coresys
- Use api_client_with_prefix for dual-version coverage
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix typo
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Return proper API errors for mqtt/mysql service conflicts
After #6739 added unexpected-error logging and Sentry capture to the
api_process wrappers, SUPERVISOR-1JTQ and SUPERVISOR-1JWM surfaced as
user-triggered service conflicts that were being treated as unexpected
errors:
- POST /services/{mqtt,mysql} when another app already provides the
service.
- DELETE /services/{mqtt,mysql} when no app currently provides it.
Both paths raised a generic ServicesError, which the API layer turned
into an opaque HTTP 400 without a translation key, and which #6739 now
also logs and captures via Sentry.
Introduce ServiceAlreadyProvidedError (409 Conflict) and
ServiceNotProvidedError (404 Not Found) as new-style API exceptions with
translation keys and extra_fields, plus a shared APIConflict base class
for future 409 responses. The mqtt and mysql service modules now raise
these instead, so the API returns structured, translatable responses and
these expected user conflicts stop being captured as bugs.
Fixes SUPERVISOR-1JTQ
Fixes SUPERVISOR-1JWM
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Don't log handled errors verbose
Missing/already present service information are well handled errors with
clear API responses. The client is supposed to handle these errors. No
need to log verbosly.
---------
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Triaging SUPERVISOR-1JWK turned up a missed port conflict:
RE_PORT_CONFLICT_ERROR only matched one of the Docker daemon's
port-in-use message shapes. The two variants produced by current moby
— "Bind for <ip>:<port> failed: port is already allocated" from
portallocator and "failed to bind host port <ip>:<port>/<proto>:
address already in use" from osallocator — fell through to
DockerAPIError, got re-raised as AppUnknownError, and the watchdog
shipped them to Sentry as unknown errors.
Widen the regex to match all known shapes (including the older form
embedding the container endpoint, still observed from older daemons
and wrappers), anchored on the "failed to set up container networking"
prefix and one of the "address already in use" or "port is already
allocated" suffixes. Log the raw Docker message at debug level before
converting, so curious users can still see the exact upstream text
(host IP, container endpoint, protocol) when investigating which
process is holding the port.
The watchdog's _restart_after_problem now catches AppPortConflict
explicitly ahead of the generic AppsError handler: log a warning,
break the retry loop, do not call async_capture_exception. A port
conflict is an environment condition — another process grabbed the
port while the add-on was down — so retrying cannot make it succeed
and reporting to Sentry is noise.
With port conflicts now raised as typed APIError subclasses at the
detection site, the DockerAPIError → format_message() rewrite fallback
in api_return_error has no work left. Drop the fallback and delete
supervisor/utils/log_format.py along with its tests; the module only
ever handled port-conflict prose.
Fixes SUPERVISOR-1JWK
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Switch the Sentry noise filter in filter_data to call the existing
check_exception_chain helper instead of an inline loop. One shared
utility for "does the chain contain this type" matches what the
reviewer suggested and removes a bit of duplication.
While touching check_exception_chain:
- Walk __cause__ instead of __context__. __cause__ is what Python sets
when code uses `raise B() from a`, which is the explicit "caused by"
signal we actually want to match. __context__ can also include
unrelated in-flight exceptions from surrounding except blocks.
Every existing call site in Supervisor uses `raise X from err`, which
sets both attributes, so switching is behaviour-preserving for all
current callers.
- Replace the `Any` type of object_type with
`type[BaseException] | tuple[type[BaseException], ...]`, which is
what isinstance/issubclass actually accept and lets mypy catch
misuse at the call site.
- Replace `issubclass(type(err), object_type)` with `isinstance`, which
is the idiomatic form and honours virtual subclasses.
Review feedback from #6732.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When get_config() raised HomeAssistantError after a Core update, the
except block set error_state and fell through to the frontend check,
which referenced an unbound `data` variable and raised UnboundLocalError.
That aborted the update with a JobException and skipped the rollback
path entirely.
Move the frontend checks into an else branch of the try/except so they
only run when get_config() succeeds. When it fails, error_state is set
and control falls through to the rollback logic below, which is what
PR #6726 intended.
Fixes SUPERVISOR-1JVX
Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
* Detect container registry rate limits uniformly
Container registry rate limits reach Supervisor in three distinct shapes:
1. HTTP 429 from the daemon - recognised today, but the exception and
resolution issue are hardcoded to Docker Hub. Since Core/Supervisor/
plugin images all live on ghcr.io now, virtually every 429 we see in
the field is actually a GHCR throttle that we mislabel. The biggest
Sentry issue (SUPERVISOR-16BK) has >115k events / >93k users, all
pulling a ghcr.io image, yet each user is told to "log into
Docker Hub".
2. HTTP 500 with 'toomanyrequests' in the body - not recognised. Docker
daemons before 28.3.0 wrap upstream 429s as 500 (fixed upstream by
moby/moby 23fa0ae74a, "Cleanup http status error checks"). The large
fleet on older daemons still produces this shape.
3. JSON error event during a streaming pull - not recognised. Once the
daemon starts writing the 200 OK response body the status is locked
in, so rate limits that land during layer download arrive as plain
text in the pull stream. Happens on all recent daemon versions -
SUPERVISOR-13FQ (>16k events) and SUPERVISOR-13E0 (>8k events) are
two large examples.
Cases 2 and 3 propagate as plain DockerError, bypass the 429 detection in
install() entirely, never produce a DOCKER_RATELIMIT resolution issue, and
generate large amounts of Sentry noise. Case 1 is detected but routes
every GHCR 429 through Docker-Hub-specific messaging and suggestions.
Changes:
- Add DockerRegistryRateLimitExceeded as the common base class and
GithubContainerRegistryRateLimitExceeded alongside the existing
DockerHubRateLimitExceeded. All extend APITooManyRequests so callers
and retry logic can key off a single type.
- Add GITHUB_RATELIMIT IssueType so GHCR failures don't show the
"log in to Docker Hub" suggestion that DOCKER_RATELIMIT carries.
- PullLogEntry.exception now maps stream errors containing
'toomanyrequests' to DockerRegistryRateLimitExceeded (case 3).
- docker/interface.py:install() routes all three cases through a single
_registry_rate_limit_exception() helper that picks the right issue
type, suggestion and exception subclass based on the image's registry.
- utils/sentry.py filters APITooManyRequests (and anything wrapping it
via __cause__) in capture_exception / async_capture_exception. One
point of policy, every caller benefits.
Callers (supervisor.update(), plugin manager, homeassistant core) are
unchanged - UPDATE_FAILED issues still get created alongside the
registry-specific rate limit issue, giving users the full picture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Consolidate Sentry noise filtering in one before_send hook
Move the APITooManyRequests filter from capture_exception /
async_capture_exception wrappers into the existing filter_data
before_send hook in supervisor/misc/filter.py, alongside the
AddonConfigurationError filter.
One isinstance tuple check instead of multiple layers, and every path
that reaches Sentry (including logging-integration and excepthook
captures, not just our explicit wrappers) now gets the same treatment.
The filter walks the __cause__ chain so wrapped rate-limit errors
(e.g. DockerHubRateLimitExceeded inside SupervisorUpdateError) still
get filtered. A debug log is emitted on each dropped event for
observability.
Review feedback from mdegat01 on #6732.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Drop GITHUB_RATELIMIT resolution issue
There is no actionable remediation for a GHCR rate limit - logging in
doesn't lift the quota the way it does for Docker Hub, and the cap is
on the authenticated account anyway. A resolution issue that just tells
the user "you were rate limited" adds UI noise without helping them.
Keep the GithubContainerRegistryRateLimitExceeded exception - retry
logic and the Sentry filter still key off it - but don't create a
resolution issue. A log entry from the exception constructor is
sufficient. Docker Hub still gets DOCKER_RATELIMIT + registry-login
suggestion since that is actionable.
Review feedback on #6732.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
With devcontainer 6 dbus-daemon is installed in the container, which
is required for tests. The latest version also has support to disable
AppArmor using the `SUPERVISOR_UNCONFINED` environment variable.
Raise HomeAssistantWSConnectionError instead of HomeAssistantAPIError
for WebSocket handshake failures. The broader HomeAssistantAPIError was
not caught by the fire-and-forget send path which only catches
HomeAssistantWSError, resulting in "Task exception was never retrieved"
errors when Core's WebSocket endpoint isn't ready.
Additionally, narrow the retry catch in connect_websocket from
HomeAssistantAPIError to HomeAssistantAuthError. The broad catch caused
connection errors (not auth failures) to trigger unnecessary token
refreshes and retries, spamming "Updated Home Assistant API token" logs.
Also raise the log level for failed fire-and-forget WebSocket commands
from debug to warning for better visibility.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Log unexpected errors in api_process wrappers
The `api_process` and `api_process_raw` decorators silently swallowed
any `HassioError` that bubbled up from endpoint handlers, returning
`"Unknown error, see Supervisor logs"` to the caller while logging
nothing. This made the response message actively misleading: e.g. when
an endpoint touching D-Bus hit `DBusNotConnectedError` (raised without
a message by `@dbus_connected`), Core would surface
`SupervisorBadRequestError: Unknown error, see Supervisor logs` and
the Supervisor logs would contain no trace of it.
Log the caught `HassioError` with traceback before delegating to
`api_return_error` so the "see Supervisor logs" hint is actually
actionable. The `APIError` branch is left alone — those carry explicit
status codes and messages set by Supervisor code and are already
visible in the response.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Capture unexpected API errors to Sentry
Non-APIError HassioError exceptions reaching api_process indicate
missing error handling in the endpoint handler. In addition to the
logging added in the previous commit, also send these to Sentry so
they surface as actionable issues rather than silently returning
"Unknown error, see Supervisor logs" to the caller.
* Drop capture exception from set_boot_slot
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Use Unix socket for Supervisor to Core communication
Reintroduce Unix socket support for Supervisor-to-Core communication
(reverted in #6735) with the addition of a feature flag gate. The
feature is now controlled by the `core_unix_socket` feature flag and
disabled by default.
When enabled and Core version supports it, Supervisor communicates with
Core via a Unix socket at /run/os/core.sock instead of TCP. This
eliminates the need for access token authentication on the socket path,
as Core authenticates the peer by the socket connection itself.
Key changes:
- Add FeatureFlag.CORE_UNIX_SOCKET to gate the feature
- HomeAssistantAPI: transport-aware session/url/websocket management
- WSClient: separate connect() (Unix, no auth) and connect_with_auth()
(TCP) class methods with proper error handling
- APIProxy delegates websocket setup to api.connect_websocket()
- Container state tracking for Unix session lifecycle
- CI builder mounts /run/supervisor for integration tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Sort feature flags alphabetically
* Drop per-call max_msg_size from WSClient
Hardcode the WebSocket message size cap to 64 MB in WSClient and remove
the parameter from WSClient.connect, connect_with_auth, _ws_connect,
and HomeAssistantAPI.connect_websocket. This was only ever overridden
by APIProxy, so threading it through four layers was unnecessary.
max_msg_size is a cap, not a pre-allocation; aiohttp only grows buffers
to the size of actual incoming messages. Supervisor's own control
channel never approaches 64 MB, so unifying the limit has no runtime
cost.
Addresses review feedback on #6742.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add versioned v2 API with apps terminology
Introduce a v2 API sub-app mounted at /v2 that uses 'apps' terminology
throughout, while keeping v1 fully backward-compatible.
Key changes:
- Add ATTR_ADDONS = 'addons' constant alongside ATTR_APPS = 'apps' so
backup file data (which must remain 'addons' for backward compat) and
v2 API responses can use distinct constants
- Add FeatureFlag.SUPERVISOR_V2_API to gate v2 route registration
- Mount aiohttp sub-app at /v2 in RestAPI.load() when flag is enabled
- Add _AppSecurityPatterns frozen dataclass and _V1_PATTERNS/_V2_PATTERNS
with strict per-version regex sets (no cross-version matching)
- Add _register_v2_apps, _register_v2_backups, _register_v2_store route
registration methods
- Add v1 thin wrapper methods (*_v1) for all affected endpoints so
business logic lives in the canonical v2 methods
- Extract _info_data() helper in APIApps so v1 closure can bypass
@api_process and still catch APIAppNotInstalled for store routing
- Add _rename_apps_to_addons_in_backups(), _process_location_in_body(),
_all_store_apps_info() shared helpers to eliminate duplication
- Add api_client_v2, api_client_with_prefix, app_api_client_with_root,
store_app_api_client_with_root parameterized test fixtures
- Add test_v2_api_disabled_without_feature_flag
- Parameterize backup, addons, and store tests to cover both v1 and v2
paths
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix pylint false positive for re.Pattern C extension methods
re.Pattern methods (match, search, etc.) are C extension methods.
Pylint cannot detect them via static analysis when re.Pattern is used
as a type annotation in a dataclass field, producing false E1101
no-member errors. Add generated-members to inform pylint these members
exist.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* pylint and feedback fixes
* Copilot suggested fixes
* Minor feedback fixes
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Delay old image cleanup until after health checks on Core update
Move the old Docker image cleanup from inside _update() to after the
post-update health checks (frontend loaded and accessible). This keeps
the previous version's image available locally when a rollback is
needed, avoiding a potentially slow re-download.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add test assertions for old image cleanup timing on Core update
Verify that the old Docker image is cleaned up only after health checks
pass, and not when a rollback is triggered.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix missing rollback when get_config fails after Core update
The early return after setting error_state skipped the rollback block,
leaving the system on a broken new version when the API stopped
responding after update. The other health check failure paths correctly
fall through to the rollback logic; this was the only one that didn't.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>