supervisor

mirror of https://github.com/home-assistant/supervisor.git synced 2026-05-24 08:38:48 +01:00

Author	SHA1	Message	Date
Stefan Agner	267fc6cd71	mounts: make is_mounted honest about server reachability (#6838 ) * mounts: use softerr for NFS instead of soft Switch the NFS mount option from `soft` to `softerr`. For HAOS-style supervisor mounts (media, share, backup — not the root filesystem) the error semantics matter: * `softerr` returns `ETIMEDOUT` on timeout instead of `EIO` (`soft`). `EIO` is indistinguishable from "the disk is dying"; tools like SQLite, restic, rsync, ffmpeg tend to treat it as a hard storage failure (mark database corrupt, abort backup with a hard error, etc.). `ETIMEDOUT` is unambiguously "the network/server is gone, transient" and is more commonly handled as retry-later. Supervisor can also surface a clear "server unreachable" notification rather than a generic I/O error. * `softerr` was added in kernel 5.10 precisely to give the fail-fast behavior of `soft` with a distinct errno so well-behaved apps can do the right thing. * For writes-must-not-be-lost use cases (databases, paid storage, evidence-grade logging) one would want `hard,intr` and a different recovery story. HAOS NFS mounts are not that — they're add-on storage where "the share went offline, try again later" is the correct user-visible behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * mounts: make is_mounted honest about server reachability systemd's "active/mounted" state is derived from /proc/self/mountinfo and doesn't reflect whether the backing server actually answers. For CIFS in particular, smb3_reconfigure never contacts the server, so a reload of a dead share returns active/mounted with no recovery attempted. PR #4882 added a Path.is_mount() cross-check to catch this, but Path.is_mount() relies on os.stat() of the mountpoint root — and both NFS (softreval) and CIFS (cached root inode attrs) can serve that stat from local state without going to the wire, so it lies in exactly the dead-share scenario it was meant to detect. The visible failure: an API reload of a backup mount whose server had gone away "succeeded", supervisor then kicked off sys_backups.reload() against the dead share, and the executor fanned out hundreds of "Task exception was never retrieved" OSError(112) tracebacks as each backup tarball's stat() parked in the kernel and failed. Replace the local-state checks with a statvfs() probe in NetworkMount.is_mounted(). os.statvfs() returns per-filesystem data (free blocks, total blocks) that has no client-side cache in either kernel — neither cifs_statfs() nor nfs_statfs() has an early-return on cache freshness; both build and send a real FSSTAT / QUERY_FS_INFO request. So the kernel either reaches the server or gives up with ETIMEDOUT / EHOSTDOWN / ECONNABORTED. is_mounted() now reflects actual reachability, finally fulfilling PR #4882's stated intent. With is_mounted() honest, the reload/restart machinery falls out naturally: Mount.reload()'s fast path is just `if await self.is_mounted()`, the post-reload check is the same call, and the "reload succeeded systemd-wise but probe failed" branch collapses into the existing "not mounted after reload, try restart" branch. mount(), _restart() and update() already call is_mounted(); they now get the probe for free. To make the probe meaningful, the network mount option strings get explicit kernel-side timeouts: * NFS switches from `soft` to `softerr` so timeouts surface as ETIMEDOUT rather than EIO. EIO is indistinguishable from "disk dying" and gets misinterpreted by SQLite/restic/rsync as a hard storage failure; ETIMEDOUT is unambiguously transient. * CIFS gains `soft,echo_interval=10,retrans=0`, giving a ~30s per-operation detection budget (3 x echo_interval since last server response) that matches the NFS budget from `timeo=100,retrans=2`. Both protocols now fail bounded operations in roughly the same time. The probe is intentionally not wrapped in an asyncio timeout: the kernel-side bound is authoritative, and an asyncio timeout would only orphan the executor thread without unblocking the syscall. The probe emits debug logs with timing so the ~30s syscall wait on a dead share is visible in LOGLEVEL=debug traces instead of appearing as a hang. Tests: * mock_is_mount fixture extended to also patch os.statvfs so existing tests that rely on a healthy mount don't need to know about the probe. * New manager test split into _healthy_skips_systemd (probe succeeds, fast path, no systemd call) and _probe_failure_triggers_systemd_reload (probe fails, escalation runs). API reload test covers both paths. * Existing tests that simulated "mount is down" via mock_is_mount.return_value=False updated to simulate probe failure via OSError(EHOSTDOWN), since is_mount is no longer the signal. * mounts: drive systemd job waits via JobRemoved instead of state polling `_update_state_await` had a race that has been present since network mounts were introduced (#4269) and survived every subsequent rewrite (#4733, `c75f36305`): the wait function reads the unit's ActiveState after the systemd dispatch returns, then matches against a target set that almost always includes the pre-call state (ACTIVE). If systemd has not yet started dispatching the queued job — and for fast operations like CIFS reload (smb3_reconfigure completes in milliseconds with no server contact) it routinely has not — the wait matches immediately on the pre-call state and returns "done" before the operation has actually begun. The race was invisible until the probe commit started doing honest network work after the wait returned, at which point we spent a full ~30s NFS probe wait on a mount that systemd was still in the middle of restarting. Switch the job-dispatching paths (mount/unmount/reload/restart) to systemd's `JobRemoved` Manager signal. The pattern: async with sys_dbus.systemd.job_removed() as jobs: job_path = await sys_dbus.systemd.restart_unit(...) await jobs.wait_for_job(job_path) is structurally race-free: subscription is set up before dispatch, the job_path is allocated synchronously inside the dispatch call, and any JobRemoved for that path after subscription is queued. Even for operations that complete faster than we can process the dispatch return value (the CIFS reload case), the signal cannot be missed. `_update_state_await` is kept for `load()` — that path observes existing state without having dispatched a job, so the PropertiesChanged-driven wait is the right primitive there. * mounts: verify the path is a mount point before trusting statvfs The userspace probe trusted statvfs() alone to declare a network mount healthy. statvfs is uncacheable client-side for both NFS and CIFS, so on a real mount it forces a wire RPC — but if the mount is gone from the kernel mount table (e.g., after a restart cycle whose umount succeeded but whose mount step failed, leaving the unit ACTIVE in systemd's view), statvfs() operates on a plain directory on the underlying root filesystem and happily returns the root fs's stats. The mount is reported healthy when in fact it doesn't exist. Add a pre-check using `Path.is_mount` — a parent-vs-path `st_dev` comparison via stat() — to detect "this path is not actually a mount point" before issuing statvfs. For the ghost-mount case (path on the root fs) both stats are local and return immediately. For a real mount the path-stat may cross into the filesystem driver, but the result is correct in either case and the statvfs that follows catches server-unreachable mounts that is_mount can't. The full is_mounted contract is now: ACTIVE per systemd, present as a mount point in our namespace, and server-reachable per statvfs. * mounts: skip wasted probes when systemd already reported job failed JobRemoved tells us whether the systemd job completed successfully — we just weren't using it. The previous reload/restart paths ran their post-job probe unconditionally, including the case where systemd had already told us the operation failed. On a dead NFS share the kernel is in transport-reconnect churn for tens of seconds after a killed mount helper exits, so that probe takes 90+ seconds — and confirms exactly what systemd already said. Plain wasted time. * `Mount.reload()` now uses the systemd job result. On "done" we still probe (CIFS reload is local-only — smb3_reconfigure returns "done" even against a dead server, so the probe is the only check that catches the lying-CIFS case). On anything else (failed, timeout, canceled, dependency, skipped, or our own timeout returning None) we escalate directly to `_restart()` without probing. * `Mount._restart()` skips the probe entirely on non-"done" results. The mount is definitively not active in that case and the probe would just spend another 30-90s on a dead share. Also bump the JobRemoved wait timeout from UPDATE_STATE_TIMEOUT (40s, sized for a single-helper invocation) to a new SYSTEMD_JOB_TIMEOUT (90s). RestartUnit runs as stop + start, each bounded by the unit's TimeoutSec (35s from #6834), so the worst case is ~70s plus systemd queue dispatch. The previous 40s budget caused supervisor to time out exactly one second before JobRemoved fired in the observed dead-NFS restart cycle. UPDATE_STATE_TIMEOUT is kept at 40s for the PropertiesChanged-driven wait in `load()`, where the layered timeout invariant from #6827 still applies. * dbus: filter signals at the wrapper level, drop JobRemovedSignal class Mike's review on #6838: instead of carrying a bespoke JobRemovedSignal context-manager class, push the filtering concept into the generic DBusSignalWrapper. Any caller that subscribes to a broadcast signal but only cares about specific payloads can pass a predicate. * DBus.signal() gains an optional `message_filter: Callable[..., bool]`. * DBusSignalWrapper.wait_for_signal() loops past messages where the filter returns False, returning the next match. * JobRemovedSignal goes away. supervisor/dbus/systemd.py exposes a small factory `job_removed_filter(get_job_path)` that returns the matching predicate; the getter indirection lets callers subscribe before the dispatch returns the job path (which is required to be race-free — see the previous commit). Mount._run_systemd_job() switches to: job_path: str \| None = None async with systemd.connected_dbus.signal( DBUS_SIGNAL_SYSTEMD_JOB_REMOVED, job_removed_filter(lambda: job_path), ) as signal: job_path = await dispatch _id, _path, _unit, result = await signal.wait_for_signal() The behavior is identical to the previous JobRemovedSignal-based implementation. The signal queue still buffers messages received after AddMatch is installed, so subscribing before dispatch keeps the race-free guarantee. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-20 09:29:53 +02:00
Mike Degatano	f8880a72be	Rename addon/addons to app/apps in filenames and imports (#6837 ) * Rename addon/addons to app/apps in filenames and imports Continues the addon→app terminology migration (#6786). Renames all source files, test files, fixture files, and directories that contained 'addon'/'addons' in their names, and updates all imports accordingly. Resolution check files in supervisor/resolution/checks/ that were renamed override the slug property to preserve the existing API contract (slugs are exposed via the resolution info API and used to run checks by name). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rename add-on.json fixture --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-13 20:55:46 +02:00
Stefan Agner	67258dea4a	Skip post-update health check when Core was not running on entry (#6821 ) PR #6726 removed the early return after a HomeAssistantError from the post-update get_config() call so that a Core that stopped responding after an update would correctly trigger a rollback. That early return was, however, also load-bearing for the backup restore flow: Backup.restore_homeassistant() stops and removes Core before invoking core.update(target_version) and starts Core later in its own await_home_assistant_restart stage. With Core not running, _update() correctly skips the start step, but the unconditional post-update get_config() now always raises, sets error_state, and triggers a spurious rollback that re-pulls the previous image and leaves the system on the wrong version after the restore completes. Return early from update() when Core was not running on entry. The caller is responsible for starting Core and there is no live API to health-check at this point. Genuine update failures (Core was running, update broke it) are unaffected and still roll back. Also rename the local rollback to rollback_version for clarity.	2026-05-07 11:27:28 +02:00
Stefan Agner	c772a9bbb0	Replace fixed-duration sleeps after bus events with gather (#6803 ) * Replace fixed-duration sleeps after bus events with gather Several tests use ``await asyncio.sleep(...)`` to "wait for the listener to run" after firing a bus event. The fixed duration is real wall-clock time and the wait can be indeterministic — if the handler chain happens to need slightly more time on a busy CI runner, the assertion races the handler. ``Bus.fire_event`` returns the listener tasks since #6252; capture and ``await asyncio.gather(tasks)`` instead of sleeping. Touches test_bus.py (the bus tests were poking scheduling instead of verifying their assertions), test_home_assistant_watchdog.py, test_plugin_base.py, addons/test_manager.py, docker/test_addon.py, and test_store_execute_reload.py. Other cleanups in the same spirit: - ``_fire_test_event`` in addons/test_addon.py becomes ``async def`` and gathers the listener tasks itself, so its 17 call sites collapse to a single ``await _fire_test_event(...)``. - The two test_store_execute_reload.py sites that used the private ``_update_connectivity()`` helper are reworked to set the cached connectivity flag directly and fire the event themselves so they can gather the listener tasks the same way. - The two ``sleep(1)`` post-pull drains in docker/test_interface.py collapse to ``sleep(0)`` (handler tasks are already gathered inside pull_image), saving ~2s. - The ``sleep(0.01)`` waits inside ``container_events()`` task bodies (api/test_addons.py, api/test_store.py, backups/test_manager.py) are just one-yield-to-the-parent and become ``sleep(0)``. Switching to ``gather`` exposes a few latent test mocks that were silently swallowing TypeErrors as background-task failures before: - ``CGroup.add_devices_allowed`` is ``async def`` but was patched as a plain MagicMock in docker/test_addon.py — now patched via ``new_callable=AsyncMock``. - The watchdog does ``await (await self.start())`` / ``await (await self.restart())`` because ``App.start`` / ``App.restart`` return ``asyncio.Task``. The mocks in addons/test_addon.py (test_app_watchdog, test_watchdog_on_stop, test_watchdog_during_attach) needed ``AsyncMock(return_value=<settled future>)`` to mirror that shape rather than a plain MagicMock. Factor bus.fire_event + gather pattern into a helper Per review feedback, the ``await asyncio.gather(*coresys.bus.fire_event(...))`` incantation was scattered across many call sites. Add ``tests.common.fire_bus_event`` that takes the coresys, event and data, fires the event and awaits the spawned listener tasks. Convert all matching sites to use it, including the ``_fire_test_event`` wrapper in addons/test_addon.py which now just builds the ``DockerContainerStateEvent`` and delegates.	2026-05-06 12:02:28 +02:00
Stefan Agner	ad1a9115d8	Improve and extend frontend probe after update with WebSocket check (#6811 ) * Improve and extend frontend probe after update with WebSocket check The post-update health check introduced in #6311 added HomeAssistantAPI.check_frontend_available, which fetched the frontend through the existing Supervisor-internal API connection to Core. Since #6742 that connection optionally runs over a Unix socket with no authentication, so the request no longer exercises the same transport, auth and routing path that an external HTTP client uses. Move the frontend probe out of HomeAssistantAPI into a small frontend_check module that talks to Core's TCP endpoints via the plain websession with no authentication, mirroring what an external client would see. While doing this, extend the post-update verification to also probe the WebSocket endpoint: open /api/websocket and confirm the first frame is the auth_required text message. This catches the kind of WebSocket breakage seen in #6802, where api/config still listed websocket_api as loaded and GET / still returned HTML, but the WebSocket handshake completed with an immediate close frame and the frontend was unusable. The component check now also requires "http" to be loaded, in addition to "frontend" and "websocket_api", and iterates so every missing component is logged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Address review feedback on WebSocket probe - Wrap ws_connect in asyncio.wait_for so the handshake has an explicit bounded timeout (the global websession's default timeout would otherwise apply). - Validate that the auth_required payload is a JSON object before calling .get("type"); a list/string would otherwise raise AttributeError at runtime. - Add a regression test covering a non-dict JSON payload. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-05-06 10:54:05 +02:00
Mike Degatano	eb3c388618	Migrate persisted 'addon' field to 'app' in config files (#6786 ) * Migrate persisted 'addon' field to 'app' in discovery and services config Rename the 'addon' key to 'app' in persisted configuration files for discovery messages (discovery.json), service modules (services.json), and supervisor config (supervisor.json), as part of the broader addon->app terminology migration. Changes: - Add ATTR_ADDON = "addon" to const.py for V1 API compat/migration - Add ATTR_ADDONS_CUSTOM_LIST = "addons_custom_list" to const.py for migration - Change ATTR_APPS_CUSTOM_LIST value from "addons_custom_list" to "apps_custom_list" - Add _migrate_supervisor_config() schema pre-processor in validate.py to transparently load old supervisor.json files using the old key - Add ATTR_ADDON to services/const.py; change ATTR_APP value to "app" - Add _migrate_addon_to_app() pre-processors to MQTT, MySQL, and discovery schemas to load old config files that used the "addon" key - Rename Message.addon -> Message.app in Discovery and update all references - Keep hassio_push/discovery payload using "addon" key for HA compatibility - GET /services/{service} and GET /discovery: V1 returns "addon" key, V2 returns "app" key, via dedicated _v1 handler methods following the backups/store pattern, registered with AppVersion guards in _register_services() and _register_discovery() - Broaden FileConfiguration schema type annotation to accept vol.All validators in addition to vol.Schema Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add schema migration tests for addon->app config key rename Test that backwards-compatible migration of old 'addon'/'addons_custom_list' keys to 'app'/'apps_custom_list' works correctly in all affected schemas, and that the new keys are accepted without modification. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add an __init__ to discovery tests * Add app_api_client_with_prefix fixture and update V1/V2 tests Move the app-level V1/V2 fixture to tests/api/conftest.py as app_api_client_with_prefix for use across any endpoint that requires app-level credentials (services_role, app.discovery, etc.). - Add app_api_client_with_prefix fixture to conftest.py - Update test_set_service_already_provided and test_del_service_not_provided to use app_api_client_with_prefix (covers both v1 and v2) - Add test_get_service_v1_v2_keys asserting addon/app key per version - Update test_api_discovery_forbidden, test_api_send_del_discovery, test_api_invalid_discovery to use app_api_client_with_prefix - Split test_discovery_not_found into test_discovery_not_found_get (uses api_client_with_prefix, GET requires homeassistant) and test_discovery_not_found_delete (uses app_api_client_with_prefix) - Add test_get_discovery_v1_v2_keys asserting addon/app key per version Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-05-05 11:18:47 +02:00
Stefan Agner	f8dbafe0bb	Drop redundant @pytest.mark.asyncio decorators (#6795 ) The pytest config sets ``asyncio_mode = "auto"``, which already auto-marks every ``async def test_*`` as a coroutine test. The 38 ``@pytest.mark.asyncio`` decorators sprinkled across the suite were no-ops kept around from before that flag was set. Remove them along with the now-unused ``import pytest`` lines they were the only consumer of. Pure mechanical cleanup; no test behavior changes.	2026-05-04 14:48:18 +02:00
Stefan Agner	61faa73be5	Return proper API errors when backup mount is down (#6785 ) Follow-up on #6739: with HassioError now logged and captured by Sentry in api_process, BackupMountDownError surfaced as an "unexpected" 400 with a noisy log entry and a Sentry event (SUPERVISOR-1JXW), even though the user had simply asked to back up to a mount that was not currently available. Map this through properly so the API returns a clean, structured 400: - Make BackupMountDownError inherit from APIError, with error_key "backup_mount_down", message_template "Backup mount '{mount}' is down", and the mount name in extra_fields. Clients now get a normalized, translatable message and a stable key instead of the raw "<name> is down, cannot back-up to it" / "...cannot copy to it" strings. - Simplify both raise sites in BackupManager (_check_location and _copy_to_location) to just pass mount=. @api_process turns the result into a 400 without logging or Sentry capture, since this is now a modeled client-state error rather than an unexpected one. The mount being down is a runtime state issue users hit when their NAS/CIFS share is briefly unreachable, not a Supervisor bug worth paging on. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 22:13:11 +02:00
Stefan Agner	33ab5b55f8	Treat JobException as a client-side API error (#6777 ) * Treat JobException as a client-side API error Job condition guards (system not running, no free space, etc.) and concurrency rejections (another job in flight) raised by the @Job decorator are explicit precondition failures with descriptive messages, not unexpected errors. JobException inheriting HassioError directly meant api_process caught them in its HassioError branch — which since #6739 logs them as unexpected and captures them to Sentry. Inherit APIError instead so api_process surfaces these through its APIError branch with the original message and skips the unexpected-error path. Status stays at APIError's default 400, so the API contract is unchanged. Extended test_backup_immediate_errors to assert async_capture_exception is not called for the freeze and free-space condition guards. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Silence too-many-ancestors on plugin job error mixins The plugin-specific job error subclasses (CliJobError, ObserverJobError, MulticastJobError, CoreDNSJobError, AudioJobError) cross pylint's too-many-ancestors threshold once JobException inherits APIError. Add the same `# pylint: disable=too-many-ancestors` already used on the ResolutionNotFound subclasses with similar diamond inheritance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Disable too-many-ancestors globally instead of per class The pylint config already disables every other too-many-* rule "for the sake of readability", but kept too-many-ancestors and forced inline disables on diamond-inherited exception classes (the ResolutionNotFound subclasses, and now five plugin job error mixins after the JobException APIError change). Add too-many-ancestors to the global disable list and drop all eight inline annotations. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:21:13 +02:00
Stefan Agner	9923b8580b	Return proper API errors for invalid hostnames (#6776 ) Follow-up on #6739: with HassioError now logged and captured by Sentry in api_process, hostname rejections from systemd-hostnamed surfaced as "unexpected" 400s with noisy log entries and a Sentry event, even though the user had simply submitted an invalid hostname. Map this through properly so the API returns a clean, structured 400: - Split ErrorType.INVALID_ARGS out of DBusInterfaceMethodError into its own DBusInvalidArgsError. The two cases collapsed there before are semantically different: UNKNOWN_METHOD / INVALID_SIGNATURE mean the call is broken (method missing or types wrong); INVALID_ARGS means the call is valid but the service rejected an argument's value. - Add HostInvalidHostnameError(HostError, APIError) with error_key and extra_fields so clients get a normalized message and a stable key rather than systemd's raw "Invalid static hostname '...'" text. - Translate DBusInvalidArgsError to HostInvalidHostnameError in SystemControl.set_hostname. @api_process turns the result into a 400 without logging or Sentry capture, since this is now a modeled client-input error rather than an unexpected one. Validation continues to live in hostnamed (hostname_is_valid() in systemd's src/basic/hostname-util.c); Supervisor only translates the rejection. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-29 10:19:27 +02:00
Mike Degatano	bc24fb5449	Refactor API registration to support v1/v2 via shared methods (#6769 ) * Refactor API registration to support v1/v2 via shared methods - Add AppVersion StrEnum (V1, V2) to supervisor/api/const.py - Replace self.v2_app with self._v2_app and expose a versions property (dict[AppVersion, web.Application]) computed dynamically so that test fixtures reassigning self.webapp are automatically reflected in V1 - All _register_* methods now accept a required app: web.Application parameter; version-specific routes are gated with "if app is self.versions[AppVersion.V1/V2]:" - load() loops over enabled_versions (V1 always, V2 when feature-flagged) and calls each registration method once per version, no duplication - Static resources are registered before webapp.add_subapp() to avoid registering into a frozen router - add_subapp uses self.webapp directly for readability - Fold _register_v2_apps/_register_v2_backups/_register_v2_store into their respective unified methods; remove the now-defunct _register_v2_* helpers and the _api_apps/_api_backups/_api_store instance vars - _register_proxy and _register_ingress updated to accept app; legacy /homeassistant/* proxy routes gated behind V1 conditional Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Add dual v1/v2 parametrization to API tests All 163 tests across 17 API modules that register identically on both v1 and v2 now run against both versions via api_client_with_prefix. - tests/api/conftest.py: advanced_logs_tester switched to api_client_with_prefix so log-endpoint tests are auto-parametrized; accepts optional v2_path_prefix kwarg for paths that differ by version - tests/api/test_{auth,discovery,dns,docker,hardware,host,ingress, jobs,mounts,network,os,resolution,security,services,supervisor}.py: api_client -> api_client_with_prefix with path prefix unpacking - supervisor/api/__init__.py: _register_panel() moved outside the version loop -- frontend static assets are V1-only - tests/api/test_panel.py: kept on plain api_client (V1-only) Tests intentionally kept V1-only: - auth/discovery: use indirect api_client parametrize for addon context - homeassistant: all tests call legacy /homeassistant/* paths (V1-only) - jobs (4 tests): inner @Job-decorated classes register names into a module-level set; re-running the same test raises RuntimeError Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Extend dual v1/v2 parametrization to homeassistant and jobs tests tests/api/conftest.py: - Add core_api_client_with_root fixture parametrized over three paths: v1-core: /core/... (canonical v1 path) v1-legacy: /homeassistant/... (legacy v1 alias, same handlers) v2-core: /v2/core/... (canonical v2 path) tests/api/test_homeassistant.py: - Switch all 17 api_client tests to core_api_client_with_root so each test runs against all three access paths (v1 canonical, v1 legacy alias, v2 canonical), exercising every registered route tests/api/test_jobs.py: - Promote four inner TestClass definitions to module-level helpers (_JobsTreeTestHelper, _JobManualCleanupTestHelper, _JobsSortedTestHelper, _JobWithErrorTestHelper) so that @Job name registration into the global _JOB_NAMES set only happens once at import time rather than on each parametrized test run - Replace closure references to outer-scope coresys with self.coresys - Use api_client_with_prefix for dual-version coverage Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix typo Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-04-27 23:39:47 +02:00
Stefan Agner	61ca2524b2	Return proper API errors for mqtt/mysql service conflicts (#6767 ) * Return proper API errors for mqtt/mysql service conflicts After #6739 added unexpected-error logging and Sentry capture to the api_process wrappers, SUPERVISOR-1JTQ and SUPERVISOR-1JWM surfaced as user-triggered service conflicts that were being treated as unexpected errors: - POST /services/{mqtt,mysql} when another app already provides the service. - DELETE /services/{mqtt,mysql} when no app currently provides it. Both paths raised a generic ServicesError, which the API layer turned into an opaque HTTP 400 without a translation key, and which #6739 now also logs and captures via Sentry. Introduce ServiceAlreadyProvidedError (409 Conflict) and ServiceNotProvidedError (404 Not Found) as new-style API exceptions with translation keys and extra_fields, plus a shared APIConflict base class for future 409 responses. The mqtt and mysql service modules now raise these instead, so the API returns structured, translatable responses and these expected user conflicts stop being captured as bugs. Fixes SUPERVISOR-1JTQ Fixes SUPERVISOR-1JWM Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Don't log handled errors verbose Missing/already present service information are well handled errors with clear API responses. The client is supposed to handle these errors. No need to log verbosly. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-24 21:56:12 +02:00
Stefan Agner	ff90e4b817	Fix UnboundLocalError when Core API fails post-update (#6761 ) When get_config() raised HomeAssistantError after a Core update, the except block set error_state and fell through to the frontend check, which referenced an unbound `data` variable and raised UnboundLocalError. That aborted the update with a JobException and skipped the rollback path entirely. Move the frontend checks into an else branch of the try/except so they only run when get_config() succeeds. When it fails, error_state is set and control falls through to the rollback logic below, which is what PR #6726 intended. Fixes SUPERVISOR-1JVX Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>	2026-04-23 15:05:40 +02:00
Stefan Agner	7fb621234e	Add Unix socket support for Core communication with feature flag (#6742 ) * Use Unix socket for Supervisor to Core communication Reintroduce Unix socket support for Supervisor-to-Core communication (reverted in #6735) with the addition of a feature flag gate. The feature is now controlled by the `core_unix_socket` feature flag and disabled by default. When enabled and Core version supports it, Supervisor communicates with Core via a Unix socket at /run/os/core.sock instead of TCP. This eliminates the need for access token authentication on the socket path, as Core authenticates the peer by the socket connection itself. Key changes: - Add FeatureFlag.CORE_UNIX_SOCKET to gate the feature - HomeAssistantAPI: transport-aware session/url/websocket management - WSClient: separate connect() (Unix, no auth) and connect_with_auth() (TCP) class methods with proper error handling - APIProxy delegates websocket setup to api.connect_websocket() - Container state tracking for Unix session lifecycle - CI builder mounts /run/supervisor for integration tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Sort feature flags alphabetically * Drop per-call max_msg_size from WSClient Hardcode the WebSocket message size cap to 64 MB in WSClient and remove the parameter from WSClient.connect, connect_with_auth, _ws_connect, and HomeAssistantAPI.connect_websocket. This was only ever overridden by APIProxy, so threading it through four layers was unnecessary. max_msg_size is a cap, not a pre-allocation; aiohttp only grows buffers to the size of actual incoming messages. Supervisor's own control channel never approaches 64 MB, so unifying the limit has no runtime cost. Addresses review feedback on #6742. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-21 15:03:05 +02:00
Mike Degatano	56abe94d74	Add versioned v2 API with apps terminology (#6741 ) * Add versioned v2 API with apps terminology Introduce a v2 API sub-app mounted at /v2 that uses 'apps' terminology throughout, while keeping v1 fully backward-compatible. Key changes: - Add ATTR_ADDONS = 'addons' constant alongside ATTR_APPS = 'apps' so backup file data (which must remain 'addons' for backward compat) and v2 API responses can use distinct constants - Add FeatureFlag.SUPERVISOR_V2_API to gate v2 route registration - Mount aiohttp sub-app at /v2 in RestAPI.load() when flag is enabled - Add _AppSecurityPatterns frozen dataclass and _V1_PATTERNS/_V2_PATTERNS with strict per-version regex sets (no cross-version matching) - Add _register_v2_apps, _register_v2_backups, _register_v2_store route registration methods - Add v1 thin wrapper methods (_v1) for all affected endpoints so business logic lives in the canonical v2 methods - Extract _info_data() helper in APIApps so v1 closure can bypass @api_process and still catch APIAppNotInstalled for store routing - Add _rename_apps_to_addons_in_backups(), _process_location_in_body(), _all_store_apps_info() shared helpers to eliminate duplication - Add api_client_v2, api_client_with_prefix, app_api_client_with_root, store_app_api_client_with_root parameterized test fixtures - Add test_v2_api_disabled_without_feature_flag - Parameterize backup, addons, and store tests to cover both v1 and v2 paths Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Fix pylint false positive for re.Pattern C extension methods re.Pattern methods (match, search, etc.) are C extension methods. Pylint cannot detect them via static analysis when re.Pattern is used as a type annotation in a dataclass field, producing false E1101 no-member errors. Add generated-members to inform pylint these members exist. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * pylint and feedback fixes * Copilot suggested fixes * Minor feedback fixes --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-20 21:19:27 +02:00
Stefan Agner	38ddb3df54	Fix Core update rollback: delay image cleanup and fix missing rollback path (#6726 ) * Delay old image cleanup until after health checks on Core update Move the old Docker image cleanup from inside _update() to after the post-update health checks (frontend loaded and accessible). This keeps the previous version's image available locally when a rollback is needed, avoiding a potentially slow re-download. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add test assertions for old image cleanup timing on Core update Verify that the old Docker image is cleaned up only after health checks pass, and not when a rollback is triggered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix missing rollback when get_config fails after Core update The early return after setting error_state skipped the rollback block, leaving the system on a broken new version when the API stopped responding after update. The other health check failure paths correctly fall through to the rollback logic; this was the only one that didn't. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-17 10:57:13 +02:00
Stefan Agner	a504d85745	Remove double newlines from build and check config output (#6743 ) * Remove double newlines from build output The log lines from run command already have newline characters, so joining them with "\n" adds extra newlines. Joining them with an empty string preserves the original formatting of the logs. * Remove double newlines from check config output The log lines from run command already have newline characters, so joining them with "\n" adds extra newlines. Joining them with an empty string preserves the original formatting of the logs. * Fix pytest	2026-04-16 17:18:05 +02:00
Mike Degatano	1218326af3	Add development feature toggle system (#6719 ) * Add experimental feature toggle system Introduces an ExperimentalFeature enum and feature_flags config to allow toggling experimental features via the supervisor options API. The first feature flag is 'supervisor_v2_api' to gate the upcoming V2 API. Absent keys in options request = no change (partial update, consistent with existing options APIs). The info endpoint always returns all known feature flags and their current state for discoverability. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * ExperimentalFeature -> FeatureFlag * Use explicit value of StrEnum to be typesafe Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Minor comment improvement Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Stefan Agner <stefan@agner.ch> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-04-15 13:13:45 +02:00
Mike Degatano	ba8c49935b	Refactor internal addon references to app/apps (#6717 ) * Rename addon→app in docstrings and comments Updates all docstrings and inline comments across supervisor/ and tests/ to use the new app/apps terminology. No runtime behaviour is changed by this commit. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Rename addon→app in code (variables, args, class names, functions) Renames all internal Python identifiers from addon/addons to app/apps: - Variable and argument names - Function and method names - Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp, all exception, check, and fixup classes, etc.) - String literals used as Python identifiers (pytest fixtures, parametrize param names, patch.object attribute strings, URL route match_info keys) External API contracts are preserved: JSON keys, error codes, discovery protocol fields, TypedDict/attr.s field names. Import module paths (supervisor/addons/) are also unchanged. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> * Fix partial backup/restore API to remap addons key to apps The external API accepts `addons` as the request body key (since ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial now take an `apps` parameter after the rename. The *body expansion in both endpoints would pass `addons=...` causing a TypeError. Remap the key before expansion in both backup_partial and restore_partial: if ATTR_APPS in body: body["apps"] = body.pop(ATTR_APPS) Also adds test_restore_partial_with_addons_key to verify the restore path correctly receives apps= when addons is passed in the request body. This path had no existing test coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Fix merge error * Adjust AppLoggerAdapter to use app_name Co-authored-by: Stefan Agner <stefan@agner.ch> --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Stefan Agner <stefan@agner.ch>	2026-04-14 16:47:20 +02:00
Stefan Agner	5c5428fde3	Revert "Use Unix socket for Supervisor to Core communication (#6590 )" (#6735 ) This reverts commit `28fa0b35bd`.	2026-04-14 12:28:02 +02:00
Stefan Agner	28fa0b35bd	Use Unix socket for Supervisor to Core communication (#6590 ) * Use Unix socket for Supervisor to Core communication Switch internal Supervisor-to-Core HTTP and WebSocket communication from TCP (port 8123) to a Unix domain socket. The existing /run/supervisor directory on the host (already mounted at /run/os inside the Supervisor container) is bind-mounted into the Core container at /run/supervisor. Core receives the socket path via the SUPERVISOR_CORE_API_SOCKET environment variable, creates the socket there, and Supervisor connects to it via aiohttp.UnixConnector at /run/os/core.sock. Since the Unix socket is only reachable by processes on the same host, requests arriving over it are implicitly trusted and authenticated as the existing Supervisor system user. This removes the token round-trip where Supervisor had to obtain and send Bearer tokens on every Core API call. WebSocket connections are likewise authenticated implicitly, skipping the auth_required/auth handshake. Key design decisions: - Version-gated by CORE_UNIX_SOCKET_MIN_VERSION so older Core versions transparently continue using TCP with token auth - LANDINGPAGE is explicitly excluded (not a CalVer version) - Hard-fails with a clear error if the socket file is unexpectedly missing when Unix socket communication is expected - WSClient.connect() for Unix socket (no auth) and WSClient.connect_with_auth() for TCP (token auth) separate the two connection modes cleanly - Token refresh always uses the TCP websession since it is inherently a TCP/Bearer-auth operation - Logs which transport (Unix socket vs TCP) is being used on first request Closes #6626 Related Core PR: home-assistant/core#163907 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Close WebSocket on handshake failure and validate auth_required Ensure the underlying WebSocket connection is closed before raising when the handshake produces an unexpected message. Also validate that the first TCP message is auth_required before sending credentials. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Fix pylint protected-access warnings in tests Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Check running container env before using Unix socket Split use_unix_socket into two properties to handle the Supervisor upgrade transition where Core is still running with a container started by the old Supervisor (without SUPERVISOR_CORE_API_SOCKET): - supports_unix_socket: version check only, used when creating the Core container to decide whether to set the env var - use_unix_socket: version check + running container env check, used for communication decisions This ensures TCP fallback during the upgrade transition while still hard-failing if the socket is missing after Supervisor configured Core to use it. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Improve Core API communication logging and error handling - Remove transport log from make_request that logged before Core container was attached, causing misleading connection logs - Log "Connected to Core via ..." once on first successful API response in get_api_state, when the transport is actually known - Remove explicit socket existence check from session property, let aiohttp UnixConnector produce natural connection errors during Core startup (same as TCP connection refused) - Add validation in get_core_state matching get_config pattern - Restore make_request docstring Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Guard Core API requests with container running check Add is_running() check to make_request and connect_websocket so no HTTP or WebSocket connection is attempted when the Core container is not running. This avoids misleading connection attempts during Supervisor startup before Core is ready. Also make use_unix_socket raise if container metadata is not available instead of silently falling back to TCP. This is a defensive check since is_running() guards should prevent reaching this state. Add attached property to DockerInterface to expose whether container metadata has been loaded. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Reset Core API connection state on container stop Listen for Core container STOPPED/FAILED events to reset the connection state: clear the _core_connected flag so the transport is logged again on next successful connection, and close any stale Unix socket session. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Only mount /run/supervisor if we use it * Fix pytest errors * Remove redundant is_running check from ingress panel update The is_running() guard in update_hass_panel is now redundant since make_request checks is_running() internally. Also mock is_running in the websession test fixture since tests using it need make_request to proceed past the container running check. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Bind mount /run/supervisor to Supervisor /run/os Home Assistant OS (as well as the Supervised run scripts) bind mount /run/supervisor to /run/os in Supervisor. Since we reuse this location for the communication socket between Supervisor and Core, we need to also bind mount /run/supervisor to Supervisor /run/os in CI. * Wrap WebSocket handshake errors in HomeAssistantAPIError Unexpected exceptions during the WebSocket handshake (KeyError, ValueError, TypeError from malformed messages) are now wrapped in HomeAssistantAPIError inside WSClient.connect/connect_with_auth. This means callers only need to catch HomeAssistantAPIError. Remove the now-unnecessary except (RuntimeError, ValueError, TypeError) from proxy _websocket_client and add a proper error message to the APIError per review feedback. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Narrow WebSocket handshake exception handling Replace broad `except Exception` with specific exception types that can actually occur during the WebSocket handshake: KeyError (missing dict keys), ValueError (bad JSON), TypeError (non-text WS message), aiohttp.ClientError (connection errors), and TimeoutError. This avoids silently wrapping programming errors into HomeAssistantAPIError. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Remove unused create_mountpoint from MountBindOptions The field was added but never used. The /run/supervisor host path is guaranteed to exist since HAOS creates it for the Supervisor container mount, so auto-creating the mountpoint is unnecessary. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Clear stale access token before raising on final retry Move token clear before the attempt check in connect_websocket so the stale token is always discarded, even when raising on the final attempt. Without this, the next call would reuse the cached bad token via _ensure_access_token's fast path, wasting a round-trip. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Add tests for Unix socket communication and Core API Add tests for the new Unix socket communication path and improve existing test coverage: - Version-based supports_unix_socket and env-based use_unix_socket - api_url/ws_url transport selection - Connection lifecycle: connected log after restart, ignoring unrelated container events - get_api_state/check_api_state parameterized across versions, responses, and error cases - make_request is_running guard and TCP flow with real token fetch - connect_websocket for both Unix and TCP (with token verification) - WSClient.connect/connect_with_auth handshake success, errors, cleanup on failure, and close with pending futures Consolidate existing tests into parameterized form and drop synthetic tests that covered very little. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>	2026-04-10 15:09:38 +02:00
Mike Degatano	941f7cd2be	Change addons to apps in all user-facing strings (#6696 ) * Change addons to apps in all user-facing strings * Fix grammar in errors * Apply suggestions from code review Co-authored-by: Jan Čermák <sairon@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Stefan Agner <stefan@agner.ch> --------- Co-authored-by: Jan Čermák <sairon@users.noreply.github.com> Co-authored-by: Stefan Agner <stefan@agner.ch>	2026-04-07 18:54:40 +02:00
Stefan Agner	667bd62742	Remove CLI command hint from unknown error messages (#6684 ) * Remove CLI command hint from unknown error messages Since #6303 introduced specific error messages for many cases, the generic "check with 'ha supervisor logs'" hint in unknown error messages is no longer as useful. Remove the CLI command part while keeping the "Check supervisor logs for details" rider. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Use consistently "Supervisor logs" with capitalization Co-authored-by: Jan Čermák <sairon@users.noreply.github.com> --------- Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>	2026-03-31 18:09:14 +02:00
Stefan Agner	9e0d3fe461	Return 401 for non-Basic Authorization headers on /auth endpoint (#6612 ) aiohttp's BasicAuth.decode() raises ValueError for any non-Basic auth method (e.g. Bearer tokens). This propagated as an unhandled exception, causing a 500 response instead of the expected 401 Unauthorized. Catch the ValueError in _process_basic() and raise HTTPUnauthorized with the WWW-Authenticate realm header so clients get a proper 401 response. Fixes SUPERVISOR-BFG Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>	2026-03-04 15:55:49 -05:00
Stefan Agner	0ef71d1dd1	Drop unsupported architectures and machines, create issue for affected apps (#6607 ) * Drop unsupported architectures and machines from Supervisor Since #5620 Supervisor no longer updates the version information on unsupported architectures and machines. This means users can no longer update to newer version of Supervisor since that PR got released. Furthermore since #6347 we also no longer build for these architectures. With this, any code related to these architectures becomes dead code and should be removed. This commit removes all refrences to the deprecated architectures and machines from Supervisor. This affects the following architectures: - armhf - armv7 - i386 And the following machines: - odroid-xu - qemuarm - qemux86 - raspberrypi - raspberrypi2 - raspberrypi3 - raspberrypi4 - tinker * Create issue if an app using a deprecated architecture is installed This adds a check to the resolution system to detect if an app is installed that uses a deprecated architecture. If so, it will show a warning to the user and recommend them to uninstall the app. * Formally deprecate machine add-on configs as well Not only deprecate add-on configs for unsupported architectures, but also for unsupported machines. * For installed add-ons architecture must always exist Fail hard in case of missing architecture, as this is a required field for installed add-ons. This will prevent the Supervisor from running with an unsupported configuration and causing further issues down the line.	2026-03-04 10:59:14 +01:00
Stefan Agner	2627d55873	Add default verbose timestamps for plugin logs (#6598 ) * Use verbose log output for plug-ins All three plug-ins which support logging (dns, multicast and audio) should use the verbose log format by default to make sure the log lines are annotated with timestamp. Introduce a new flag default_verbose for advanced logs. * Use default_verbose for host logs as well Use the new default_verbose flag for advanced logs, to make it more explicit that we want timestamps for host logs as well.	2026-03-03 11:58:11 +01:00
Jan Čermák	6a955527f3	Ensure dt_utc in /os/info always returns current time (#6602 ) The /os/info API endpoint has been using D-Bus property TimeUSec which got cached between requests, so the time returned was not always the same as current time on the host system at the time of the request. Since there's no reason to use D-Bus API for the time, as Supervisor runs on the same machine and time is global, simply format current datetime object with Python and return it in the response. Fixes #6581	2026-02-27 17:59:11 +01:00
Stefan Agner	7f6327e94e	Handle missing Accept header in host logs (#6594 ) * Handle missing Accept header in host logs Avoid indexing request headers directly in the host advanced logs handler when Accept is absent, preventing KeyError crashes on valid requests without that header. Fixes SUPERVISOR-1939. * Add pytest	2026-02-26 11:30:08 +01:00
Mike Degatano	9f00b6e34f	Ensure uuid of dismissed suggestion/issue matches an existing one (#6582 ) * Ensure uuid of dismissed suggestion/issue matches an existing one * Fix lint, test and feedback issues * Adjust existing tests and remove new ones for not found errors * fix device access issue usage	2026-02-25 10:26:44 +01:00
Stefan Agner	3147d080a2	Unify Core user handling with HomeAssistantUser model (#6558 ) * Unify Core user listing with HomeAssistantUser model Replace the ingress-specific IngressSessionDataUser with a general HomeAssistantUser dataclass that models the Core config/auth/list WS response. This deduplicates the WS call (previously in both auth.py and module.py) into a single HomeAssistant.list_users() method. - Add HomeAssistantUser dataclass with fields matching Core's user API - Remove get_users() and its unnecessary 5-minute Job throttle - Auth and ingress consumers both use HomeAssistant.list_users() - Auth API endpoint uses typed attribute access instead of dict keys - Migrate session serialization from legacy "displayname" to "name" - Accept both keys in schema/deserialization for backwards compat - Add test for loading persisted sessions with legacy displayname key Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Tighten list_users() to trust Core's auth/list contract Core's config/auth/list WS command always returns a list, never None. Replace the silent `if not raw: return []` (which also swallowed empty lists) with an assert, remove the dead AuthListUsersNoneResponseError exception class, and document the HomeAssistantWSError contract in the docstring. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove \| None from async_send_command return type The WebSocket result is always set from data["result"] in _receive_json, never explicitly to None. Remove the misleading \| None from the return type of both WSClient and HomeAssistantWebSocket async_send_command, and drop the now-unnecessary assert in list_users. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Use HomeAssistantWSConnectionError in _ensure_connected _ensure_connected and connect_with_auth raise on connection-level failures, so use the more specific HomeAssistantWSConnectionError instead of the broad HomeAssistantWSError. This allows callers to distinguish connection errors from Core API errors (e.g. unsuccessful WebSocket command responses). Also document that _ensure_connected can propagate HomeAssistantAuthError from ensure_access_token. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Remove user list cache from _find_user_by_id Drop the _list_of_users cache to avoid stale auth data in ingress session creation. The method now fetches users fresh each time and returns None on any API error instead of serving potentially outdated cached results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-17 18:31:08 +01:00
Stefan Agner	da800b8889	Simplify HomeAssistantWebSocket and raise on connection errors (#6553 ) * Raise HomeAssistantWSError when Core WebSocket is unreachable Previously, async_send_command silently returned None when Home Assistant Core was not reachable, leading to misleading error messages downstream (e.g. "returned invalid response of None instead of a list of users"). Refactor _can_send to _ensure_connected which now raises HomeAssistantWSError on connection failures while still returning False for silent-skip cases (shutdown, unsupported version). async_send_message catches the exception to preserve fire-and-forget behavior. Update callers that don't handle HomeAssistantWSError: _hardware_events and addon auto-update in tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Simplify HomeAssistantWebSocket command/message distinction The WebSocket layer had a confusing split between "messages" (fire-and-forget) and "commands" (request/response) that didn't reflect Home Assistant Core's architecture where everything is just a WS command. - Remove dead WSClient.async_send_message (never called) - Rename async_send_message → _async_send_command (private, fire-and-forget) - Rename send_message → send_command (sync wrapper) - Simplify _ensure_connected: drop message param, always raise on failure - Simplify async_send_command: always raise on connection errors - Remove MIN_VERSION gating (minimum supported Core is now 2024.2+) - Remove begin_backup/end_backup version guards for Core < 2022.1.0 - Add debug logging for silently ignored connection errors Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * Wait for Core to come up before backup This is crucial since the WebSocket command to Core now fails with the new error handling if Core is not running yet. * Wait for Core install job instead * Use CLI to fetch jobs instead of Supervisor API The Supervisor API needs authentication token, which we have not available at this point in the workflow. Instead of fetching the token, we can use the CLI, which is available in the container. --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-12 09:20:23 +01:00
Stefan Agner	66228f976d	Use session.request() instead of getattr dispatch in HomeAssistantAPI (#6541 ) Replace the dynamic `getattr(self.sys_websession, method)(...)` pattern with the explicit `self.sys_websession.request(method, ...)` call. This is type-safe and avoids runtime failures from typos in method names. Also wrap the timeout parameter in `aiohttp.ClientTimeout` for consistency with the typed `request()` signature. Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>	2026-02-10 09:43:55 +01:00
Tom Quist	4d8d44721d	Fix MCP API proxy support for streaming and headers (#6461 ) * Fix MCP API proxy support for streaming and headers This commit fixes two issues with using the core API core/api/mcp through the API proxy: 1. Streaming support: The proxy now detects text/event-stream responses and properly streams them instead of buffering all data. This is required for MCP's Server-Sent Events (SSE) transport. 2. Header forwarding: Added MCP-required headers to the forwarded headers: - Accept: Required for content negotiation - Last-Event-ID: Required for resuming broken SSE connections - Mcp-Session-Id: Required for session management across requests The proxy now also preserves MCP-related response headers (Mcp-Session-Id) and sets X-Accel-Buffering to "no" for streaming responses to prevent buffering by intermediate proxies. Tests added to verify: - MCP headers are properly forwarded to Home Assistant - Streaming responses (text/event-stream) are handled correctly - Response headers are preserved * Refactor: reuse stream logic for SSE responses (#3) * Fix ruff format + cover streaming payload error * Fix merge error * Address review comments (headers / streaming proxy) (#4) * Address review: header handling for streaming/non-streaming * Forward MCP-Protocol-Version and Origin headers * Do not forward Origin header through API proxy (#5) --------- Co-authored-by: Stefan Agner <stefan@agner.ch>	2026-02-04 17:28:11 +01:00
Stefan Agner	a849050369	Improve CpuArch type safety with explicit conversions (#6524 ) The CpuArch enum was being used inconsistently throughout the codebase, with some code expecting enum values and other code expecting strings. This caused type checking issues and potential runtime errors. Changes: - Fix match_base() to return CpuArch enum instead of str - Add explicit string conversions using !s formatting where arch values are used in f-strings (build.py, model.py) - Convert CpuArch to str explicitly in contexts requiring strings (docker/addon.py, misc/filter.py) - Update all tests to use CpuArch enum values instead of strings - Update test mocks to return CpuArch enum values This ensures type consistency and improves MyPy type checking accuracy across the architecture detection and management code. Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-04 11:34:23 +01:00
Stefan Agner	7ad9a911e8	Add DELETE method support to /core/api proxy (#6521 ) The Supervisor's /core/api proxy previously only supported GET and POST methods, returning 405 Method Not Allowed for DELETE requests. This prevented addons from calling Home Assistant Core REST API endpoints that require DELETE methods, such as deleting automations, scripts, or scenes. The underlying proxy implementation already supported passing through any HTTP method via request.method.lower(), so only the route registration was needed. Fixes #6509 Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>	2026-02-03 11:51:59 +01:00
Stefan Agner	6957341c3e	Refactor Docker pull progress with registry manifest fetcher (#6379 ) * Use count-based progress for Docker image pulls Refactor Docker image pull progress to use a simpler count-based approach where each layer contributes equally (100% / total_layers) regardless of size. This replaces the previous size-weighted calculation that was susceptible to progress regression. The core issue was that Docker rate-limits concurrent downloads (~3 at a time) and reports layer sizes only when downloading starts. With size- weighted progress, large layers appearing late would cause progress to drop dramatically (e.g., 59% -> 29%) as the total size increased. The new approach: - Each layer contributes equally to overall progress - Per-layer progress: 70% download weight, 30% extraction weight - Progress only starts after first "Downloading" event (when layer count is known) - Always caps at 99% - job completion handles final 100% This simplifies the code by moving progress tracking to a dedicated module (pull_progress.py) and removing complex size-based scaling logic that tried to account for unknown layer sizes. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Exclude already-existing layers from pull progress calculation Layers that already exist locally should not count towards download progress since there's nothing to download for them. Only layers that need pulling are included in the progress calculation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add registry manifest fetcher for size-based pull progress Fetch image manifests directly from container registries before pulling to get accurate layer sizes upfront. This enables size-weighted progress tracking where each layer contributes proportionally to its byte size, rather than equal weight per layer. Key changes: - Add RegistryManifestFetcher that handles auth discovery via WWW-Authenticate headers, token fetching with optional credentials, and multi-arch manifest list resolution - Update ImagePullProgress to accept manifest layer sizes via set_manifest() and calculate size-weighted progress - Fall back to count-based progress when manifest fetch fails - Pre-populate layer sizes from manifest when creating layer trackers The manifest fetcher supports ghcr.io, Docker Hub, and private registries by using credentials from Docker config when available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Clamp progress to 100 to prevent floating point precision issues Floating point arithmetic in weighted progress calculations can produce values slightly above 100 (e.g., 100.00000000000001). This causes validation errors when the progress value is checked. Add min(100, ...) clamping to both size-weighted and count-based progress calculations to ensure the result never exceeds 100. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Use sys_websession for manifest fetcher instead of creating new session Reuse the existing CoreSys websession for registry manifest requests instead of creating a new aiohttp session. This improves performance and follows the established pattern used throughout the codebase. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Make platform parameter required and warn on missing platform - Make platform a required parameter in get_manifest() and _fetch_manifest() since it's always provided by the calling code - Return None and log warning when requested platform is not found in multi-arch manifest list, instead of falling back to first manifest which could be the wrong architecture 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Log manifest fetch failures at warning level Users will notice degraded progress tracking when manifest fetch fails, so log at warning level to help diagnose issues. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Add pylint disable comments for protected access in manifest tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Separate download_current and total_size updates in pull progress Update download_current and total_size independently in the DOWNLOADING handler. This ensures download_current is updated even when total is not yet available. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Reject invalid platform format in manifest selection --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-02-02 15:56:24 +01:00
dependabot[bot]	2a4890e2b0	Bump aiodocker from 0.24.0 to 0.25.0 (#6448 ) * Bump aiodocker from 0.24.0 to 0.25.0 Bumps [aiodocker](https://github.com/aio-libs/aiodocker) from 0.24.0 to 0.25.0. - [Release notes](https://github.com/aio-libs/aiodocker/releases) - [Changelog](https://github.com/aio-libs/aiodocker/blob/main/CHANGES.rst) - [Commits](https://github.com/aio-libs/aiodocker/compare/v0.24.0...v0.25.0) --- updated-dependencies: - dependency-name: aiodocker dependency-version: 0.25.0 dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <support@github.com> * Update to new timeout configuration * Fix pytest failure --------- Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> Co-authored-by: Mike Degatano <michael.degatano@gmail.com> Co-authored-by: Stefan Agner <stefan@agner.ch>	2026-01-30 09:39:06 +01:00
Stefan Agner	a2db716a5f	Check frontend availability after Home Assistant Core updates (#6311 ) * Check frontend availability after Home Assistant Core updates Add verification that the frontend is actually accessible at "/" after core updates to ensure the web interface is serving properly, not just that the API endpoints respond. Previously, the update verification only checked API endpoints and whether the frontend component was loaded. This could miss cases where the API is responsive but the frontend fails to serve the UI. Changes: - Add check_frontend_available() method to HomeAssistantAPI that fetches the root path and verifies it returns HTML content - Integrate frontend check into core update verification flow after confirming the frontend component is loaded - Trigger automatic rollback if frontend is inaccessible after update - Fix blocking I/O calls in rollback log file handling to use async executor 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> * Avoid checking frontend if config data is None * Improve pytest tests * Make sure Core returns a valid config * Remove Core version check in frontend availability test The call site already makes sure that an actual Home Assistant Core instance is running before calling the frontend availability test. So this is rather redundant. Simplify the code by removing the version check and update tests accordingly. * Add test coverage for get_config --------- Co-authored-by: Claude <noreply@anthropic.com>	2026-01-29 09:06:45 +01:00
David Rapan	641b205ee7	Add configurable interface route metric (#6447 ) * Add route_metric attribute to IpProperties class Signed-off-by: David Rapan <david@rapan.cz> * Refactor dbus setting IP constants Signed-off-by: David Rapan <david@rapan.cz> * Add route metric Signed-off-by: David Rapan <david@rapan.cz> * Merge test_api_network_interface_info Signed-off-by: David Rapan <david@rapan.cz> * Add test case for route metric update Signed-off-by: David Rapan <david@rapan.cz> --------- Signed-off-by: David Rapan <david@rapan.cz>	2026-01-28 13:08:36 +01:00
AlCalzone	df8201ca33	Update `get_docker_args()` to return `mounts` not `volumes` (#6499 ) * Update `get_docker_args()` to return `mounts` not `volumes` * fix more mocks to return PurePaths	2026-01-27 15:00:33 -05:00
Mike Degatano	909a2dda2f	Migrate (almost) all docker container interactions to aiodocker (#6489 ) * Migrate all docker container interactions to aiodocker * Remove containers_legacy since its no longer used * Add back remove color logic * Revert accidental invert of conditional in setup_network * Fix typos found by copilot * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Revert "Apply suggestions from code review" This reverts commit `0a475433ea`. --------- Co-authored-by: Stefan Agner <stefan@agner.ch> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2026-01-27 12:42:17 +01:00
Mike Degatano	1d1a8cdad3	Add API to force repository repair (#6439 ) * Add API to force repository repair * Fix inheritance for error * Fix absolute import	2026-01-06 16:01:48 +01:00
Mike Degatano	d23bc291d5	Migrate create container to aiodocker (#6415 ) * Migrate create container to aiodocker * Fix extra hosts transformation * Env not Environment * Fix tests * Fixes from feedback --------- Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>	2025-12-15 09:57:30 +01:00
Jan Čermák	cd4e7f2530	Remove the option to revert to `overlay2` driver (#6399 ) OS Agent will no longer support migrating to the overlay2 driver due to reasons explained in home-assistant/os-agent#245. Remove it from the Docker API as well.	2025-12-05 14:45:56 +01:00
Stefan Agner	5d02b09a0d	Fix addon options reset to defaults (#6397 ) Co-authored-by: Claude <noreply@anthropic.com>	2025-12-05 13:53:51 +01:00
Mike Degatano	81b7e54b18	Remove unknown errors from addons and auth (#6303 ) * Remove unknown errors from addons * Remove customized unknown error types * Fix docker ratelimit exception and tests * Fix stats test and add more for known errors * Add defined error for when build fails * Fixes from feedback * Fix mypy issues * Fix test failure due to rename * Change auth reset error message	2025-12-03 18:11:51 +01:00
Stefan Agner	fa490210cd	Improve CpuArch type safety across codebase (#6372 ) Co-authored-by: Claude <noreply@anthropic.com>	2025-12-01 19:56:05 +01:00
Mike Degatano	6302c7d394	Fix progress when using containerd snapshotter (#6357 ) * Fix progress when using containerd snapshotter * Add test for tiny image download under containerd-snapshotter * Fix API tests after progress allocation change * Fix test for auth changes * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Stefan Agner <stefan@agner.ch> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>	2025-11-27 16:26:22 +01:00
Jan Čermák	f55fd891e9	Add API endpoint for migrating Docker storage driver (#6361 ) Implement Supervisor API for home-assistant/os-agent#238, adding possibility to schedule migration either to Containerd overlayfs driver, or migration to the graph overlay2 driver, once the device is rebooted the next time. While it's technically in the DBus OS interface, in Supervisor's abstraction it makes more sense to put it under `/docker` endpoints.	2025-11-27 16:02:39 +01:00
Jan Čermák	5ed0c85168	Add optional no_colors query parameter to advanced logs endpoints (#6326 ) Add support for `no_colors` query parameter on all advanced logs API endpoints, allowing users to optionally strip ANSI color sequences from log output. This complements the existing color stripping on /latest endpoints added in #6319.	2025-11-21 09:29:15 +01:00

1 2 3 4 5 ...

264 Commits