* Replace fixed-duration sleeps after bus events with gather
Several tests use ``await asyncio.sleep(...)`` to "wait for the
listener to run" after firing a bus event. The fixed duration is
real wall-clock time and the wait can be indeterministic — if the
handler chain happens to need slightly more time on a busy CI
runner, the assertion races the handler.
``Bus.fire_event`` returns the listener tasks since #6252; capture
and ``await asyncio.gather(*tasks)`` instead of sleeping. Touches
test_bus.py (the bus tests were poking scheduling instead of
verifying their assertions), test_home_assistant_watchdog.py,
test_plugin_base.py, addons/test_manager.py, docker/test_addon.py,
and test_store_execute_reload.py.
Other cleanups in the same spirit:
- ``_fire_test_event`` in addons/test_addon.py becomes ``async def``
and gathers the listener tasks itself, so its 17 call sites
collapse to a single ``await _fire_test_event(...)``.
- The two test_store_execute_reload.py sites that used the private
``_update_connectivity()`` helper are reworked to set the cached
connectivity flag directly and fire the event themselves so they
can gather the listener tasks the same way.
- The two ``sleep(1)`` post-pull drains in docker/test_interface.py
collapse to ``sleep(0)`` (handler tasks are already gathered
inside pull_image), saving ~2s.
- The ``sleep(0.01)`` waits inside ``container_events()`` task
bodies (api/test_addons.py, api/test_store.py,
backups/test_manager.py) are just one-yield-to-the-parent and
become ``sleep(0)``.
Switching to ``gather`` exposes a few latent test mocks that were
silently swallowing TypeErrors as background-task failures before:
- ``CGroup.add_devices_allowed`` is ``async def`` but was patched
as a plain MagicMock in docker/test_addon.py — now patched via
``new_callable=AsyncMock``.
- The watchdog does ``await (await self.start())`` /
``await (await self.restart())`` because ``App.start`` /
``App.restart`` return ``asyncio.Task``. The mocks in
addons/test_addon.py (test_app_watchdog, test_watchdog_on_stop,
test_watchdog_during_attach) needed
``AsyncMock(return_value=<settled future>)`` to mirror that
shape rather than a plain MagicMock.
* Factor bus.fire_event + gather pattern into a helper
Per review feedback, the ``await asyncio.gather(*coresys.bus.fire_event(...))``
incantation was scattered across many call sites. Add
``tests.common.fire_bus_event`` that takes the coresys, event and data,
fires the event and awaits the spawned listener tasks. Convert all
matching sites to use it, including the ``_fire_test_event`` wrapper
in addons/test_addon.py which now just builds the
``DockerContainerStateEvent`` and delegates.
Per CLAUDE.md, plain test_* functions are the project style; class-
based test grouping is considered legacy. Convert the 24 test methods
in test_pull_progress.py (TestLayerProgress, TestImagePullProgress)
to module-level functions — none of them used self, so the rewrite is
mechanical.
Also rename three helper classes whose names accidentally matched
pytest's Test* collection pattern, even though they are fakes/fixtures
rather than test cases:
- TestAddon -> FakeApp (data holder used as a fake App in pwned tests)
- TestDockerInterface -> FakeDockerInterface (fixture/inner helper in
docker tests)
The two DBusServiceMock subclasses named TestInterface already had
__test__ = False and are left alone.
* Detect container registry rate limits uniformly
Container registry rate limits reach Supervisor in three distinct shapes:
1. HTTP 429 from the daemon - recognised today, but the exception and
resolution issue are hardcoded to Docker Hub. Since Core/Supervisor/
plugin images all live on ghcr.io now, virtually every 429 we see in
the field is actually a GHCR throttle that we mislabel. The biggest
Sentry issue (SUPERVISOR-16BK) has >115k events / >93k users, all
pulling a ghcr.io image, yet each user is told to "log into
Docker Hub".
2. HTTP 500 with 'toomanyrequests' in the body - not recognised. Docker
daemons before 28.3.0 wrap upstream 429s as 500 (fixed upstream by
moby/moby 23fa0ae74a, "Cleanup http status error checks"). The large
fleet on older daemons still produces this shape.
3. JSON error event during a streaming pull - not recognised. Once the
daemon starts writing the 200 OK response body the status is locked
in, so rate limits that land during layer download arrive as plain
text in the pull stream. Happens on all recent daemon versions -
SUPERVISOR-13FQ (>16k events) and SUPERVISOR-13E0 (>8k events) are
two large examples.
Cases 2 and 3 propagate as plain DockerError, bypass the 429 detection in
install() entirely, never produce a DOCKER_RATELIMIT resolution issue, and
generate large amounts of Sentry noise. Case 1 is detected but routes
every GHCR 429 through Docker-Hub-specific messaging and suggestions.
Changes:
- Add DockerRegistryRateLimitExceeded as the common base class and
GithubContainerRegistryRateLimitExceeded alongside the existing
DockerHubRateLimitExceeded. All extend APITooManyRequests so callers
and retry logic can key off a single type.
- Add GITHUB_RATELIMIT IssueType so GHCR failures don't show the
"log in to Docker Hub" suggestion that DOCKER_RATELIMIT carries.
- PullLogEntry.exception now maps stream errors containing
'toomanyrequests' to DockerRegistryRateLimitExceeded (case 3).
- docker/interface.py:install() routes all three cases through a single
_registry_rate_limit_exception() helper that picks the right issue
type, suggestion and exception subclass based on the image's registry.
- utils/sentry.py filters APITooManyRequests (and anything wrapping it
via __cause__) in capture_exception / async_capture_exception. One
point of policy, every caller benefits.
Callers (supervisor.update(), plugin manager, homeassistant core) are
unchanged - UPDATE_FAILED issues still get created alongside the
registry-specific rate limit issue, giving users the full picture.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Consolidate Sentry noise filtering in one before_send hook
Move the APITooManyRequests filter from capture_exception /
async_capture_exception wrappers into the existing filter_data
before_send hook in supervisor/misc/filter.py, alongside the
AddonConfigurationError filter.
One isinstance tuple check instead of multiple layers, and every path
that reaches Sentry (including logging-integration and excepthook
captures, not just our explicit wrappers) now gets the same treatment.
The filter walks the __cause__ chain so wrapped rate-limit errors
(e.g. DockerHubRateLimitExceeded inside SupervisorUpdateError) still
get filtered. A debug log is emitted on each dropped event for
observability.
Review feedback from mdegat01 on #6732.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Drop GITHUB_RATELIMIT resolution issue
There is no actionable remediation for a GHCR rate limit - logging in
doesn't lift the quota the way it does for Docker Hub, and the cap is
on the authenticated account anyway. A resolution issue that just tells
the user "you were rate limited" adds UI noise without helping them.
Keep the GithubContainerRegistryRateLimitExceeded exception - retry
logic and the Sentry filter still key off it - but don't create a
resolution issue. A log entry from the exception constructor is
sufficient. Docker Hub still gets DOCKER_RATELIMIT + registry-login
suggestion since that is actionable.
Review feedback on #6732.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Use Unix socket for Supervisor to Core communication
Reintroduce Unix socket support for Supervisor-to-Core communication
(reverted in #6735) with the addition of a feature flag gate. The
feature is now controlled by the `core_unix_socket` feature flag and
disabled by default.
When enabled and Core version supports it, Supervisor communicates with
Core via a Unix socket at /run/os/core.sock instead of TCP. This
eliminates the need for access token authentication on the socket path,
as Core authenticates the peer by the socket connection itself.
Key changes:
- Add FeatureFlag.CORE_UNIX_SOCKET to gate the feature
- HomeAssistantAPI: transport-aware session/url/websocket management
- WSClient: separate connect() (Unix, no auth) and connect_with_auth()
(TCP) class methods with proper error handling
- APIProxy delegates websocket setup to api.connect_websocket()
- Container state tracking for Unix session lifecycle
- CI builder mounts /run/supervisor for integration tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Sort feature flags alphabetically
* Drop per-call max_msg_size from WSClient
Hardcode the WebSocket message size cap to 64 MB in WSClient and remove
the parameter from WSClient.connect, connect_with_auth, _ws_connect,
and HomeAssistantAPI.connect_websocket. This was only ever overridden
by APIProxy, so threading it through four layers was unnecessary.
max_msg_size is a cap, not a pre-allocation; aiohttp only grows buffers
to the size of actual incoming messages. Supervisor's own control
channel never approaches 64 MB, so unifying the limit has no runtime
cost.
Addresses review feedback on #6742.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Rename addon→app in docstrings and comments
Updates all docstrings and inline comments across supervisor/ and
tests/ to use the new app/apps terminology. No runtime behaviour
is changed by this commit.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Rename addon→app in code (variables, args, class names, functions)
Renames all internal Python identifiers from addon/addons to app/apps:
- Variable and argument names
- Function and method names
- Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp,
all exception, check, and fixup classes, etc.)
- String literals used as Python identifiers (pytest fixtures,
parametrize param names, patch.object attribute strings,
URL route match_info keys)
External API contracts are preserved: JSON keys, error codes,
discovery protocol fields, TypedDict/attr.s field names.
Import module paths (supervisor/addons/) are also unchanged.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix partial backup/restore API to remap addons key to apps
The external API accepts `addons` as the request body key (since
ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial
now take an `apps` parameter after the rename. The **body expansion
in both endpoints would pass `addons=...` causing a TypeError.
Remap the key before expansion in both backup_partial and
restore_partial:
if ATTR_APPS in body:
body["apps"] = body.pop(ATTR_APPS)
Also adds test_restore_partial_with_addons_key to verify the restore
path correctly receives apps= when addons is passed in the request
body. This path had no existing test coverage.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
* Fix merge error
* Adjust AppLoggerAdapter to use app_name
Co-authored-by: Stefan Agner <stefan@agner.ch>
---------
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Use Unix socket for Supervisor to Core communication
Switch internal Supervisor-to-Core HTTP and WebSocket communication
from TCP (port 8123) to a Unix domain socket.
The existing /run/supervisor directory on the host (already mounted
at /run/os inside the Supervisor container) is bind-mounted into the
Core container at /run/supervisor. Core receives the socket path via
the SUPERVISOR_CORE_API_SOCKET environment variable, creates the
socket there, and Supervisor connects to it via aiohttp.UnixConnector
at /run/os/core.sock.
Since the Unix socket is only reachable by processes on the same host,
requests arriving over it are implicitly trusted and authenticated as
the existing Supervisor system user. This removes the token round-trip
where Supervisor had to obtain and send Bearer tokens on every Core
API call. WebSocket connections are likewise authenticated implicitly,
skipping the auth_required/auth handshake.
Key design decisions:
- Version-gated by CORE_UNIX_SOCKET_MIN_VERSION so older Core
versions transparently continue using TCP with token auth
- LANDINGPAGE is explicitly excluded (not a CalVer version)
- Hard-fails with a clear error if the socket file is unexpectedly
missing when Unix socket communication is expected
- WSClient.connect() for Unix socket (no auth) and
WSClient.connect_with_auth() for TCP (token auth) separate the
two connection modes cleanly
- Token refresh always uses the TCP websession since it is inherently
a TCP/Bearer-auth operation
- Logs which transport (Unix socket vs TCP) is being used on first
request
Closes#6626
Related Core PR: home-assistant/core#163907
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Close WebSocket on handshake failure and validate auth_required
Ensure the underlying WebSocket connection is closed before raising
when the handshake produces an unexpected message. Also validate that
the first TCP message is auth_required before sending credentials.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Fix pylint protected-access warnings in tests
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Check running container env before using Unix socket
Split use_unix_socket into two properties to handle the Supervisor
upgrade transition where Core is still running with a container
started by the old Supervisor (without SUPERVISOR_CORE_API_SOCKET):
- supports_unix_socket: version check only, used when creating the
Core container to decide whether to set the env var
- use_unix_socket: version check + running container env check, used
for communication decisions
This ensures TCP fallback during the upgrade transition while still
hard-failing if the socket is missing after Supervisor configured
Core to use it.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Improve Core API communication logging and error handling
- Remove transport log from make_request that logged before Core
container was attached, causing misleading connection logs
- Log "Connected to Core via ..." once on first successful API response
in get_api_state, when the transport is actually known
- Remove explicit socket existence check from session property, let
aiohttp UnixConnector produce natural connection errors during
Core startup (same as TCP connection refused)
- Add validation in get_core_state matching get_config pattern
- Restore make_request docstring
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Guard Core API requests with container running check
Add is_running() check to make_request and connect_websocket so no
HTTP or WebSocket connection is attempted when the Core container is
not running. This avoids misleading connection attempts during
Supervisor startup before Core is ready.
Also make use_unix_socket raise if container metadata is not available
instead of silently falling back to TCP. This is a defensive check
since is_running() guards should prevent reaching this state.
Add attached property to DockerInterface to expose whether container
metadata has been loaded.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Reset Core API connection state on container stop
Listen for Core container STOPPED/FAILED events to reset the
connection state: clear the _core_connected flag so the transport
is logged again on next successful connection, and close any stale
Unix socket session.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Only mount /run/supervisor if we use it
* Fix pytest errors
* Remove redundant is_running check from ingress panel update
The is_running() guard in update_hass_panel is now redundant since
make_request checks is_running() internally. Also mock is_running
in the websession test fixture since tests using it need make_request
to proceed past the container running check.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Bind mount /run/supervisor to Supervisor /run/os
Home Assistant OS (as well as the Supervised run scripts) bind mount
/run/supervisor to /run/os in Supervisor. Since we reuse this location
for the communication socket between Supervisor and Core, we need to
also bind mount /run/supervisor to Supervisor /run/os in CI.
* Wrap WebSocket handshake errors in HomeAssistantAPIError
Unexpected exceptions during the WebSocket handshake (KeyError,
ValueError, TypeError from malformed messages) are now wrapped in
HomeAssistantAPIError inside WSClient.connect/connect_with_auth.
This means callers only need to catch HomeAssistantAPIError.
Remove the now-unnecessary except (RuntimeError, ValueError,
TypeError) from proxy _websocket_client and add a proper error
message to the APIError per review feedback.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Narrow WebSocket handshake exception handling
Replace broad `except Exception` with specific exception types that
can actually occur during the WebSocket handshake: KeyError (missing
dict keys), ValueError (bad JSON), TypeError (non-text WS message),
aiohttp.ClientError (connection errors), and TimeoutError. This
avoids silently wrapping programming errors into HomeAssistantAPIError.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Remove unused create_mountpoint from MountBindOptions
The field was added but never used. The /run/supervisor host path
is guaranteed to exist since HAOS creates it for the Supervisor
container mount, so auto-creating the mountpoint is unnecessary.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Clear stale access token before raising on final retry
Move token clear before the attempt check in connect_websocket so
the stale token is always discarded, even when raising on the final
attempt. Without this, the next call would reuse the cached bad token
via _ensure_access_token's fast path, wasting a round-trip.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add tests for Unix socket communication and Core API
Add tests for the new Unix socket communication path and improve
existing test coverage:
- Version-based supports_unix_socket and env-based use_unix_socket
- api_url/ws_url transport selection
- Connection lifecycle: connected log after restart, ignoring
unrelated container events
- get_api_state/check_api_state parameterized across versions,
responses, and error cases
- make_request is_running guard and TCP flow with real token fetch
- connect_websocket for both Unix and TCP (with token verification)
- WSClient.connect/connect_with_auth handshake success, errors,
cleanup on failure, and close with pending futures
Consolidate existing tests into parameterized form and drop synthetic
tests that covered very little.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Improve Docker network test coverage and infrastructure
Add test cases for enable_ipv6=None (no user setting) to
test_network_recreation, verifying existing behavior where None
leaves the network unchanged. Use pytest.param with descriptive IDs
for better test readability.
Add create_network_mock side_effect to the docker fixture so network
creation returns realistic metadata built from the provided params.
Remove redundant manual create mock setups from individual tests.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Enable IPv6 on Supervisor network by default for all installations
Previously, IPv6 was only enabled by default for new installations
(when enable_ipv6 config was None). Existing installations with
IPv4-only networks were left unchanged unless the user explicitly
set enable_ipv6 to true.
Now, when no explicit IPv6 setting exists, the network is migrated
to dual-stack on next boot. The same safety checks apply: migration
is blocked if user containers are running and requires a reboot.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add specific error message for registry authentication failures
When a Docker image pull fails with 401 Unauthorized and registry
credentials are configured, raise DockerRegistryAuthError instead of
a generic DockerError. This surfaces a clear message to the user
("Docker registry authentication failed for <registry>. Check your
registry credentials") instead of "An unknown error occurred with
addon <name>".
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add tests for registry authentication error handling
Test that a 401 during image pull raises DockerRegistryAuthError when
credentials are configured, and falls back to generic DockerError
when no credentials are present.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Add tests for addon install/update/rebuild auth failure handling
Test that DockerRegistryAuthError propagates correctly through
addon install, update, and rebuild paths without being wrapped
in a generic AddonUnknownError.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
---------
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
aiodocker derives ServerAddress for X-Registry-Auth by doing
image.partition("/"). For Docker Hub images like
"homeassistant/amd64-supervisor", this extracts "homeassistant"
(the namespace) instead of "docker.io" (the registry).
With the classic graphdriver image store, ServerAddress was never
checked and credentials were sent regardless. With the containerd
image store (default since Docker v29 / HAOS 15), the resolver
compares ServerAddress against the actual registry host and silently
drops credentials on mismatch, falling back to anonymous access.
Fix by prefixing Docker Hub images with "docker.io/" when registry
credentials are configured, so aiodocker sets ServerAddress correctly.
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
* Drop unsupported architectures and machines from Supervisor
Since #5620 Supervisor no longer updates the version information on
unsupported architectures and machines. This means users can no longer
update to newer version of Supervisor since that PR got released.
Furthermore since #6347 we also no longer build for these
architectures. With this, any code related to these architectures
becomes dead code and should be removed.
This commit removes all refrences to the deprecated architectures and
machines from Supervisor.
This affects the following architectures:
- armhf
- armv7
- i386
And the following machines:
- odroid-xu
- qemuarm
- qemux86
- raspberrypi
- raspberrypi2
- raspberrypi3
- raspberrypi4
- tinker
* Create issue if an app using a deprecated architecture is installed
This adds a check to the resolution system to detect if an app is
installed that uses a deprecated architecture. If so, it will show a
warning to the user and recommend them to uninstall the app.
* Formally deprecate machine add-on configs as well
Not only deprecate add-on configs for unsupported architectures, but
also for unsupported machines.
* For installed add-ons architecture must always exist
Fail hard in case of missing architecture, as this is a required field
for installed add-ons. This will prevent the Supervisor from running
with an unsupported configuration and causing further issues down the
line.
* Fix environment variable type errors by converting IP addresses to strings
Environment variables must be strings, but IPv4Address and IPv4Network
objects were being passed directly to container environment dictionaries,
causing typeguard validation errors.
Changes:
- Convert IPv4Address objects to strings in homeassistant.py for
SUPERVISOR and HASSIO environment variables
- Convert IPv4Network object to string in observer.py for
NETWORK_MASK environment variable
- Update tests to expect string values instead of IP objects in
environment dictionaries
- Remove unused ip_network import from test_observer.py
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Use explicit string conversion for extra_hosts IP addresses
Use the !s format specifier in the f-string to explicitly convert
IPv4Address objects to strings when building the ExtraHosts list.
While f-strings implicitly convert objects to strings, using !s makes
the conversion explicit and consistent with the environment variable
fixes in the previous commit.
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The manifest fetcher was using docker.io as the registry API endpoint,
but Docker Hub's actual registry API is at registry-1.docker.io. When
trying to access https://docker.io/v2/..., requests were being redirected
to https://www.docker.com/ (the marketing site), which returned HTML
instead of JSON, causing manifest fetching to fail.
This matches exactly what Docker itself does internally - see
daemon/pkg/registry/config.go:49 where Docker hardcodes
DefaultRegistryHost = "registry-1.docker.io" for registry operations.
Changes:
- Add DOCKER_HUB_API constant for the actual API endpoint
- Add _get_api_endpoint() helper to translate docker.io to
registry-1.docker.io for HTTP API calls
- Update _get_auth_token() and _fetch_manifest() to use the API endpoint
- Keep docker.io as the registry identifier for naming and credentials
- Add tests to verify the API endpoint translation
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Migrate info and events to aiodocker
* Migrate container logs to aiodocker
* Fix dns plugin loop test
* Fix mocking for docker info
* Fixes from feedback
* Harden monitor error handling
* Deleted failing tests because they were not useful
* Use count-based progress for Docker image pulls
Refactor Docker image pull progress to use a simpler count-based approach
where each layer contributes equally (100% / total_layers) regardless of
size. This replaces the previous size-weighted calculation that was
susceptible to progress regression.
The core issue was that Docker rate-limits concurrent downloads (~3 at a
time) and reports layer sizes only when downloading starts. With size-
weighted progress, large layers appearing late would cause progress to
drop dramatically (e.g., 59% -> 29%) as the total size increased.
The new approach:
- Each layer contributes equally to overall progress
- Per-layer progress: 70% download weight, 30% extraction weight
- Progress only starts after first "Downloading" event (when layer
count is known)
- Always caps at 99% - job completion handles final 100%
This simplifies the code by moving progress tracking to a dedicated
module (pull_progress.py) and removing complex size-based scaling logic
that tried to account for unknown layer sizes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Exclude already-existing layers from pull progress calculation
Layers that already exist locally should not count towards download
progress since there's nothing to download for them. Only layers that
need pulling are included in the progress calculation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add registry manifest fetcher for size-based pull progress
Fetch image manifests directly from container registries before pulling
to get accurate layer sizes upfront. This enables size-weighted progress
tracking where each layer contributes proportionally to its byte size,
rather than equal weight per layer.
Key changes:
- Add RegistryManifestFetcher that handles auth discovery via
WWW-Authenticate headers, token fetching with optional credentials,
and multi-arch manifest list resolution
- Update ImagePullProgress to accept manifest layer sizes via
set_manifest() and calculate size-weighted progress
- Fall back to count-based progress when manifest fetch fails
- Pre-populate layer sizes from manifest when creating layer trackers
The manifest fetcher supports ghcr.io, Docker Hub, and private
registries by using credentials from Docker config when available.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clamp progress to 100 to prevent floating point precision issues
Floating point arithmetic in weighted progress calculations can produce
values slightly above 100 (e.g., 100.00000000000001). This causes
validation errors when the progress value is checked.
Add min(100, ...) clamping to both size-weighted and count-based
progress calculations to ensure the result never exceeds 100.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use sys_websession for manifest fetcher instead of creating new session
Reuse the existing CoreSys websession for registry manifest requests
instead of creating a new aiohttp session. This improves performance
and follows the established pattern used throughout the codebase.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Make platform parameter required and warn on missing platform
- Make platform a required parameter in get_manifest() and _fetch_manifest()
since it's always provided by the calling code
- Return None and log warning when requested platform is not found in
multi-arch manifest list, instead of falling back to first manifest
which could be the wrong architecture
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Log manifest fetch failures at warning level
Users will notice degraded progress tracking when manifest fetch fails,
so log at warning level to help diagnose issues.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add pylint disable comments for protected access in manifest tests
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Separate download_current and total_size updates in pull progress
Update download_current and total_size independently in the DOWNLOADING
handler. This ensures download_current is updated even when total is
not yet available.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Reject invalid platform format in manifest selection
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Migrate all docker container interactions to aiodocker
* Remove containers_legacy since its no longer used
* Add back remove color logic
* Revert accidental invert of conditional in setup_network
* Fix typos found by copilot
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Revert "Apply suggestions from code review"
This reverts commit 0a475433ea.
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Introduce new option `duplicate_log_file` to HA Core configuration that will
set an environment variable `HA_DUPLICATE_LOG_FILE=1` for the Core container if
enabled. This will serve as a flag for Core to enable the legacy log file,
along the standard logging which is handled by Systemd Journal.
* Disable timeout for Docker image pull operations
The aiodocker migration introduced a regression where image pulls could
timeout during slow downloads. The session-level timeout (900s total)
was being applied to pull operations, but docker-py explicitly sets
timeout=None for pulls, allowing them to run indefinitely.
When aiodocker receives timeout=None, it converts it to
ClientTimeout(total=None), which aiohttp treats as "no timeout"
(returns TimerNoop instead of enforcing a timeout).
This fixes TimeoutError exceptions that could occur during installation
on systems with slow network connections or when pulling large images.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix pytests
---------
Co-authored-by: Claude <noreply@anthropic.com>
The aiodocker images.import_image() method returns a coroutine that
needs to be awaited, but the code was iterating over it directly,
causing "TypeError: 'coroutine' object is not iterable".
Fixes SUPERVISOR-13D9
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude <noreply@anthropic.com>
* Use Docker's official registry domain detection logic
Replace the custom IMAGE_WITH_HOST regex with a proper implementation
based on Docker's reference parser (vendor/github.com/distribution/
reference/normalize.go).
Changes:
- Change DOCKER_HUB from "hub.docker.com" to "docker.io" (official default)
- Add DOCKER_HUB_LEGACY for backward compatibility with "hub.docker.com"
- Add IMAGE_DOMAIN_REGEX and get_domain() function that properly detects:
- localhost (with optional port)
- Domains with "." (e.g., ghcr.io, 127.0.0.1)
- Domains with ":" port (e.g., myregistry:5000)
- IPv6 addresses (e.g., [::1]:5000)
- Update credential handling to support both docker.io and hub.docker.com
- Add comprehensive tests for domain detection
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Refactor Docker domain detection to utils module
Move get_domain function to supervisor/docker/utils.py and rename it
to get_domain_from_image for consistency with get_registry_for_image.
Use named group in the regex for better readability.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Rename domain to registry for consistency
Use consistent "registry" terminology throughout the codebase:
- Rename get_domain_from_image to get_registry_from_image
- Rename IMAGE_DOMAIN_REGEX to IMAGE_REGISTRY_REGEX
- Update named group from "domain" to "registry"
- Update all related comments and variable names
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Fix progress when using containerd snapshotter
* Add test for tiny image download under containerd-snapshotter
* Fix API tests after progress allocation change
* Fix test for auth changes
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Pass registry credentials to add-on build for private base images
When building add-ons that use a base image from a private registry,
the build would fail because credentials configured via the Supervisor
API were not passed to the Docker-in-Docker build container.
This fix:
- Adds get_docker_config_json() to generate a Docker config.json with
registry credentials for the base image
- Creates a temporary config file and mounts it into the build container
at /root/.docker/config.json so BuildKit can authenticate when pulling
the base image
- Cleans up the temporary file after build completes
Fixes#6354🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix pylint errors
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Refactor registry credential extraction into shared helper
Extract duplicate logic for determining which registry matches an image
into a shared `get_registry_for_image()` method in `DockerConfig`. This
method is now used by both `DockerInterface._get_credentials()` and
`AddonBuild.get_docker_config_json()`.
Move `DOCKER_HUB` and `IMAGE_WITH_HOST` constants to `docker/const.py`
to avoid circular imports between manager.py and interface.py.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ruff format
* Document raises
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix private registry authentication for aiodocker image pulls
After PR #6252 migrated image pulling from dockerpy to aiodocker,
private registry authentication stopped working. The old _docker_login()
method stored credentials in ~/.docker/config.json via dockerpy, but
aiodocker doesn't read that file - it requires credentials passed
explicitly via the auth parameter.
Changes:
- Remove unused _docker_login() method (dockerpy login was ineffective)
- Pass credentials directly to pull_image() via new auth parameter
- Add auth parameter to DockerAPI.pull_image() method
- Add unit tests for Docker Hub and custom registry authentication
Fixes#6345🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Ignore protected access in test
* Fix plug-in pull test
* Fix HA core tests
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Handle pull events with complete progress details only
Under certain circumstances, Docker seems to send pull events with
incomplete progress details (i.e., missing 'current' or 'total' fields).
In practise, we've observed an empty dictionary for progress details
as well as missing 'total' field (while 'current' was present).
All events were using Docker 28.3.3 using the old, default Docker graph
backend.
* Fix docstring/comment
* Migrate images from dockerpy to aiodocker
* Add missing coverage and fix bug in repair
* Bind libraries to different files and refactor images.pull
* Use the same socket again
Try using the same socket again.
* Fix pytest
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Fix docker image pull progress blocked by small layers
Small Docker layers (typically <100 bytes) can skip the downloading phase
entirely, going directly from "Pulling fs layer" to "Download complete"
without emitting any progress events with byte counts. This caused the
aggregate progress calculation to block indefinitely, as it required all
layer jobs to have their `extra` field populated with byte counts before
proceeding.
The issue manifested as parent job progress jumping from 0% to 97.9% after
long delays, as seen when a 96-byte layer held up progress reporting for
~50 seconds until it finally reached the "Extracting" phase.
Set a minimal `extra` field (current=1, total=1) when layers reach
"Download complete" without having gone through the downloading phase.
This allows the aggregate progress calculation to proceed immediately
while still correctly representing the layer as 100% downloaded.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update test to capture issue correctly
* Improve pytest
* Fix pytest comment
* Fix pylint warning
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Formally deprecate CodeNotary build config
* Remove CodeNotary specific integrity checking
The current code is specific to how CodeNotary was doing integrity
checking. A future integrity checking mechanism likely will work
differently (e.g. through EROFS based containers). Remove the current
code to make way for a future implementation.
* Drop CodeNotary integrity fixups
* Drop unused tests
* Fix pytest
* Fix pytest
* Remove CodeNotary related exceptions and handling
Remove CodeNotary related exceptions and handling from the Docker
interface.
* Drop unnecessary comment
* Remove Codenotary specific IssueType/SuggestionType
* Drop Codenotary specific environment and secret reference
* Remove unused constants
* Introduce APIGone exception for removed APIs
Introduce a new exception class APIGone to indicate that certain API
features have been removed and are no longer available. Update the
security integrity check endpoint to raise this new exception instead
of a generic APIError, providing clearer communication to clients that
the feature has been intentionally removed.
* Drop content trust
A cosign based signature verification will likely be named differently
to avoid confusion with existing implementations. For now, remove the
content trust option entirely.
* Drop code sign test
* Remove source_mods/content_trust evaluations
* Remove content_trust reference in bootstrap.py
* Fix security tests
* Drop unused tests
* Drop codenotary from schema
Since we have "remove extra" in voluptuous, we can remove the
codenotary field from the addon schema.
* Remove content_trust from tests
* Remove content_trust unsupported reason
* Remove unnecessary comment
* Remove unrelated pytest
* Remove unrelated fixtures
* Add support for ulimit in addon config
Similar to docker-compose, this adds support for setting ulimits
for addons via the addon config. This is useful e.g. for InfluxDB
which on its own does not support setting higher open file descriptor
limits, but recommends increasing limits on the host.
* Make soft and hard limit mandatory if ulimit is a dict
* Add progress reporting to addon, HA and Supervisor updates
* Fix assert in test
* Add progress to addon, core, supervisor updates/installs
* Fix double install bug in addons install
* Remove initial_install and re-arrange order of load
* Fix CID file handling to prevent directory creation
It seems that under certain conditions Docker creates a directory
instead of a file for the CID file. This change ensures that
the CID file is always created as a file, and any existing directory
is removed before creating the file.
* Fix tests
* Fix pytest
* Write cidfiles of Docker containers and mount them individually to /run/cid
There is no standard way to get the container ID in the container
itself, which can be needed for instance for #6006. The usual pattern is
to use the --cidfile argument of Docker CLI and mount the generated file
to the container. However, this is feature of Docker CLI and we can't
use it when creating the containers via API. To get container ID to
implement native logging in e.g. Core as well, we need the help of the
Supervisor.
This change implements similar feature fully in Supervisor's DockerAPI
class that orchestrates lifetime of all containers managed by
Supervisor. The files are created in the SUPERVISOR_DATA directory, as
it needs to be persisted between reboots, just as the instances of
Docker containers are.
Supervisor's cidfile must be created when starting the Supervisor
itself, for that see home-assistant/operating-system#4276.
* Address review comments, fix mounting of the cidfile
* Send progress updates during image pull for install/update
* Add extra to tests about job APIs
* Sent out of date progress to sentry and combine done event
* Pulling container image layer
* Enable IPv6 by default for new installations
Enable IPv6 by default for new Supervisor installations. Let's also
make the `enable_ipv6` attribute nullable, so we can distinguish
between "not set" and "set to false".
* Add pytest
* Add log message that system restart is required for IPv6 changes
* Fix API pytest
* Create resolution center issue when reboot is required
* Order log after actual setter call
Configurable and w/ migrations between IPv4-Only and Dual-Stack
Signed-off-by: David Rapan <david@rapan.cz>
Co-authored-by: Stefan Agner <stefan@agner.ch>