* Fix Docker exec exit code handling by using detach=False
When executing commands inside containers using `container_run_inside()`,
the exec metadata did not contain a valid exit code because `detach=True`
starts the exec in the background and returns immediately before completion.
Root cause: With `detach=True`, Docker's exec start() returns an awaitable
that yields output bytes. However, the await only waits for the HTTP/REST
call to complete, NOT for the actual exec command to finish. The command
continues running in the background after the HTTP response is received.
Calling `inspect()` immediately after returns `ExitCode: None` because
the exec hasn't completed yet.
Solution: Use `detach=False` which returns a Stream object that:
- Automatically waits for exec completion by reading from the stream
- Provides actual command output (not just empty bytes)
- Makes exit code immediately available after stream closes
- No polling needed
Changes:
- Switch from `detach=True` to `detach=False` in container_run_inside()
- Read output from stream using async context manager
- Add defensive validation to ensure ExitCode is never None
- Update tests to mock the Stream interface using AsyncMock
- Add debug log showing exit code after command execution
Fixes#6518
Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
* Address review feedback
---------
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The Supervisor's /core/api proxy previously only supported GET and POST
methods, returning 405 Method Not Allowed for DELETE requests. This
prevented addons from calling Home Assistant Core REST API endpoints
that require DELETE methods, such as deleting automations, scripts,
or scenes.
The underlying proxy implementation already supported passing through
any HTTP method via request.method.lower(), so only the route
registration was needed.
Fixes#6509
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The aiodocker 0.25.0 upgrade (PR #6448) changed how DockerError handles
the message parameter. The library now extracts the message string from
Docker API JSON responses before passing it to DockerError, rather than
passing the entire dict.
The port conflict detection tests were written before this change and
incorrectly passed dicts to DockerError. This caused TypeErrors when
the port conflict detection code tried to match err.message with a
regex, expecting a string but receiving a dict.
Update both test_addon_start_port_conflict_error and
test_observer_start_port_conflict to pass message strings directly,
matching the real aiodocker 0.25.0 behavior.
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Map port conflict on start error into a known error
* Apply suggestions from code review
* Run ruff format
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Use count-based progress for Docker image pulls
Refactor Docker image pull progress to use a simpler count-based approach
where each layer contributes equally (100% / total_layers) regardless of
size. This replaces the previous size-weighted calculation that was
susceptible to progress regression.
The core issue was that Docker rate-limits concurrent downloads (~3 at a
time) and reports layer sizes only when downloading starts. With size-
weighted progress, large layers appearing late would cause progress to
drop dramatically (e.g., 59% -> 29%) as the total size increased.
The new approach:
- Each layer contributes equally to overall progress
- Per-layer progress: 70% download weight, 30% extraction weight
- Progress only starts after first "Downloading" event (when layer
count is known)
- Always caps at 99% - job completion handles final 100%
This simplifies the code by moving progress tracking to a dedicated
module (pull_progress.py) and removing complex size-based scaling logic
that tried to account for unknown layer sizes.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Exclude already-existing layers from pull progress calculation
Layers that already exist locally should not count towards download
progress since there's nothing to download for them. Only layers that
need pulling are included in the progress calculation.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add registry manifest fetcher for size-based pull progress
Fetch image manifests directly from container registries before pulling
to get accurate layer sizes upfront. This enables size-weighted progress
tracking where each layer contributes proportionally to its byte size,
rather than equal weight per layer.
Key changes:
- Add RegistryManifestFetcher that handles auth discovery via
WWW-Authenticate headers, token fetching with optional credentials,
and multi-arch manifest list resolution
- Update ImagePullProgress to accept manifest layer sizes via
set_manifest() and calculate size-weighted progress
- Fall back to count-based progress when manifest fetch fails
- Pre-populate layer sizes from manifest when creating layer trackers
The manifest fetcher supports ghcr.io, Docker Hub, and private
registries by using credentials from Docker config when available.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Clamp progress to 100 to prevent floating point precision issues
Floating point arithmetic in weighted progress calculations can produce
values slightly above 100 (e.g., 100.00000000000001). This causes
validation errors when the progress value is checked.
Add min(100, ...) clamping to both size-weighted and count-based
progress calculations to ensure the result never exceeds 100.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Use sys_websession for manifest fetcher instead of creating new session
Reuse the existing CoreSys websession for registry manifest requests
instead of creating a new aiohttp session. This improves performance
and follows the established pattern used throughout the codebase.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Make platform parameter required and warn on missing platform
- Make platform a required parameter in get_manifest() and _fetch_manifest()
since it's always provided by the calling code
- Return None and log warning when requested platform is not found in
multi-arch manifest list, instead of falling back to first manifest
which could be the wrong architecture
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Log manifest fetch failures at warning level
Users will notice degraded progress tracking when manifest fetch fails,
so log at warning level to help diagnose issues.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add pylint disable comments for protected access in manifest tests
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Separate download_current and total_size updates in pull progress
Update download_current and total_size independently in the DOWNLOADING
handler. This ensures download_current is updated even when total is
not yet available.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Reject invalid platform format in manifest selection
---------
Co-authored-by: Claude <noreply@anthropic.com>
During system shutdown (reboot/poweroff), the watchdog was incorrectly
detecting the Home Assistant Core container as failed and attempting to
restart it. This occurred because Docker was stopping all containers in
parallel with Supervisor's own shutdown sequence, causing the watchdog
to trigger while add-ons were still being stopped.
This led to an abrupt termination of Core before it could cleanly shut
down its SQLite database, resulting in a warning on the next startup:
"The system could not validate that the sqlite3 database was shutdown
cleanly".
The fix registers a supervisor state change listener that unregisters
the watchdog when entering any shutdown state (SHUTDOWN, STOPPING, or
CLOSE). This prevents restart attempts during both user-initiated
reboots (via API) and external shutdown signals (Docker SIGTERM,
console reboot commands).
Since SHUTDOWN, STOPPING, and CLOSE are terminal states with no reverse
transition back to RUNNING, no re-registration logic is needed.
Fixes#6511
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
The test was failing intermittently in CI because concurrent async
operations in asyncio.gather() were getting slightly different
timestamps (microseconds apart) despite being inside a time_machine
context.
When test2.execute() calls were timestamped at start+2ms due to async
scheduling delays, they weren't cleaned up in the final test block
(cutoff = start+1ms), causing a false rate limit error.
Fix by using tick=False to completely freeze time during the gather,
ensuring all 4 calls get the exact same timestamp.
Co-authored-by: Claude Sonnet 4.5 <noreply@anthropic.com>
* Check frontend availability after Home Assistant Core updates
Add verification that the frontend is actually accessible at "/" after core
updates to ensure the web interface is serving properly, not just that the
API endpoints respond.
Previously, the update verification only checked API endpoints and whether
the frontend component was loaded. This could miss cases where the API is
responsive but the frontend fails to serve the UI.
Changes:
- Add check_frontend_available() method to HomeAssistantAPI that fetches
the root path and verifies it returns HTML content
- Integrate frontend check into core update verification flow after
confirming the frontend component is loaded
- Trigger automatic rollback if frontend is inaccessible after update
- Fix blocking I/O calls in rollback log file handling to use async
executor
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Avoid checking frontend if config data is None
* Improve pytest tests
* Make sure Core returns a valid config
* Remove Core version check in frontend availability test
The call site already makes sure that an actual Home Assistant Core
instance is running before calling the frontend availability test.
So this is rather redundant. Simplify the code by removing the version
check and update tests accordingly.
* Add test coverage for get_config
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Add route_metric attribute to IpProperties class
Signed-off-by: David Rapan <david@rapan.cz>
* Refactor dbus setting IP constants
Signed-off-by: David Rapan <david@rapan.cz>
* Add route metric
Signed-off-by: David Rapan <david@rapan.cz>
* Merge test_api_network_interface_info
Signed-off-by: David Rapan <david@rapan.cz>
* Add test case for route metric update
Signed-off-by: David Rapan <david@rapan.cz>
---------
Signed-off-by: David Rapan <david@rapan.cz>
* Migrate all docker container interactions to aiodocker
* Remove containers_legacy since its no longer used
* Add back remove color logic
* Revert accidental invert of conditional in setup_network
* Fix typos found by copilot
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Revert "Apply suggestions from code review"
This reverts commit 0a475433ea.
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix too short timeouts for Synology NAS
With Home Assistant Core 2025.12.x updates available the STARTUP_API_RESPONSE_TIMEOUT that HA supervisor is willing to wait (before assuming a startup failure and rolling back the entire core update) seems to be too low on not-so-beefy hosts. The problem has been seen on Synology NAS machines running Home Assistant on the side (like in my case). I have doubled the timeout from 3 to 6 minutes and the upgrade to Core 2025.12.1 works on my Synology DS723+. My update took 4min 56s -- hence the timeout increase was proven necessary.
* Fix tests for increased API Timeout
* Increase the timeout to 10 minutes
* Increase the timeout in tests
---------
Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>
Introduce new option `duplicate_log_file` to HA Core configuration that will
set an environment variable `HA_DUPLICATE_LOG_FILE=1` for the Core container if
enabled. This will serve as a flag for Core to enable the legacy log file,
along the standard logging which is handled by Systemd Journal.
Add AttributeError to the exception handler in the git pull operation.
This catches the case where a repository exists but has no 'origin'
remote configured, which can happen if the remote was renamed or
deleted by the user or due to repository corruption.
When this error occurs, it now creates a CORRUPT_REPOSITORY issue with
an EXECUTE_RESET suggestion, triggering the auto-fix mechanism to
re-clone the repository.
Fixes SUPERVISOR-69Z
Fixes SUPERVISOR-172C
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude <noreply@anthropic.com>
OS Agent will no longer support migrating to the overlay2 driver due to reasons
explained in home-assistant/os-agent#245. Remove it from the Docker API as
well.
* Disable timeout for Docker image pull operations
The aiodocker migration introduced a regression where image pulls could
timeout during slow downloads. The session-level timeout (900s total)
was being applied to pull operations, but docker-py explicitly sets
timeout=None for pulls, allowing them to run indefinitely.
When aiodocker receives timeout=None, it converts it to
ClientTimeout(total=None), which aiohttp treats as "no timeout"
(returns TimerNoop instead of enforcing a timeout).
This fixes TimeoutError exceptions that could occur during installation
on systems with slow network connections or when pulling large images.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix pytests
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Remove unknown errors from addons
* Remove customized unknown error types
* Fix docker ratelimit exception and tests
* Fix stats test and add more for known errors
* Add defined error for when build fails
* Fixes from feedback
* Fix mypy issues
* Fix test failure due to rename
* Change auth reset error message
* Fix typing for IPv6 addr-gen-mode and ip6-privacy settings
* Fix ConnectionStateType typing
* Rename ConnectionStateType to ConnectionState
The extra type suffix is unnecessary.
* Apply suggestions from code review
Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>
---------
Co-authored-by: Jan Čermák <sairon@users.noreply.github.com>
The aiodocker images.import_image() method returns a coroutine that
needs to be awaited, but the code was iterating over it directly,
causing "TypeError: 'coroutine' object is not iterable".
Fixes SUPERVISOR-13D9
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-authored-by: Claude <noreply@anthropic.com>
* Use Docker's official registry domain detection logic
Replace the custom IMAGE_WITH_HOST regex with a proper implementation
based on Docker's reference parser (vendor/github.com/distribution/
reference/normalize.go).
Changes:
- Change DOCKER_HUB from "hub.docker.com" to "docker.io" (official default)
- Add DOCKER_HUB_LEGACY for backward compatibility with "hub.docker.com"
- Add IMAGE_DOMAIN_REGEX and get_domain() function that properly detects:
- localhost (with optional port)
- Domains with "." (e.g., ghcr.io, 127.0.0.1)
- Domains with ":" port (e.g., myregistry:5000)
- IPv6 addresses (e.g., [::1]:5000)
- Update credential handling to support both docker.io and hub.docker.com
- Add comprehensive tests for domain detection
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Refactor Docker domain detection to utils module
Move get_domain function to supervisor/docker/utils.py and rename it
to get_domain_from_image for consistency with get_registry_for_image.
Use named group in the regex for better readability.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Rename domain to registry for consistency
Use consistent "registry" terminology throughout the codebase:
- Rename get_domain_from_image to get_registry_from_image
- Rename IMAGE_DOMAIN_REGEX to IMAGE_REGISTRY_REGEX
- Update named group from "domain" to "registry"
- Update all related comments and variable names
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Fix progress when using containerd snapshotter
* Add test for tiny image download under containerd-snapshotter
* Fix API tests after progress allocation change
* Fix test for auth changes
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Implement Supervisor API for home-assistant/os-agent#238, adding possibility to
schedule migration either to Containerd overlayfs driver, or migration to the
graph overlay2 driver, once the device is rebooted the next time. While it's
technically in the DBus OS interface, in Supervisor's abstraction it makes more
sense to put it under `/docker` endpoints.
* Pass registry credentials to add-on build for private base images
When building add-ons that use a base image from a private registry,
the build would fail because credentials configured via the Supervisor
API were not passed to the Docker-in-Docker build container.
This fix:
- Adds get_docker_config_json() to generate a Docker config.json with
registry credentials for the base image
- Creates a temporary config file and mounts it into the build container
at /root/.docker/config.json so BuildKit can authenticate when pulling
the base image
- Cleans up the temporary file after build completes
Fixes#6354🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix pylint errors
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Refactor registry credential extraction into shared helper
Extract duplicate logic for determining which registry matches an image
into a shared `get_registry_for_image()` method in `DockerConfig`. This
method is now used by both `DockerInterface._get_credentials()` and
`AddonBuild.get_docker_config_json()`.
Move `DOCKER_HUB` and `IMAGE_WITH_HOST` constants to `docker/const.py`
to avoid circular imports between manager.py and interface.py.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* ruff format
* Document raises
---------
Co-authored-by: Claude <noreply@anthropic.com>
Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix private registry authentication for aiodocker image pulls
After PR #6252 migrated image pulling from dockerpy to aiodocker,
private registry authentication stopped working. The old _docker_login()
method stored credentials in ~/.docker/config.json via dockerpy, but
aiodocker doesn't read that file - it requires credentials passed
explicitly via the auth parameter.
Changes:
- Remove unused _docker_login() method (dockerpy login was ineffective)
- Pass credentials directly to pull_image() via new auth parameter
- Add auth parameter to DockerAPI.pull_image() method
- Add unit tests for Docker Hub and custom registry authentication
Fixes#6345🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Ignore protected access in test
* Fix plug-in pull test
* Fix HA core tests
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Drop Debian 12 from supported OS list
With the deprecation of Home Assistant Supervised installation method
Debian 12 is no longer supported. This change removes Debian 12
from the list of supported operating systems in the evaluation logic.
* Improve tests
* Deprecate i386, armhf and armv7 Supervisor architectures
* Exclude Core from architecture deprecation checks
This allows to download the latest available Core version still, even
on deprecated systems.
* Fix pytest
Add support for `no_colors` query parameter on all advanced logs API endpoints,
allowing users to optionally strip ANSI color sequences from log output. This
complements the existing color stripping on /latest endpoints added in #6319.
* Handle pull events with complete progress details only
Under certain circumstances, Docker seems to send pull events with
incomplete progress details (i.e., missing 'current' or 'total' fields).
In practise, we've observed an empty dictionary for progress details
as well as missing 'total' field (while 'current' was present).
All events were using Docker 28.3.3 using the old, default Docker graph
backend.
* Fix docstring/comment
* Fix call_at to use event loop time base instead of Unix timestamp
The CoreSys.call_at method was incorrectly passing Unix timestamps
directly to asyncio.loop.call_at(), which expects times in the event
loop's monotonic time base. This caused scheduled jobs to be scheduled
approximately 55 years in the future (the difference between Unix epoch
time and monotonic time since boot).
The bug was masked by time-machine 2.19.0, which patched time.monotonic()
and caused loop.time() to return Unix timestamps. Time-machine 3.0.0
removed this patching (as it caused event loop freezes), exposing the bug.
Fix by converting the datetime to event loop time base:
- Calculate delay from current Unix time to scheduled Unix time
- Add delay to current event loop time to get scheduled loop time
Also simplify test_job_scheduled_at to avoid time-machine's async
context managers, following the pattern of test_job_scheduled_delay.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Add comment about dateime in the past
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Strip ANSI escape color sequences from /latest log responses
Strip ANSI sequences of CSI commands [1] used for log coloring from
/latest log endpoints. These endpoint were primarily designed for log
downloads and colors are mostly not wanted in those. Add optional
argument for stripping the colors from the logs and enable it for the
/latest endpoints.
[1] https://en.wikipedia.org/wiki/ANSI_escape_code#CSIsection
* Refactor advanced logs' tests to use fixture factory
Introduce `advanced_logs_tester` fixture to simplify testing of advanced logs
in the API tests, declaring all the needed fixture in a single place. # Please
enter the commit message for your changes. Lines starting
* Migrate images from dockerpy to aiodocker
* Add missing coverage and fix bug in repair
* Bind libraries to different files and refactor images.pull
* Use the same socket again
Try using the same socket again.
* Fix pytest
---------
Co-authored-by: Stefan Agner <stefan@agner.ch>
* Add context to Sentry events during setup phase
Since not all properties are safe to access the current code avoids
adding any context during initialization and setup phase. However,
quite some reports are during the setup phase. This change adds some
context to events during setup phase as well, to make debugging easier.
* Drop default arch (not available during setup)
* Avoid adding Content-Type to non-body responses
The current code sets the content-type header for all responses
to the result's content_type property if upstream does not set a
content_type. The default value for content_type is
"application/octet-stream".
For responses that do not have a body (like 204 No Content or
304 Not Modified), setting a content-type header is unnecessary and
potentially misleading. Follow HTTP standards by only adding the
content-type header to responses that actually contain a body.
* Add pytest for ingress proxy
* Preserve Content-Type header for HEAD requests in ingress API
* Fix docker image pull progress blocked by small layers
Small Docker layers (typically <100 bytes) can skip the downloading phase
entirely, going directly from "Pulling fs layer" to "Download complete"
without emitting any progress events with byte counts. This caused the
aggregate progress calculation to block indefinitely, as it required all
layer jobs to have their `extra` field populated with byte counts before
proceeding.
The issue manifested as parent job progress jumping from 0% to 97.9% after
long delays, as seen when a 96-byte layer held up progress reporting for
~50 seconds until it finally reached the "Extracting" phase.
Set a minimal `extra` field (current=1, total=1) when layers reach
"Download complete" without having gone through the downloading phase.
This allows the aggregate progress calculation to proceed immediately
while still correctly representing the layer as 100% downloaded.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude <noreply@anthropic.com>
* Update test to capture issue correctly
* Improve pytest
* Fix pytest comment
* Fix pylint warning
---------
Co-authored-by: Claude <noreply@anthropic.com>
* Formally deprecate CodeNotary build config
* Remove CodeNotary specific integrity checking
The current code is specific to how CodeNotary was doing integrity
checking. A future integrity checking mechanism likely will work
differently (e.g. through EROFS based containers). Remove the current
code to make way for a future implementation.
* Drop CodeNotary integrity fixups
* Drop unused tests
* Fix pytest
* Fix pytest
* Remove CodeNotary related exceptions and handling
Remove CodeNotary related exceptions and handling from the Docker
interface.
* Drop unnecessary comment
* Remove Codenotary specific IssueType/SuggestionType
* Drop Codenotary specific environment and secret reference
* Remove unused constants
* Introduce APIGone exception for removed APIs
Introduce a new exception class APIGone to indicate that certain API
features have been removed and are no longer available. Update the
security integrity check endpoint to raise this new exception instead
of a generic APIError, providing clearer communication to clients that
the feature has been intentionally removed.
* Drop content trust
A cosign based signature verification will likely be named differently
to avoid confusion with existing implementations. For now, remove the
content trust option entirely.
* Drop code sign test
* Remove source_mods/content_trust evaluations
* Remove content_trust reference in bootstrap.py
* Fix security tests
* Drop unused tests
* Drop codenotary from schema
Since we have "remove extra" in voluptuous, we can remove the
codenotary field from the addon schema.
* Remove content_trust from tests
* Remove content_trust unsupported reason
* Remove unnecessary comment
* Remove unrelated pytest
* Remove unrelated fixtures
When fetching the host info without UDisks2 D-Bus connected it would raise an exception and disable managing addons via home assistant (and other GUI functions that need host info status).