1
0
mirror of https://github.com/home-assistant/supervisor.git synced 2026-05-30 11:33:05 +01:00
Commit Graph

13 Commits

Author SHA1 Message Date
Stefan Agner 0ac8b42062 Rework Supervisor connectivity check with coalescing and force flag (#6765)
* Rework Supervisor connectivity check with coalescing and force flag

Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.

The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:

- check_and_update_connectivity(force=False) is the only path that
  runs the HTTP probe and updates the cached state. Concurrent callers
  coalesce onto a single in-flight probe. A min-interval throttle
  lives inside the method and reuses the cached result within window
  instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
  for signal handlers (D-Bus, plugin callbacks) that must return
  quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
  sets a trailing-rerun flag so the owning task runs one more probe
  after the current one completes. Used for signals that carry fresh
  state-change information (NM connectivity transition to FULL, DNS
  container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
  emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.

Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Propagate connectivity-probe cancellation and skip last-check on cancel

Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.

The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.

A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Clarify why post-NTP-sync forces a connectivity probe

The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add debug logs to Supervisor connectivity probe paths

The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:

- who requested a check (force flag distinguishes signal-driven probes
  from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
  min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
  mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
  the message so transitions are visible)

All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Skip system-checks fan-out in test_events_on_issue_changes

The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.

The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.

Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:14:13 +02:00
Mike Degatano ba8c49935b Refactor internal addon references to app/apps (#6717)
* Rename addon→app in docstrings and comments

Updates all docstrings and inline comments across supervisor/ and
tests/ to use the new app/apps terminology. No runtime behaviour
is changed by this commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Rename addon→app in code (variables, args, class names, functions)

Renames all internal Python identifiers from addon/addons to app/apps:
- Variable and argument names
- Function and method names
- Class names (Addon→App, AddonManager→AppManager, DockerAddon→DockerApp,
  all exception, check, and fixup classes, etc.)
- String literals used as Python identifiers (pytest fixtures,
  parametrize param names, patch.object attribute strings,
  URL route match_info keys)

External API contracts are preserved: JSON keys, error codes,
discovery protocol fields, TypedDict/attr.s field names.
Import module paths (supervisor/addons/) are also unchanged.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix partial backup/restore API to remap addons key to apps

The external API accepts `addons` as the request body key (since
ATTR_APPS = "addons"), but do_backup_partial and do_restore_partial
now take an `apps` parameter after the rename. The **body expansion
in both endpoints would pass `addons=...` causing a TypeError.

Remap the key before expansion in both backup_partial and
restore_partial:

    if ATTR_APPS in body:
        body["apps"] = body.pop(ATTR_APPS)

Also adds test_restore_partial_with_addons_key to verify the restore
path correctly receives apps= when addons is passed in the request
body. This path had no existing test coverage.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

* Fix merge error

* Adjust AppLoggerAdapter to use app_name

Co-authored-by: Stefan Agner <stefan@agner.ch>

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Stefan Agner <stefan@agner.ch>
2026-04-14 16:47:20 +02:00
Mike Degatano a122b5f1e9 Migrate info, events and container logs to aiodocker (#6514)
* Migrate info and events to aiodocker

* Migrate container logs to aiodocker

* Fix dns plugin loop test

* Fix mocking for docker info

* Fixes from feedback

* Harden monitor error handling

* Deleted failing tests because they were not useful
2026-02-03 18:36:41 +01:00
Stefan Agner 9a0f530a2f Add Supervisor connectivity check after DNS restart (#6005)
* Add Supervisor connectivity check after DNS restart

When the DNS plug-in got restarted, check Supervisor connectivity
in case the DNS plug-in configuration change influenced Supervisor
connectivity. This is helpful when a DHCP server gets started after
Home Assistant is up. In that case the network provided DNS server
(local DNS server) becomes available after the DNS plug-in restart.

Without this change, the Supervisor connectivity will remain false
until the a Job triggers a connectivity check, for example the
periodic update check (which causes a updater and store reload) by
Core.

* Fix pytest and add coverage for new functionality
2025-07-10 11:08:10 +02:00
Stefan Agner 953f7d01d7 Improve DNS plug-in restart (#5999)
* Improve DNS plug-in restart

Instead of simply go by PrimaryConnectioon change, use the DnsManager
Configuration property. This property is ultimately used to write the
DNS plug-in configuration, so it is really the relevant information
we pass on to the plug-in.

* Check for changes and restart DNS plugin

* Check for changes in plug-in DNS

Cache last local (NetworkManager) provided DNS servers. Check against
this DNS server list when deciding when to restart the DNS plug-in.

* Check connectivity unthrottled in certain situations

* Fix pytest

* Fix pytest

* Improve test coverage for DNS plugins restart functionality

* Apply suggestions from code review

Co-authored-by: Mike Degatano <michael.degatano@gmail.com>

* Debounce local DNS changes and event based connectivity checks

* Remove connection check logic

* Remove unthrottled connectivity check

* Fix delayed call

* Store restart task and cancel in case a restart is running

* Improve DNS configuration change tests

* Remove stale code

* Improve DNS plug-in tests, less mocking

* Cover multiple private functions at once

Improve tests around notify_locals_changed() to cover multiple
functions at once.

---------

Co-authored-by: Mike Degatano <michael.degatano@gmail.com>
2025-07-09 11:35:03 +02:00
Mike Degatano 0e8ace949a Fix mypy issues in plugins and resolution (#5946)
* Fix mypy issues in plugins

* Fix mypy issues in resolution module

* fix misses in resolution check

* Fix signatures on evaluate methods

* nitpick fix suggestions
2025-06-16 14:12:47 -04:00
Stefan Agner f6faa18409 Bump pre-commit ruff to 0.5.7 and reformat (#5242)
It seems that the codebase is not formatted with the latest ruff
version. This PR reformats the codebase with ruff 0.5.7.
2024-08-13 20:53:56 +02:00
Mike Degatano 3cc6bd19ad Mark system as unhealthy on OSError Bad message errors (#4750)
* Bad message error marks system as unhealthy

* Finish adding test cases for changes

* Rename test file for uniqueness

* bad_message to oserror_bad_message

* Omit some checks and check for network mounts
2023-12-21 18:05:29 +01:00
Mike Degatano 1611beccd1 Add job group execution limit option (#4457)
* Add job group execution limit option

* Fix pylint issues

* Assign variable before usage

* Cleanup jobs when done

* Remove isinstance check for performance

* Explicitly raise from None

* Add some more documentation info
2023-08-08 16:49:17 -04:00
Mike Degatano 1f92ab42ca Reduce executor code for docker (#4438)
* Reduce executor code for docker

* Fix pylint errors and move import/export image

* Fix test and a couple other risky executor calls

* Fix dataclass and return

* Fix test case and add one for corrupt docker

* Add some coverage

* Undo changes to docker manager startup
2023-07-18 11:39:39 -04:00
Mike Degatano d19166bb86 Docker events based watchdog and docker healthchecks (#3725)
* Docker events based watchdog

* Separate monitor from DockerAPI since it needs coresys

* Move monitor into dockerAPI

* Fix properties on coresys

* Add watchdog tests

* Added tests

* pylint issue

* Current state failures test

* Thread-safe event processing

* Use labels property
2022-07-15 09:21:59 +02:00
Mike Degatano 8bb4596d04 Add API option to disable fallback DNS (#3586)
* Add API option to disable fallback DNS

* DNS unsupported evaluation and fallback in sentry
2022-04-25 18:15:40 +02:00
Mike Degatano f3e2ccce43 Create issue for detected DNS server problem (#3578)
* Create issue for detected DNS server problem

* Validate behavior on restart as well

* tls:// not supported, remove check

* Move DNS server checks into resolution checks

* Revert all changes to plugins.dns

* Run DNS server checks if affected

* Mock aiodns query during all checks tests
2022-04-21 10:55:49 +02:00