mirror of
https://github.com/home-assistant/supervisor.git
synced 2026-05-19 14:18:53 +01:00
0ac8b42062
* Rework Supervisor connectivity check with coalescing and force flag Previously, a failed connectivity probe could strand Supervisor in a "no connectivity" state indefinitely. After an Ethernet reconnect, a probe kicked by NetworkManager's connectivity transition could race with CoreDNS being restarted (due to DNS locals changing), time out on DNS, and leave supervisor.connectivity = False. The retry that _on_dns_container_running was meant to fire landed inside the 5 s JobThrottle window from the just-failed probe and was silently dropped, since JobThrottle.THROTTLE drops rather than waits. The rework replaces the @Job(throttle=THROTTLE) decorator and the public connectivity setter with a single authoritative state-updating method: - check_and_update_connectivity(force=False) is the only path that runs the HTTP probe and updates the cached state. Concurrent callers coalesce onto a single in-flight probe. A min-interval throttle lives inside the method and reuses the cached result within window instead of dropping calls. - request_connectivity_check(force=False) is a fire-and-forget wrapper for signal handlers (D-Bus, plugin callbacks) that must return quickly without blocking signal dispatch on the HTTP round-trip. - force=True bypasses the min-interval and, when a probe is in flight, sets a trailing-rerun flag so the owning task runs one more probe after the current one completes. Used for signals that carry fresh state-change information (NM connectivity transition to FULL, DNS container RUNNING, startup, post-NTP sync). - _update_connectivity is the sole writer of the cached flag and emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions. Call sites migrate accordingly. The opportunistic supervisor.connectivity = False writes in update_apparmor, updater.fetch_data, os.manager, and addon_pwned error paths are replaced with request_connectivity_check() calls so the probe remains authoritative - an endpoint-specific failure no longer lies about the overall connectivity state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Propagate connectivity-probe cancellation and skip last-check on cancel Awaiting an asyncio.Task does not propagate cancellation INTO the task, so the previous owner-doesn't-shield comment was misleading: a cancelled owner left the spawned probe running orphaned, and the next caller could start a second probe alongside it. The owner now explicitly cancels and awaits the probe on CancelledError before re-raising. The last-check timestamp is also moved out of the finally block so a cancelled probe does not leave a "fresh result just ran" cache behind that would short-circuit the next non-forced caller. A regression test exercises both: that owner cancellation clears the in-flight reference and leaves the timestamp untouched, and that a subsequent non-forced check therefore still actually probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Clarify why post-NTP-sync forces a connectivity probe The previous comment claimed the last-check timestamp may be unreliable after a time jump, but _connectivity_last_check uses loop.time() which is monotonic and unaffected by wall-clock corrections. The real reason to force a fresh probe is TLS validation: certificates that appeared expired or not-yet-valid before the system clock was corrected may now verify, so a probe that just failed with an SSL error can succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add debug logs to Supervisor connectivity probe paths The original stuck-offline bug was hard to spot in logs because the silent throttle-drop and the cached state had no audit trail. With debug-level logging at each decision point, a future investigation can reconstruct from a single log file: - who requested a check (force flag distinguishes signal-driven probes from precondition / opportunistic-error-path requests) - why a probe did not actually run (in-flight coalesce, cached within min-interval, owner cancellation) - when a forced rerun was queued and when it ran (the precise failure mode that stranded the supervisor in the original incident) - when the cached state actually flipped (with the previous value in the message so transitions are visible) All new lines are debug-level. The existing _do_connectivity_check "failed" / "succeeded" lines are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Skip system-checks fan-out in test_events_on_issue_changes The test asserts that apply_suggestion fires an ISSUE_REMOVED event. ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before apply_suggestion calls healthcheck. The healthcheck call afterwards is incidental to this test's intent, but it fans out into check_system() which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes against the NetworkManager mock's stub nameserver 192.168.30.1 that each hit the default ~10 s aiodns timeout. The file took ~21 s to run. The slowness has been latent since #3818 (Aug 2022), which added the apply_suggestion step at the end of test_events_on_issue_changes two days after the DNS check landed in its current form (#3811). The default 24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in full-suite runs once any earlier test has tripped the throttle, which is likely why this slipped through. Mock coresys.resolution.healthcheck for just this one apply_suggestion call rather than introducing a file-wide DNS mock. The patch is local to the slow call site and the test's assertion is unaffected. The file drops from ~21 s to ~2.5 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
269 lines
9.0 KiB
Python
269 lines
9.0 KiB
Python
"""Test supervisor object."""
|
|
|
|
import asyncio
|
|
import errno
|
|
from unittest.mock import AsyncMock, MagicMock, Mock, patch
|
|
|
|
from aiohttp import ClientTimeout
|
|
from aiohttp.client_exceptions import ClientError
|
|
from awesomeversion import AwesomeVersion
|
|
import pytest
|
|
|
|
from supervisor.const import BusEvent, UpdateChannel
|
|
from supervisor.coresys import CoreSys
|
|
from supervisor.docker.supervisor import DockerSupervisor
|
|
from supervisor.exceptions import (
|
|
DockerError,
|
|
SupervisorAppArmorError,
|
|
SupervisorUpdateError,
|
|
)
|
|
from supervisor.host.apparmor import AppArmorControl
|
|
from supervisor.resolution.const import ContextType, IssueType
|
|
from supervisor.resolution.data import Issue
|
|
|
|
from tests.common import MockResponse
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"side_effect,connectivity", [(ClientError(), False), (None, True)]
|
|
)
|
|
async def test_connectivity_check(
|
|
coresys: CoreSys,
|
|
websession: MagicMock,
|
|
side_effect: Exception | None,
|
|
connectivity: bool,
|
|
):
|
|
"""Test connectivity check updates state based on probe outcome."""
|
|
assert coresys.supervisor.connectivity is True
|
|
|
|
websession.head = AsyncMock(side_effect=side_effect)
|
|
await coresys.supervisor.check_and_update_connectivity(force=True)
|
|
|
|
assert coresys.supervisor.connectivity is connectivity
|
|
|
|
|
|
async def test_connectivity_check_min_interval_when_connected(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""Non-forced checks within the min-interval use the cached state."""
|
|
websession.head = AsyncMock()
|
|
|
|
# First call runs the probe.
|
|
await coresys.supervisor.check_and_update_connectivity()
|
|
assert websession.head.call_count == 1
|
|
|
|
# Second call within the (10 min) window should not hit the network.
|
|
await coresys.supervisor.check_and_update_connectivity()
|
|
assert websession.head.call_count == 1
|
|
|
|
|
|
async def test_connectivity_check_force_bypasses_min_interval(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""force=True skips the min-interval short-circuit."""
|
|
websession.head = AsyncMock()
|
|
|
|
await coresys.supervisor.check_and_update_connectivity()
|
|
assert websession.head.call_count == 1
|
|
|
|
await coresys.supervisor.check_and_update_connectivity(force=True)
|
|
assert websession.head.call_count == 2
|
|
|
|
|
|
async def test_connectivity_check_coalesces_concurrent_callers(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""Concurrent callers await the same in-flight probe instead of each firing one."""
|
|
probe_started = asyncio.Event()
|
|
probe_release = asyncio.Event()
|
|
|
|
async def slow_head(*args, **kwargs):
|
|
probe_started.set()
|
|
await probe_release.wait()
|
|
|
|
websession.head = AsyncMock(side_effect=slow_head)
|
|
|
|
first = asyncio.create_task(
|
|
coresys.supervisor.check_and_update_connectivity(force=True)
|
|
)
|
|
await probe_started.wait()
|
|
|
|
# Kick off a pile of additional callers while the first probe is in flight.
|
|
concurrent = [
|
|
asyncio.create_task(coresys.supervisor.check_and_update_connectivity())
|
|
for _ in range(5)
|
|
]
|
|
# Let them all reach the in-flight await.
|
|
await asyncio.sleep(0)
|
|
|
|
probe_release.set()
|
|
await asyncio.gather(first, *concurrent)
|
|
|
|
assert websession.head.call_count == 1
|
|
|
|
|
|
async def test_connectivity_check_force_during_in_flight_triggers_rerun(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""A force signal arriving while a probe is in flight queues exactly one rerun."""
|
|
probe_started = asyncio.Event()
|
|
probe_release = asyncio.Event()
|
|
|
|
async def first_then_fast(*args, **kwargs):
|
|
if websession.head.call_count == 1:
|
|
probe_started.set()
|
|
await probe_release.wait()
|
|
|
|
websession.head = AsyncMock(side_effect=first_then_fast)
|
|
|
|
first = asyncio.create_task(
|
|
coresys.supervisor.check_and_update_connectivity(force=True)
|
|
)
|
|
await probe_started.wait()
|
|
|
|
# Forced call while a probe is in flight should set the rerun flag.
|
|
forced = asyncio.create_task(
|
|
coresys.supervisor.check_and_update_connectivity(force=True)
|
|
)
|
|
# Non-forced calls must NOT queue a rerun.
|
|
cheap = asyncio.create_task(coresys.supervisor.check_and_update_connectivity())
|
|
await asyncio.sleep(0)
|
|
|
|
probe_release.set()
|
|
await asyncio.gather(first, forced, cheap)
|
|
|
|
assert websession.head.call_count == 2
|
|
|
|
|
|
async def test_connectivity_check_owner_cancellation_cancels_probe(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""Owner cancellation propagates to the probe and skips updating last-check."""
|
|
probe_started = asyncio.Event()
|
|
probe_release = asyncio.Event()
|
|
|
|
async def slow_head(*args, **kwargs):
|
|
probe_started.set()
|
|
await probe_release.wait()
|
|
|
|
websession.head = AsyncMock(side_effect=slow_head)
|
|
last_check_before = coresys.supervisor._connectivity_last_check # pylint: disable=protected-access
|
|
|
|
owner = asyncio.create_task(
|
|
coresys.supervisor.check_and_update_connectivity(force=True)
|
|
)
|
|
await probe_started.wait()
|
|
|
|
owner.cancel()
|
|
with pytest.raises(asyncio.CancelledError):
|
|
await owner
|
|
|
|
# Owner cancellation must cancel the spawned probe, not orphan it,
|
|
# and the cached last-check timestamp must NOT advance.
|
|
assert coresys.supervisor._connectivity_check is None # pylint: disable=protected-access
|
|
assert coresys.supervisor._connectivity_last_check == last_check_before # pylint: disable=protected-access
|
|
|
|
# A subsequent non-forced call must therefore still run a probe.
|
|
websession.head = AsyncMock()
|
|
await coresys.supervisor.check_and_update_connectivity()
|
|
assert websession.head.call_count == 1
|
|
|
|
|
|
async def test_update_connectivity_fires_event_on_change(coresys: CoreSys):
|
|
"""SUPERVISOR_CONNECTIVITY_CHANGE fires only when the cached value changes."""
|
|
events: list[bool] = []
|
|
|
|
async def listener(state: bool) -> None:
|
|
events.append(state)
|
|
|
|
coresys.bus.register_event(BusEvent.SUPERVISOR_CONNECTIVITY_CHANGE, listener)
|
|
|
|
# Same value: no event.
|
|
coresys.supervisor._update_connectivity(True) # pylint: disable=protected-access
|
|
# Change to False: one event.
|
|
coresys.supervisor._update_connectivity(False) # pylint: disable=protected-access
|
|
# Change back to True: another event.
|
|
coresys.supervisor._update_connectivity(True) # pylint: disable=protected-access
|
|
await asyncio.sleep(0)
|
|
|
|
assert events == [False, True]
|
|
|
|
|
|
async def test_request_connectivity_check_is_fire_and_forget(
|
|
coresys: CoreSys, websession: MagicMock
|
|
):
|
|
"""request_connectivity_check schedules a check that runs asynchronously."""
|
|
websession.head = AsyncMock()
|
|
|
|
# Synchronous call must return without awaiting the HTTP probe.
|
|
result = coresys.supervisor.request_connectivity_check(force=True)
|
|
assert result is None
|
|
|
|
# Yield until the scheduled task has had a chance to complete.
|
|
for _ in range(5):
|
|
await asyncio.sleep(0)
|
|
|
|
assert websession.head.call_count == 1
|
|
|
|
|
|
async def test_update_failed(coresys: CoreSys, capture_exception: Mock):
|
|
"""Test update failure."""
|
|
# pylint: disable-next=protected-access
|
|
coresys.updater._data.setdefault("image", {})["supervisor"] = (
|
|
"ghcr.io/home-assistant/aarch64-hassio-supervisor"
|
|
)
|
|
err = DockerError()
|
|
with (
|
|
patch.object(DockerSupervisor, "install", side_effect=err),
|
|
patch.object(type(coresys.supervisor), "update_apparmor"),
|
|
pytest.raises(SupervisorUpdateError),
|
|
):
|
|
await coresys.supervisor.update(AwesomeVersion("1.0"))
|
|
|
|
capture_exception.assert_called_once_with(err)
|
|
assert (
|
|
Issue(IssueType.UPDATE_FAILED, ContextType.SUPERVISOR)
|
|
in coresys.resolution.issues
|
|
)
|
|
|
|
|
|
@pytest.mark.parametrize(
|
|
"channel", [UpdateChannel.STABLE, UpdateChannel.BETA, UpdateChannel.DEV]
|
|
)
|
|
async def test_update_apparmor(
|
|
coresys: CoreSys, channel: UpdateChannel, websession: MagicMock, tmp_supervisor_data
|
|
):
|
|
"""Test updating apparmor."""
|
|
websession.get = Mock(return_value=MockResponse())
|
|
coresys.updater.channel = channel
|
|
with (
|
|
patch.object(AppArmorControl, "load_profile") as load_profile,
|
|
):
|
|
await coresys.supervisor.update_apparmor()
|
|
|
|
websession.get.assert_called_once_with(
|
|
f"https://version.home-assistant.io/apparmor_{channel}.txt",
|
|
timeout=ClientTimeout(total=10),
|
|
)
|
|
load_profile.assert_called_once()
|
|
|
|
|
|
async def test_update_apparmor_error(
|
|
coresys: CoreSys, websession: MagicMock, tmp_supervisor_data
|
|
):
|
|
"""Test error updating apparmor profile."""
|
|
websession.get = Mock(return_value=MockResponse())
|
|
with (
|
|
patch.object(AppArmorControl, "load_profile"),
|
|
patch("supervisor.supervisor.Path.write_text", side_effect=(err := OSError())),
|
|
):
|
|
err.errno = errno.EBUSY
|
|
with pytest.raises(SupervisorAppArmorError):
|
|
await coresys.supervisor.update_apparmor()
|
|
assert coresys.core.healthy is True
|
|
|
|
err.errno = errno.EBADMSG
|
|
with pytest.raises(SupervisorAppArmorError):
|
|
await coresys.supervisor.update_apparmor()
|
|
assert coresys.core.healthy is False
|