1
0
mirror of https://github.com/home-assistant/supervisor.git synced 2026-05-19 14:18:53 +01:00
Files
supervisor/tests/test_supervisor.py
T
Stefan Agner 0ac8b42062 Rework Supervisor connectivity check with coalescing and force flag (#6765)
* Rework Supervisor connectivity check with coalescing and force flag

Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.

The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:

- check_and_update_connectivity(force=False) is the only path that
  runs the HTTP probe and updates the cached state. Concurrent callers
  coalesce onto a single in-flight probe. A min-interval throttle
  lives inside the method and reuses the cached result within window
  instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
  for signal handlers (D-Bus, plugin callbacks) that must return
  quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
  sets a trailing-rerun flag so the owning task runs one more probe
  after the current one completes. Used for signals that carry fresh
  state-change information (NM connectivity transition to FULL, DNS
  container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
  emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.

Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Propagate connectivity-probe cancellation and skip last-check on cancel

Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.

The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.

A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Clarify why post-NTP-sync forces a connectivity probe

The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add debug logs to Supervisor connectivity probe paths

The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:

- who requested a check (force flag distinguishes signal-driven probes
  from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
  min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
  mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
  the message so transitions are visible)

All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Skip system-checks fan-out in test_events_on_issue_changes

The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.

The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.

Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:14:13 +02:00

269 lines
9.0 KiB
Python

"""Test supervisor object."""
import asyncio
import errno
from unittest.mock import AsyncMock, MagicMock, Mock, patch
from aiohttp import ClientTimeout
from aiohttp.client_exceptions import ClientError
from awesomeversion import AwesomeVersion
import pytest
from supervisor.const import BusEvent, UpdateChannel
from supervisor.coresys import CoreSys
from supervisor.docker.supervisor import DockerSupervisor
from supervisor.exceptions import (
DockerError,
SupervisorAppArmorError,
SupervisorUpdateError,
)
from supervisor.host.apparmor import AppArmorControl
from supervisor.resolution.const import ContextType, IssueType
from supervisor.resolution.data import Issue
from tests.common import MockResponse
@pytest.mark.parametrize(
"side_effect,connectivity", [(ClientError(), False), (None, True)]
)
async def test_connectivity_check(
coresys: CoreSys,
websession: MagicMock,
side_effect: Exception | None,
connectivity: bool,
):
"""Test connectivity check updates state based on probe outcome."""
assert coresys.supervisor.connectivity is True
websession.head = AsyncMock(side_effect=side_effect)
await coresys.supervisor.check_and_update_connectivity(force=True)
assert coresys.supervisor.connectivity is connectivity
async def test_connectivity_check_min_interval_when_connected(
coresys: CoreSys, websession: MagicMock
):
"""Non-forced checks within the min-interval use the cached state."""
websession.head = AsyncMock()
# First call runs the probe.
await coresys.supervisor.check_and_update_connectivity()
assert websession.head.call_count == 1
# Second call within the (10 min) window should not hit the network.
await coresys.supervisor.check_and_update_connectivity()
assert websession.head.call_count == 1
async def test_connectivity_check_force_bypasses_min_interval(
coresys: CoreSys, websession: MagicMock
):
"""force=True skips the min-interval short-circuit."""
websession.head = AsyncMock()
await coresys.supervisor.check_and_update_connectivity()
assert websession.head.call_count == 1
await coresys.supervisor.check_and_update_connectivity(force=True)
assert websession.head.call_count == 2
async def test_connectivity_check_coalesces_concurrent_callers(
coresys: CoreSys, websession: MagicMock
):
"""Concurrent callers await the same in-flight probe instead of each firing one."""
probe_started = asyncio.Event()
probe_release = asyncio.Event()
async def slow_head(*args, **kwargs):
probe_started.set()
await probe_release.wait()
websession.head = AsyncMock(side_effect=slow_head)
first = asyncio.create_task(
coresys.supervisor.check_and_update_connectivity(force=True)
)
await probe_started.wait()
# Kick off a pile of additional callers while the first probe is in flight.
concurrent = [
asyncio.create_task(coresys.supervisor.check_and_update_connectivity())
for _ in range(5)
]
# Let them all reach the in-flight await.
await asyncio.sleep(0)
probe_release.set()
await asyncio.gather(first, *concurrent)
assert websession.head.call_count == 1
async def test_connectivity_check_force_during_in_flight_triggers_rerun(
coresys: CoreSys, websession: MagicMock
):
"""A force signal arriving while a probe is in flight queues exactly one rerun."""
probe_started = asyncio.Event()
probe_release = asyncio.Event()
async def first_then_fast(*args, **kwargs):
if websession.head.call_count == 1:
probe_started.set()
await probe_release.wait()
websession.head = AsyncMock(side_effect=first_then_fast)
first = asyncio.create_task(
coresys.supervisor.check_and_update_connectivity(force=True)
)
await probe_started.wait()
# Forced call while a probe is in flight should set the rerun flag.
forced = asyncio.create_task(
coresys.supervisor.check_and_update_connectivity(force=True)
)
# Non-forced calls must NOT queue a rerun.
cheap = asyncio.create_task(coresys.supervisor.check_and_update_connectivity())
await asyncio.sleep(0)
probe_release.set()
await asyncio.gather(first, forced, cheap)
assert websession.head.call_count == 2
async def test_connectivity_check_owner_cancellation_cancels_probe(
coresys: CoreSys, websession: MagicMock
):
"""Owner cancellation propagates to the probe and skips updating last-check."""
probe_started = asyncio.Event()
probe_release = asyncio.Event()
async def slow_head(*args, **kwargs):
probe_started.set()
await probe_release.wait()
websession.head = AsyncMock(side_effect=slow_head)
last_check_before = coresys.supervisor._connectivity_last_check # pylint: disable=protected-access
owner = asyncio.create_task(
coresys.supervisor.check_and_update_connectivity(force=True)
)
await probe_started.wait()
owner.cancel()
with pytest.raises(asyncio.CancelledError):
await owner
# Owner cancellation must cancel the spawned probe, not orphan it,
# and the cached last-check timestamp must NOT advance.
assert coresys.supervisor._connectivity_check is None # pylint: disable=protected-access
assert coresys.supervisor._connectivity_last_check == last_check_before # pylint: disable=protected-access
# A subsequent non-forced call must therefore still run a probe.
websession.head = AsyncMock()
await coresys.supervisor.check_and_update_connectivity()
assert websession.head.call_count == 1
async def test_update_connectivity_fires_event_on_change(coresys: CoreSys):
"""SUPERVISOR_CONNECTIVITY_CHANGE fires only when the cached value changes."""
events: list[bool] = []
async def listener(state: bool) -> None:
events.append(state)
coresys.bus.register_event(BusEvent.SUPERVISOR_CONNECTIVITY_CHANGE, listener)
# Same value: no event.
coresys.supervisor._update_connectivity(True) # pylint: disable=protected-access
# Change to False: one event.
coresys.supervisor._update_connectivity(False) # pylint: disable=protected-access
# Change back to True: another event.
coresys.supervisor._update_connectivity(True) # pylint: disable=protected-access
await asyncio.sleep(0)
assert events == [False, True]
async def test_request_connectivity_check_is_fire_and_forget(
coresys: CoreSys, websession: MagicMock
):
"""request_connectivity_check schedules a check that runs asynchronously."""
websession.head = AsyncMock()
# Synchronous call must return without awaiting the HTTP probe.
result = coresys.supervisor.request_connectivity_check(force=True)
assert result is None
# Yield until the scheduled task has had a chance to complete.
for _ in range(5):
await asyncio.sleep(0)
assert websession.head.call_count == 1
async def test_update_failed(coresys: CoreSys, capture_exception: Mock):
"""Test update failure."""
# pylint: disable-next=protected-access
coresys.updater._data.setdefault("image", {})["supervisor"] = (
"ghcr.io/home-assistant/aarch64-hassio-supervisor"
)
err = DockerError()
with (
patch.object(DockerSupervisor, "install", side_effect=err),
patch.object(type(coresys.supervisor), "update_apparmor"),
pytest.raises(SupervisorUpdateError),
):
await coresys.supervisor.update(AwesomeVersion("1.0"))
capture_exception.assert_called_once_with(err)
assert (
Issue(IssueType.UPDATE_FAILED, ContextType.SUPERVISOR)
in coresys.resolution.issues
)
@pytest.mark.parametrize(
"channel", [UpdateChannel.STABLE, UpdateChannel.BETA, UpdateChannel.DEV]
)
async def test_update_apparmor(
coresys: CoreSys, channel: UpdateChannel, websession: MagicMock, tmp_supervisor_data
):
"""Test updating apparmor."""
websession.get = Mock(return_value=MockResponse())
coresys.updater.channel = channel
with (
patch.object(AppArmorControl, "load_profile") as load_profile,
):
await coresys.supervisor.update_apparmor()
websession.get.assert_called_once_with(
f"https://version.home-assistant.io/apparmor_{channel}.txt",
timeout=ClientTimeout(total=10),
)
load_profile.assert_called_once()
async def test_update_apparmor_error(
coresys: CoreSys, websession: MagicMock, tmp_supervisor_data
):
"""Test error updating apparmor profile."""
websession.get = Mock(return_value=MockResponse())
with (
patch.object(AppArmorControl, "load_profile"),
patch("supervisor.supervisor.Path.write_text", side_effect=(err := OSError())),
):
err.errno = errno.EBUSY
with pytest.raises(SupervisorAppArmorError):
await coresys.supervisor.update_apparmor()
assert coresys.core.healthy is True
err.errno = errno.EBADMSG
with pytest.raises(SupervisorAppArmorError):
await coresys.supervisor.update_apparmor()
assert coresys.core.healthy is False