1
0
mirror of https://github.com/home-assistant/supervisor.git synced 2026-05-19 22:28:52 +01:00
Files
supervisor/tests/plugins/test_dns.py
T
Stefan Agner 0ac8b42062 Rework Supervisor connectivity check with coalescing and force flag (#6765)
* Rework Supervisor connectivity check with coalescing and force flag

Previously, a failed connectivity probe could strand Supervisor in a
"no connectivity" state indefinitely. After an Ethernet reconnect, a
probe kicked by NetworkManager's connectivity transition could race
with CoreDNS being restarted (due to DNS locals changing), time out on
DNS, and leave supervisor.connectivity = False. The retry that
_on_dns_container_running was meant to fire landed inside the 5 s
JobThrottle window from the just-failed probe and was silently dropped,
since JobThrottle.THROTTLE drops rather than waits.

The rework replaces the @Job(throttle=THROTTLE) decorator and the
public connectivity setter with a single authoritative state-updating
method:

- check_and_update_connectivity(force=False) is the only path that
  runs the HTTP probe and updates the cached state. Concurrent callers
  coalesce onto a single in-flight probe. A min-interval throttle
  lives inside the method and reuses the cached result within window
  instead of dropping calls.
- request_connectivity_check(force=False) is a fire-and-forget wrapper
  for signal handlers (D-Bus, plugin callbacks) that must return
  quickly without blocking signal dispatch on the HTTP round-trip.
- force=True bypasses the min-interval and, when a probe is in flight,
  sets a trailing-rerun flag so the owning task runs one more probe
  after the current one completes. Used for signals that carry fresh
  state-change information (NM connectivity transition to FULL, DNS
  container RUNNING, startup, post-NTP sync).
- _update_connectivity is the sole writer of the cached flag and
  emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions.

Call sites migrate accordingly. The opportunistic
supervisor.connectivity = False writes in update_apparmor,
updater.fetch_data, os.manager, and addon_pwned error paths are
replaced with request_connectivity_check() calls so the probe remains
authoritative - an endpoint-specific failure no longer lies about the
overall connectivity state.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Propagate connectivity-probe cancellation and skip last-check on cancel

Awaiting an asyncio.Task does not propagate cancellation INTO the task,
so the previous owner-doesn't-shield comment was misleading: a cancelled
owner left the spawned probe running orphaned, and the next caller could
start a second probe alongside it. The owner now explicitly cancels and
awaits the probe on CancelledError before re-raising.

The last-check timestamp is also moved out of the finally block so a
cancelled probe does not leave a "fresh result just ran" cache behind
that would short-circuit the next non-forced caller.

A regression test exercises both: that owner cancellation clears the
in-flight reference and leaves the timestamp untouched, and that a
subsequent non-forced check therefore still actually probes.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Clarify why post-NTP-sync forces a connectivity probe

The previous comment claimed the last-check timestamp may be unreliable
after a time jump, but _connectivity_last_check uses loop.time() which
is monotonic and unaffected by wall-clock corrections. The real reason
to force a fresh probe is TLS validation: certificates that appeared
expired or not-yet-valid before the system clock was corrected may now
verify, so a probe that just failed with an SSL error can succeed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Add debug logs to Supervisor connectivity probe paths

The original stuck-offline bug was hard to spot in logs because the
silent throttle-drop and the cached state had no audit trail. With
debug-level logging at each decision point, a future investigation can
reconstruct from a single log file:

- who requested a check (force flag distinguishes signal-driven probes
  from precondition / opportunistic-error-path requests)
- why a probe did not actually run (in-flight coalesce, cached within
  min-interval, owner cancellation)
- when a forced rerun was queued and when it ran (the precise failure
  mode that stranded the supervisor in the original incident)
- when the cached state actually flipped (with the previous value in
  the message so transitions are visible)

All new lines are debug-level. The existing _do_connectivity_check
"failed" / "succeeded" lines are kept unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Skip system-checks fan-out in test_events_on_issue_changes

The test asserts that apply_suggestion fires an ISSUE_REMOVED event.
ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before
apply_suggestion calls healthcheck. The healthcheck call afterwards is
incidental to this test's intent, but it fans out into check_system()
which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes
against the NetworkManager mock's stub nameserver 192.168.30.1 that each
hit the default ~10 s aiodns timeout. The file took ~21 s to run.

The slowness has been latent since #3818 (Aug 2022), which added the
apply_suggestion step at the end of test_events_on_issue_changes two
days after the DNS check landed in its current form (#3811). The default
24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in
full-suite runs once any earlier test has tripped the throttle, which is
likely why this slipped through.

Mock coresys.resolution.healthcheck for just this one apply_suggestion
call rather than introducing a file-wide DNS mock. The patch is local to
the slow call site and the test's assertion is unaffected. The file
drops from ~21 s to ~2.5 s.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-04-29 10:14:13 +02:00

479 lines
16 KiB
Python

"""Test DNS plugin."""
import asyncio
import errno
from ipaddress import IPv4Address
from pathlib import Path
from unittest.mock import AsyncMock, Mock, patch
from aiodocker.containers import DockerContainer
import pytest
from supervisor.const import BusEvent, LogLevel
from supervisor.coresys import CoreSys
from supervisor.docker.const import ContainerState
from supervisor.docker.dns import DockerDNS
from supervisor.docker.monitor import DockerContainerStateEvent
from supervisor.plugins.dns import HostEntry, PluginDns
from supervisor.resolution.const import ContextType, IssueType, SuggestionType
from supervisor.resolution.data import Issue, Suggestion
@pytest.fixture(name="docker_interface")
async def fixture_docker_interface() -> tuple[AsyncMock, AsyncMock]:
"""Mock docker interface methods."""
with (
patch.object(DockerDNS, "run") as run,
patch.object(DockerDNS, "restart") as restart,
):
yield (run, restart)
@pytest.fixture(name="write_json")
async def fixture_write_json() -> Mock:
"""Mock json file writer."""
with patch("supervisor.plugins.dns.write_json_file") as write_json_file:
yield write_json_file
@pytest.fixture(name="mock_call_later")
def fixture_mock_call_later(coresys: CoreSys):
"""Mock sys_call_later with zero delay for testing."""
def mock_call_later(_delay, *args, **kwargs) -> asyncio.TimerHandle:
"""Mock to remove delay."""
return coresys.call_later(0, *args, **kwargs)
return mock_call_later
async def test_config_write(
coresys: CoreSys,
docker_interface: tuple[AsyncMock, AsyncMock],
write_json: Mock,
):
"""Test config write on DNS start and restart."""
assert coresys.plugins.dns.locals == ["dns://192.168.30.1"]
coresys.plugins.dns.servers = ["dns://1.1.1.1", "dns://8.8.8.8"]
await coresys.plugins.dns.start()
docker_interface[0].assert_called_once()
docker_interface[1].assert_not_called()
write_json.assert_called_once_with(
Path("/data/dns/coredns.json"),
{
"servers": ["dns://1.1.1.1", "dns://8.8.8.8"],
"locals": ["dns://192.168.30.1"],
"fallback": True,
"debug": False,
},
)
docker_interface[0].reset_mock()
write_json.reset_mock()
coresys.plugins.dns.servers = ["dns://8.8.8.8"]
coresys.plugins.dns.fallback = False
coresys.config.logging = LogLevel.DEBUG
await coresys.plugins.dns.restart()
docker_interface[0].assert_not_called()
docker_interface[1].assert_called_once()
write_json.assert_called_once_with(
Path("/data/dns/coredns.json"),
{
"servers": ["dns://8.8.8.8"],
"locals": ["dns://192.168.30.1"],
"fallback": False,
"debug": True,
},
)
async def test_reset(coresys: CoreSys):
"""Test reset returns dns plugin to defaults."""
coresys.plugins.dns.servers = ["dns://1.1.1.1", "dns://8.8.8.8"]
coresys.plugins.dns.fallback = False
coresys.plugins.dns._loop = True # pylint: disable=protected-access
assert len(coresys.apps.installed) == 0
with (
patch.object(type(coresys.plugins.dns.hosts), "unlink") as unlink,
patch.object(type(coresys.plugins.dns), "write_hosts") as write_hosts,
):
await coresys.plugins.dns.reset()
assert coresys.plugins.dns.servers == []
assert coresys.plugins.dns.fallback is True
assert coresys.plugins.dns._loop is False # pylint: disable=protected-access
unlink.assert_called_once()
write_hosts.assert_called_once()
# Verify the hosts data structure is properly initialized
# pylint: disable=protected-access
assert coresys.plugins.dns._hosts == [
HostEntry(
ip_address=IPv4Address("127.0.0.1"),
names=["localhost", "localhost.local.hass.io"],
),
HostEntry(
ip_address=IPv4Address("172.30.32.2"),
names=[
"hassio",
"hassio.local.hass.io",
"supervisor",
"supervisor.local.hass.io",
],
),
HostEntry(
ip_address=IPv4Address("172.30.32.1"),
names=[
"homeassistant",
"homeassistant.local.hass.io",
"home-assistant",
"home-assistant.local.hass.io",
],
),
HostEntry(
ip_address=IPv4Address("172.30.32.3"),
names=["dns", "dns.local.hass.io"],
),
HostEntry(
ip_address=IPv4Address("172.30.32.6"),
names=["observer", "observer.local.hass.io"],
),
]
async def test_loop_detection_on_failure(coresys: CoreSys, container: DockerContainer):
"""Test loop detection when coredns fails."""
assert len(coresys.resolution.issues) == 0
assert len(coresys.resolution.suggestions) == 0
with (
patch.object(DockerDNS, "attach"),
patch.object(DockerDNS, "is_running", return_value=True),
):
await coresys.plugins.dns.load()
with (
patch.object(PluginDns, "rebuild") as rebuild,
patch.object(
DockerDNS,
"current_state",
side_effect=[
ContainerState.FAILED,
ContainerState.FAILED,
],
),
):
coresys.bus.fire_event(
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
DockerContainerStateEvent(
name="hassio_dns",
state=ContainerState.FAILED,
id="abc123",
time=1,
),
)
await asyncio.sleep(0)
assert len(coresys.resolution.issues) == 0
assert len(coresys.resolution.suggestions) == 0
rebuild.assert_called_once()
rebuild.reset_mock()
container.log.return_value = ["plugin/loop: Loop"]
coresys.bus.fire_event(
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
DockerContainerStateEvent(
name="hassio_dns",
state=ContainerState.FAILED,
id="abc123",
time=1,
),
)
await asyncio.sleep(0)
assert coresys.resolution.issues == [
Issue(IssueType.DNS_LOOP, ContextType.PLUGIN, "dns")
]
assert coresys.resolution.suggestions == [
Suggestion(SuggestionType.EXECUTE_RESET, ContextType.PLUGIN, "dns")
]
rebuild.assert_called_once()
async def test_load_error(
coresys: CoreSys, caplog: pytest.LogCaptureFixture, container
):
"""Test error reading config files during load."""
with (
patch("supervisor.plugins.dns.Path.read_text", side_effect=(err := OSError())),
patch("supervisor.plugins.dns.Path.write_text", side_effect=err),
):
err.errno = errno.EBUSY
await coresys.plugins.dns.load()
assert "Can't read resolve.tmpl" in caplog.text
assert "Can't read hosts.tmpl" in caplog.text
assert coresys.core.healthy is True
caplog.clear()
err.errno = errno.EBADMSG
await coresys.plugins.dns.load()
assert "Can't read resolve.tmpl" in caplog.text
assert "Can't read hosts.tmpl" in caplog.text
assert coresys.core.healthy is False
async def test_load_error_writing_resolv(
coresys: CoreSys, caplog: pytest.LogCaptureFixture, container
):
"""Test error writing resolv during load."""
with patch(
"supervisor.plugins.dns.Path.write_text", side_effect=(err := OSError())
):
err.errno = errno.EBUSY
await coresys.plugins.dns.load()
assert "Can't write/update /etc/resolv.conf" in caplog.text
assert coresys.core.healthy is True
caplog.clear()
err.errno = errno.EBADMSG
await coresys.plugins.dns.load()
assert "Can't write/update /etc/resolv.conf" in caplog.text
assert coresys.core.healthy is False
async def test_notify_locals_changed_end_to_end_with_changes_and_running(
coresys: CoreSys, mock_call_later
):
"""Test notify_locals_changed end-to-end: local DNS changes detected and plugin restarted."""
dns_plugin = coresys.plugins.dns
# Set cached locals to something different from current network state
current_locals = dns_plugin._compute_locals()
dns_plugin._cached_locals = (
["dns://192.168.1.1"]
if current_locals != ["dns://192.168.1.1"]
else ["dns://192.168.1.2"]
)
with (
patch.object(dns_plugin, "restart") as mock_restart,
patch.object(dns_plugin.instance, "is_running", return_value=True),
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
):
# Call notify_locals_changed
dns_plugin.notify_locals_changed()
# Wait for the async task to complete
await asyncio.sleep(0.1)
# Verify restart was called and cached locals were updated
mock_restart.assert_called_once()
assert dns_plugin._cached_locals == current_locals
async def test_notify_locals_changed_end_to_end_with_changes_but_not_running(
coresys: CoreSys, mock_call_later
):
"""Test notify_locals_changed end-to-end: local DNS changes detected but plugin not running."""
dns_plugin = coresys.plugins.dns
# Set cached locals to something different from current network state
current_locals = dns_plugin._compute_locals()
dns_plugin._cached_locals = (
["dns://192.168.1.1"]
if current_locals != ["dns://192.168.1.1"]
else ["dns://192.168.1.2"]
)
with (
patch.object(dns_plugin, "restart") as mock_restart,
patch.object(dns_plugin.instance, "is_running", return_value=False),
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
):
# Call notify_locals_changed
dns_plugin.notify_locals_changed()
# Wait for the async task to complete
await asyncio.sleep(0.1)
# Verify restart was NOT called but cached locals were still updated
mock_restart.assert_not_called()
assert dns_plugin._cached_locals == current_locals
async def test_notify_locals_changed_end_to_end_no_changes(
coresys: CoreSys, mock_call_later
):
"""Test notify_locals_changed end-to-end: no local DNS changes detected."""
dns_plugin = coresys.plugins.dns
# Set cached locals to match current network state
current_locals = dns_plugin._compute_locals()
dns_plugin._cached_locals = current_locals
with (
patch.object(dns_plugin, "restart") as mock_restart,
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
):
# Call notify_locals_changed
dns_plugin.notify_locals_changed()
# Wait for the async task to complete
await asyncio.sleep(0.1)
# Verify restart was NOT called since no changes
mock_restart.assert_not_called()
assert dns_plugin._cached_locals == current_locals
async def test_notify_locals_changed_debouncing_cancels_previous_timer(
coresys: CoreSys,
):
"""Test notify_locals_changed debouncing cancels previous timer before creating new one."""
dns_plugin = coresys.plugins.dns
# Set cached locals to trigger change detection
current_locals = dns_plugin._compute_locals()
dns_plugin._cached_locals = (
["dns://192.168.1.1"]
if current_locals != ["dns://192.168.1.1"]
else ["dns://192.168.1.2"]
)
call_count = 0
handles = []
def mock_call_later_with_tracking(_delay, *args, **kwargs) -> asyncio.TimerHandle:
"""Mock to remove delay and track calls."""
nonlocal call_count
call_count += 1
handle = coresys.call_later(0, *args, **kwargs)
handles.append(handle)
return handle
with (
patch.object(dns_plugin, "restart") as mock_restart,
patch.object(dns_plugin.instance, "is_running", return_value=True),
patch.object(dns_plugin, "sys_call_later", new=mock_call_later_with_tracking),
):
# First call sets up timer
dns_plugin.notify_locals_changed()
assert call_count == 1
first_handle = dns_plugin._locals_changed_handle
assert first_handle is not None
# Second call should cancel first timer and create new one
dns_plugin.notify_locals_changed()
assert call_count == 2
second_handle = dns_plugin._locals_changed_handle
assert second_handle is not None
assert first_handle != second_handle
# Wait for the async task to complete
await asyncio.sleep(0.1)
# Verify restart was called once for the final timer
mock_restart.assert_called_once()
assert dns_plugin._cached_locals == current_locals
async def test_stop_cancels_pending_timers_and_tasks(coresys: CoreSys):
"""Test stop cancels pending locals change timers and restart tasks to prevent resource leaks."""
dns_plugin = coresys.plugins.dns
mock_timer_handle = Mock()
mock_task_handle = Mock()
dns_plugin._locals_changed_handle = mock_timer_handle
dns_plugin._restart_after_locals_change_handle = mock_task_handle
with patch.object(dns_plugin.instance, "stop"):
await dns_plugin.stop()
# Should cancel pending timer and task, then clean up
mock_timer_handle.cancel.assert_called_once()
mock_task_handle.cancel.assert_called_once()
assert dns_plugin._locals_changed_handle is None
assert dns_plugin._restart_after_locals_change_handle is None
async def test_dns_restart_triggers_connectivity_check(coresys: CoreSys):
"""Test end-to-end that DNS container restart triggers connectivity check."""
dns_plugin = coresys.plugins.dns
# Load the plugin to register the event listener
with (
patch.object(type(dns_plugin.instance), "attach"),
patch.object(type(dns_plugin.instance), "is_running", return_value=True),
):
await dns_plugin.load()
# Verify listener was registered (connectivity check listener should be stored)
assert dns_plugin._connectivity_check_listener is not None
# Create event to signal when connectivity check is requested
connectivity_check_event = asyncio.Event()
# Mock the fire-and-forget request to set the event when called
def mock_request_connectivity_check(*, force: bool = False):
connectivity_check_event.set()
with (
patch.object(
coresys.supervisor,
"request_connectivity_check",
side_effect=mock_request_connectivity_check,
),
patch("supervisor.plugins.dns.asyncio.sleep") as mock_sleep,
):
# Fire the DNS container state change event through bus system
coresys.bus.fire_event(
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
DockerContainerStateEvent(
name="hassio_dns",
state=ContainerState.RUNNING,
id="test_id",
time=1234567890,
),
)
# Wait for connectivity check to be called
await asyncio.wait_for(connectivity_check_event.wait(), timeout=1.0)
# Verify sleep was called with correct delay
mock_sleep.assert_called_once_with(5)
# Reset and test that other containers don't trigger check
connectivity_check_event.clear()
mock_sleep.reset_mock()
# Fire event for different container
coresys.bus.fire_event(
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
DockerContainerStateEvent(
name="hassio_homeassistant",
state=ContainerState.RUNNING,
id="test_id",
time=1234567890,
),
)
# Wait a bit and verify connectivity check was NOT triggered
try:
await asyncio.wait_for(connectivity_check_event.wait(), timeout=0.1)
assert False, (
"Connectivity check should not have been called for other containers"
)
except TimeoutError:
# This is expected - connectivity check should not be called
pass
# Verify sleep was not called for other containers
mock_sleep.assert_not_called()