mirror of
https://github.com/home-assistant/supervisor.git
synced 2026-05-19 22:28:52 +01:00
0ac8b42062
* Rework Supervisor connectivity check with coalescing and force flag Previously, a failed connectivity probe could strand Supervisor in a "no connectivity" state indefinitely. After an Ethernet reconnect, a probe kicked by NetworkManager's connectivity transition could race with CoreDNS being restarted (due to DNS locals changing), time out on DNS, and leave supervisor.connectivity = False. The retry that _on_dns_container_running was meant to fire landed inside the 5 s JobThrottle window from the just-failed probe and was silently dropped, since JobThrottle.THROTTLE drops rather than waits. The rework replaces the @Job(throttle=THROTTLE) decorator and the public connectivity setter with a single authoritative state-updating method: - check_and_update_connectivity(force=False) is the only path that runs the HTTP probe and updates the cached state. Concurrent callers coalesce onto a single in-flight probe. A min-interval throttle lives inside the method and reuses the cached result within window instead of dropping calls. - request_connectivity_check(force=False) is a fire-and-forget wrapper for signal handlers (D-Bus, plugin callbacks) that must return quickly without blocking signal dispatch on the HTTP round-trip. - force=True bypasses the min-interval and, when a probe is in flight, sets a trailing-rerun flag so the owning task runs one more probe after the current one completes. Used for signals that carry fresh state-change information (NM connectivity transition to FULL, DNS container RUNNING, startup, post-NTP sync). - _update_connectivity is the sole writer of the cached flag and emits SUPERVISOR_CONNECTIVITY_CHANGE only on actual transitions. Call sites migrate accordingly. The opportunistic supervisor.connectivity = False writes in update_apparmor, updater.fetch_data, os.manager, and addon_pwned error paths are replaced with request_connectivity_check() calls so the probe remains authoritative - an endpoint-specific failure no longer lies about the overall connectivity state. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Propagate connectivity-probe cancellation and skip last-check on cancel Awaiting an asyncio.Task does not propagate cancellation INTO the task, so the previous owner-doesn't-shield comment was misleading: a cancelled owner left the spawned probe running orphaned, and the next caller could start a second probe alongside it. The owner now explicitly cancels and awaits the probe on CancelledError before re-raising. The last-check timestamp is also moved out of the finally block so a cancelled probe does not leave a "fresh result just ran" cache behind that would short-circuit the next non-forced caller. A regression test exercises both: that owner cancellation clears the in-flight reference and leaves the timestamp untouched, and that a subsequent non-forced check therefore still actually probes. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Clarify why post-NTP-sync forces a connectivity probe The previous comment claimed the last-check timestamp may be unreliable after a time jump, but _connectivity_last_check uses loop.time() which is monotonic and unaffected by wall-clock corrections. The real reason to force a fresh probe is TLS validation: certificates that appeared expired or not-yet-valid before the system clock was corrected may now verify, so a probe that just failed with an SSL error can succeed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add debug logs to Supervisor connectivity probe paths The original stuck-offline bug was hard to spot in logs because the silent throttle-drop and the cached state had no audit trail. With debug-level logging at each decision point, a future investigation can reconstruct from a single log file: - who requested a check (force flag distinguishes signal-driven probes from precondition / opportunistic-error-path requests) - why a probe did not actually run (in-flight coalesce, cached within min-interval, owner cancellation) - when a forced rerun was queued and when it ran (the precise failure mode that stranded the supervisor in the original incident) - when the cached state actually flipped (with the previous value in the message so transitions are visible) All new lines are debug-level. The existing _do_connectivity_check "failed" / "succeeded" lines are kept unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Skip system-checks fan-out in test_events_on_issue_changes The test asserts that apply_suggestion fires an ISSUE_REMOVED event. ISSUE_REMOVED is fired by dismiss_issue inside FixupBase.__call__, before apply_suggestion calls healthcheck. The healthcheck call afterwards is incidental to this test's intent, but it fans out into check_system() which runs CheckDNSServer (A and AAAA) - real aiodns query_dns() probes against the NetworkManager mock's stub nameserver 192.168.30.1 that each hit the default ~10 s aiodns timeout. The file took ~21 s to run. The slowness has been latent since #3818 (Aug 2022), which added the apply_suggestion step at the end of test_events_on_issue_changes two days after the DNS check landed in its current form (#3811). The default 24 h JobThrottle on CheckDNSServer.run_check tends to mask the cost in full-suite runs once any earlier test has tripped the throttle, which is likely why this slipped through. Mock coresys.resolution.healthcheck for just this one apply_suggestion call rather than introducing a file-wide DNS mock. The patch is local to the slow call site and the test's assertion is unaffected. The file drops from ~21 s to ~2.5 s. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
479 lines
16 KiB
Python
479 lines
16 KiB
Python
"""Test DNS plugin."""
|
|
|
|
import asyncio
|
|
import errno
|
|
from ipaddress import IPv4Address
|
|
from pathlib import Path
|
|
from unittest.mock import AsyncMock, Mock, patch
|
|
|
|
from aiodocker.containers import DockerContainer
|
|
import pytest
|
|
|
|
from supervisor.const import BusEvent, LogLevel
|
|
from supervisor.coresys import CoreSys
|
|
from supervisor.docker.const import ContainerState
|
|
from supervisor.docker.dns import DockerDNS
|
|
from supervisor.docker.monitor import DockerContainerStateEvent
|
|
from supervisor.plugins.dns import HostEntry, PluginDns
|
|
from supervisor.resolution.const import ContextType, IssueType, SuggestionType
|
|
from supervisor.resolution.data import Issue, Suggestion
|
|
|
|
|
|
@pytest.fixture(name="docker_interface")
|
|
async def fixture_docker_interface() -> tuple[AsyncMock, AsyncMock]:
|
|
"""Mock docker interface methods."""
|
|
with (
|
|
patch.object(DockerDNS, "run") as run,
|
|
patch.object(DockerDNS, "restart") as restart,
|
|
):
|
|
yield (run, restart)
|
|
|
|
|
|
@pytest.fixture(name="write_json")
|
|
async def fixture_write_json() -> Mock:
|
|
"""Mock json file writer."""
|
|
with patch("supervisor.plugins.dns.write_json_file") as write_json_file:
|
|
yield write_json_file
|
|
|
|
|
|
@pytest.fixture(name="mock_call_later")
|
|
def fixture_mock_call_later(coresys: CoreSys):
|
|
"""Mock sys_call_later with zero delay for testing."""
|
|
|
|
def mock_call_later(_delay, *args, **kwargs) -> asyncio.TimerHandle:
|
|
"""Mock to remove delay."""
|
|
return coresys.call_later(0, *args, **kwargs)
|
|
|
|
return mock_call_later
|
|
|
|
|
|
async def test_config_write(
|
|
coresys: CoreSys,
|
|
docker_interface: tuple[AsyncMock, AsyncMock],
|
|
write_json: Mock,
|
|
):
|
|
"""Test config write on DNS start and restart."""
|
|
assert coresys.plugins.dns.locals == ["dns://192.168.30.1"]
|
|
coresys.plugins.dns.servers = ["dns://1.1.1.1", "dns://8.8.8.8"]
|
|
|
|
await coresys.plugins.dns.start()
|
|
docker_interface[0].assert_called_once()
|
|
docker_interface[1].assert_not_called()
|
|
|
|
write_json.assert_called_once_with(
|
|
Path("/data/dns/coredns.json"),
|
|
{
|
|
"servers": ["dns://1.1.1.1", "dns://8.8.8.8"],
|
|
"locals": ["dns://192.168.30.1"],
|
|
"fallback": True,
|
|
"debug": False,
|
|
},
|
|
)
|
|
|
|
docker_interface[0].reset_mock()
|
|
write_json.reset_mock()
|
|
coresys.plugins.dns.servers = ["dns://8.8.8.8"]
|
|
coresys.plugins.dns.fallback = False
|
|
coresys.config.logging = LogLevel.DEBUG
|
|
|
|
await coresys.plugins.dns.restart()
|
|
docker_interface[0].assert_not_called()
|
|
docker_interface[1].assert_called_once()
|
|
|
|
write_json.assert_called_once_with(
|
|
Path("/data/dns/coredns.json"),
|
|
{
|
|
"servers": ["dns://8.8.8.8"],
|
|
"locals": ["dns://192.168.30.1"],
|
|
"fallback": False,
|
|
"debug": True,
|
|
},
|
|
)
|
|
|
|
|
|
async def test_reset(coresys: CoreSys):
|
|
"""Test reset returns dns plugin to defaults."""
|
|
coresys.plugins.dns.servers = ["dns://1.1.1.1", "dns://8.8.8.8"]
|
|
coresys.plugins.dns.fallback = False
|
|
coresys.plugins.dns._loop = True # pylint: disable=protected-access
|
|
assert len(coresys.apps.installed) == 0
|
|
|
|
with (
|
|
patch.object(type(coresys.plugins.dns.hosts), "unlink") as unlink,
|
|
patch.object(type(coresys.plugins.dns), "write_hosts") as write_hosts,
|
|
):
|
|
await coresys.plugins.dns.reset()
|
|
|
|
assert coresys.plugins.dns.servers == []
|
|
assert coresys.plugins.dns.fallback is True
|
|
assert coresys.plugins.dns._loop is False # pylint: disable=protected-access
|
|
unlink.assert_called_once()
|
|
write_hosts.assert_called_once()
|
|
|
|
# Verify the hosts data structure is properly initialized
|
|
# pylint: disable=protected-access
|
|
assert coresys.plugins.dns._hosts == [
|
|
HostEntry(
|
|
ip_address=IPv4Address("127.0.0.1"),
|
|
names=["localhost", "localhost.local.hass.io"],
|
|
),
|
|
HostEntry(
|
|
ip_address=IPv4Address("172.30.32.2"),
|
|
names=[
|
|
"hassio",
|
|
"hassio.local.hass.io",
|
|
"supervisor",
|
|
"supervisor.local.hass.io",
|
|
],
|
|
),
|
|
HostEntry(
|
|
ip_address=IPv4Address("172.30.32.1"),
|
|
names=[
|
|
"homeassistant",
|
|
"homeassistant.local.hass.io",
|
|
"home-assistant",
|
|
"home-assistant.local.hass.io",
|
|
],
|
|
),
|
|
HostEntry(
|
|
ip_address=IPv4Address("172.30.32.3"),
|
|
names=["dns", "dns.local.hass.io"],
|
|
),
|
|
HostEntry(
|
|
ip_address=IPv4Address("172.30.32.6"),
|
|
names=["observer", "observer.local.hass.io"],
|
|
),
|
|
]
|
|
|
|
|
|
async def test_loop_detection_on_failure(coresys: CoreSys, container: DockerContainer):
|
|
"""Test loop detection when coredns fails."""
|
|
assert len(coresys.resolution.issues) == 0
|
|
assert len(coresys.resolution.suggestions) == 0
|
|
|
|
with (
|
|
patch.object(DockerDNS, "attach"),
|
|
patch.object(DockerDNS, "is_running", return_value=True),
|
|
):
|
|
await coresys.plugins.dns.load()
|
|
|
|
with (
|
|
patch.object(PluginDns, "rebuild") as rebuild,
|
|
patch.object(
|
|
DockerDNS,
|
|
"current_state",
|
|
side_effect=[
|
|
ContainerState.FAILED,
|
|
ContainerState.FAILED,
|
|
],
|
|
),
|
|
):
|
|
coresys.bus.fire_event(
|
|
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
|
|
DockerContainerStateEvent(
|
|
name="hassio_dns",
|
|
state=ContainerState.FAILED,
|
|
id="abc123",
|
|
time=1,
|
|
),
|
|
)
|
|
await asyncio.sleep(0)
|
|
assert len(coresys.resolution.issues) == 0
|
|
assert len(coresys.resolution.suggestions) == 0
|
|
rebuild.assert_called_once()
|
|
|
|
rebuild.reset_mock()
|
|
container.log.return_value = ["plugin/loop: Loop"]
|
|
coresys.bus.fire_event(
|
|
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
|
|
DockerContainerStateEvent(
|
|
name="hassio_dns",
|
|
state=ContainerState.FAILED,
|
|
id="abc123",
|
|
time=1,
|
|
),
|
|
)
|
|
await asyncio.sleep(0)
|
|
assert coresys.resolution.issues == [
|
|
Issue(IssueType.DNS_LOOP, ContextType.PLUGIN, "dns")
|
|
]
|
|
assert coresys.resolution.suggestions == [
|
|
Suggestion(SuggestionType.EXECUTE_RESET, ContextType.PLUGIN, "dns")
|
|
]
|
|
rebuild.assert_called_once()
|
|
|
|
|
|
async def test_load_error(
|
|
coresys: CoreSys, caplog: pytest.LogCaptureFixture, container
|
|
):
|
|
"""Test error reading config files during load."""
|
|
with (
|
|
patch("supervisor.plugins.dns.Path.read_text", side_effect=(err := OSError())),
|
|
patch("supervisor.plugins.dns.Path.write_text", side_effect=err),
|
|
):
|
|
err.errno = errno.EBUSY
|
|
await coresys.plugins.dns.load()
|
|
|
|
assert "Can't read resolve.tmpl" in caplog.text
|
|
assert "Can't read hosts.tmpl" in caplog.text
|
|
assert coresys.core.healthy is True
|
|
|
|
caplog.clear()
|
|
err.errno = errno.EBADMSG
|
|
await coresys.plugins.dns.load()
|
|
|
|
assert "Can't read resolve.tmpl" in caplog.text
|
|
assert "Can't read hosts.tmpl" in caplog.text
|
|
assert coresys.core.healthy is False
|
|
|
|
|
|
async def test_load_error_writing_resolv(
|
|
coresys: CoreSys, caplog: pytest.LogCaptureFixture, container
|
|
):
|
|
"""Test error writing resolv during load."""
|
|
with patch(
|
|
"supervisor.plugins.dns.Path.write_text", side_effect=(err := OSError())
|
|
):
|
|
err.errno = errno.EBUSY
|
|
await coresys.plugins.dns.load()
|
|
|
|
assert "Can't write/update /etc/resolv.conf" in caplog.text
|
|
assert coresys.core.healthy is True
|
|
|
|
caplog.clear()
|
|
err.errno = errno.EBADMSG
|
|
await coresys.plugins.dns.load()
|
|
|
|
assert "Can't write/update /etc/resolv.conf" in caplog.text
|
|
assert coresys.core.healthy is False
|
|
|
|
|
|
async def test_notify_locals_changed_end_to_end_with_changes_and_running(
|
|
coresys: CoreSys, mock_call_later
|
|
):
|
|
"""Test notify_locals_changed end-to-end: local DNS changes detected and plugin restarted."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
# Set cached locals to something different from current network state
|
|
current_locals = dns_plugin._compute_locals()
|
|
dns_plugin._cached_locals = (
|
|
["dns://192.168.1.1"]
|
|
if current_locals != ["dns://192.168.1.1"]
|
|
else ["dns://192.168.1.2"]
|
|
)
|
|
|
|
with (
|
|
patch.object(dns_plugin, "restart") as mock_restart,
|
|
patch.object(dns_plugin.instance, "is_running", return_value=True),
|
|
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
|
|
):
|
|
# Call notify_locals_changed
|
|
dns_plugin.notify_locals_changed()
|
|
|
|
# Wait for the async task to complete
|
|
await asyncio.sleep(0.1)
|
|
|
|
# Verify restart was called and cached locals were updated
|
|
mock_restart.assert_called_once()
|
|
assert dns_plugin._cached_locals == current_locals
|
|
|
|
|
|
async def test_notify_locals_changed_end_to_end_with_changes_but_not_running(
|
|
coresys: CoreSys, mock_call_later
|
|
):
|
|
"""Test notify_locals_changed end-to-end: local DNS changes detected but plugin not running."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
# Set cached locals to something different from current network state
|
|
current_locals = dns_plugin._compute_locals()
|
|
dns_plugin._cached_locals = (
|
|
["dns://192.168.1.1"]
|
|
if current_locals != ["dns://192.168.1.1"]
|
|
else ["dns://192.168.1.2"]
|
|
)
|
|
|
|
with (
|
|
patch.object(dns_plugin, "restart") as mock_restart,
|
|
patch.object(dns_plugin.instance, "is_running", return_value=False),
|
|
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
|
|
):
|
|
# Call notify_locals_changed
|
|
dns_plugin.notify_locals_changed()
|
|
|
|
# Wait for the async task to complete
|
|
await asyncio.sleep(0.1)
|
|
|
|
# Verify restart was NOT called but cached locals were still updated
|
|
mock_restart.assert_not_called()
|
|
assert dns_plugin._cached_locals == current_locals
|
|
|
|
|
|
async def test_notify_locals_changed_end_to_end_no_changes(
|
|
coresys: CoreSys, mock_call_later
|
|
):
|
|
"""Test notify_locals_changed end-to-end: no local DNS changes detected."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
# Set cached locals to match current network state
|
|
current_locals = dns_plugin._compute_locals()
|
|
dns_plugin._cached_locals = current_locals
|
|
|
|
with (
|
|
patch.object(dns_plugin, "restart") as mock_restart,
|
|
patch.object(dns_plugin, "sys_call_later", new=mock_call_later),
|
|
):
|
|
# Call notify_locals_changed
|
|
dns_plugin.notify_locals_changed()
|
|
|
|
# Wait for the async task to complete
|
|
await asyncio.sleep(0.1)
|
|
|
|
# Verify restart was NOT called since no changes
|
|
mock_restart.assert_not_called()
|
|
assert dns_plugin._cached_locals == current_locals
|
|
|
|
|
|
async def test_notify_locals_changed_debouncing_cancels_previous_timer(
|
|
coresys: CoreSys,
|
|
):
|
|
"""Test notify_locals_changed debouncing cancels previous timer before creating new one."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
# Set cached locals to trigger change detection
|
|
current_locals = dns_plugin._compute_locals()
|
|
dns_plugin._cached_locals = (
|
|
["dns://192.168.1.1"]
|
|
if current_locals != ["dns://192.168.1.1"]
|
|
else ["dns://192.168.1.2"]
|
|
)
|
|
|
|
call_count = 0
|
|
handles = []
|
|
|
|
def mock_call_later_with_tracking(_delay, *args, **kwargs) -> asyncio.TimerHandle:
|
|
"""Mock to remove delay and track calls."""
|
|
nonlocal call_count
|
|
call_count += 1
|
|
handle = coresys.call_later(0, *args, **kwargs)
|
|
handles.append(handle)
|
|
return handle
|
|
|
|
with (
|
|
patch.object(dns_plugin, "restart") as mock_restart,
|
|
patch.object(dns_plugin.instance, "is_running", return_value=True),
|
|
patch.object(dns_plugin, "sys_call_later", new=mock_call_later_with_tracking),
|
|
):
|
|
# First call sets up timer
|
|
dns_plugin.notify_locals_changed()
|
|
assert call_count == 1
|
|
first_handle = dns_plugin._locals_changed_handle
|
|
assert first_handle is not None
|
|
|
|
# Second call should cancel first timer and create new one
|
|
dns_plugin.notify_locals_changed()
|
|
assert call_count == 2
|
|
second_handle = dns_plugin._locals_changed_handle
|
|
assert second_handle is not None
|
|
assert first_handle != second_handle
|
|
|
|
# Wait for the async task to complete
|
|
await asyncio.sleep(0.1)
|
|
|
|
# Verify restart was called once for the final timer
|
|
mock_restart.assert_called_once()
|
|
assert dns_plugin._cached_locals == current_locals
|
|
|
|
|
|
async def test_stop_cancels_pending_timers_and_tasks(coresys: CoreSys):
|
|
"""Test stop cancels pending locals change timers and restart tasks to prevent resource leaks."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
mock_timer_handle = Mock()
|
|
mock_task_handle = Mock()
|
|
dns_plugin._locals_changed_handle = mock_timer_handle
|
|
dns_plugin._restart_after_locals_change_handle = mock_task_handle
|
|
|
|
with patch.object(dns_plugin.instance, "stop"):
|
|
await dns_plugin.stop()
|
|
|
|
# Should cancel pending timer and task, then clean up
|
|
mock_timer_handle.cancel.assert_called_once()
|
|
mock_task_handle.cancel.assert_called_once()
|
|
assert dns_plugin._locals_changed_handle is None
|
|
assert dns_plugin._restart_after_locals_change_handle is None
|
|
|
|
|
|
async def test_dns_restart_triggers_connectivity_check(coresys: CoreSys):
|
|
"""Test end-to-end that DNS container restart triggers connectivity check."""
|
|
dns_plugin = coresys.plugins.dns
|
|
|
|
# Load the plugin to register the event listener
|
|
with (
|
|
patch.object(type(dns_plugin.instance), "attach"),
|
|
patch.object(type(dns_plugin.instance), "is_running", return_value=True),
|
|
):
|
|
await dns_plugin.load()
|
|
|
|
# Verify listener was registered (connectivity check listener should be stored)
|
|
assert dns_plugin._connectivity_check_listener is not None
|
|
|
|
# Create event to signal when connectivity check is requested
|
|
connectivity_check_event = asyncio.Event()
|
|
|
|
# Mock the fire-and-forget request to set the event when called
|
|
def mock_request_connectivity_check(*, force: bool = False):
|
|
connectivity_check_event.set()
|
|
|
|
with (
|
|
patch.object(
|
|
coresys.supervisor,
|
|
"request_connectivity_check",
|
|
side_effect=mock_request_connectivity_check,
|
|
),
|
|
patch("supervisor.plugins.dns.asyncio.sleep") as mock_sleep,
|
|
):
|
|
# Fire the DNS container state change event through bus system
|
|
coresys.bus.fire_event(
|
|
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
|
|
DockerContainerStateEvent(
|
|
name="hassio_dns",
|
|
state=ContainerState.RUNNING,
|
|
id="test_id",
|
|
time=1234567890,
|
|
),
|
|
)
|
|
|
|
# Wait for connectivity check to be called
|
|
await asyncio.wait_for(connectivity_check_event.wait(), timeout=1.0)
|
|
|
|
# Verify sleep was called with correct delay
|
|
mock_sleep.assert_called_once_with(5)
|
|
|
|
# Reset and test that other containers don't trigger check
|
|
connectivity_check_event.clear()
|
|
mock_sleep.reset_mock()
|
|
|
|
# Fire event for different container
|
|
coresys.bus.fire_event(
|
|
BusEvent.DOCKER_CONTAINER_STATE_CHANGE,
|
|
DockerContainerStateEvent(
|
|
name="hassio_homeassistant",
|
|
state=ContainerState.RUNNING,
|
|
id="test_id",
|
|
time=1234567890,
|
|
),
|
|
)
|
|
|
|
# Wait a bit and verify connectivity check was NOT triggered
|
|
try:
|
|
await asyncio.wait_for(connectivity_check_event.wait(), timeout=0.1)
|
|
assert False, (
|
|
"Connectivity check should not have been called for other containers"
|
|
)
|
|
except TimeoutError:
|
|
# This is expected - connectivity check should not be called
|
|
pass
|
|
|
|
# Verify sleep was not called for other containers
|
|
mock_sleep.assert_not_called()
|