1
0
mirror of https://github.com/home-assistant/operating-system.git synced 2026-05-19 06:58:54 +01:00
Files
Stefan Agner d76f3f4a6c systemd: Increase runtime watchdog timeout to 5 minutes (#4705)
Previously the watchdog timeout was set to "default", which makes
systemd adopt whatever timeout the underlying hardware driver
advertises. In practice this means very different behavior across
platforms: common x86-64 watchdogs (iTCO_wdt, i6300esb on virtual
systems) default to around 30 seconds, while the Raspberry Pi BCM2835
watchdog uses 15 seconds.

These short timeouts have proven too aggressive in the presence of
stalled network storage. PID1 enters the kernel through paths that can
park it in D-state on a dead NFS mount: mount_enter_mounting() calls
chase()/open_tree(), is_dir()/mkdir_p_label() on bind mount sources,
and mkdir_p_label()/unit_warn_if_dir_nonempty() on destinations all
end up in nfs_lookup/nfs_getattr/nfs_readdir, waiting for an RPC major
timeout. While PID1 is blocked it cannot ping the watchdog, and the
system reboots even though the underlying issue is a network storage
stall, not a kernel hang. Reported upstream as
https://github.com/systemd/systemd/issues/42050.

Set RuntimeWatchdogSec to 5 minutes to align all platforms on a
single, conservative value that tolerates NFS/CIFS RPC timeouts while
still recovering from genuine PID1 hangs.

A note on hardware limits: some watchdog drivers expose a
max_hw_heartbeat_ms instead of max_timeout. In that case the watchdog
core in drivers/watchdog/watchdog_dev.c arms a kernel timer that pings
the hardware in the background and multiplexes a longer userspace
timeout on top, so a 300s request is accepted even when the silicon
counter cannot represent it. The BCM2835 watchdog used on all
Raspberry Pi variants is exactly this case: the PM_WDOG_TIME_SET
register caps at ~16s of hardware heartbeat, but the driver migrated
to the max_hw_heartbeat_ms path in v6.8 (commit f33f5b1fd1be
"watchdog: bcm2835_wdt: Fix WDIOC_SETTIMEOUT handling", Dec 2023), so
WDIOC_SETTIMEOUT(300) is honored — the core re-pings the chip every
~15s automatically and only stops once userspace has been silent for
the full 300s. Drivers that still use max_timeout directly will reject
300s with -EINVAL and systemd falls back to the driver's own value;
this is no worse than the previous "default" behavior.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
2026-05-12 16:45:02 +02:00
..
2018-06-12 14:31:50 +00:00
2018-04-11 15:40:15 +02:00