Engineering a Sub-Second Broadcast Failover That Actually Holds Up at 02:00 AM

By the KAVANA engineering team — June 2026

There is a particular kind of silence that broadcast engineers dread. It is not the silence of dead air — you can hear that immediately. It is the silence of a station that is technically on air, outputting carrier, but playing nothing useful: no audio, a stuck frame, a repeated clip from forty minutes ago. That silence can last a long time if nobody catches it. At 02:00 AM at a county-level station with no overnight operator, it can last until the morning shift arrives at 06:00.

We have been on the receiving end of those phone calls. This post explains what we built to stop getting them, and why the engineering choices we made are less obvious than they might appear.

The Industry Baseline and Why It Was Not Good Enough

Broadcast automation platforms from NexGen, WideOrbit, and their peers have failover capabilities. Their marketed failover windows — when they publish them at all — typically range from 5 to 30 seconds. Some implementations are faster in practice; some are considerably slower depending on how the system detects failure and how it coordinates handoff between primary and backup.

Five seconds of dead air is a long time in broadcast. At a music station it is noticeable. At a news station during a live segment it is a significant technical incident. At a station that is being monitored by a regulatory body that cares about continuous coverage obligations, it is a compliance event.

Thirty seconds of dead air is not a failover. It is an outage.

We did not set out to engineer a sub-second failover because we wanted a number to put on a marketing page. We set out to engineer one because we had a specific incident that made the previous architecture obviously wrong, and because the stations we serve cannot tolerate what the incumbent vendors consider acceptable.

The Incident That Made Us Take This Seriously

In the early hours of a Wednesday morning, the main playout PC at a county-level station in Hunan province lost power unexpectedly. Not a graceful shutdown — a hard power cut, the kind that happens when a UPS battery degrades silently over two years and then fails at the worst possible moment.

The station's backup system was a separate PC running our software in a hot-standby configuration. The theory was correct: if primary fails, backup takes over. The practice revealed three problems we had not anticipated.

First, the failure detection mechanism at the time relied on a heartbeat that was checked on a five-second interval. The backup system did not know the primary had died until the heartbeat missed three consecutive checks — fifteen seconds of dead air before the backup even attempted to take over.

Second, the backup system's takeover sequence involved re-reading the current playlist state from a shared network drive. The same network event that had disrupted power to the primary PC had also briefly interrupted the network switch. The backup system spent additional seconds in retry loops trying to read a file from a path that was momentarily unreachable.

Third — and this is the one that embarrassed us most — the backup system successfully took over, began playing audio, and approximately four minutes later the primary system's UPS rebooted and the primary came back online. Both systems were now playing audio simultaneously. The signal at the transmitter was a mix of two different program streams at slightly different timing offsets. Nobody caught this for eleven minutes because at 02:17 AM, the station's emergency contact number rang and nobody answered.

We fixed all three problems. The process of fixing them taught us things that changed how we think about broadcast reliability architecture.

Why We Chose Direct Machine Interconnect Over Cloud-Managed HA

The conventional modern answer to broadcast failover is cloud-managed high availability: a cloud orchestration layer monitors both machines, detects failure, and coordinates the handoff. This pattern works well for web services and database replicas. It has a structural problem in broadcast.

Cloud-managed HA introduces a third party into the critical path: the cloud coordination service itself. If the cloud service is unreachable — because the station's internet connection has failed, because the cloud provider is having an availability event, because a BGP routing change has increased latency past the detection threshold — the failover logic breaks down. For a station with a poor or intermittent internet connection (which describes a significant fraction of the stations we serve), cloud-managed HA is not a safety net. It is an additional dependency that can fail at the worst time.

We chose direct machine interconnect instead. The primary and backup machines communicate over a dedicated local network link — a physical cable, not WiFi, not routed through the station's main switch — and the failover decision is made locally by the backup machine without requiring any cloud coordination. The backup machine monitors the primary continuously, makes the failure determination locally, and executes the handover locally.

The tradeoff is that you lose the advantages of a managed third-party orchestrator: automatic remediation, cloud-based audit logs, integration with a commercial monitoring platform. We accept those tradeoffs for our target environment, where internet reliability cannot be assumed and where the cost of a cloud HA subscription would be a meaningful fraction of the station's IT budget.

The wav9 Audio Firewall as a Failover Integrity Check

Detecting that the primary machine has failed is the easy part of failover. The harder part is verifying that the backup machine's output is actually good before committing to it.

This matters because of a failure mode we call "successful but wrong." The backup system starts playing. Audio is flowing. But the audio is wrong: it is a segment from two hours ago because the playlist synchronization was incomplete, or it is the wrong language track because a file encoding error was not detected at ingest, or it is technically audio but the level is 12 dB below nominal because a gain stage is misconfigured.

A failover that switches from broken primary output to broken backup output has not solved the problem. It has just changed the nature of the problem and reset the clock.

We built what we call the wav9 audio firewall as a per-frame content inspection layer that runs below the playout engine on both machines. Before the backup system commits its output to the transmitter chain, the wav9 layer performs a series of integrity checks on the first few seconds of audio: level validation against programmed targets, silence detection, and a format integrity check that verifies the audio is what the playlist metadata claims it is. If the integrity checks fail, the backup system does not commit its output — it continues retrying with a different segment, a fallback emergency program, or a pre-cached backup track.

The wav9 specification is published as an open standard at our GitHub organization. We made it open because we think the broadcast industry needs an interoperable audio integrity layer and because making it proprietary does not serve anyone's interests except ours.

How KAVANA-DOG Actually Makes the Failure Determination

KAVANA-DOG is our watchdog process. It deserves more explanation than "it monitors the broadcast chain" because the specifics of how it makes the failure determination are where the sub-second window is either achieved or lost.

DOG runs on the backup machine as a separate process with a higher OS scheduling priority than the playout application. It monitors the primary machine through three independent channels simultaneously: a heartbeat signal from the primary's DOG process, a TCP-level health check on the primary's playout application port, and an audio-level monitor that checks whether the primary's audio output is actually producing signal at the expected level.

The failure condition is defined as: any two of these three channels showing a failure state simultaneously, sustained for more than 200 milliseconds. We chose the two-of-three logic to eliminate single-channel false positives (a heartbeat packet that gets dropped by a busy switch, a momentary audio level dip during a natural pause). We chose 200 milliseconds as the sustained threshold based on empirical testing of what can cause a false positive versus what is actually a failure.

When the failure condition is met, DOG on the backup machine initiates handover. The handover sequence is pre-computed: at startup, and periodically thereafter, DOG preloads the current playlist state and computes the segment that should be playing right now and what should follow it. The handover does not involve reading from the network. It does not involve calling a remote API. It executes from pre-cached local state. This is why the handover can happen within a recognizable broadcast frame rather than in the several seconds a network-dependent handover would require.

The total elapsed time from the moment the failure condition is met to the moment the backup system's audio is committed to the transmitter chain is, in our production measurements across several hundred handover events over the past three years, consistently under 800 milliseconds. In favorable conditions it is under 400 milliseconds.

The Mistakes We Made and What They Cost Us

We have made three significant mistakes in this system's history that are worth describing honestly.

Mistake one: the false positive storm. An early version of DOG used a single-channel failure determination. During a period of network congestion caused by a firmware update running on the station's router, DOG was seeing intermittent heartbeat failures. Over a four-hour window, it triggered thirty-seven failover events — oscillating between primary and backup every few minutes. The audio output during this period was discontinuous but technically present. The station's compliance log looked like a strobe. We learned from this that single-channel failure detection is not acceptable and implemented the two-of-three logic described above.

Mistake two: synchronized failure. We had two stations running the same software version that had a bug in the audio level monitoring code. Both stations' DOG processes could, under certain conditions, simultaneously decide that the other's output was failing — even when both were actually healthy. This created a race condition where both systems attempted to become primary simultaneously. The resolution was to implement a deterministic arbitration protocol: in a dual-primary conflict, the machine with the lexicographically earlier hardware ID yields. This is not an elegant solution, but it is predictable.

Mistake three: the stale state problem. Pre-caching the playlist state for fast handover creates an obvious issue: the cache can become stale. If the primary machine has been running for six hours and the cached state in DOG was last refreshed four hours ago, the backup's pre-computed handover point will be wrong. We did not initially implement aggressive enough cache refresh logic. A station experienced a handover where the backup began playing a segment that had already aired three hours earlier. We now refresh the pre-cached state every sixty seconds and immediately before any scheduled segment boundary.

What This Architecture Does Not Solve

We want to be honest about the limits of what we have built.

Sub-second failover between two machines at a single site does not protect against site-level failures: building power loss affecting both machines, a facility fire, a network failure that takes down both the primary and backup simultaneously. For site-level resilience you need a geographically separate backup site, which is a different architectural problem and a different cost discussion.

The system also does not protect against content failures that happen on both machines simultaneously. If a corrupted audio file is in the playlist on both primary and backup, both machines will have the same problem.

And the monitoring and remote management capability — KAVANA-MGR's ability to see and control the station remotely — depends on the station's network connectivity. DOG can operate autonomously at the local level without network access, but the human-in-the-loop for diagnosing complex failures still needs to be able to reach the machine. We use a reverse SSH tunnel architecture for this, which works even through NAT and restrictive firewalls, but it requires at least intermittent internet connectivity.

The 02:00 AM Question

We started this post with a specific scenario: a failure at 02:00 AM when no operator is on site. What actually happens in that scenario with the current system?

DOG detects the failure within 200 milliseconds of it meeting the failure condition threshold. The backup system begins output within 800 milliseconds of that. The wav9 integrity check confirms the output is valid. The switchover is complete before a listener would hear dead air — assuming the failure is clean. If the failure is messy (the primary is producing corrupted audio rather than no audio), the detection may take a few hundred milliseconds longer as DOG waits for the audio-level monitor to confirm the anomaly.

DOG simultaneously sends a status report over the reverse SSH tunnel to the monitoring endpoint — which for most of our deployments is a server at the broadcasting group's central facility. The on-call engineer (if there is one) receives an alert. For the stations that have nobody on call, the incident is logged and visible when the morning shift arrives.

Is this perfect? No. Is it meaningfully better than five to thirty seconds of dead air followed by a manual intervention? Yes. For the stations we serve, that difference is the difference between a footnote in the morning debrief and a regulatory compliance incident.

If you are evaluating broadcast failover architectures and want to understand the technical details further, the KAVANA-DOG documentation covers the monitoring and arbitration protocol, and the wav9 spec covers the audio integrity verification layer. We are also reachable at international@kavanafm.com for direct technical questions.

KAVANA is developed by Hunan ShengGuang Technology Co., Ltd. (湖南声广科技有限公司), incorporated 2012, team active since 2005. We hold a broadcast production and distribution license (湘字第00565号) and operate under Chinese cybersecurity Level 3 certification. Technical documentation and open specifications: github.com/kavanafm.