What Actually Breaks in a Broadcast Audio Codec Pipeline (and How to Design Around It)

By the KAVANA engineering team — June 2026

The phrase "codec pipeline" suggests a problem with a defined beginning and end: audio goes in one side, audio comes out the other, and the codecs are the boxes in between. In practice, a broadcast audio codec pipeline is a sequence of format conversions, container handoffs, level adjustments, and buffer transitions that each introduce their own failure modes — and the failures compound in ways that are not obvious from looking at any individual stage in isolation.

We have been building and operating broadcast automation systems for twenty years, across dozens of county-level and regional stations in China. The codec questions that seem simple in a studio or a test environment reveal their complexity at scale, over long unattended broadcast runs, with heterogeneous source material that arrives from production workflows that were not designed with your playout system in mind. This post is an honest account of what breaks, why it breaks, and what we built to avoid it.

The Stages of a Broadcast Codec Pipeline and Where Each One Fails

It helps to be explicit about what a typical broadcast codec pipeline actually contains. The stages are not always labeled as stages — in most playout systems they are implicit — but they are all present, and each one has its own failure modes.

Source ingest is the first stage: audio arrives from a content production system, a network feed, a DAW export, or an AI synthesis pipeline. The formats are heterogeneous. MP4 containers with AAC audio are common from video production workflows. WAV files (PCM, multiple sample rates and bit depths) come from radio production software and audio editors. FLAC and MP3 arrive from music libraries. The occasional oddity — a Windows Media audio file from a legacy production system, an AIFF from a macOS DAW — appears often enough to require handling.

Ingest normalization converts the source material into whatever internal format the playout system works with. This stage is where metadata loss first becomes a risk: container metadata that was meaningful in the source format (title, artist, duration, loudness measurements embedded by the production system) may survive or may not depending on how the conversion is implemented.

Playout buffering is the stage where audio is read from storage, decoded, and placed into the memory buffer that the playout engine uses to ensure continuous output. This is where sample rate mismatches and resampling jitter appear.

Output encoding takes the buffered audio and encodes it for the specific transmission chain — FM analog processing, DAB digital, IP stream, or some combination. This is where AAC and MP2 frame errors typically occur.

Transmission is the final stage, where the encoded audio leaves the broadcast machine for the transmitter, the streaming encoder, or both.

Each of these stages has characteristic failure modes. The diagnostic difficulty is that a failure at one stage often does not produce a visible error at that stage — the failure propagates downstream and manifests as a different kind of problem several stages later.

Metadata Loss: the Failure That Hides Until You Need the Data

Broadcast audio metadata — title, duration, loudness, production date, rights information — is embedded differently in different container formats. ID3 tags in MP3, Vorbis comments in FLAC, QuickTime atoms in MP4, BWF chunks in WAV. The information is the same in principle; the encoding is format-specific.

When audio is converted at ingest from its source format to the playout system's internal format, metadata survives only if the conversion explicitly handles the mapping. A conversion that extracts audio and writes it to WAV without BWF chunks loses the loudness measurement that was embedded by the production system's mastering process. If your downstream normalization relies on reading an embedded loudness value rather than re-measuring, it is now working without the data it expected — and depending on how it handles a missing value, it may apply incorrect processing, silently skip normalization, or fail in a way that is not immediately obvious.

The duration metadata problem is subtler. Most playout systems read audio duration from container metadata at schedule-loading time rather than decoding each file in advance. A WAV file with a correct duration in its header presents no problem. A WAV file whose duration metadata was written incorrectly by a production tool — written before all audio data was flushed to the file, for example, which is a known bug in some versions of popular DAW software — will appear to be the correct duration in the schedule, but will end early or play past its scheduled slot depending on the actual versus reported length.

Duration metadata mismatches cause the most visible and listener-audible failures in a broadcast pipeline. A segment that ends two minutes before its scheduled slot leaves dead air or forces an unscheduled early transition. A segment that runs over its scheduled slot means the next element starts late, and the timing error accumulates through the rest of the hour.

Loudness Leak: When Normalization Succeeds at One Stage and Fails at Another

Loudness normalization applied at ingest is correct for the material as it exists at the time of ingest. It is not necessarily correct for the material as it will be processed at subsequent stages.

A specific failure mode we have encountered: audio normalized to -23 LUFS at ingest is passed through a DSP chain at the output stage that includes a hardware broadcast processor — a common piece of equipment at FM stations that applies loudness enhancement and limiting for the transmission chain. A broadcast processor that is configured aggressively for loudness enhancement can add 6 to 10 LUFS of effective loudness to the output. The ingest normalization was correct; the transmitter output is dramatically louder than intended because the broadcast processor is not aware of the normalization target.

The loudness leak can also occur in the other direction. Audio from a network feed that arrives pre-normalized by the upstream facility passes through the ingest stage without re-normalization (because it appears to be at target). But the upstream facility's normalization target was -24 LUFS rather than -23 LUFS, and their true peak measurement was -3 dBTP rather than -1 dBTP. The material is nominally normalized, but it is 1 LUFS quieter than the local target, and transitions from this material to locally-produced content produce an audible level jump.

There is no general solution to loudness leak that does not require measuring at every stage. The practical approach — which is what we implement in KAVANA-MGR — is to normalize at ingest, measure at the output stage, and flag material that falls outside the expected range for review. The output stage measurement is not a re-normalization pass; applying loudness processing to pre-processed material often produces artifacts. It is a monitoring pass that surfaces inconsistencies that were not caught at ingest.

Resampling Jitter: the Problem You Cannot Hear Until the Encoding Stage

Broadcast production audio exists at multiple standard sample rates: 44.1 kHz (CD standard, common in music production), 48 kHz (professional broadcast standard, required for some transmission chains), 96 kHz (high-resolution production). A broadcast system that receives material at mixed sample rates must resample everything to a single rate before encoding for transmission.

Resampling is a mathematically well-understood operation. A good resampling implementation — using a proper windowed sinc filter with adequate filter length — introduces inaudible artifacts. A poor resampling implementation — particularly the linear interpolation that some playout systems use because it is computationally cheap — introduces audible aliasing in the presence of high-frequency content and produces jitter artifacts at the encoding stage.

The jitter problem specifically occurs because AAC and MP2 encoders operate on fixed-length audio frames. AAC uses frames of 1024 PCM samples; MP2 uses frames of 1152 PCM samples. An encoder that receives audio at a nominal 48 kHz but with sample-accurate jitter from a poor resampler produces frames whose actual audio content varies slightly from the declared frame duration. Most decoders handle this without perceptible effect. Some decoders, particularly in broadcast monitoring chains, produce drift that accumulates over long broadcast runs and eventually results in a buffer underrun or a synchronization loss.

The practical solution is to resample at ingest rather than at encoding, using a high-quality resampler, and to normalize all material to a single sample rate — typically 48 kHz for broadcast — before it enters the playout queue. This eliminates the resampling jitter problem at the encoding stage by ensuring the encoder receives audio at a consistent sample rate throughout. The tradeoff is storage: 48 kHz PCM WAV files are larger than 44.1 kHz files, and a station with a large music library may need to plan for the additional storage requirement.

The Single Trusted Intermediate Format and Why It Matters

The failures described above — metadata loss, loudness leak, resampling jitter — each have individual solutions. But the pattern underlying all of them is the same: when audio passes through multiple conversions in a heterogeneous pipeline, each conversion is a potential point of information loss or transformation error, and the errors from multiple stages compound.

The architectural solution we arrived at after years of operating these systems is to enforce a single trusted intermediate format throughout the playout pipeline. Every audio file that enters the playout queue, regardless of its source format, is converted to this format at ingest and stored in this format. The playout engine, the output encoder, and the monitoring chain all operate on audio in this format exclusively.

For KAVANA, that intermediate format is WAV9: 48 kHz, 32-bit float, PCM, with a mandatory BWF metadata chunk that carries loudness measurements, duration, source information, and provenance. The wav9-spec is published and versioned. Any conversion tool that produces a WAV9-compliant file is a valid source; any system that consumes WAV9 can rely on the invariants the format guarantees.

The enforcement of a single format removes a class of failures entirely. If every audio file in the playout queue is 48 kHz, there is no resampling at encoding time. If every file carries a BWF loudness chunk, there is no missing metadata at normalization time. If every file has a verified duration in its header (WAV9 requires duration verification at write time), there are no schedule timing errors from incorrect duration metadata.

The cost of enforcement is the ingest pipeline. Every source format must be converted to WAV9 before it enters the queue, which requires maintaining conversion support for a broad range of input formats. In practice, this is manageable: the conversion is a batch operation that runs at ingest time, not in the real-time playout path, and the set of formats that actually appears in broadcast production environments is smaller than the theoretical maximum.

Encoding Stage Failures: AAC Frame Errors and MP2 Bit Rate Mismatches

The output encoding stage — converting the playout audio to the format required by the transmission chain — has its own failure modes that are distinct from the pipeline failures described above.

AAC encoding failures typically manifest as frame errors: a frame that the encoder could not process correctly, which the decoder handles by dropping the frame (producing a brief silence) or by reconstructing it from adjacent frames (producing a brief artifact). Frame errors in continuous broadcast operation are usually caused by one of three things: audio data that arrives at the encoder faster or slower than the encoder's input buffer expects (a buffer underflow or overflow), audio with amplitude values outside the encoder's expected range (not clamped to [-1.0, 1.0] in float representation), or an encoder that accumulates state across long runs and eventually produces a frame it cannot correctly encode.

The third cause — encoder state accumulation — is particularly insidious because it produces failures that are time-correlated with broadcast run length rather than content-correlated. An encoder that works perfectly for the first eight hours of a broadcast day and then starts producing occasional frame errors after ten hours is exhibiting this behavior. The solution is controlled encoder restarts at schedule boundaries — not because the encoder is failing in a recoverable way, but because a clean restart resets the accumulated state before it becomes a problem.

MP2 bit rate mismatches cause a different failure mode. MP2 is the encoding format used for DAB digital broadcast in many markets and is also used in some IP stream configurations. A bit rate mismatch — the encoder configured for 192 kbps but the transmission chain expecting 256 kbps, for example — does not always produce an immediate error. Some decoders accept the lower bit rate and decode it without complaint; others flag a compliance error; others silently fail. The silent failure mode is the worst outcome, because the transmission appears healthy from the encoder's perspective but the received audio at the decoder is degraded or absent.

In KAVANA-DOG, we address the encoder state accumulation problem with controlled restarts at hour boundaries in long broadcast runs. This is a pragmatic engineering decision: we do not know exactly when a given encoder instance will accumulate enough state to become unreliable, so we restart at regular intervals rather than waiting for failure. The restart is designed to be imperceptible to the listener — it occurs at a natural break in the audio, typically at the transition between segments, and the new encoder instance is warm before the old one is stopped.

The Network Feed Problem: When the Codec Pipeline Starts Upstream

Many broadcast stations, particularly at the county and regional level, carry content from network feeds — provincial-level or national-level content distributed by a broadcaster that is upstream in the production hierarchy. This content arrives encoded, not as raw PCM, and it arrives with processing decisions already applied by the upstream facility.

The network feed creates a codec pipeline problem that is genuinely difficult to solve cleanly. The upstream content has been encoded (typically AAC or MP2), transmitted across a network (introducing potential packet loss or jitter), decoded at the receiving station, and then needs to be re-encoded for local transmission. Each decode-encode cycle introduces generation loss — the artifacts accumulated by lossy encoding that compound across multiple encode-decode cycles.

For content that cycles through two or three generations — produced as WAV, encoded for network distribution, decoded at the affiliate, re-encoded for local transmission — the generation loss is typically inaudible at practical bit rates. For content that cycles through more generations, or for content that was encoded at a low bit rate to begin with, generation loss becomes audible.

The cleaner solution where it is available is to receive network feeds as PCM (the upstream facility sends unencoded audio over a high-bandwidth connection) or to accept lossless encoding (FLAC over the network connection). Neither option is universally available from Chinese network feed sources. Where neither is available, the practical approach is to minimize the number of decode-encode cycles and to use the highest feasible bit rate at each encoding stage.

The KAVANA integration with network feeds ingests decoded audio through the standard WAV9 pipeline where possible, applying the same normalization and format verification that local content receives. For feeds that cannot be decoded cleanly — feeds that arrive in formats with non-standard packetization, for example — we log the format and flag the content for manual review rather than attempting automated conversion that might produce corrupted WAV9 output.

Putting It Together: What a Healthy Codec Pipeline Looks Like in Operation

A broadcast codec pipeline that is working correctly is invisible. The presenter talks, the music plays, the news package airs, and none of the engineers think about codec format conversion, because there is nothing to think about.

The indicators that a pipeline is healthy are mostly negative: no duration anomalies in the playout schedule, no loudness complaints from listeners or regulators, no frame errors in the encoder log, no resampling artifacts reported by the monitoring chain. This makes the pipeline hard to evaluate proactively — it is a system designed to prevent failures, and the absence of failures is not dramatic.

What we monitor in production, through KAVANA-DOG and KAVANA-MGR, is the accumulation of small signals that precede failures: a steady growth in encoder output latency that suggests state accumulation, a drift in the average loudness of the output stage that suggests the normalization pipeline is not catching some content category, a growing fraction of audio files with missing BWF metadata that suggests the ingest pipeline is not being applied consistently. These signals, caught early, allow corrective action before a listener-audible failure occurs.

The technical documentation for the WAV9 intermediate format specification is published at github.com/kavanafm/wav9-spec. The DOG monitoring configuration guide and MGR playout integration documentation are available through the product documentation. Stations with specific codec pipeline questions for their transmission chain are welcome to contact us at international@kavanafm.com.

KAVANA is developed by Hunan ShengGuang Technology Co., Ltd. (湖南声广科技有限公司), incorporated 2012, team active since 2005. We hold a broadcast production and distribution license (湘字第00565号) and operate under Chinese cybersecurity Level 3 certification. Technical documentation and open specifications: github.com/kavanafm.