Local GPU vs Public Cloud for Broadcast AI: the Math We Ran for 500 Stations
By the KAVANA engineering team — June 2026
When we started having the local-versus-cloud conversation with stations in 2023, the standard advice from AI infrastructure consultants was unanimous: use the cloud. Marginal cost pricing, no upfront capital, elasticity, managed infrastructure, pay only for what you use. All of that is accurate. None of it is the right frame for how most broadcast stations actually work.
We have now run cost and capability analyses for approximately 500 stations across the range of Chinese broadcasting — county-level radio, city-level FM, regional television, provincial broadcasting groups. This post explains what we found, the math behind it, where the cloud is genuinely better, and where the numbers point clearly in the other direction.
The Baseline Numbers: a County-Level Station, Three Years
Start with a concrete and representative case: a county-level radio station, approximately 200,000 listeners, two channels, eighteen hours of daily programming, using AI synthesis for news updates, weather, traffic, time calls, and sponsored content. Not a heavy AI user by broadcast standards, but a real production workload.
Estimated daily synthesis volume: approximately 45 minutes of completed audio output, representing around 35,000 characters of synthesized text after editing. That is a typical figure for this station profile — it does not include music, live programming, or pre-recorded content that does not go through AI synthesis.
At Alibaba Cloud's published pricing for CosyVoice 3 synthesis (the model we use as our primary cloud pipeline), this volume costs approximately 0.035 RMB per 100 characters. At 35,000 characters per day, that is 12.25 RMB per day, or roughly 4,500 RMB per year. At today's exchange rates, approximately $620 USD per year in synthesis API costs.
That is the number that sounds low and makes the cloud argument straightforward. But synthesis API cost is not the full picture.
The full cloud picture for this station includes: synthesis API costs ($620/year), inference latency overhead (we will return to this), data transfer costs if audio files are being round-tripped through cloud storage, the operational dependency on internet connectivity, and the fact that the synthesis workload at this station is not evenly distributed — there are peak windows around the morning news cycle and drive time where the demand profile is compressed and latency matters.
The local GPU picture for the same station: an RTX 5090 consumer GPU currently lists at approximately 16,000 RMB ($2,200 USD). Combined with a host system — we recommend the Intel Core Ultra 9 285K platform, which runs around 12,000 RMB ($1,650 USD) with 96 GB RAM and a compatible motherboard — the hardware investment is approximately 28,000 RMB ($3,850 USD). Annual power consumption for the GPU running at typical broadcast synthesis workloads is around 400 kWh, or roughly 300 RMB per year at average Chinese commercial electricity rates. Three-year total cost of local GPU: approximately 29,000 RMB ($4,000 USD) including power, assuming no significant maintenance costs.
Three-year total cost of cloud at this station's volume: approximately 13,500 RMB ($1,860 USD).
At these numbers, cloud is cheaper for this station profile over three years. The local GPU does not pay off until year five or so, assuming stable pricing.
Where the Math Flips: Volume and Latency
The county-level station is not where the economics flip. The city-level FM station or the regional broadcasting group is.
Consider a city-level FM station with three channels, producing AI content at four times the county-level volume — 140,000 characters per day across all channels. Three-year cloud cost at the same per-character rate: approximately 54,000 RMB ($7,400 USD). Three-year local GPU cost: still approximately 29,000 RMB — because the hardware and power cost does not scale linearly with output volume the way API pricing does.
At city-level FM volumes, local GPU is cheaper than cloud within the first two years. For a provincial broadcasting group running eight channels, the crossover happens faster.
The latency dimension adds a different calculation. Cloud synthesis over a typical Chinese broadband connection adds 300 to 800 milliseconds of round-trip latency before the synthesis computation even begins. For pre-produced content that runs through the synthesis pipeline hours before air, this latency is irrelevant. For content that is synthesized close to air time — traffic updates produced twelve minutes before a drive-time slot, breaking news synthesized and cleared for a news window — that latency matters.
Local GPU synthesis in our production deployments runs at RTF (real-time factor) values between 0.06 and 0.12 on a current GPU — meaning a 60-second audio segment synthesizes in 4 to 7 seconds. There is no network round-trip, no API authentication overhead, no queuing behind other customers' requests. For time-sensitive content production, local inference is not just cheaper at high volumes — it is faster in practice, and consistently so.
Non-Cost Considerations That Actually Change the Decision
The cost analysis is the one that gets the attention in budget discussions. It is not the only dimension that matters for broadcast stations, and for some stations it is not the most important one.
Connectivity dependency. A cloud synthesis pipeline requires working internet connectivity to produce content. For county-level stations — which make up the bulk of our customer base and which are often located in areas with less reliable connectivity than city-center facilities — a cloud-only synthesis architecture means that connectivity interruptions also interrupt content production. Local GPU synthesis has no network dependency. The synthesis pipeline runs whether or not the internet connection is working.
This is not a theoretical concern. We have customers in mountainous regions of Yunnan and Tibet where the internet connection at the broadcast facility is a 4G link with meaningful outage rates. For those stations, cloud synthesis is not a viable primary architecture regardless of the cost analysis.
Data residency and content security. Broadcast content — particularly news content, traffic data, and content involving named individuals or organizations — is subject to data residency requirements under Chinese cybersecurity law. Content that is sent to a cloud API is, by definition, leaving the local facility. For most routine broadcast content, this is not a practical issue. For content categories that touch sensitive information, local synthesis eliminates the question.
Regulatory inspection readiness. Broadcasting organizations in China operate under content oversight requirements that include the ability to demonstrate where content was processed and stored. Local synthesis provides a clean answer to this question. Cloud synthesis requires documentation of the data processing chain that can be more complex to produce on demand.
Emergency operation. Our sub-second failover architecture is designed to maintain broadcast continuity through local failures. Local GPU synthesis fits this architecture naturally — the synthesis capability stays online through the same conditions that might interrupt internet connectivity. A cloud synthesis dependency creates a gap in the failover logic.
The Hardware We Actually Recommend and Why
We have evaluated multiple GPU configurations for broadcast synthesis workloads. Our current standard recommendation is the RTX 5090 with Intel Core Ultra 9 285K, and the rationale is worth explaining because it is not obvious from the spec sheet.
The RTX 5090 has 32 GB of GDDR7 VRAM. For our production synthesis stack — which runs CosyVoice 2 (our OmniVoice pipeline) alongside Kokoro for secondary synthesis and a lightweight ASR model for content verification — the VRAM requirement with all models loaded is approximately 18 to 22 GB. The 5090 provides comfortable headroom for running the full synthesis stack plus the content review models simultaneously, without model swapping. The RTX 4090 (24 GB VRAM) works for most configurations but is tight when running concurrent synthesis and ASR. Earlier generations require model swapping that adds latency.
The Intel Core Ultra 9 285K is a P+E hybrid architecture with 24 cores. The relevant characteristic for our workload is the large L3 cache and the fast single-thread performance. Synthesis model inference is GPU-bound, but the scheduling overhead — managing multiple synthesis requests, coordinating the content review pipeline, running the broadcast playout process simultaneously — benefits from the CPU architecture. We have also tested AMD Ryzen 9 9950X in this role; performance is similar and it is a viable alternative.
The 96 GB RAM specification is not about current synthesis requirements. It is headroom for the AI document processing and mixing utilities that increasingly run alongside synthesis — script generation, content analysis, news summarization. A station that starts with synthesis and later adds AI script generation will be glad for the memory headroom.
This hardware also runs the Kokoro TTS pipeline — our fast-path synthesis option for content that needs to synthesize in under two seconds rather than the five to seven seconds of the full quality pipeline. Kokoro's 97-voice library covers the range of voice types needed for a multiformat station.
When Cloud Is Genuinely the Right Answer
We should be honest about where cloud synthesis makes more sense than local GPU.
A station with very low AI synthesis volume — a community station that uses AI voice only for after-hours filler programming and produces less than 10,000 characters per day — is not going to recover the cost of local GPU hardware through synthesis savings. For that station profile, cloud API pricing is genuinely cheaper on a total cost basis, likely indefinitely.
A station that is just beginning to experiment with AI voice and wants to pilot the workflow before committing to infrastructure investment should start with the cloud pipeline. We support both paths and the workflow is identical — switching from cloud to local synthesis is a configuration change, not an architectural change.
A production house — not a broadcaster, but an organization producing audio content for distribution through third parties — may have a different calculus. If the organization does not run continuous broadcast operations, the capital cost of GPU hardware does not amortize the same way it does for a 24/7 broadcast facility.
We also use cloud synthesis (primarily Alibaba Cloud CosyVoice 3) as the fallback path in our production stack. When local GPU synthesis is unavailable — hardware maintenance, a GPU failure that has not yet been addressed — the pipeline routes to the cloud API automatically. The cloud API is also the right choice for voice cloning that uses MiniMax's multi-speaker model, which we do not run locally.
Cross-Business Reuse: the Argument That Changes the Calculus
The cost analysis above treats the GPU as dedicated broadcast synthesis infrastructure. That framing undervalues the hardware.
A broadcast facility that installs a GPU for AI synthesis has hardware capable of running inference workloads for multiple departments simultaneously. In practice, our customers who have deployed local GPU infrastructure use it for:
Television production. The same OmniVoice synthesis pipeline that produces radio news audio produces narration for documentary and news video packages. The same ASR pipeline that does broadcast content verification does post-production transcription and subtitle generation. These workloads were previously outsourced to cloud services or done manually; they run on the same hardware as the broadcast synthesis.
Digital media and social channels. A provincial broadcasting group operates broadcast channels alongside WeChat, Weibo, and Douyin channels that require their own content production cadence. Short-form audio and video content for social channels uses the same synthesis and mixing tools as broadcast content. The AI production utilities in KAVANA are explicitly designed to produce content for multiple distribution formats from a single production workflow.
Newsroom AI tools. Journalists at stations with local GPU infrastructure use the AI tooling for draft generation, background research assistance, and multilingual translation of source material. These workloads are relatively lightweight in GPU terms but they are continuous throughout the working day and they produce value that is separate from broadcast synthesis.
Archive processing. Several of our customers have used local GPU inference to run retrospective processing on years of archived broadcast audio — transcription, content indexing, voice identification. This is a batch workload that runs during off-hours and uses GPU capacity that would otherwise sit idle.
When the hardware cost is amortized across these multiple use cases, the per-workload cost comes down significantly. A GPU that would take four years to pay off on broadcast synthesis alone pays off in two years when it is also doing television narration, social content production, and newsroom AI assistance. The question for a broadcasting organization evaluating local GPU investment should not be "can broadcast synthesis alone justify this hardware?" It should be "what is the full set of AI workloads across our organization, and what does it cost to handle all of them at scale through cloud APIs versus running them locally?"
The API Comparison: What We Are Actually Competing With
We are sometimes asked to compare our local GPU economics directly against OpenAI API, Azure AI, and Alibaba Cloud inference costs for equivalent workloads.
For synthesis specifically, the comparison is straightforward. OpenAI TTS and Azure Speech synthesis are priced at a level that makes them more expensive than Alibaba Cloud CosyVoice for high-volume Chinese broadcast synthesis, and neither is optimized for broadcast prosody. We use Alibaba Cloud as our cloud pipeline precisely because the synthesis quality and price point are better for our use case.
The more interesting comparison is for the language model workloads — script generation, news summarization, content analysis — that run alongside synthesis. At the volume a medium-sized broadcasting group generates, calling GPT-4 or Claude for every script draft is expensive. Running a local Qwen model on the same GPU hardware that does synthesis is not. The cross-workload reuse argument applies here as well: a GPU that runs synthesis can also run a capable local language model for the text generation workloads, at zero marginal API cost.
The honest competitive position is this: for a station evaluating cloud APIs only for synthesis, at low to medium volumes, the cloud APIs win on cost and operational simplicity. For a station evaluating the full AI workload across synthesis, script generation, content analysis, and newsroom tools, at the volumes that a functional broadcast operation generates, local GPU infrastructure has a strong total-cost-of-ownership case and significant operational advantages that the per-character API pricing does not capture.
The KAVANA AI infrastructure page covers the system architecture and the AI utilities documentation covers the production workloads in more detail. We are also happy to run the cost model for a specific station profile — email international@kavanafm.com with your station's format, channel count, and approximate daily synthesis volume and we will give you the honest numbers, including the scenarios where the analysis does not favor local deployment.
The Practical Decision Framework
After 500 station analyses, the pattern is clear enough to summarize in terms that do not require running the full model.
Local GPU infrastructure is likely the right choice if: the station produces more than 30,000 characters of AI synthesis per day on a sustained basis; the station has meaningful internet connectivity concerns; the station operates under data residency or regulatory requirements that are cleaner to satisfy with local processing; or the organization has multiple AI workloads across departments that can share the infrastructure.
Cloud API synthesis is likely the right choice if: the station is piloting AI synthesis and not ready to commit to infrastructure; the synthesis volume is low and the cloud cost is manageable; the station has reliable internet connectivity and no data residency requirements; or the organization's IT structure makes capital infrastructure purchases difficult but ongoing SaaS costs easier.
The cases where the answer is genuinely ambiguous — moderate volume, reliable internet, no specific compliance drivers — benefit from modeling the specific situation. The five-year total cost of ownership comparison usually produces a clear recommendation; the three-year comparison sometimes does not.
The technical documentation for the local GPU deployment path is at github.com/kavanafm/kavana-docs-en. The documentation covers hardware configuration, software deployment, network requirements, and integration with the broadcast automation system. We publish it openly because we think stations making infrastructure decisions deserve access to the technical details before they commit, not after.
KAVANA is developed by Hunan ShengGuang Technology Co., Ltd. (湖南声广科技有限公司), incorporated 2012, team active since 2005. We hold a broadcast production and distribution license (湘字第00565号) and operate under Chinese cybersecurity Level 3 certification. Technical documentation and open specifications: github.com/kavanafm.