Success story
Google Cloud and Red Hat help digital media provider reduce AI costs with hardware flexiblity
Industry:
Media and technology
Region:
Global/Multi-region
Headquarters:
Mountain View, California, USA
Size:
180K+
Overview
Google Cloud offers a fully integrated and optimized AI platform at scale, including custom-built chips, generative AI models, a development platform, and AI-powered applications. Google Cloud was recognized as the AI Visionary Partner of the Year in the 2026 Red Hat® Ecosystem Innovation Awards.
When a global digital media technology platform needed to increase efficiency for its trust and safety workloads, it turned to Google Cloud and Red Hat Professional Services. The team established a solution that provides the flexibility to switch between graphical processing units (GPUs) and Google Cloud’s tensor processing units (TPUs), achieving faster performance with TPUs. Using TPUs also lowers costs, with financial savings of 92% for running safety workloads and 62% for running gen AI workloads. These cost and efficiency benefits help the customer to protect users and maintain trust while delivering faster response times to enhance the user experience.
Challenge
Running trust and safety systems more efficiently and at lower cost
Trust and safety systems are an essential requirement for today’s digital platforms, where every user interaction must be evaluated in real time to prevent harm, safeguard compliance, and maintain user trust.
As a global digital media and technology platform provider, the customer needed a scalable inference solution for AI-driven content and to support its trust and safety protocols. To ensure an almost-instant response, the company’s safety systems are required to scan global user queries with a strict latency service-level objective (SLO) of less than 50 milliseconds. Under pressure to launch faster and cheaper globally, the customer was keen to mitigate the risks associated with graphical processing unit (GPU) shortages and reduce operational costs. It needed a solution that would reduce its reliance on specific hardware while maintaining high performance for large language models (LLMs).
Solution
Optimizing AI workloads across hardware
The customer worked with Google Cloud and Red Hat to establish a solution using the virtual large language model (vLLM) inference engine on the latest Google Cloud TPUs. Designed by Google specifically for neural network machine learning, TPUs provide a faster, more efficient alternative to GPUs. At the same time, vLLM provides the high throughput inference serving engine the team needed to meet the customer’s strict latency SLOs. The solution uses vLLM with Ray, an open source distributed computing framework, as the orchestration layer to support scalable online serving and batch inference.
The team decided to work with Red Hat as it is a major contributor to the open source vLLM project and has integrated it into its product portfolio. The adoption strategy included benchmarking TPU performance against existing GPU setups. The team optimized low-level system code, which resulted in 400% faster performance for small inputs. The exercise showed that moving from GPUs to TPUs was also straightforward with the Google Kubernetes Engine—the team simply had to update configuration settings and use a vLLM TPU image.
Software & services used by Google Cloud
Red Hat Professional Services
Business outcome
Reducing costs while increasing AI performance
Thanks to the project with Google Cloud and Red Hat, the customer can now run safety and trust workloads within its strict latency SLOs. “Faster performance means better user experiences,” said Brittany Rockwell, Senior Product Manager, Google Cloud. “We demonstrated for the customer that using TPUs for its trust and safety workloads not only increases speed, but also significantly reduces costs.”
For safety workloads that mainly process incoming queries, the solution reduces costs by 92% using TPUs compared to using GPU hardware, while also running 400% faster. For latency-sensitive gen AI features, the solution reduces costs by 62% compared to using GPUs. The system is both fast and cost-efficient at processing large-scale data inputs, with batch processing for entity mapping achieved a cost of only US$0.48 per 1 million tokens with a throughput of 14,000 tokens per second. The customer is planning to provision TPUs within its existing clusters over the next 6 months, and is continuing to optimize performance for typical workloads.
Related resources
Microsoft Azure Red Hat® OpenShift® powers scalable generative AI at Banco Bradesco
Capgemini helps banks modernize faster with a blueprint based on Red Hat OpenShift
One Technology maximizes government efficiency through strategic IT automation
Everpure helps manufacturer deliver apps 3x faster with unified platform for VMs and containers
Logicalis Spain helps Piñero safeguard customer experiences with Red Hat Cloud Services
Open source fuels innovation. This fact is exemplified best by Red Hat’s customers, who are using open source technologies to change the game. We’re proud to call them "innovators in the open" and share their stories.