We’ve heard all the love for the vLLM meetups and we’re excited to announce that the next one is happening in New York City on May 7th! We’ll be hosting it at IBM One Madison Avenue, and we can’t wait to see you there. This is your heads-up to mark your calendars! As a subscriber to our newsletter, you’ll get first access to the registration page before it goes public. Keep an eye out for another email next week with all the details and the signup link.
We’re also planning to bring vLLM meetups to more cities and we want your input. Where should we go next? Let us know!
Bi-weekly vLLM office hours
Upcoming
vLLM Office Hours #23: Deep Dive Into the LLM Compressor
April 10, 2025 - 2:00PM ET / 11:00AM PT
vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs
April 14, 2025 - 2:00PM ET / 11:00AM PT
Recordings you don’t want to miss
Introduction to vLLM V1 | Video | Slides | Blog
DeepSeek and vLLM | Video | Slides | Blog
Multimodal LLMs With vLLM v1 | Video | Slides | Blog
Blog highlights
Meet vLLM: For faster, more efficient LLM inference and serving
Have you ever wondered how AI-powered applications like chatbots, code assistants and more respond so quickly? Or perhaps you’ve experienced the frustration of waiting for a large language model (LLM) to generate a response, wondering what’s taking so long. Well, behind the scenes, there’s an open source project aimed at making inference, or responses from models, more efficient
3.5X Faster vision-language models with quantization
Vision-Language Models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others.
vLLM V1: Accelerating multimodal inference for large language models
In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.
How we optimized vLLM for DeepSeek-R1
DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and discuss how we are scaling DeepSeek-R1 for real-world deployment. Additionally, we will review the various open source contributions from DeepSeek and outline our roadmap for integrating them into vLLM.
Multimodal model quantization support through LLM Compressor
LLM Compressor is a unified library for optimizing models for deployment with vLLM. As of its 0.4.0 release, LLM Compressor now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.
Unleash the full potential of LLMs: Optimize for performance with vLLM
Large language models (LLMs) are transforming industries, from customer service to cutting-edge applications, unlocking vast opportunities for innovation. Yet, their potential comes with a catch: high computational costs and complexity.
Research from our labs
We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:
- A probabilistic inference approach to inference-time scaling of LLMs using particle-based Monte Carlo methods
arXiv | Code - GPTQ: Accurate post-praining quantization for generative pre-trained transformers
arXiv | Code - Unveiling the secret recipe: A guide for supervised fine-tuning small LLMs
arXiv
Want to join our Friday discussions on cutting-edge AI research? Reply to this email and let us know!
Stay engaged with the vLLM community
vLLM is nearing 44,000 stars! Be sure to add your star and join the community. Thank you for your support.
resource
Introdução à IA empresarial: um guia para iniciantes
Sobre o autor
Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.
Mais como este
Navegue por canal
Automação
Últimas novidades em automação de TI para empresas de tecnologia, equipes e ambientes
Inteligência artificial
Descubra as atualizações nas plataformas que proporcionam aos clientes executar suas cargas de trabalho de IA em qualquer ambiente
Nuvem híbrida aberta
Veja como construímos um futuro mais flexível com a nuvem híbrida
Segurança
Veja as últimas novidades sobre como reduzimos riscos em ambientes e tecnologias
Edge computing
Saiba quais são as atualizações nas plataformas que simplificam as operações na borda
Infraestrutura
Saiba o que há de mais recente na plataforma Linux empresarial líder mundial
Aplicações
Conheça nossas soluções desenvolvidas para ajudar você a superar os desafios mais complexos de aplicações
Programas originais
Veja as histórias divertidas de criadores e líderes em tecnologia empresarial