vLLM roundup March 2025

11 de abril de 2025Saša Zelenović3 minutos (tempo de leitura)

We’ve heard all the love for the vLLM meetups and we’re excited to announce that the next one is happening in New York City on May 7th! We’ll be hosting it at IBM One Madison Avenue, and we can’t wait to see you there. This is your heads-up to mark your calendars! As a subscriber to our newsletter, you’ll get first access to the registration page before it goes public. Keep an eye out for another email next week with all the details and the signup link.

We’re also planning to bring vLLM meetups to more cities and we want your input. Where should we go next? Let us know!

Submit a city

Bi-weekly vLLM office hours

Upcoming

vLLM Office Hours #23: Deep Dive Into the LLM Compressor

April 10, 2025 - 2:00PM ET / 11:00AM PT

vLLM Office Hours #24: Performance Optimization of vLLM on Google TPUs

April 14, 2025 - 2:00PM ET / 11:00AM PT

Recordings you don’t want to miss

Introduction to vLLM V1 | Video | Slides | Blog

View all recordings

Blog highlights

Meet vLLM: For faster, more efficient LLM inference and serving

Have you ever wondered how AI-powered applications like chatbots, code assistants and more respond so quickly? Or perhaps you’ve experienced the frustration of waiting for a large language model (LLM) to generate a response, wondering what’s taking so long. Well, behind the scenes, there’s an open source project aimed at making inference, or responses from models, more efficient

Keep reading

3.5X Faster vision-language models with quantization

Vision-Language Models (VLMs), such as the Pixtral and Qwen-VL series, are trained to generate text from image and text inputs. With the expanded input types and the performance of large language models, they enable accurate and promising new use cases such as content moderation, image captioning and tagging, visual question answering, and document extraction/analysis, among others.

Keep reading

vLLM V1: Accelerating multimodal inference for large language models

In this article, we dive into the innovations behind vLLM V1 (V1 Alpha), which addresses the challenges of multimodal inference encountered in V0. We’ll explore the design decisions that enhance performance, from encoder caching to optimized data processing and share benchmark results that highlight the improvements. Finally, we’ll outline our vision for future work to further push the boundaries of efficient, scalable AI.

Keep reading

How we optimized vLLM for DeepSeek-R1

DeepSeek and vLLM optimizations have been a top priority for our team and the vLLM community as a whole, and we are excited to share a deep dive into our work. In this article, we will cover the key inference improvements we have made, detail the integration of DeepSeek’s latest advancements into vLLM, and discuss how we are scaling DeepSeek-R1 for real-world deployment. Additionally, we will review the various open source contributions from DeepSeek and outline our roadmap for integrating them into vLLM.

Keep reading

Multimodal model quantization support through LLM Compressor

LLM Compressor is a unified library for optimizing models for deployment with vLLM. As of its 0.4.0 release, LLM Compressor now supports multimodal model quantization, enabling efficient compression of vision-language and audio models with the most popular quantization formats.

Keep reading

Unleash the full potential of LLMs: Optimize for performance with vLLM

Large language models (LLMs) are transforming industries, from customer service to cutting-edge applications, unlocking vast opportunities for innovation. Yet, their potential comes with a catch: high computational costs and complexity.

Keep reading

Research from our labs

We recently launched AI Research Hub, a destination for all research from Red Hat and Neural Magic labs. We plan to post all our research papers, research blogs, and accompanying code to this new location, so please bookmark it! Here are three papers we are currently featuring on the new page:

A probabilistic inference approach to inference-time scaling of LLMs using particle-based Monte Carlo methods
arXiv | Code
GPTQ: Accurate post-praining quantization for generative pre-trained transformers
arXiv | Code
Unveiling the secret recipe: A guide for supervised fine-tuning small LLMs
arXiv

Want to join our Friday discussions on cutting-edge AI research? Reply to this email and let us know!

Stay engaged with the vLLM community

vLLM is nearing 44,000 stars! Be sure to add your star and join the community. Thank you for your support.

Sobre o autor

Saša Zelenović

Principal Product Marketing Manager

Saša Zelenović is a Principal Product Marketing Manager at Red Hat, joining in 2025 through the Neural Magic acquisition where he led as Head of Marketing. With a passion for developer-focused marketing, Sasa drives efforts to help developers compress models for inference and deploy them with vLLM. He co-hosts the bi-weekly vLLM Office Hours, a go-to spot for insights and community around all things vLLM.

Read full bio