Artificial Intelligence

Performance research papers

Featured research

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Large language models (LLMs) have improved through larger models and data, but gains are slowing. Recent methods scale computation at inference using reward models, treating it as a search problem prone to reward hacking. This paper reframes it as probabilistic inference, using sampling to explore state distributions in a state-space model with approximate likelihood. It introduces a new approach adapting particle-based Monte Carlo methods, achieving 4-16x better scaling than deterministic search on math reasoning tasks. Qwen2.5-Math-1.5B-Instruct beats GPT-4o in 4 rollouts, and Qwen2.5-Math-7B-Instruct reaches o1 accuracy in 32 rollouts. This links probabilistic inference to LLM scaling.

Github

Download Video

All research papers

"Give Me BF16 or Give Me Death?" Accuracy-Performance Trade-Offs in LLM Quantization

Despite the popularity of large language model (LLM) quantization for inference acceleration, significant uncertainty remains regarding the accuracy-performance trade-offs associated with various quantization formats. We present a comprehensive empirical study of quantized accuracy, evaluating popular quantization formats (FP8, INT8, INT4) across academic benchmarks and real-world tasks, on the entire Llama-3.1 model family. Additionally, our study examines the difference in text generated by quantized models versus their uncompressed counterparts. Beyond benchmarks, we also present a couple of quantization improvements which allowed us to obtain state-of-the-art accuracy recovery results.

On the Complexity of Neural Computation in Superposition

This paper explores the theoretical foundations of computing in superposition within neural networks, focusing on explicit, provably correct algorithms and their efficiency. Our results demonstrate that for a broad class of problems, including permutations and pairwise logical operations, a neural network computing in superposition requires a significant number of parameters and neurons. We establish that any sparse sub-network must have a considerable number of parameters, irrespective of the original dense network size. We present an upper bound showing that pairwise logical operations, such as AND, can be computed using a relatively efficient number of neurons and parameters.

Sparse Finetuning for Inference Acceleration of Large Language Models

We consider the problem of accurate sparse finetuning of large language models (LLMs), that is, finetuning pretrained LLMs on specialized tasks, while inducing sparsity in their weights. On the accuracy side, we observe that standard loss-based finetuning may fail to recover accuracy, especially at high sparsities. To address this, we perform a detailed study of distillation-type losses, determining an L2-based distillation approach we term SquareHead which enables accurate recovery even at higher sparsities, across all model types.

参与和学习

行业解决方案

平台产品

特色产品

试用与购买

服务

培训 & 认证

特色产品

主题

文章

了解更多

面向客户

面向合作伙伴

关于红帽

开源

公司信息

建议

选择语言

选择语言

Performance research papers

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

All research papers

"Give Me BF16 or Give Me Death?" Accuracy-Performance Trade-Offs in LLM Quantization

On the Complexity of Neural Computation in Superposition

Sparse Finetuning for Inference Acceleration of Large Language Models

SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot

Sparse Expansion and Neuronal Disentanglement

Sparse*BERT: Sparse Models are Robust

The Optimal BERT Surgeon: Scalable and Accurate Second-Order Pruning for Large Language Models

How Well Do Sparse Imagenet Models Transfer?

M-FAC: Efficient Matrix-Free Approximations of Second-Order Information

Asynchronous Decentralized SGD with Quantized and Local Updates

AC/DC: Alternating Compressed / DeCompressed Training of Deep Neural Networks

Towards Tight Communication Lower Bounds for Distributed Optimization

Sparsity in Deep Learning: Pruning and Growth for Efficient Inference and Training in Neural Networks

On the Predictability of Pruning Across Scales

WoodFisher: Efficient Second-Order Approximation for Neural Network Compression

Relaxed Scheduling for Scalable Belief Propagation

Adaptive Gradient Quantization for Data-Parallel SGD

Inducing and Exploiting Activation Sparsity for Fast Neural Network Inference

MARLIN: Mixed-Precision Auto-Regressive Parallel Inference on Large Language Models

Accurate Compression of Text-to-Image Diffusion Models via Vector Quantization

Mathador-LM: A Dynamic Benchmark for Mathematical Reasoning on Large Language Models

Wasserstein Distances, Neuronal Entanglement, and Sparsity

PV-Tuning: Beyond Straight-Through Estimation for Extreme LLM Compression

Enabling High-Sparsity Foundational Llama Models with Efficient Pretraining and Deployment

QuaRot: Outlier-Free 4-Bit Inference in Rotated LLMs

Extreme Compression of Large Language Models via Additive Quantization

How to Prune Your Language Model: Recovering Accuracy on the "Sparsity May Cry" Benchmark

Scaling Laws for Sparsely-Connected Foundation Models

SpQR: A Sparse-Quantized Representation for Near-Lossless LLM Weight Compression

Bias in Pruned Vision Models: In-Depth Analysis and Countermeasures

SparseProp: Efficient Sparse Backpropagation for Faster Training of Neural Networks

ZipLM: Inference-Aware Structured Pruning of Language Models

Quantized Distributed Training of Large Models with Convergence Guarantees

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Activation-Informed Merging of Large Language Models

A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods

Unveiling the Secret Recipe: A Guide For Supervised Fine-Tuning Small LLMs

Dr. SoW: Density Ratio of Strong-over-weak LLMs for Reducing the Cost of Human Annotation in Preference Tuning

DICE: Discrete Inversion Enabling Controllable Editing for Multinomial Diffusion and Masked Generative Models

Spectrum-Aware Parameter Efficient Fine-Tuning for Diffusion Models

Differentially Private Synthetic Data Generation for Relational Databases

Value Augmented Sampling for Language Model Alignment and Personalization

LInK: Learning Joint Representations of Design and Performance Spaces through Contrastive Learning for Mechanism Synthesis

LAB: Large-Scale Alignment for ChatBots

Curiosity-driven Red-teaming for Large Language Models

Constraining Generative Models for Engineering Design with Negative Data

Analyzing Generalization of Neural Networks through Loss Path Kernels

Beyond Uniform Sampling: Offline Reinforcement Learning with Imbalanced Datasets

Compositional Foundation Models for Hierarchical Planning

Identifiability Guarantees for Causal Disentanglement from Soft Interventions

Aligning Optimization Trajectories with Diffusion Models for Constrained Design Generation

Post-processing Private Synthetic Data for Improving Utility on Selected Measures

Beyond Statistical Similarity: Rethinking Metrics for Deep Generative Models in Engineering Design

Private Synthetic Data Meets Ensemble Learning

Multi-Symmetry Ensembles: Improving Diversity and Generalization via Opposing Symmetries

A Probabilistic Framework for Modular Continual Learning

Improving Tuning-Free Real Image Editing with Proximal Guidance

Estimating the Density Ratio between Distributions with High Discrepancy using Multinomial Logistic Regression