7 Best AI Video Translation Tools of 2026 (Reddit-Tested & Real Reviews)
SOURCE: AIJOURN.COM
JUN 06, 2026
Pruning and Distilling LLMs Using NVIDIA TensorRT Model Optimizer
SOURCE: DEVELOPER.NVIDIA.COM
OCT 07, 2025
By Max Xu, Keval Morabia, Asha Anoosheh and Jamie L
AI-Generated Summary
Large language models (LLMs) have set a high bar in natural language processing (NLP) tasks such as coding, reasoning, and math. However, their deployment remains resource-intensive, motivating a growing interest in small language models (SLMs) that offer strong performance at a fraction of the cost.
NVIDIA researchers and engineers have demonstrated a method that combines structured weight pruning with knowledge distillation, a powerful strategy for compressing large models into smaller, efficient variants without significant loss in quality. For more details, see Compact Language Models via Pruning and Knowledge Distillation.
This post explains model pruning and knowledge distillation, how they work, and how you can easily apply them to your own models to achieve optimal performance using NVIDIA TensorRT Model Optimizer.
Pruning is a model optimization technique that leverages the common over-parameterization of neural networks occurring from training models with enough capacity to learn complex features and ensure smooth convergence. Pruning systematically identifies and removes unimportant parameters such as weights, neurons, or even layers from a trained model.
This process can often eliminate large amounts of a model’s weights with minimal impact on accuracy, directly translating to a more compact model with accelerated inference speeds and lower computational cost. Similar to how an arborist trims a tree to improve its health and growth, model pruning makes a model smaller and more efficient.
Depth pruning and width pruning are the two main approaches.
Depth pruning removes entire layers from the neural network, reducing the overall depth and complexity (Figure 1).

Figure 1. Depth pruning a neural network reduces overall depth and complexity
Width pruning eliminates internal structures such as individual neurons, attention heads, or embedding channels, slimming down the model’s width (Figure 2).

Figure 2. Reducing layer width by pruning an unimportant neuron
The core idea is to identify and remove parts of the LLM that contribute the least to its overall performance. Different methods are used to assess the importance of different components, such as:
Research shows that width pruning typically achieves better accuracy than depth pruning, though depth pruning often reduces inference latency more at the same number of parameters. The choice between depth pruning, width pruning, or a combination of both should depend on the desired balance between accuracy and latency. For more information, see LLM Pruning and Distillation in Practice: The Minitron Approach.
Knowledge distillation is a model compression technique that transfers knowledge from a larger “teacher” model to a smaller and more efficient “student” model (Figure 3). The goal is to create a compact model that retains the high performance of the larger model, making it suitable for deployment at a lower resource cost.

Figure 3. Knowledge distillation trained student and teacher model outputs
Knowledge distillation trains a compact student model to emulate a larger teacher, not by relying solely on hard labels, but by learning from the teacher’s guidance. This transfers rich, generalizable behavior so the student approaches the teacher’s accuracy while running far more efficiently.
Two common distillation styles, response-based and feature-based, differ in how each passes knowledge from teacher to student.
Response-based knowledge distillation transfers a teacher model’s knowledge to a student by training the student to match the teacher’s soft output probabilities rather than only hard labels. These soft targets convey inter-class similarities, for example that “cat” is closer to “tiger” than to “car,” and the student is optimized to align with them using KL divergence.
The approach is simple to implement, requires no access to the teacher’s internal features, and is highly effective for classification tasks. In practice, it’s common to combine the distillation loss with standard cross-entropy on ground-truth labels and tune the loss weights to balance stability and fidelity, yielding compact models that preserve much of the teacher’s accuracy.

Figure 4. Student learning from a teacher’s soft targets through output comparison
Feature-based knowledge distillation transfers a teacher’s intermediate representations hidden activations or feature maps to guide a student toward learning similar internal structure, not just similar outputs. During training, selected teacher and student layers are paired and aligned, projection layers are often used when dimensions differ.
This deeper, layer-level supervision provides richer signals than response-based KD and has proven effective across vision (CNN feature maps, for example) and NLP (Transformer hidden states and attentions, for example). Because it relies on internal activations, this technique requires access to the teacher’s intermediate layers and careful layer selection and weighting alongside the standard task loss to balance stability and accuracy.

Figure 5. Student learning from a teacher’s hidden layer’s feature map comparison
Pruning and distillation form a powerful pipeline for model compression, enabling the creation of SLMs that are well-suited for deployment in production environments and edge applications. TensorRT Model Optimizer streamlines applying these techniques at scale, turning state-of-the-art LLMs into deployable, cost-effective solutions.
This section walks you through how to build a pipeline using TensorRT Model Optimizer. It includes dataset preparation, fine-tuning a teacher model on the WikiText dataset, and applying pruning and distillation techniques to produce a 6B-parameter model from Qwen3-8B. For more information, see the Qwen3-8B Pruning and Distillation with NeMo 2.0 Framework notebook.
Prior to pruning and distillation, it is necessary to convert Hugging Face models to the NVIDIA NeMo checkpoint format and preprocess the dataset. For detailed instructions, refer to the model conversion and data preparation step.
Here, we will demonstrate how to prune using both the depth pruning and width pruning approaches. The scripts provided can be run inside the NVIDIA NeMo framework container nvcr.io/nvidia/nemo:25.09.
The initial approach involves trimming the Qwen3 8B model from 36 to 24 layers (about 6B parameters) by automatically selecting the best 24 layers to keep using a small calibration dataset of 1,024 samples.
The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup.
torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py \ --devices 2 \ --pp_size 2 \ --restore_path Qwen3-8B-nemo \ --legacy_ckpt \ --save_path Qwen3-8B-nemo-depth-pruned \ --seq_length 4096 \ --num_train_samples 1024 \ --mbs 4 \ --data_paths wikitext-data/wikitext-train_text_document \ --target_num_layers 24 |
The second, alternative approach to model size reduction involves width pruning. This is achieved by shrinking key architectural components: the MLP intermediate (ffn_hidden_size) is reduced from 12,288 to 9,216, and the Embedding (hidden_size) from 4,096 to 3,584, also resulting in a 6B model.
Further reductions in the number of attention heads (num_attention_heads) and GQA query groups (num_query_groups) can be implemented as needed. The layer count (num_layers) may also be adjusted to achieve the desired model size.
The script for this process is provided below, showing how to prune using a two-GPU pipeline parallel setup.
torchrun --nproc_per_node 2 /opt/NeMo/scripts/llm/gpt_prune.py \ --devices 2 \ --pp_size 2 \ --restore_path Qwen3-8B-nemo \ --legacy_ckpt \ --save_path Qwen3-8B-nemo-width-pruned \ --seq_length 4096 \ --num_train_samples 1024 \ --mbs 4 \ --data_paths wikitext-data/wikitext-train_text_document \ --target_ffn_hidden_size 9216 \ --target_hidden_size 3584 |
By trimming redundant or low-importance weights, pruning not only shrinks the model’s memory footprint but can also speed up inference. However, this process is typically followed by fine-tuning or retraining to recover any accuracy lost during the pruning phase and to ensure the pruned model maintains high performance on target tasks. This is where distillation comes in.
This example distills the Qwen3 depth- and width-pruned models using knowledge distillation with Model Optimizer and the NeMo 2.0 Framework.
When distilling knowledge from the teacher model to a depth-pruned model, the path of the student model will be Qwen3-8B-nemo-depth-pruned. This path corresponds to the output of the depth-pruning step, as detailed in the NeMo distillation notebook.
The script for this process is provided below, showing how to distill using a single-node eight-GPU Tensor Parallel setup. In practice, we recommend multinode training for faster training.
torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py \ --name Qwen3-8B-nemo-depth-pruned-distill \ --devices 8 \ --num_nodes 1 \ --tp_size 8 \ --model_path Qwen3-8B-nemo-depth-pruned \ --teacher_path Qwen3-8B-nemo \ --legacy_ckpt \ --max_steps 40 \ --warmup_steps 1 \ --gbs 768 \ --mbs 8 \ --lr 1e-4 \ --min_lr 1e-5 \ --seq_length 4096 \ --log_dir . \ --log_interval 5 \ --val_check_interval 5 \ --limit_val_batches 2 \ --data_paths wikitext-data/wikitext-train_text_document |
While distilling knowledge from the teacher to the width-pruned model, the student_model_path model would be Qwen3-8B-nemo-width-pruned as produced by the width-pruning step in the NeMo pruning notebook. Further details found in the NeMo distillation notebook.
The script for this process is provided below, showing how to distill using a single-node eight-GPU tensor parallel setup. In practice, we recommend multinode training for faster training.
torchrun --nproc_per_node 8 /opt/NeMo/scripts/llm/gpt_train.py \ --name Qwen3-8B-nemo-width-pruned-distill \ --devices 8 \ --num_nodes 1 \ --tp_size 8 \ --model_path Qwen3-8B-nemo-width-pruned \ --teacher_path Qwen3-8B-nemo \ --legacy_ckpt \ --max_steps 40 \ --warmup_steps 1 \ --gbs 768 \ --mbs 8 \ --lr 1e-4 \ --min_lr 1e-5 \ --seq_length 4096 \ --log_dir . \ --log_interval 5 \ --val_check_interval 5 \ --limit_val_batches 2 \ --data_paths wikitext-data/wikitext-train_text_document |
For more comprehensive information, see the NeMo Framework distillation documentation. These resources will help you easily enable and integrate distillation into your workflow.
Experimental results for pruning and distillation from Qwen3 8B using Model Optimizer show that Qwen3 Depth Pruned 6B model is 30% faster than the Qwen3 4B model, and it also performs better on the MMLU (Massive Multitask Language Understanding) benchmark. Depth pruning was applied to reduce the model from 36 to 24 layers, resulting in a 6B model, using one NVIDIA H100 80 GB HBM3.
The Pruned model is distilled from Qwen3-8B using the OptimalScale/ClimbMix data processed from nvidia/ClimbMix pretraining dataset. The experiment uses 25% of the data, which is approximately 90B tokens. Distillation takes 8 hours with 96 nodes, each having eight NVIDIA H100 GPUs (6K GPU hours).

Figure 6. The Qwen3 Depth Pruned 6B model outperforms 4B on both speed and accuracy and approaches 8B accuracy while running much faster
The 6B pruned model demonstrates a significant advancement in performance compared to its 4B counterpart. Notably, the 6B pruned model achieves a 30% increase in speed, making it considerably more efficient for various computational tasks. For Throughput comparison, all models are quantized to FP8 precision using Model Optimizer and run with TensorRT-LLM.
Beyond its speed advantage, the 6B pruned model also exhibits superior accuracy, as evidenced by its higher score on the MMLU benchmark. With a score of 72.5, it surpasses the 4B model’s score of 70.0, indicating a better understanding and capability across a broad range of language-related tasks.
This dual improvement in both speed and accuracy positions the 6B pruned model as a more robust and effective solution for applications requiring both rapid processing and high-quality results.
The pruned models were distilled on a pretraining dataset, so the model is a base variant. Having a base model, we only compared all the models on base model benchmarks such as MMLU. Practically using these models for reasoning tasks would require performing post-training on the models as well.
Pruning and knowledge distillation are highly cost-effective methods to progressively shrink LLMs while matching or exceeding baseline accuracy across domains, and they’re typically more data-efficient than either synthetic-data fine-tuning or full pretraining.
Ready to get started? Check out the Qwen3 8B Pruning and Distillation with NeMo 2.0 Framework notebook. Visit the NVIDIA/TensorRT-Model-Optimizer GitHub repo to learn more about pruning and distillation. For more information about model optimization techniques using TensorRT Model Optimizer, see related posts on post-training quantization, quantization-aware training, and speculative decoding.
+10
Like
Developer Tools & Techniques | General | NeMo | TensorRT | Beginner Technical | Intermediate Technical | Tutorial | LLM Techniques | LLMs

About Max Xu
Max Xu is a senior technical lead at NVIDIA specializing in AI training and inference at scale, performance engineering, and end-to-end application deployment. He brings full-stack GPU expertise spanning from chip design, CUDA and kernel-level development to server and cloud for model training and inference, translating innovations into real-world impact. Before NVIDIA, Max worked in engineering roles across major CSP and semiconductor companies.

About Keval Morabia
Keval Morabia is a senior deep learning engineer on the NVIDIA TensorRT Model Optimizer team where he focuses on algorithms for optimizing LLMs. More specifically, Keval works on optimization techniques like Pruning, Neural Architecture Search, and Knowledge Distillation that have demonstrated significant speedups for the MLPerf Inference submissions in the past. Keval joined NVIDIA through the acquisition of OmniML Inc., where he was an early ML engineer. Keval received his master's degree in Computer Science from the University of Illinois at Urbana-Champaign and bachelor's degree in Computer Science from BITS Pilani, India.
View all posts by Keval Morabia

About Asha Anoosheh
Asha Anoosheh is a deep learning algorithms engineer at NVIDIA working on the TensorRT Model Optimizer library. He has an M.Sc. from the ETH Zürich in robotics with a focus in computer vision.
View all posts by Asha Anoosheh

About Jamie Li
Jamie Li is a senior technical marketing engineer at NVIDIA focused on wrangling the latest technologies in AI inference. He brings a deep background in both AI software engineering and customer management, translating innovations into practical customer outcomes. Before NVIDIA, he held roles developing, breaking, and fixing AI solutions in the enterprise tech sector. He also did research in medical imaging and holds a master’s degree in Computer Science with an AI focus.
LATEST NEWS
WHAT'S TRENDING
Data Science
5 Imaginative Data Science Projects That Can Make Your Portfolio Stand Out
OCT 05, 2022
SOURCE: AIJOURN.COM
JUN 06, 2026
SOURCE: BLOGS.CISCO.COM
MAY 21, 2026
SOURCE: BIOENGINEER.ORG
MAY 08, 2026