The future of voice-controlled workspaces: technology, use cases, and implications
SOURCE: THEGADGETFLOW.COM
FEB 07, 2026
Microsoft Releases VibeVoice-ASR: A Unified Speech-to-Text Model Designed to Handle 60-Minute Long-Form Audio in a Single Pass
SOURCE: MARKTECHPOST.COM
JAN 22, 2026
By
-
January 22, 2026
Microsoft has released VibeVoice-ASR as part of the VibeVoice family of open source frontier voice AI models. VibeVoice-ASR is described as a unified speech-to-text model that can handle 60-minute long-form audio in a single pass and output structured transcriptions that encode Who, When, and What, with support for Customized Hotwords.
VibeVoice sits in a single repository that hosts Text-to-Speech, real time TTS, and Automatic Speech Recognition models under an MIT license. VibeVoice uses continuous speech tokenizers that run at 7.5 Hz and a next-token diffusion framework where a Large Language Model reasons over text and dialogue and a diffusion head generates acoustic detail. This framework is mainly documented for TTS, but it defines the overall design context in which VibeVoice-ASR lives.

https://huggingface.co/microsoft/VibeVoice-ASR
Unlike conventional ASR (Automatic Speech Recognition) systems that first cut audio into short segments and then run diarization and alignment as separate components, VibeVoice-ASR is designed to accept up to 60 minutes of continuous audio input within a 64K token length budget. The model keeps one global representation of the full session. This means the model can maintain speaker identity and topic context across the entire hour instead of resetting every few seconds.
The first key feature is that many conventional ASR systems process long audio by cutting it into short segments, which can lose global context. VibeVoice-ASR instead takes up to 60 minutes of continuous audio within a 64K token window so it can maintain consistent speaker tracking and semantic context across the entire recording.
This is important for tasks like meeting transcription, lectures, and long support calls. A single pass over the complete sequence simplifies the pipeline. There is no need to implement custom logic to merge partial hypotheses or repair speaker labels at boundaries between audio chunks.
Customized Hotwords are the second key feature. Users can provide hotwords such as product names, organization names, technical terms, or background context. The model uses these hotwords to guide the recognition process.
This allows you to bias decoding toward the correct spelling and pronunciation for domain specific tokens without retraining the model. For example, a dev-user can pass internal project names or customer specific terms at inference time. This is useful when deploying the same base model across several products that share similar acoustic conditions but very different vocabularies.
Microsoft also ships a finetuning-asr directory with LoRA based fine tuning scripts for VibeVoice-ASR. Together, hotwords and LoRA fine tuning give a path for both light weight adaptation and deeper domain specialization.
The third feature is Rich Transcription with Who, When, and What. The model jointly performs ASR, diarization, and timestamping, and returns a structured output that indicates who said what and when.
See below the three evaluation figures named DER, cpWER, and tcpWER.

https://huggingface.co/microsoft/VibeVoice-ASR
These graphs summarize how well the model performs on multi speaker long form data, which is the primary target setting for this ASR system.
The structured output format is well suited for downstream processing like speaker specific summarization, action item extraction, or analytics dashboards. Since segments, speakers, and timestamps already come from a single model, downstream code can treat the transcript as a time aligned event log.
LATEST NEWS
WHAT'S TRENDING
Data Science
5 Imaginative Data Science Projects That Can Make Your Portfolio Stand Out
OCT 05, 2022
SOURCE: THEGADGETFLOW.COM
FEB 07, 2026
SOURCE: FACE2FACEAFRICA.COM
FEB 05, 2026
SOURCE: ROBOTICSANDAUTOMATIONNEWS.COM
JAN 30, 2026
SOURCE: GADGETREVIEW.COM
JAN 30, 2026
SOURCE: NEWELECTRONICS.CO.UK
JAN 22, 2026
SOURCE: GIGAZINE.NET
JAN 18, 2026
SOURCE: GOODEREADER.COM
JAN 18, 2026
SOURCE: MARKTECHPOST.COM
JAN 06, 2026