CMU Researchers Introduce the Open Whisper-Style Speech Model: Advancing Open-Source Solutions for Efficient and Transparent Speech Recognition Training


SOURCE: HTTPS://WWW.MARKTECHPOST.COM/
OCT 03, 2023

Natural language processing (NLP) has paid much attention to large-scale Transformers. These models, trained on large datasets, have demonstrated amazing emergent abilities in various downstream applications. Notably, comparable pre-training methods have been successfully used in voice processing. A promising path to creating universal speech models that can handle many speech tasks inside a single model is large-scale supervised learning. A collection of multilingual, multitask models called OpenAI Whisper [15] was developed using 680k hours of labeled voice data that was carefully selected from various online sources.

The complete process for model building (from data preparation to training) is still unavailable to the general public despite the publication of pre-trained Whisper models and inference code, which has been a usual situation for large language models (LLMs). This restriction raises several issues:

  1. Because users are unaware of the actual training data, using pre-trained models on new benchmarks carries the danger of data leakage.
  2. Users lack access to the training dynamics. So, researchers have difficulty understanding the underlying mechanisms and illuminating strategies for improving the model’s performance.
  3. Dealing with problems relating to robustness, fairness, bias, and toxicity, all of which typically arise due to the data and training process, is made far more difficult by the lack of access to the entire model development pipeline.

By pushing for the publication of comprehensive training pipelines, there has recently been a determined movement to promote open science in the field of LLM research. This inspired the research team from Carnegie Mellon University, Shanghai Jiao Tong University, and Honda Research Institute to create the Open Whisper-style Speech Model (OWSM)2, which uses an open-source toolbox and publicly available data to replicate whisper-style training. To handle crucial tasks, including language identification (LID), multilingual automated speech recognition (ASR), and utterance-level segmentation, OWSM adopts the Whisper framework.

Notably, OWSM also displays several technical innovations. Instead of just any-to-English translation, it handles any-to-any speech translation. OWSM also uses a variety of tactics to improve efficiency. The entire pipeline, including data preparation, training, inference, and scoring, will be covered by reproducible recipes. The team also plans to make pre-trained models and training logs available, letting researchers dig into the mechanics of the training procedure and obtain important knowledge for their research.

While OWSM performs similarly to Whisper or even better on some metrics, its goal is not to engage in a protracted arms race with Whisper. The team’s largest dataset only makes up about 25% of the training set used by Whisper, and they cannot execute numerous trial runs because of resource constraints.

In the future, the team plans to explore the following directions:

  1. The current OWSM still lags behind Whisper in many benchmarks. The researchers believe that using more sophisticated encoder or decoder architectures, gathering more varied ASR and ST data from open sources, and incorporating self-supervised speech representations similar to Google USM can help.
  2. They also intend to add other speech-processing tasks to the multitask framework, such as spoken language comprehension and speech synthesis based on discrete representations, to create “universal speech models.”