Meta gives away a free video dataset of 846 hours


SOURCE: ANALYTICSINDIAMAG.COM
FEB 04, 2022

On February 1, Meta announced a new resource to advance fairness in speech recognition: The company’s AI team released a research paper on a new project called ‘Casual Conversations’, an exhaustive data set with manual transcriptions to help researchers evaluate the accuracy of audio models.

Machine Learning models are as good as their data. When the models tend to only recognise the voice patterns of a white person, and appear to neglect a certain community, race or gender, it indicates a knowledge gap that seems unfair. In the context of ML, fairness refers to the attempts made to correct these biases in the statistical data. Ideally, data should be sensitised and equally representative of communities regardless of disability, ethnicity and gender.

The research on Automatic Speech Recognition (ASR) systems pale in comparison to the studies in the area of facial recognition.

Past studies

According to a Stanford study in 2020, the speech recognition systems of the biggest tech companies like Amazon, Apple, Google, Microsoft and IBM failed to identify 19% of the words when the user was white and 35% when the user was Black. However, only two companies responded to the study. While Amazon said it was constantly improving its speech recognition service, Google acknowledged the inefficiencies and claimed it has been taking a long, hard look at the model flaws.

In 2014, Google researchers wrote a paper detailing the reason behind the biases in language. Titled, ‘Discriminative Pronunciation Modelling for Dialectical Speech Recognition,’ the paper spoke about how African American Vernacular English (AAVE), a dialect mostly used by African Americans in casual speech is different from Standard American English (SAE) in terms of pronunciation and vocabulary. The accuracy of an ASR system dropped around a specific dialect due to the lack of representation in training data.

Diverse dataset

The Casual Conversations dataset comprises 846 hours of 45,000 videos, each up to a minute long on average. The conversations of more than 3,000 participants from different ages, ethnicities and gender on random subjects went into the datasets. In addition, researchers did a taxonomy of the collected speech based on the skin tones. While skin colour is a more important variable in computer vision, the skin tone of the participant could be interrelated to variables in speech.

The researchers made speech recognition models including a LibriSpeech model, a supervised Video model, a semi-supervised Video model and a semi-supervised teacher Video model. The results showed big accuracy gaps in terms of gender but not across age groups. As it turned out, skin colour was an important factor in driving different performances among subgroups. The more varied and larger the dataset, the lesser the comparative error rates of the ASR model, the study concluded. The dataset must represent a diverse range of attributes from subgroups to achieve more evenly distributed accuracies.

Prospects

Last October, Speechmatics, a UK-based speech recognition company, said its speech recognition system had an accuracy of 83% for African American users. Speechmatics beat Microsoft (73%) Amazon and Google (69% each), IBM (62%) and Apple (55%) hands down in accuracy levels. The company’s model failed to recognise 17% of the words spoken by Black voices compared to Amazon and Google’s 31%.

Speechmatics said it had trained its ML models on reams of unlabelled data from podcasts and social media to expose the software to different accents, styles and grammar. “It would be good if people were open-sourcing test sets that let you evaluate how well you’re doing on this front,” Will Williams, the company’s vice-president of ML, said.

Similar articles you can read