10–09–2023: Huggingface betting on Speech-toText? Where can it be leveraged in Enterprise?
HuggingFace released the Speech-to-text leaderboard last week where all open source models are ranked and assessed mainly on an English dataset for now. The ranking takes 2 metrics into account, including the WER (Word Error Rate) and the RTF which is more or less the speed of transcription.
Some History
While the early attempts at Speech to text relied on voice and pattern. Think of the voice as an analog signal and assess the pattern, this discipline was called signal processing. It was taught in many bachelor and master courses in computer science and mathematics.
By the years 2000’s, the accuracy of speech-to-text was allegedly already beyond 80%, yet no real application existed besides the introduction of Google speech search and Siri (2011). However, those who can remember Siri from 2011 will acknowledge that performance in practice was questionable. Models from this time were mainly complex close-box solutions that tended to fail when speaking conditions were not optimal. Finally, the question of what to do with the data once transcribed was also an issue.
Like many fields that needed a lot of data preprocessing and where even the tone of the voice can impact the signal, it was revolutionized by AI and more particularly deep learning. But more importantly, it changed the field from a field of Language expert and computer scientist experts to some more brute force computational power.
By 2016, evaluation on clean datasets was already below 5% WER such as Deepspeech 2 from Baidu leveraging RNN. see Librispeech.
IBM, Google, and smaller private companies were competing for the business. Then came OpenAI (the one from ChatGPT).
In 2021 OpenAI open-sourced a range of (fairly lightweight) models called Whisper whose performance was simply better. OpenAI is an expert at building Transformers models and they leveraged this type of model for Whisper and outperformed most other expensive models for free.