10–09–2023: Huggingface betting on Speech-toText? Where can it be leveraged in Enterprise?

Adrien
4 min readSep 10, 2023

HuggingFace released the Speech-to-text leaderboard last week where all open source models are ranked and assessed mainly on an English dataset for now. The ranking takes 2 metrics into account, including the WER (Word Error Rate) and the RTF which is more or less the speed of transcription.

Some History

While the early attempts at Speech to text relied on voice and pattern. Think of the voice as an analog signal and assess the pattern, this discipline was called signal processing. It was taught in many bachelor and master courses in computer science and mathematics.

By the years 2000’s, the accuracy of speech-to-text was allegedly already beyond 80%, yet no real application existed besides the introduction of Google speech search and Siri (2011). However, those who can remember Siri from 2011 will acknowledge that performance in practice was questionable. Models from this time were mainly complex close-box solutions that tended to fail when speaking conditions were not optimal. Finally, the question of what to do with the data once transcribed was also an issue.

Like many fields that needed a lot of data preprocessing and where even the tone of the voice can impact the signal, it was revolutionized by AI and…

--

--

Adrien

Strategy/Data/Leadership head of DS at OCBC ~~ exTwitter ~~ ex-gojek