WhisperX on OctoAI: Delivering the latest innovations and highest quality in speech-to-text at 6x lower costs

Blog Author - Michal Piszczek

Jul 13, 2023

Chart showing the price per hour of audio transcription for Whisper models, and OctoAI has the lowest cost

Here at OctoAI, we are laser focused on always making the best, most performant generative AI models available to developers. Part of how we do this is by staying engaged with leading-edge research in the field and by having a pulse on the latest model innovations in the open source software (OSS) community. The team continuously reviews,evaluates models for production use cases, and incorporates the best ones into the OctoAI library. One of the models we have been focused on is Whisper for audio transcription. Today, we’re excited to share that we have added WhisperX to OctoAI, which brings multiple functionality and performance enhancements over Whisper large and Whisper large v2, and runs at 6x lower costs (compared to Whisper large v2 on OpenAI). WhisperX is available today in OctoAI.

From Audrey to Whisper - 70 years of speech recognition evolution

Audio transcription and Automatic Speech Recognition (ASR) have been topics of interest for applied automation from the early days of computers. ASR has come a long way from its start as the 6 foot “Audrey” Automatic Digit Recognizer from Bell Labs in 1952. We’ve seen IBM’s Shoebox, the growth in popularity of Hidden Markov models (HMMs) for word extraction, Dragon Systems’ “NaturallySpeaking”, and of course, the launch of Google Voice Search, Apple Siri and Amazon Alexa. Most recently, OpenAI released Whisper in 2022. Speech recognition continues to be one of the most tangible and immediately applicable areas where AI can reduce manual effort, increase quality and coverage, and free up resources for other activities. Use cases where speech recognition is actively applied today include customer service, call center productivity, video captioning, medical transcription and many others, and builders in these areas are actively exploring ways to improve quality and performance of their transcription.

Whisper and the resulting OSS innovation and expansion

Whisper, published by OpenAI in September 2022, is today one of the most powerful and broadly applicable foundation models for ASR. Trained on 680,000 hours of both english and non-english labeled speech data, Whisper provides a robust foundation for a broad range of speech-to-text use cases. While broadly applicable and robust, Whisper misses a number of specific capabilities - such as its limitation to short form audio under 30s, lack of speaker diarization to identify specific speakers, and word level time stamps - which are required for several common use cases. These have triggered a number of rapid innovations and iterations from the broader OSS community around Whisper, adding several new options to the 5 initial sizes available at release - tiny, base, small, medium and large, ranging from 39M to 1.5B parameters and GPU memory footprints from 1 to 10GB. Enhancements to Whisper in the last few months include manual multi-lingual fine tuning and diarization guides from Hugging Face, open source projects including Zero-shot Audio Classification using Whisper, Buzz to support real time and longer form audio transcription using Whisper and chunking, and Hugging Face’s Transformer AutomaticSpeechRecognitionPipeline method to support chunking and long form audio.

Curating the best option for production use cases - WhisperX on OctoAI

A challenge we hear from developers building on these innovations, is that increased optionality doesn't immediately result in increased productivity or value. The teams and developers building audio transcription applications are focused on solving customer problems. They typically lack time and resources to thoroughly evaluate the quality and speed of every OSS model that might be released. This is where OctoAI’s active research engagement in the space and evaluation of the latest OSS models is valuable.

OctoAI’s Whisper research team shortlisted several options – based on internally assessed scores for quality, infrastructure needs, API surface and capabilities, and developer adoption, including OpenAI’s Whisper large-v2 and WhisperX. WhisperX was published by a group of researchers from the University of Oxford in March 2023, and attempts to address a number of the limitations in Whisper – including better support for long audio transcription, and addition of word level time stamps. WhisperX on OctoAI performed better in internal evaluations along multiple dimensions compared to alternatives, including ease of direct application for a broad range of use cases, accuracy of transcription, and speed/runtime for transcription - delivering a 5x cost savings over the equivalent from Deepgram, and over 6x cost savings over OpenAI.

Run WhisperX on OctoAI today

Get started with WhisperX on OctoAI today with a free trial on OctoAI. You’re also welcome to join us on our Discord server to engage with the team and our community.

We look forward to hearing from you!

