OctoAI Logo
Sign up
Log in
Sign up
Log in
Audio to Text

WhisperX API Endpoint

Whisper is a popular open source speech to text model from OpenAI that approaches human level accuracy on English speech recognition. Transcription and Translation is supported in 99 different languages. Beyond speech recognition, it can be used for a multitude of speech processing tasks such as speaker recognition, speech separation and keyword spotting.

Run Model

Advice from ML Experts

The hosted WhisperX model is a multi-task version which can not only transcribe audio in its original language, but also translate it into a target language at the same time (in this case english). The timestamps it generates can be used to help generate closed-captions for your audio. In general, it is more efficient and accurate to send a long audio clip to be processed at the same time rather than chopping things apart and processing them separately. However, for real time transcription use-cases, you will receive more immediate feedback by sending audio in 30 second clips.

Supported Model Variants:


Your fine tuned Whisper variants


MIT License

What is WhisperX?

WhisperX is a further improved model with time-accurate speech recognition with word-level timestamps, it demonstrates state of the art performance on long-form transcription and word segmentation. Whisper-X adds features to the base Whisper model such as reduced hallucinations, better transcription timestamps, no-speech detection, speaker diarization and batched inference with up to 70x speed improvements. Speaker diarization performs speaker recognition by partitioning an audio stream containing human speech into homogeneous segments according to the identity of each speaker.
from octoai.client import Client
import base64

whisper_url = "https://your-whisper-demo.octoai.run/predict"
whisper_health_check = "https://your-whisper-demo.octoai.run/healthcheck"

file_path = "your-sample-audio.wav"
with open(file_path, "rb") as f:
    encoded_audio = base64.b64encode(f.read())
    base64_string = encoded_audio.decode("utf-8")

inputs = {
    "language": "en",
    "task": "transcribe",
    "audio": base64_string,

OCTOAI_TOKEN = "API Token goes here from guide on creating OctoAI API token"

client = Client(token=OCTOAI_TOKEN)
if client.health_check(whisper_health_check) == 200:
  outputs = client.infer(endpoint_url=whisper_url, inputs=inputs)
  transcription = outputs["transcription"]
  assert "She sells seashells by the seashore" in transcription
  assert (
        "She sells seashells by the seashore"
        in outputs["response"]["segments"][0]["text"]

Fine tuning supported

Customize Whisper with your own audio data to create specialized speech models, such as medical or legal conversation and dictation use cases. We are currently working with design partners for this feature. Contact us to get early access.

Fine tuning
  • Recognize new terms

  • Learn new dialects and accents

  • Make legal or healthcare specific models

Human to AI conversation

Speech recognition at your need for speed

Recognize speech at fast human (cadence) level on faster GPU hardware or slower for historical batch processing at a 6x lower cost with WhisperX on OctoAI.

  • Times are cumulative, so diarization includes transcription and alignment

  • Audio used for timing
  • Benchmark run on A10g

WhisperX transcription on A10G results

transcribe: 1 seconds in 0.37253355979919434
transcribe: 20 seconds in 1.2478430271148682
transcribe: 60 seconds in 1.8664119243621826
transcribe: 300 seconds in 4.814593553543091
transcribe: 1200 seconds in 34.68618607521057
transcribe: 3600 seconds in 42.72543740272522
transcribe: 7200 seconds in 87.29967927932739
transcribe: 14400 seconds in 178.04800415039062
transcribe: 28800 seconds in 344.45165491104126
transcribe: 57600 seconds in 484.6209282875061

Time (seconds)Acceleration

Audio File


Transcription (1)


61x faster

Alignment (2)


36x faster

Diarization (3)


18x faster

  1. Transcription converts speech to text.

  2. Alignment generates the start and stop time for each word in the transcript. This time includes both transcription and alignment.

  3. Diarization identifies the speaker. This time includes transcription, alignment, and diarization.

Deploy with ease for web or mobile apps

We built a demo app integrating with React to show how easy it is to use Whisper on OctoAI for your web app. Check out the app, just a note you might experience a cold start and transcription processing might take a bit longer.

Whisper on OctoAI features


High quality speech detection with reduced hallucinations

Yes via WhisperX

Word level timestamp accuracy for utterance-level detection

Yes via WhisperX

Trade speed for cost with fast detection for real-time applications or slow detection for batch processing

Yes via WhisperX acceleration services

Whisper model history


The original Whisper models (size tiny to large-v1) were released by OpenAI on Sept. 1, 2022. It was trained for human level accuracy on English speech recognition.


Whisper-large-v2 was released by OpenAI on Dec. 8, 2022, and was trained on 2.5x more epochs.


WhisperX was released on Dec. 14, 2022 by Max Bain. It provides fast (70x realtime with large-v2) automatic speech recognition with word-level timestamps and speaker diarization.