For examples of the types of outputs to expect, you can visit the Whisper Demo at OctoAI.

This guide will cover the basics of running an inference to convert speech to text using the TypeScript SDK, including using our healthCheck method to ensure the endpoint is healthy before sending it any requests.

As a next step, Whisper works very well with Asynchronous Inference Using the TypeScript SDK.

Requirements

Whisper: Speech Recognition

Whisper is a natural language processing model that converts audio to text. Like with Stable Diffusion, we’ll use base64 encoding an mp3 or a wav file into a base64 string. For more information on all the ways you can customize your Whisper inferences, please see the Whisper OpenAPI documentation.

TypeScript
import { Client } from "@octoai/client";
import { readFileSync} from "fs";

const whisperEndpoint = "https://whisper-demo-kk0powt97tmb.octoai.run/predict";
const whisperHealthCheck = "https://whisper-demo-kk0powt97tmb.octoai.run/healthcheck";

// This instantiation approach takes your OCTOAI_TOKEN as an environment variable
// If you have not set it as an envvar, you can use the below instead
// const OCTOAI_TOKEN = "API token here from following the token creation guide";
// const client = new Client(OCTOAI_TOKEN);
const client = new Client();

// First, we need to convert an audio file to base64.
const audio = readFileSync("./octo_poem.wav", {
    encoding: "base64",
});

// These are the inputs we will send to the endpoint, including the audio base64 string.
const inputs = {
    language: "en",
    task: "transcribe",
    audio: audio,
};

async function wavToText() {
    if (await client.healthCheck(whisperHealthCheck) === 200) {
        const outputs: any = await client.infer(whisperEndpoint, inputs);
        console.log(outputs.transcription);
    }
}

wavToText().then();

With this particular test file, we will have the below printed in the console:

Once upon a time, an AI system was asked to come up with a poem for an octopus. 
It said something along the lines of, Octopus, you are very wise.

Whisper Outputs

The above outputs variable returns JSON in something like the following format.

JSON
{
  prediction_time_ms: 626.42526,
  response: {
    segments: [ [Object] ],
    word_segments: [
      [Object]
    ]
  },
  transcription: ' Once upon a time, an AI system was asked to come up with a poem for an octopus. It said something along the lines of, Octopus, you are very wise.'
}

Each segment is an object that looks something like:

JSON
{
  start: 5.553,
  end: 8.66,
  text: 'It said something along the lines of, Octopus, you are very wise.',
  words: [
    {
      word: 'It',
      start: 5.553,
      end: 5.633,
      score: 0.945,
      speaker: null
    },
    {
      word: 'said',
      start: 5.653,
      end: 5.814,
      score: 0.328,
      speaker: null
    },
    // etc...
  ],
  speaker: null
}

Each word_segment is an object that looks something like:

JSON
{ word: 'Once', start: 0.783, end: 0.903, score: 0.883, speaker: null }