Streaming Speech Recognition with Python using a Websocket

We'll use the websockets Python package to write our client. You can install it using:

pip install websockets

The streaming speech recognition API expects 16 bit linear PCM audio and when using REST or Websocket the audio needs to be base64 encoded. Optimally the audio should be sampled at 16 kHz, but it's not a requirement and the service will resample incoming audio if necessary.

Let's say we have a 16 bit WAV file sampled at 16 kHz. We can then stream chunks of audio while simultaneously receiving partial transcriptions. The first message in a stream is a config message which sets the encoding, sample rate and language of the incoming audio.

We have to define a request generator which generates chunks of audio to be recognized:

import base64
import wave


def generate_requests(wav_path, chunk_width=1024):
    with wave.open(wav_path, 'r') as wav:
        sample_rate = wav.getframerate()

        yield {
            "streamingConfig": {
                "config": {
                    "encoding": "LINEAR16",
                    "sampleRateHertz": sample_rate,
                    "enableWordTimeOffsets": True,
                    "languageCode": "is-IS-x-exp",
                },
                "interimResults": True,
            }
        }

        while True:
            chunk = wav.readframes(chunk_width)
            if not chunk:
                return
            yield {
                "audioContent": base64.b64encode(chunk).decode('utf-8')
            }

Let's now write a client using the async interface for websockets:

import asyncio
import websockets
import sys
import os
import json


async def main():
    uri = "wss://speech.talgreinir.is/v2beta1/speech:streamingrecognize?token=" + os.environ["TIRO_SPEECH_KEY"]
    async with websockets.connect(uri, ssl=True) as sock:
        async def read():
            try:
                out_transcript = ""
                async for m in sock:
                    try:
                        response = json.loads(m)
                        transcript = response["result"]["results"][0]["alternatives"][
                            0
                        ]["transcript"]
                        is_final = response["result"]["results"][0].get(
                            "isFinal", False
                        )

                        current_output = (
                            " ".join((out_transcript, transcript))
                            if out_transcript
                            else transcript
                        )

                        if is_final:
                            out_transcript = current_output
                            transcript = ""

                        print(
                            current_output,
                            end="\r",
                            flush=True,
                        )

                    except KeyError:
                        pass
            except websockets.ConnectionClosed:
                print()

        async def send():
            for m in generate_requests(sys.argv[1]):
                out = json.dumps(m)
                await sock.send(out)

        await asyncio.gather(send(), read())


if __name__ == "__main__":
    asyncio.run(main())