Audio Transcriber

using google cloud speech to text API in python

Audio Transcriber

Are you struggling with writing subtitles for your videos? Well, this is a very common issue. So in this article, we will discuss how to create a subtitle file from a video or audio file. Before starting with it there are few things that we need to set up.

  • Converting video to audio: we will use moviepy library to convert the video to audio.
  • Audio segmentation: we will use FFmpeg for splitting the audio into chunks.
  • Speech to text: we will use Google Cloud Speech to text API for transcribing the audio to text.

Setting up

Once it's done we can proceed further by discussing every step one by one.

Video to audio

the overall goal of this process is to generate an audio file. So that we can use it for the transcription process. If you already have an audio file you can skip this step and proceed further.

we will be using the moviepy library, let's check the below code.

from moviepy.editor import *

mp4_file = r'videosample\video.mp4'
mp3_file = r'videosample\video.mp3'

videoclip = VideoFileClip(mp4_file)

audioclip =


as you can see it's pretty straightforward code that takes two inputs and calls a function VideoFileClip with video file as an argument, it returns the audio file. And then we are writing the audio files into the audio file path.

Mp3 to WAV

Once you convert the video file into an audio file. it produces a file with an mp3 file format. Now we need to convert it to wav file format. You must be thinking about why we are re-formatting it. If you need to know more about it, please read this Mp3 to Wav File Conversion using Python article. it will teach you about the conversion as well as why we need to convert it. But if you already have a wav file with you then we don't need to convert it.

Audio Segmentation

we will split the audio file into 15 secs audio chunks and we will save it in multiple files. To split the file, we will use the FFmpeg command. for example,

FFmpeg -i audiosample\file1.wav -f segment -segment_time 15 -ac 1 -c copy audiosample\parts\ut%09d.wav

we can do this using python and with multiple files as well. Please check the below code

import os
from pathlib import Path
directory = r"path\to\audiofile\folder"

def segment(file,folder,org_path):

    if not os.path.exists(folder):
        os.system("mkdir "+folder)
    os.system("ffmpeg -i \""+str(file)+"\" -f segment -segment_time 15 -ac 1 -c copy \""+parts+"\"")
    print(str(file)+" segmentation Done...")

def recur(folder_path):
    for folder in dirs:
        if folder.is_dir():

            print("Processing Folder- "+str(folder))
            print("Segmenting File- "+str(folder))
            # print()

if __name__ == "__main__":

It will split the file into various chunks of files as shown below.


Audio Transcription

Now we have to transcribe all the chunks using Google Cloud Speech to Text API. And create a .srt file from the response that we get from the API.

Let import all the libraries that we will use in this.

import os
from pathlib import Path
from import speech_v1 as speech

Let's create a function to iterate through all the files in a directory and transcribe them all one by one.

def recur(folder_path):
    for folder in dirs:
        if folder_details[folder_len-1]!="parts":

In the above function, we walked through each folder in the input directory and checked if we encounter any directory called parts then we are calling the convert function with folder path, parent directory path, and parent directory name.

Now let's create a convert function to transcribe and save it in the .srt file.

def convert(file,folder,pack):

    files = sorted(os.listdir(str(file)+'/'))
    all_text = []
    for f in files:
        name = str(file)+'/' + f
        print("Transcribing File- "+str(name))
        with open(name, "rb") as audio_file:
            content =
            config = speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,language_code="en-US")
            audio = speech.RecognitionAudio(content=content)
            text =speech_to_text(config, audio)
        except Exception as e:
            text = "No Audio"
    transcript = ""
    for i, t in enumerate(all_text):
        total_seconds = i * 15
        m, s = divmod(total_seconds, 60)
        h, m = divmod(m, 60)

        total_seconds_n = total_seconds + 15
        m_n, s_n = divmod(total_seconds_n, 60)
        h_n, m_n = divmod(m_n, 60)

        transcript = transcript + "{}\n{:0>2d}:{:0>2d}:{:0>2d},000 --> {:0>2d}:{:0>2d}:{:0>2d},000\n {}\n\n".format(i+1,h, m, s,h_n, m_n, s_n, t)
        print("Transcript completed- "+str(transcript))
    with open(transcript_file, "w") as f:

In the above script, we have set up the configuration and audio content that is required for transcription collected the transcript, and saved it in a file.

Important lines of code

  • setting up the API key, you can export the JSON file of your service account while adding a key. Please check this link
  • these lines are responsible for setting up the configuration and calling the API, that is being called inside speech_to_text function.
    config =speech.RecognitionConfig(encoding=speech.RecognitionConfig.AudioEncoding.LINEAR16,language_code="en-US")
    audio = speech.RecognitionAudio(content=content)
    text =speech_to_text(config, audio)

Now, We will create the speech_to_text function and call the API.

def speech_to_text(config, audio):
    client = speech.SpeechClient()
    operation  = client.long_running_recognize(config=config, audio=audio)
    response = operation.result(timeout=90)
    return text_res

created an object for the SpeechClient class and call the long_running_recognize method with config and audio parameter, that will call the google cloud speech to text API and it will call an in-house function print_sentences with the response and once it return the transcript it will return the data to convert function.

Now let's see what this print_sentences function does.

def print_sentences(response):
    for result in response.results:
        best_alternative = result.alternatives[0]
        transcript = best_alternative.transcript
        confidence = best_alternative.confidence
        print("-" * 80)
        print(f"Transcript: {transcript}")
        print(f"Confidence: {confidence:.0%}")

        return transcript

Here we are parsing the response and printing the transcript data and its accuracy and then returning the transcript data.

Finally, let's call the recur function from the main module which is a base function in our program.

if __name__ == "__main__":
    directory = r"path\to\chunks\directory"

If we execute our program the output will look something like this.


Now it is ready to use as a subtitle in your video.

Bonus Points

  • you may get errors while running the program related to the API key. Please make sure you have the correct key in the JSON file that you have downloaded and also make sure you have set the GOOGLE_APPLICATION_CREDENTIALS Environment properly.
  • you may also get errors related to billing, please make sure your project has a billing account and also make sure it is active in the google cloud console.
  • Divide the files into multiple chunks because if you are converting the local file then there is a limit of 60 secs. Please check the documentation for more details.
  • You can transcribe in multiple languages. please check the config parameters for the API.
  • You can also download or clone the source code of this article from my GitHub directory.

Subscribe to my newsletter for more articles, and don't forget to give your opinion in the comment or to give this article a thumbs up. You can also support me by `Buying me a coffee.


Did you find this article valuable?

Support Rahul Dubey by becoming a sponsor. Any amount is appreciated!