Transcribing Audio Files Never Been Easier

2 min readNov 16, 2022

Today’s post is brought to you by Dr. Mehdi Allahyari. You can find the original post here.

Have you ever thought about “How to create summary of an audio file?”. Examples of audio files could be recording of a meeting or a podcast episode to name a few. There are two major steps involved that are illustrated in the diagram below:

Automatic Speech Recognition: This step creates the transcript or text version of the audio file.
Summarization Model: It summarizes the transcript using a Machine Learning summarization model.

Summarizing an audio file pipeline

In this post, we will focus on the first step and see how to create a transcript of an audio file. Additionally, we will leverage the open-source libraries for this task. Some of the open-source libraries include:

Automatic Speech Recognition using NeMo

First step is to install the NeMo package:

apt-get update && apt-get install -y libsndfile1 ffmpeg
pip install Cython
pip install nemo_toolkit['all']

We, then, need to instantiate the models we would like to use. You can see all the available models here. We will load audio_file and convert it to text (a.k.a transcribe) with QuartzNet ASR model. Note: audio files must be in “.wav” format.

Audio_file = "path/to/your/file.wav"

# Speech Recognition model - QuartzNet
quartznet = nemo_asr.models.EncDecCTCModel.from_pretrained(model_name="stt_en_quartznet15x5").cuda()# Punctuation and capitalization model
punctuation = nemo_nlp.models.PunctuationCapitalizationModel.from_pretrained(model_name='punctuation_en_distilbert').cuda()

from_pretrained(…) API downloads and initialized model directly from the cloud.

The next step is to use the model:

# Convert our audio sample to text
files = [Audio_file]
raw_text = ''
text = ''
for fname, transcription in zip(files, quartznet.transcribe(paths2audio_files=files)):
  raw_text = transcription

# Add capitalization and punctuation
res = punctuation.add_punctuation_capitalization(queries=[raw_text])
text = res[0]
print(f'\nRaw recognized text: {raw_text}. \nText with capitalization and punctuation: {text}')

That’s it! The “text” variable contains the transcript of your audio file. You can find several other tutorials of capabilities of NeMo.

Next steps

Now that we have the transcript, we can create a summary out of it. Hint: You can find lots of choices and examples here. It would be nice to develop a demo application providing API for these tasks, therefore, we would have a full application!

Happy practicing!

Thanks for reading my newsletter. You can follow me on Linkedin or Twitter@Angelina_Magr!

Transcribing Audio Files Never Been Easier

Automatic Speech Recognition using NeMo

Next steps

Written by Angelina Yang

Responses (1)