Exploring Python Packages — speech_recognition and spleeter

3 min readAug 14, 2020

Welcome to another post where we will explore interesting and exciting packages related to speech recognition! These packages will help in converting the speech to text which is so cool 🤯. Also another magical tool which we will see is Spleeter by Deezer. This tool is mind blowing …. and is available for free in python! Without ado lets dive into these!

Speech Recognition

There are many python packages which are used to get this functionality. The major among which is speech_recognition while others like pydub, moviepy etc. will help in the pre-processing part. The first step is to get these packages installed in python. I am using Anaconda so I have given the conda command -

conda install -c conda-forge speechrecognition
conda install -c conda-forge pydub
conda install -c conda-forge spleeter

For simple voice to text conversion we will require speech_recognition and pydub only. First just make a simple recording of a sentence using your phone or PC. I had done the recording using my phone due to which the format was in the form of .m4a . An important thing to note is that speech_recognition takes input as .wav or .flac files only! This means that the first step would be to convert .m4a to .wav. This can be done by using pydub package.

Once the audio sample is converted to .wav format we can pass it through speech_recognition to get the output. Simple and easy! The sample code for this is shown below ( Disclaimer: I have created this code using lots of internet resources like Stack overflow, git etc.) .

# import the necessary packages
import speech_recognition as sr 
from pydub import AudioSegment# preconversion of daudio sample from .m4a to .wav
def preConvert():
    initialSound = 'Your_audio_sample_path.m4a'
    track = AudioSegment.from_file(initialSound, format='mp4')
    track.export('Your_audio_sample_path.wav',format='wav')
    newSound = 'Your_audio_sample_path.wav'
    return newSoundsound = preConvert()# getting the speech recognizer function
r = sr.Recognizer()# this indicates the length of quietness
# this is used when the speaker is slow 
# and there are lot of pauses in the audio sample
# the value is in seconds and it means that 
# if the audio sample is quiet for more than 3 seconds
# it will stop recognizingr.pause_threshold = 3# for the sound 
with sr.AudioFile(sound) as source:
    r.adjust_for_ambient_noise(source)
    print('Converting audio to text....')
    audio = r.listen(source)
    
    try:
        # printing the words spoken
        print(r.recognize_google(audio))
    except Exception as e:
        print("Error {} : ".format(e) )

As usual this is not foolproof. The speech recognition is quiet accurate for English and when spoken slowly. When spoken fast or in other languages … this does not work that much properly 🤷‍♀️. Also for very long duration audio sample this does not work properly 😥. For long audio sample, we can split it into smaller parts depending on length i.e cut sample for 10 sec time etc. Or this can be done on the basis of silence i.e when there is no talking you can break the audio sample at that time. This both can be done by using pydub.

A tip for extracting sound from video — Python has moviepy which can be used to do this. Its really amazing! Just import — import moviepy.editor as mp and use the below 2 lines to convert video to .wav file -

clip = mp.VideoFileClip(videoPath)
clip.audio.write_audiofile(“Video_Audio.wav”) Spleeter

Spleeter is really an amazing tool! This tool by Deezer has the ability to chop songs into voice and music! Not only this, it can even extract drums, piano etc. It can extract upto 5 different individual elements in the song!

This works really well! Though its not perfect it works nearly flawless. Using this I was able to separate voice from music which worked better when converting it to text.

It has a simple command which needs to be used in Anaconda prompt — spleeter separate -i audio_example_path.mp3 -p spleeter:2stems -o output

This will takes the audio as input and create a folder called output in the same path and give two tracks — namely accompaniment.wav and vocal.wav. Give this tool a try and see the results for yourself!

I hope you were able to learn a little more about the ability of Python by using one of the endless packages available. Let me know in the comments!

Originally published at http://evrythngunder3d.wordpress.com on August 14, 2020.

Exploring Python Packages — speech_recognition and spleeter

Written by Shrinand Kadekodi

No responses yet