Research Blog

Automating Speech-to-Text: How to Transcribe Audio & Video with Azure Speech Services 

Automating Speech-to-Text

Introduction 

In today’s digital landscape, businesses and content creators rely on speech-to-text technology for efficient transcription of audio and video files. Whether you’re a developer, researcher, or media professional, automating speech transcription saves time and enhances productivity. 

This guide will walk you through building an automated transcription tool using Azure Cognitive Services’ Speech-to-Text API on Linux Ubuntu. By the end of this article, you’ll be able to: 

  • Convert video files to audio for transcription 
  • Normalize audio formats for better accuracy 
  • Leverage Azure Speech-to-Text API for precise transcriptions 
  • Automate the transcription process using Python on Ubuntu 
  • Optionally, run this workflow on an Azure Virtual Machine (VM) 

Why Automate Speech-to-Text Transcription? 

Manual transcription is time-consuming and prone to errors. Automating this process enhances efficiency, ensuring accurate and swift text conversion from multimedia content. Azure Speech Services provides robust AI-powered transcription capabilities, making it a preferred choice for businesses, podcasters, and professionals. 

To learn more about AI-powered development, check out our Custom Software Development Services. 

Prerequisites 

Before setting up the transcription tool, ensure you have: 

  • A Microsoft Azure account with Speech Services enabled 
  • Python 3 installed on Ubuntu 
  • FFmpeg for media file conversion 
  • Required Python libraries: azure-cognitiveservices-speech, moviepy, argparse 

Run the following commands to install dependencies: 

sudo apt update && sudo apt install ffmpeg -y
pip install azure-cognitiveservices-speech moviepy argparse

Step 1: Setting Up Azure Speech Services 

  1. Create an Azure Account: Sign up at Azure Portal if you don’t have an account. 
  2. Set Up Speech Services: Navigate to Azure Speech Services, create a resource, select a pricing tier, and copy the API Key and Region from the Keys and Endpoint tab. 
  3. Configure the Speech SDK in Python: 

Step 2: Writing the Python Script 

Handling Command-Line Arguments 

import argparse
parser = argparse.ArgumentParser(description="Transcribe speech from video and audio files.")
parser.add_argument("media_files", nargs="+", help="Paths to video/audio files")
args = parser.parse_args()

Extract Audio from Video Files 

import subprocess
def extract_audio(video_file):
    audio_file = f"{video_file.rsplit('.', 1)[0]}_audio.wav"
    subprocess.run([
        "ffmpeg", "-i", video_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", audio_file, "-y"
    ], check=True)
    return audio_file

Convert Audio to the Required Format 

def convert_audio_to_wav(input_audio):
    output_wav = input_audio.rsplit('.', 1)[0] + "_fixed.wav"
    subprocess.run([
        "ffmpeg", "-i", input_audio, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", output_wav, "-y"
    ], check=True)
    return output_wav

Transcribe Audio Using Azure Speech-to-Text 

def transcribe_audio(audio_file, speech_config):
    audio_config = speechsdk.audio.AudioConfig(filename=audio_file)
    speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
    result = speech_recognizer.recognize_once()
    return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else None

Save the Transcription 

import os
def save_transcription(text, filename):
    os.makedirs("transcriptions", exist_ok=True)
    with open(f"transcriptions/{filename}_transcription.txt", "w") as file:
        file.write(text)

Step 3: Running the Script 

To transcribe an audio or video file, run: 

python transcribe.py video1.mp4 audio1.wav

This script will: 

  1. Extract audio from video (if applicable) 
  2. Convert the audio to the required format 
  3. Send it to Azure’s Speech-to-Text API 
  4. Save the transcribed text in the transcriptions/ folder 

Advanced Features & Future Enhancements 

This workflow can be expanded to support: 

  • Live speech transcription for real-time applications 
  • Multi-speaker recognition for differentiating voices 
  • Automatic translation for multilingual content 

Looking for expert mobile and web solutions? Explore our Mobile App Development Services. 

Conclusion

By leveraging Azure Cognitive Services, this automated speech-to-text transcription tool provides accurate, efficient, and scalable solutions for processing audio and video files. Whether you’re handling podcasts, interviews, or business meetings, this approach saves time and ensures high-quality transcriptions. 

For complete source code, visit: GitHub Repository 

Share this post

Dheeraj Kumar

Technical Project Manager

Tech Lead with 8+ years of experience in Software development, project management, and UI/UX design, specialising in building scalable mobile applications, leading cross-functional teams, and delivering user-centric solutions with a strong focus on performance, quality, and innovation.