Introduction
In today’s digital landscape, businesses and content creators rely on speech-to-text technology for efficient transcription of audio and video files. Whether you’re a developer, researcher, or media professional, automating speech transcription saves time and enhances productivity.
This guide will walk you through building an automated transcription tool using Azure Cognitive Services’ Speech-to-Text API on Linux Ubuntu. By the end of this article, you’ll be able to:
- Convert video files to audio for transcription
- Normalize audio formats for better accuracy
- Leverage Azure Speech-to-Text API for precise transcriptions
- Automate the transcription process using Python on Ubuntu
- Optionally, run this workflow on an Azure Virtual Machine (VM)
Why Automate Speech-to-Text Transcription?
Manual transcription is time-consuming and prone to errors. Automating this process enhances efficiency, ensuring accurate and swift text conversion from multimedia content. Azure Speech Services provides robust AI-powered transcription capabilities, making it a preferred choice for businesses, podcasters, and professionals.
To learn more about AI-powered development, check out our Custom Software Development Services.
Prerequisites
Before setting up the transcription tool, ensure you have:
- A Microsoft Azure account with Speech Services enabled
- Python 3 installed on Ubuntu
- FFmpeg for media file conversion
- Required Python libraries: azure-cognitiveservices-speech, moviepy, argparse
Run the following commands to install dependencies:
sudo apt update && sudo apt install ffmpeg -y
pip install azure-cognitiveservices-speech moviepy argparse
Step 1: Setting Up Azure Speech Services
- Create an Azure Account: Sign up at Azure Portal if you don’t have an account.
- Set Up Speech Services: Navigate to Azure Speech Services, create a resource, select a pricing tier, and copy the API Key and Region from the Keys and Endpoint tab.
- Configure the Speech SDK in Python:
import azure.cognitiveservices.speech as speechsdk
speech_config = speechsdk.SpeechConfig(
subscription="YOUR_AZURE_SPEECH_KEY",
region="YOUR_AZURE_REGION"
)
Step 2: Writing the Python Script
Handling Command-Line Arguments
import argparse
parser = argparse.ArgumentParser(description="Transcribe speech from video and audio files.")
parser.add_argument("media_files", nargs="+", help="Paths to video/audio files")
args = parser.parse_args()
Extract Audio from Video Files
import subprocess
def extract_audio(video_file):
audio_file = f"{video_file.rsplit('.', 1)[0]}_audio.wav"
subprocess.run([
"ffmpeg", "-i", video_file, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", audio_file, "-y"
], check=True)
return audio_file
Convert Audio to the Required Format
def convert_audio_to_wav(input_audio):
output_wav = input_audio.rsplit('.', 1)[0] + "_fixed.wav"
subprocess.run([
"ffmpeg", "-i", input_audio, "-acodec", "pcm_s16le", "-ar", "16000", "-ac", "1", output_wav, "-y"
], check=True)
return output_wav
Transcribe Audio Using Azure Speech-to-Text
def transcribe_audio(audio_file, speech_config):
audio_config = speechsdk.audio.AudioConfig(filename=audio_file)
speech_recognizer = speechsdk.SpeechRecognizer(speech_config=speech_config, audio_config=audio_config)
result = speech_recognizer.recognize_once()
return result.text if result.reason == speechsdk.ResultReason.RecognizedSpeech else None
Save the Transcription
import os
def save_transcription(text, filename):
os.makedirs("transcriptions", exist_ok=True)
with open(f"transcriptions/{filename}_transcription.txt", "w") as file:
file.write(text)
Step 3: Running the Script
To transcribe an audio or video file, run:
python transcribe.py video1.mp4 audio1.wav
This script will:
- Extract audio from video (if applicable)
- Convert the audio to the required format
- Send it to Azure’s Speech-to-Text API
- Save the transcribed text in the transcriptions/ folder
Advanced Features & Future Enhancements
This workflow can be expanded to support:
- Live speech transcription for real-time applications
- Multi-speaker recognition for differentiating voices
- Automatic translation for multilingual content
Looking for expert mobile and web solutions? Explore our Mobile App Development Services.
Conclusion
By leveraging Azure Cognitive Services, this automated speech-to-text transcription tool provides accurate, efficient, and scalable solutions for processing audio and video files. Whether you’re handling podcasts, interviews, or business meetings, this approach saves time and ensures high-quality transcriptions.
For complete source code, visit: GitHub Repository