Got an hour-long video and not really into manually creating subtitles? not plans to put it on YouTube for their automated transcription services? then – try Google Cloud Speech-to-Text! In this post I’ll share some scripts for automating the process and creating an .str
file to go along your video for displaying the subtitles.
Google’s Speech-to-Text (https://cloud.google.com/speech-to-text) is available through their gcloud
command line tool, which makes it very easy to script and automate. There are just a few steps and considerations for using it, for example extracting FLAC from your video and splitting it up to small chunks.
Prerequisites
- Google Cloud enabled for your Google account (you need a credit card, I believe, but you get $300 in credits for using it the first time)
- A live Google Cloud project
gcloud
installed (https://cloud.google.com/sdk/docs/install) and configured for your account- A Google Cloud Storage bucket (https://console.cloud.google.com/storage)
- FFMPEG installed (https://ffmpeg.org/download.html)
- A long video to transcribe (in whatever format, but FFMPEG should be able to read it)
Extract 5-minute FLAC Audio Chunks
Google Speech worked best for me when limiting the length of transcription to 5 minutes. More than that and I got timeouts and errors. So I recommend splitting the audio to 5 minute chunks like so
$ ffmpeg -i video.mp4 -vn -ar 44100 -c:a flac -sample_fmt s16 -ac 1 -y -map 0 -segment_time 00:05:00 -f segment audio%02d.flac
This creates (depending on the length of your video) one or more audioNN.flac
files. The encoding is 16-bit, 44.1KHz and mono to make the files as small as possible (~11Mb). I had a pretty slow internet connection while doing this, so uploading a 24-bit 48KHz stereo files was a pain. Google anyway recommend anything above 16KHz for optimal results. You can tweak the parameters, but I doubt it will provide for a better transcription. Upload all the files to your Google Cloud Storage bucket.
Run the Google Speech Command
Cue up all the files for transcription:
$ for i in {0..[[NN]]}; do gcloud ml speech recognize-long-running "gs://[[mybucket]]/audio$(printf '%02d' $i).flac" --encoding='flac' --sample-rate=44100 --language-code='en-US' --include-word-time-offsets --async; done
Change [[NN]]
and [[mybucket]]
to match your setup (e.g. number of audioNN.flac
files you have).
While doing this I also experimented with providing a list of “boost words”: https://cloud.google.com/speech-to-text/docs/speech-adaptation . My “special” words were in a text file words.txt
, one word per line, so I could run:
$ for i in {0..18}; do gcloud ml speech recognize-long-running "gs://speechrec/audio$(printf '%02d' $i).flac" --encoding='flac' --sample-rate=44100 --language-code='en-US' --include-word-time-offsets --async --hints "$(awk '{ printf "%s%s", (NR==1?"[":", "), $0 } END{ print "]" }' words.txt)"; done
This attached the words on the command line in the --hints
arguments.
Note this is running the recognize-long-running
option, which returns immediately and allows to later query the service for the results.
This command will return a bunch of operations/NNNNNNNNNNNNNNNNNN
keys that are used for retrieving the results, e.g.
gcloud ml speech operations describe operations/NNNNNNNNNNNNNNNNNN
Which returns a JSON with the results. I therefore pipe the outputs from describe
into .json
files:
gcloud ml speech operations describe operations/NNNNNNNNNNNNNNNNNN > audio00.json
Keeping the audio .flac
files aligned with the .json
files.
Create a .srt
File from the JSON Output
This involves a script that I wrote in Python to take all the JSON files and combine them, transcoding the .json
format into .srt
format while keeping the timestamps valid and in-sync with the video.
Using the script:
$ google_cloud_speech_json_to_srt.py --concat --fix-timestamps audio*.json
Note the script will look for the aligned .flac
files to figure out the starting timestamp of that audio chunk, so it could keep the subtitle timestamps correctly synced with the video.
The script make some heuristics about the number of word to put in a single subtitle line, and makes sure the duration they show up makes sense.
Bask in Your Glory
The .srt
file should be ready now. Rename it to match the e.g. video.mp4
file (video.srt
), and play the video with VLC and the subtitles should automatically come up, in sync with the video.
Have fun subtitling!
Roy.