Automatically Generate Captions for My Videos With Whisper

I hate writing captions for my video, because it’s such a time-consuming process. So when I discovered OpenAI’s Whisper program (and model), and the re-write in C++ made by Georgi Gerganov, I knew I had to try it out to save myself hours of transcribing pain.

After generating an MP3 file in FCPX, I used ffmpeg to convert it to a WAV file that Whisper.cpp expects:

ffmpeg -i input.mp3 -acodec pcm_s16le -ac 1 -ar 16000 output.wav

Then I installed Whisper.cpp:

git clone
cd whisper.cpp

# You probably don't need large, but that's what I used
bash ./models/ large
make large

# Adjust threads to your computer's CPU
./main -t 10 --output-srt --language en --model ./models/ggml-large.bin --file ~/Downloads/output.wav

The audio file I gave it was nearly 12 minutes in length, and Whisper.cpp took about 5 minutes in total to process it.

After it spat out an SRT file, I imported it into FCPX and had a look. While there were some typos that I had to correct, it was nearly 90% correct in deciphering my terrible, terrible voice. The only minor gripe I had was that each line started with a space, which I thought was probably a bug in Whisper.cpp converting the VTT generated by Whisper into SRT, but other than that it was perfect. I no longer have to manually type out my own voice.