I never thought I’d manage to make a Halloween-appropriate post, even as I sat down to write this one. Even after I was inspired by a blog post called “Free software scares normal people”. (TL;DR: very often Open Source software is written by people who want lots of complicated things. The author wrote a front end to Handbrake, a video tool, so that it would convert whatever you dropped on it into an MP4 file, which is all that anyone really wants anyway.) So my Halloween themed blog post was scaring me in the face.
This is a tale about how a one-off tool for transcribing data became something that a Normal Person can use. If you need data transcribed and value money and privacy, please check out Mercuryscribe. Everything you need to know to use it is there.
Read on to find out how it came to be.
I’ve written before about little software projects that I write for the academic that lives in my house, like this piece about automating downloading lots of articles. In another project, I found some code that can do transcription and managed to make it work. I polished it up enough that I could make it work and even did something that made it possible for the researcher to be able to update the speaker’s names. It even worked on her Mac! When it was needed again for another project, I found that it still worked, and since, best I can tell, there’s nothing else available that’s completely free and does not require giving your data to someone else, I thought maybe I’d polish it up such that someone else could use it. I’d spent something like 5-10 hours writing the script that transcribed all the data for a couple of projects. Maybe someone else could make use of it!
It totally works! Just open a Terminal and pip install transcribe-with-whisper
If you know what that means, it might actually be true.
Normal People don’t open terminals. They don’t use pip. They certainly don’t think that transcribe-with-whisper my-video-file.mp4 is easy, even if it is free and private.
No problem, I thought. I’ll just vibe code up a web front end. Before I got too frustrated, my command line tool was accessible from the web browser! You can upload a file and have it transcribe it into a web page that shows the transcription and plays the original file. You can even click on a word and have the audio jump to that same place.
But wait! I added a rudimentary editor, making it possible to edit the transcription, and when you save it updates the VTT files (used to build the HTML and sometimes useful) and regenerates the HTML with the updated text.
But wait! Speaker 1 and Speaker 2 isn’t very descriptive. Now you can update the speaker names. You can sort-of play with it here.
Mercuryscribe.com is born
I bought a domain name. (It has a terrible AI-generated logo now, but I am hiring a live human to generate a Real Logo for https://mercuryscribe.com/.)
But that’s when it stopped being easy. Managing the versions of everything with Python is not for the faint of heart. I thought that I could ask people to install Homebrew and then Python, and then. . . Well, that seemed like a non-starter for most everyone that I know.
Enter Docker–the cross-platform miracle solution
So I made a Docker container! All you have to do is install Docker Desktop and then paste in a single command, and you’re off to the races! Sure, building separate Docker containers for both AMD/Intel Macs and also the M4/ARM Macs took a couple of weeks (maybe it was a month), but it worked.
So if you have docker installed, you can “just” do this! (But you might want to check out the Getting Started page for some details.)
docker run --rm -p 5001:5001 -v "$(pwd)/mercuryscribe:/app/mercuryscribe" ghcr.io/literatecomputing/transcribe-with-whisper-web:latest
I tested this with a power user who isn’t comfortable with a shell and he was able to make it work pretty quickly. Except for . . .
Tokens are not just for video games any more
The next issue was Hugging Face Tokens. “Hugging Face, Inc. is an American company based in New York City that develops computation tools for building applications using machine learning.” (Thanks, wikipedia!) The free code that I started with uses some AI models that are made available through Hugging Face. To run the code, you need a token that gives you permission to download the model from Hugging face. So you need to create an account, and visit some pages to accept their terms of use. The transcription is all done locally on your computer without sharing it with anyone, but to do that the code must download the models that make the transcription possible.
The original version of the code assumed that you could go get the token and pass it to the script. I provided “simple” instructions for doing that, but it was still a huge barrier to entry for Normal People. To fix that, I added a friendly web page, with links and instructions for getting the token, it checks that the token has the right terms accepted, and if not, links to the places you need to go to accept them. Finally, it saves the token safely away in a place where it will find it the next time.
So now, I’ve got the easy-to-use Docker container, I’ve saved you having to get a token into a mysterious file on your computer, I’ve provided sort-of helpful instructions on getting the token. It works on both Intel and ARM Macs, which are more different than I’d have liked them to be.
But does it do Windows?
Next, I contacted an academic that started at the University of Tennessee the same year that I did. She’s widely respected for her expertise in using software for qualitative analysis. She uses Windows. No problem. Docker Desktop works on Windows. It was easy to install on the Mac. Well, it was a problem, so much of a problem that I decided that I needed to make a Windows Version.
A week later (and I did little else that week!), I’ve created a Windows version using PyInstaller that takes your Python thing and wraps it all up in a single Zip file that includes a .bat file that a Normal Person can click on and make it go.
Finally, she was ready to try it. She managed to go to https://mercuryscribe.com/, click on the “Get Started Free” link, click on the “Windows Package” link. It took several emails to get her through the Hugging Face gauntlet (I’ve since updated those instructions, so maybe it’ll be easier for you). Then she hit an error that I’ve never seen. I was crestfallen. I’d already spent half a day installing Windows on a spare laptop (the last version of Windows that I used regularly was Windows 98). Was I now going to have to do some bugging across Windows installs or something?
Thankfully, it was simple. The problem was that when you click on the .bat file, it cranks up an ugly terminal window. Once things were working in the web browser, she closed the terminal. What wasn’t clear was that the ugly terminal was what made the whole thing work. I added a warning to the Get Started page.
She clicked on the .bat file again, it started up–without whining about the Hugging Face Token–and lo! She was able to transcribe some audio into text.
Will you be next?
If you or someone you know needs to transcribe audio or video files, doesn’t want to share their data with anyone, and doesn’t want to spend money, Mercuryscribe might be the solution. And, I hope, it won’t be scary.
If you’re an academic and could share, or otherwise help people find out about this, I’d really appreciate it.
And another thing!
I didn’t have room here, but if you use a Mac and wish there were an easy way to get your PDFs that have DOIs named consistently, check out NameMyPdf.
