Transcription or Dictation: Will HAL Open the Pod Bay Doors?
“Is there some way to automatically transcribe a recording?” That’s a question I recently received from this site. Automatically? What does that mean? In my mind’s eye, I see that this automatic transcription software should closely resemble the HAL 9000 computer from 2001: A computer that talks and can understand human speech. It’s a high ideal, but there are still technicalities involved. My conclusion, a while back, was, “I’m sorry, Dave, I just can’t do that.”
Is Automatic Speech Transcription HAL getting any closer to opening the Pod Bay Doors?
I conduct some tests using some speech-to-text tech I have on hand, and see how it stacks up against standard transcription. In this post: the test results, lessons learned, and best practices for each technique.
There are many devices, services and software that act like Hal: Siri on iOS, the Android Google Voice, or any number of corporate voice address systems that say “speak your request and I’ll get you to the right department.”
With my 3rd generation iPad (March 2012 Retina Display, running the iOS version 5.x), I use the Dictation feature to talk to my iPad and see the words appear, as text. (Apple on iOS Dictation: iPhone 4S, iPad 3rd generation) Yay! Talking to computers is now doable. Since I’m a fairly fast typist, I have never seen the need to buy dictation software. I’m happy to have something on hand to act as a test.
Note: The Dictation feature I describe on my iPad is also available on the latest MacOS operating system, too. Also, with iOS 6, the Dictate mode has been taken over by Siri. (I’m still waiting for Google to come out with Maps for the iPad before I update, though.)
Testing 1 - 2 - 3
I performed three different tests for transcription—two automatic using the iPad’s Dictate feature, and one manual using ExpressScribe for the Mac. The two variations of “automatic” speech-to-text methods were pitted against each other, then I tested the automatic victor against against the manual transcription method.
I’ll describe each test and the results, and then afterwards discuss the setup for each method in more detail.
Test 1: Directly transcribe audio of an interview
The result? Pure comedy.
Here is the text transcribed using that method.
is this diversity is here is the war is so business. When she is a share it was. It is in Alabama. Going to the railroad telegraph later so BlackBerry Parmore that was a Burgessville still one were wanting to go to the limit is only one of boys that have a useful capabilities so he told us you telegraph Morse code and also the radio Morse code
ooookay. We happily found a new parlor game. It’s better than Mad Libs, not as physically challenging as Twister. But oh my, is it ever entertaining! Useful? Sorry, Hal, you’d better start singing Daisy, Daisy. (in super slow mo)
Test 2: Play audio, listen, then repeat it in Dictate Mode
We can improve on our very funny Test 1 method, as long as a thinking human helps interpret the interview audio and speaks it to Dictate mode slowly, comma, clearly, comma, and includes the punctuation.
The method is similar. Play a bit of audio interview, stop, put the iPad in dictation mode. Repeat the words in the audio, stop dictation. Keep repeating until you’ve re-dictated all the audio recording.
The results of my test:
first the telegraph Morse code when I was 10 years old. My father wanted his sons to have some useful skill and so he was — he’d grown up in Alabama, and he had gone into the railroad as a telegraph operator. And so back during that time, why, that was a useful skill where one could earn a living. And so he wanted his boys have a useful capability so he taught us the telegraph Morse code and also the radio Morse code.
Oh! So that is what he was saying!! Well, bust my buttons! Why didn’t you say so in the first place?*
The result of Tests 1 and 2: The Listen, Repeat it in Dictate mode method of transcription won. That’s the way to do it.
This listen-and-then-parrot technique is what Nuance (makers of Dragon Naturally Speaking, Dragon Dictate) recommend for transcribing.
I agree with them. Because BlackBerry Parmore that was a Burgessville said so. Really.
Test 3: ExpressScribe, cross-platform software for transcribing digital audio.
Now that we’ve got a working method to use iPad HAL to take dictation, how does the automatic Listen and repeat to Dictate It Mode stack up against standard transcription while typing?
Here is how I conducted the test. I divided 20 minutes of audio into two 10-minute clips. I transcribed one using the semi-automatic iPad Dictate method Listen, Repeat it in Dictate Mode. How long did it take to transcribe 10-minutes’ worth of audio? 44 minutes.
I transcribed the other 10-minute clip using ExpressScribe. (ExpressScribe is a cross-platform software application that comes with a free and paid version. The free version does all you need to transcribe audio. The Pro Version supports many more audio formats and offers support for video formats, too. The Pro Version is geared to accommodate the needs of a business or institutional workflow where multiple people send audio documents to a transcription pool of workers. You don’t need to go there.)
In ExpressScribe, I loaded up the 10-minute audio clip. I set up the software to make for good transcribing conditions. (See next section for particulars of my setup). I opened up a word processing application, and created a new document. Pressed the handy-dandy hot-key that would begin a special form of playback (play a bit, pause, then resume playback from near the end of the previous playback snippet), and began typing. How long did it take me to transcribe 10 minutes’ worth of audio? 27 minutes.
(Note: I’m a fairly fast touch-typist. Using the 1 minute test at typingtest.com, I type 83 words per minute.) If you type by hunting-and-pecking, or your touch-typing speed is slow to medium, your results will vary from mine.
|Method||Audio duration||Time to Transcribe|
|Listen, Repeat it in dictate mode||10 minutes||44 minutes|
|ExpressScribe Play with Pausing manual typing (80+ wpm)||10 minutes||27 minutes|
(By the way, it always takes longer to transcribe a portion of audio than it took to record that audio in the first place. Professional transcriptionists say that for each hour of audio, it will take 4-6 times as long to transcribe. The time mentioned here does not include going back and re-listening to the recording to catch errors.)
Now that I’ve tested this method, whenever I need to transcribe something, I will definitely use ExpressScribe.
That doesn’t mean I’ll never use the iPad’s Listen, Repeat it in Dictate mode. Dictate mode is a good fallback when I’m away from my computer but near an internet connection (Dictate mode requires a net connection). It works best when transcribing audio that’s on a device other than the iPad itself (such as a portable recorder or my LiveScribe pen). Switching back and forth between two iPad apps (for playback and dictation) would drive me bonkers.
How to do: Listen, Repeat it in Dictate Mode
This technique is described for an iOS device that has the Dictation feature. (Never set up the iPad or iPhone’s Dictate App? Go to settings > General > Keyboard (scroll down for it) then switch on Dictation.)
(If you’re on a Mac using the Mac OS X Mountain Lion (10.8), you also have Dictate capability on your computer.)
You need to have some sort of text-typing app. There’s the default Notes app. In my case, however, I use the Plaintext app.
- Set up an audio file for playback. I did this on my Mac in the QuickTime Player.
- Position the iPad close by so its microphone (built-in or external) is close to you.
- On the iPad, Launch the Plaintext app, and create a new text file, name it, and press Return.
- Tap the microphone key on the keyboard. There is an indicator that pops up over the microphone key that displays sound level as you speak, or as the iPad receives any audio input. Next, you’ll do the Dictation Mimicry process. It requires a bit of a juggle between your computer and your iOS device:
- On the computer, play back a small bit of audio (a short excerpt that you can remember and repeat.) Then pause it again. (note: tapping the space bar will play and stop (pause) the audio in QuickTime Player)
- On your iPad (or iPhone), Tap microphone icon to begin dictation.
- Repeat the words from the audio excerpt you just heard. Include punctuation, if necessary.
- Tap the microphone key to make the dictate function update the text.
- Continue the process until you have transcribed the interview.
When you dictate, speak punctuation out loud (after all, when you type, you type punctuation).
Make friends with period, comma, new line, new paragraph, colon, em-dash (that’s a long dash—often used to punctuate spoken word), dot-dot-dot, open parens, close parens, open quote, close quote. Here’s a complete list of punctuation shortcuts.
An example: What I said while dictating:
She said comma open-quote Be glad period close quote
The transcribed result:
She said, “Be glad.”
How I set up ExpressScribe
ExpressScribe allows you to play back audio and type what you hear. The two most important settings I used were adjustment of the playback speed to about 80% (because it’s easier to type all the words when the words are played slower), and a magic play audio command, Play (with Pausing).
ExpressScribe’s Play (with Pausing) command, located in the Control menu, plays about 5 seconds of audio, pauses, then loops back and catches a bit of what went before. It’s perfect for transcription.
This conceptual illustration shows what that style of playback is like. (I’m using screenshots from another audio software application, Audacity, to illustrate it. ExpressScribe does not look like what you see right here.)
When I type along, I generally have no problem with the first portion of audio, but if I miss anything, it’s at the end of that snippet. When playback pauses, I can catch up on my typing, and when it loops back a bit, I catch what I just missed. It’s very excellent; it just works.
Still, I had to get it set up for the right amount of skip back time. In the Preferences (Mac) or Options (Windows), I went to the Playback pane and set the Auto Backstep on Stop (ms) to 500 ms—milliseconds. 1000ms is a full second, 500ms is a half-second. 500 milliseconds works for me; start there and adjust it higher or lower if you need to.
I also made a hot key that I could tap to invoke playback using the Play (With Pausing) command.
I adjusted the playback speed to approximately 80%. (I have noticed that I can vary the speed faster or slower by a few percentage points depending on the speed of the speaker. For instance, transcribing my Mother, I play back at 80%, and my father’s speech cadence is a tad slower, so I play back at 83%)
The free version of ExpressScribe supports a limited number of audio file formats—WAV, AIFF, MP3, DCT (a dictation format), and WMA. Be sure that your audio file is saved in one of those formats. (I usually work with AIFF or WAV files.)
Once you have ExpressScribe set up just so, you will be able to type the words and not reach for hot keys to manually skip back or pause or skip forward.
Other Speech-to-Text software
I described the tech and software I had on hand to get speech into text form. I know that the iPad’s Dictate (or Siri) is not the only automatic way for dictation.
I have not yet tested the Android speech-to-text functions that rely on Google Voice. Yet.
In the desktop computer world, the big player for software dictation is Nuance, with its Dragon suite of software products. The newest Mac version of Dragon Dictate works only on the latest MacOS; Nuance does not sell it on their site. You can get the current and older versions at Amazon and other retailers. (affiliate links)
- Dragon NaturallySpeaking Home (Windows 7, XP)
- Dragon Dictate, Version 3.0 (Mac OS X Mountain Lion 10.8)
- Dragon Dictate, Version 2, Mac [Works on MacOS 10.6, 10.7]
What about you? Have you used speech-to-text tech?
I know that there are many ways to get speech to text. I’ve described my experience. Do you use a different technique? Different software? Different devices? I’m all ears! Please describe it in the comments.
This is wonderful insight. I keep having people ask me about doing this. Thanks for doing the legwork on it.
While I think you type faster than I, I still like the idea of typing myself.
How well does the NCH software work without the footpedal? I had always thought that a footpedal was almost mandatory for this type of work and I’m pretty certain the free version of the NCH software does not support a pedal.
Hope all is well with you. We are doing well here.
Greetings, Richard… Glad to hear from you!
The method I describe on the free version works fine without a foot pedal. (have a post in semi-composed mode that goes deeper into context on this—this post was getting a little too long to go into wild detail)
Then again, I’ve been doing more transcribing yesterday. In cases where there’s a lot of conversational back and forth, playback pace needs to be slower, and there’s more of a need to rewind a lot to capture the overlapping sounds of the two speakers. (I took the liberty of saying, Aw, ferget it. I’ll play forward and listen for the particularly interesting bits, since the back and forth was the sound of both my mother and me talking over names listed on the 1940 census. booooring!) But for straightforward ask a question and listen to the answer style audio, the Play With Pausing sans footpedal works.
I conducted a training last year at the Torrance Library for an oral history project they’re doing. I covered the tech aspect of it. (Operating the Zoom Handy H1 recorder; how to use ExpressScribe to do transcription.) The ExpressScribe info in the post above was drawn from the research and writing I did for that session. The next post on ExpressScribe set up will go into a little more gory detail for working from start to finish. Am contemplating doing a screen capture movie, too.
Of course, tho I show screenshots of Windows (oops, next post will; they aren’t in the post above), I do all my work on the Mac. I’ll give it a whirl, and if you want to tell me what things I miss by virtue of being on the fruity platform, I’d appreciate it.
But seriously. Play (with Pausing) is the cat’s meow. Sliced bread. Awesomesauce. Epic Win. All the rage. (you get the idea.)