Why Long Form Transcription With AI Is Still a Far Cry

Beth Worthy

5/30/2017

The Remington 700 typewriter. VHS cassettes. Phone booths.

These are all things that have been replaced by modern technology. And if history tells us anything, it's that soon more tools, devices and even some jobs will soon become a thing of the past, in large part due to advances in artificial intelligence (AI).

Long-form transcription isn't likely to be one of them.

While technology has certainly improved the process of converting spoken words into written words, it still isn't quite smart enough to do with every day precision.

Sure, phones are smart enough to decipher simple instructions and answer a couple of questions. And there is some software available that does an OK job with voice dictation for documents.

But the task of accurately transcribing long strings of spoken words associated with actual human conversations is still left in the capable hands (ear, minds and eyes) of living, breathing human beings.

Also Read: Why Automated Translators Can Never Replace Humans

Any Project Size, At Your Deadline.

Get Quality Transcripts With A 99% Accuracy Guarantee.

Here's why:

Technological advances take time

While much progress has been made in the area of improving and expanding the capabilities of artificial intelligence, there's still a long way to go.

And contrary to what you might believe, technological advances take time.

"Why Long Form Transcription With AI Is Still a Far Cry"

Click To Tweet

Experts say that the error rate of human transcription of conversational speech is about 4 percent. It's not by any means perfect, but it's not too bad, either. On the other hand, experts surmise that the error rate of all of the best AI systems (Google, IBM and Microsoft combined) trying to transcribe conversational speech would be about 8 percent.

And the best commercially available systems will deliver an error rate closer to 12 percent.

These error rates are better than they were five years ago, but even at their best they are still at least double that of humans. It will be a while before AI is ready to tackle transcription services with absolute precision.

Context is complicated

One of the problems with relying on AI for transcription services is that context matters--and it matters a lot.

A professional and highly trained human being is always going to be better than a computer at understanding the context in which something is being said--the difference between whey and way, deer and dear, through and threw, for example.

It's hard to keep conversations clean

No, this isn't about vulgar or profane language (although that can present its own problems). This is about background noise, wind, static and music, all of which create complications for AI-based transcription programs.

At least for now, people are better at focusing on what really matters. And, when they're not sure what was actually being said, they can figure it out by considering other factors, including context.

That's something AI simply isn't able to do ... yet.

People talk fast

Computers might be able to calculate the world's most complicated math problem in a millionth of a second, but for some reason they can't seem to catch up with the rate at which people speak.

Many people talk too fast for computers to keep up with their every word. This means that AI won't work for delivering clear, clean and consistent transcriptions of quick-moving conversations.

And the problem gets even worse when more than one speaker is involved in the conversation.

People have impediments

The Linguistic Society of America says that there are more than 6,900 languages currently spoken by people around the world--and even people who speak the same language don't pronounce words the same way.

Some people can't pronounce words properly because they have speech impediments. Some people choose not to pronounce words correctly because they are making a cultural statement, their pronunciations are influenced by family or friends, or they are taking artistic license.

The reason why people mispronounce words doesn't matter as much as the fact that they do it--and AI is not yet capable of identifying the trend or proper word.

Conversations can be complicated

Period or exclamation point? Comma or semicolon? Who is actually speaking, anyway?

For computers, these relatively simply questions can be quite vexing--because conversations can be complicated.

Who is talking, what they are really trying to say and how excited they are is something computers can't quite keep up with in their current conditions. But humans can. Human transcribers understand lively discourse. They can determine who among many participants in the conversation is speaking.

And they can capture the proper emotion of the moment.

Computers can't do this ... yet.

Stick with real intelligence

A lot of progress has been made in the area of artificial intelligence and voice recognition--but it is probably too soon to rely on automated transcription.

Real intelligence is still the best way to ensure that you are insured against easily avoidable mistakes. Right? Right!

Beth Worthy

Beth Worthy is the Cofounder & President of GMR Transcription Services, Inc., a California-based company that has been providing accurate and fast transcription services since 2004. She has enjoyed nearly ten years of success at GMR, playing a pivotal role in the company's growth. Under Beth's leadership, GMR Transcription doubled its sales within two years, earning recognition as one of the OC Business Journal's fastest-growing private companies. Outside of work, she enjoys spending time with her husband and two kids.