Microsoft's voice recognition tech now better than teams of human transcribers

Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years

Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years

The bar recently reached 5.1% after researchers used human translators.

Phil Woodland, an information engineer at Cambridge uni who specialises in speech recognition and has worked on the same dataset before, told The Reg that "the error rates have come down significantly" since this problem was tackled in the early 2000s (using one 2004 telephone conversation dataset called RT-04 IBM researchers achieved an error rate of 15.2 per cent).

"After our transcription system reached the 5.9 percent word error rate that we had measured for humans, other researchers conducted their own study, employing a more involved multi-transcriber process, which yielded a 5.1 human parity word error rate". As the technology better emulates human speech, fewer user errors will take place.

"While achieving a 5.1 percent word error rate on the Switchboard speech recognition task is a significant achievement, the speech research community still has many challenges to address, such as achieving human levels of recognition in noisy environments with distant microphones, in recognizing accented speech, or speaking styles and languages for which only limited training data is available", said Huang.

The test involves transcribing conversations between people discussing a range of topics, from sports to politics, but the conversations are more formal in nature. The achievement was made possible with a convolutional neural network combined bidirectional long-short-term memory. "Additionally, our approach to combine predictions from multiple acoustic models now does so at both the frame/senone and word levels", said Xuedong Huang, a technical fellow at Microsoft. Microsoft used deep learning software and cloud compute infrastructure to improve the system model.

Reaching human parity with an accuracy on par with humans has been a research goal for the last 25 years. The company hopes that its computer system can move beyond just transcribing and learn to understand the meaning and intent behind speech.

Recommended News

We are pleased to provide this opportunity to share information, experiences and observations about what's in the news.
Some of the comments may be reprinted elsewhere in the site or in the newspaper.
Thank you for taking the time to offer your thoughts.