Microsoft's Speech Recognition AI has reached a new milestone

Computers are getting smarter and smarter, but we've never really see them to be able to interpret and understand speech as well as we do. Can they really be that smart? Yes, Microsoft thinks that is possible.

When humans are trying to transcribe a spoken conversation all in one go, on average, they manage to miss 5.9 percent of what they hear (word error rate or WER).

According to a paper published on October 17th, 2016, Microsoft's team of engineers in the Artificial Intelligence and Research division reported their system reached a similar WER. What this means is that computers are now about equal to that of human abilities when asked to transcribe the same conversation.

"We've reached human parity," said Microsoft's chief speech scientist Xuedong Huang. "This is a historic achievement."

That feat is the fruit of decades of testing. The milestone came in September when the AI scored 6.3 WER. This high figure is giving an implication for digital assistant supremacy where Windows, Cortana and Xbox could create a big impact.

"Even five years ago, I wouldn't have thought we could have achieved this. I just wouldn't have thought it would be possible," said Harry Shum, the executive VP who heads the Microsoft Artificial Intelligence and Research group.

The AI has trained the AI using deep neural networks to store a huge amount of data. Called training sets, the researchers could use these to train the machine to recognize patterns from human inputs. Not just plain text, both sounds and images were also used to train the AI in order for it to utilize its stored data more efficiently.

The result is clear, the team has made the AI able to understand human speech as good as us humans. But still, the researchers see the AI as something far from perfection: it's as good as humans, but we're far from flawless.

With another smart computer, the possibility of AI has just got wider. Moving forward, the team at Microsoft hopes to achieve even better scores in the future as well as to ensure that speech recognition works better in the real-world situation.

Computers equipped with even the basic sound input hardware can interpret human's speech. The problem lies in the conditions of everyday life such as on a crowded street, noisy restaurant or in harsh weather with strong winds.

Furthermore, the team also wants to make the technology to be able to assign names to individual speakers when multiple people are talking, and on making sure that it works well with a wide variety of voices, regardless of age, accent or ability.

Humans have a long way to go before creating a computer that really mimics human. But with deep neural network to make computers think like a human brain, we're giving the possibilities where the computers of the future can be a representation of us. We're still ways off, but we're seeing a world where we humans no longer have to understand computers to make them understand us.

The research milestone comes after decades of research in speech recognition. It all started in the early 1970s with DARPA being tasked by the U.S agency to make a technology breakthrough in the interest of national security. Over the decades, many tech companies and research organizations have joined the effort.