• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition

GenesisNemesis

I am a proud hedonist.
Joined
Jul 24, 2006
Messages
4,419
Location
California
Basic Beliefs
Secular Humanist, Scientific Skepticism, Strong Atheism
Say goodbye to a bunch of jobs.

http://blogs.microsoft.com/next/201...recognition/#sm.0000001afb2827d06sxul74kr3xwe

Microsoft has made a major breakthrough in speech recognition, creating a technology that recognizes the words in a conversation as well as a person does.

In a paper published Monday, a team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists. The researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported just last month.

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”
 
Last edited:
Say goodbye to a bunch of jobs.

Not to hijack your thread... but on the topic of job preservation in the face of technological advances, I say, "farewell and good riddance to obsolete jobs that have lost their relevance in society"

Medical advances have reduced the death rate to a point where undertaker and grave digger jobs have been radically lost around the modern world. Good. Find a job in construction.

The automobile has put countless stagecoach drivers out of business. Get your taxi license and move on

The autonomous vehicle threatens the jobs of over a million truck drivers across the country. Guess what group is lobbying the hardest against legislation around this?

If job preservation was a legitimate driving force against advancement, we would all be toiling in manual labor (on a farm, probably) to the ripe old age of 45.
 
6.3% is awfully high error rate for humans. I suspect they are cheating with test samples to reach this "record".
Computers tend to do relatively well on hard samples where people do a lot of errors. Problem is, error rate does not go down on easy (for humans) samples. It's the same thing with face recognition - good on hard samples and not very good on easy samples. So overall error rate could be the same but error distributions are very different.
 
6.3% is awfully high error rate for humans. I suspect they are cheating with test samples to reach this "record".
Computers tend to do relatively well on hard samples where people do a lot of errors. Problem is, error rate does not go down on easy (for humans) samples. It's the same thing with face recognition - good on hard samples and not very good on easy samples. So overall error rate could be the same but error distributions are very different.

“The next frontier is to move from recognition to understanding,” Zweig said.
...
“It will be much longer, much further down the road until computers can understand the real meaning of what’s being said or shown,” Shum said.

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation ...
it means that the error rate – or the rate at which the computer misheard a word like “have” for “is” or “a” for “the” – is the same as you’d expect from a person hearing the same conversation. ...

I wonder what kinds of words the errors occurred in. The computer is probably very good at shorter words that make up most of the structure of a sentence. Whereas people tend to fill these in and remember the longer and less commonly used words that carry the meaning. So when asked to transcribe spoken verse verbatim the error count might favor the savant-like computer. If so then there certainly is a long way to go until computers understand meaning.
 
I don't know what "transcribing" entails in this case. If it is done sentence by sentence then it would require good short term memory from people which makes this whole exercise a clear case of cheating. I am not a native speaker but I can "transcribe" whole hour of late show without a single mistake.
Ok, Kimmel is a bit harder to reach 100% because he mumbles a lot, but Conan, Colbert and other regular american english are just fine. So this 6.3% looks very suspicious to me.
I think what they have here is a large collection of different people speaking with different including rare and hard accents. And these human 6.3% come from these weird and unfamiliar to them accents. Computer on the other hand can easily be trained for different accents and therefore have much better performance on hard accents. Speaking of weird/hard accents and humans. I don't have a heavy accent and most americans would understand me fine, one guy even thought I was from America which was pretty ridiculous. But there were few cases where americans had real troubles with my english. I think this tells you that people vary greatly at their abilities and 6.3% error rates includes all that variation.

In general computer error rate is distributed more uniformly, it can recognize a word which human would have trouble with more often but it will make a lot of dumb mistakes where humans would never make a mistake. Which tells you that computer usually have very low (compared to humans) confidence in "its" recognition. But they mask this low confidence with other tricks like using context. Try making computer to recognize "Deal Clinton" or "Nichael Jackson", humans use context too, but they have much more confidence in their raw context-free recognition than computers.

In short, I think despite proclamations of exceeding humans they have not even reached Uncanny Valley.
 
Last edited:
Say goodbye to a bunch of jobs.

Not to hijack your thread... but on the topic of job preservation in the face of technological advances, I say, "farewell and good riddance to obsolete jobs that have lost their relevance in society"

Medical advances have reduced the death rate to a point where undertaker and grave digger jobs have been radically lost around the modern world. Good. Find a job in construction.

The automobile has put countless stagecoach drivers out of business. Get your taxi license and move on

The autonomous vehicle threatens the jobs of over a million truck drivers across the country. Guess what group is lobbying the hardest against legislation around this?

If job preservation was a legitimate driving force against advancement, we would all be toiling in manual labor (on a farm, probably) to the ripe old age of 45.
I generally agree. But in many cases these obsolete jobs are "replaced" with ridiculous ones and I am not sure which is worse.
 
... I am not a native speaker but I can "transcribe" whole hour of late show without a single mistake. ...

I'm amazed and jealous and I am a native New Englander. I'd tried to take notes on what the candidates said during the first debate and had a real hard time even keeping up. And afterwards I need a cryptologist to decipher my handwriting. :o
 
... I am not a native speaker but I can "transcribe" whole hour of late show without a single mistake. ...

I'm amazed and jealous and I am a native New Englander. I'd tried to take notes on what the candidates said during the first debate and had a real hard time even keeping up. And afterwards I need a cryptologist to decipher my handwriting. :o
I did not mean I could do it in real time. Using recorded video and taking a lot of time I can "transcribe" it without any errors.
 
Say goodbye to a bunch of jobs.

Not to hijack your thread... but on the topic of job preservation in the face of technological advances, I say, "farewell and good riddance to obsolete jobs that have lost their relevance in society"

Medical advances have reduced the death rate to a point where undertaker and grave digger jobs have been radically lost around the modern world. Good. Find a job in construction.

The automobile has put countless stagecoach drivers out of business. Get your taxi license and move on

The autonomous vehicle threatens the jobs of over a million truck drivers across the country. Guess what group is lobbying the hardest against legislation around this?

If job preservation was a legitimate driving force against advancement, we would all be toiling in manual labor (on a farm, probably) to the ripe old age of 45.

Human advancement is not some magic from the gods.

There is not just one way to advance.

There can be the advancement dictated by tyrants.

What we have today.

Or there can be sane advancement that does not destroy the species in a few generations.
 
Say goodbye to a bunch of jobs.

http://blogs.microsoft.com/next/201...recognition/#sm.0000001afb2827d06sxul74kr3xwe

Microsoft has made a major breakthrough in speech recognition, creating a technology that recognizes the words in a conversation as well as a person does.

In a paper published Monday, a team of researchers and engineers in Microsoft Artificial Intelligence and Research reported a speech recognition system that makes the same or fewer errors than professional transcriptionists. The researchers reported a word error rate (WER) of 5.9 percent, down from the 6.3 percent WER the team reported just last month.

The 5.9 percent error rate is about equal to that of people who were asked to transcribe the same conversation, and it’s the lowest ever recorded against the industry standard Switchboard speech recognition task.

“We’ve reached human parity,” said Xuedong Huang, the company’s chief speech scientist. “This is an historic achievement.”
No, it's a mistake. The guy said "59 percent error rate" but the speech recognition system transcribed it as "5.9". an honest mistake.

There, you have it, in a nutshell.
EB
 
I'm amazed and jealous and I am a native New Englander. I'd tried to take notes on what the candidates said during the first debate and had a real hard time even keeping up. And afterwards I need a cryptologist to decipher my handwriting. :o
I did not mean I could do it in real time. Using recorded video and taking a lot of time I can "transcribe" it without any errors.

I'm betting, considering the reasons for building speech recognition tools like this, they're talking about real-time transcription.

If it takes 15 minutes and 43 tries to understand I asked for a tea, Earl Grey, hot is that a meaningful '0%' error rate?
 
I did not mean I could do it in real time. Using recorded video and taking a lot of time I can "transcribe" it without any errors.

I'm betting, considering the reasons for building speech recognition tools like this, they're talking about real-time transcription.

If it takes 15 minutes and 43 tries to understand I asked for a tea, Earl Grey, hot is that a meaningful '0%' error rate?
I understand but real time human transcribing mistakes are not due to recognition itself, they are due to the the fact it has to be done in real time. Ordinary person can not transcribe in real time.
 
6.3% is awfully high error rate for humans. I suspect they are cheating with test samples to reach this "record".
Computers tend to do relatively well on hard samples where people do a lot of errors. Problem is, error rate does not go down on easy (for humans) samples. It's the same thing with face recognition - good on hard samples and not very good on easy samples. So overall error rate could be the same but error distributions are very different.

I don't think so. Humans suck. We often think we hear and understand correctly when we don't. How often is there a conflict between two people disagreeing on if someone misspoke or misheard?
 
6.3% is awfully high error rate for humans. I suspect they are cheating with test samples to reach this "record".
Computers tend to do relatively well on hard samples where people do a lot of errors. Problem is, error rate does not go down on easy (for humans) samples. It's the same thing with face recognition - good on hard samples and not very good on easy samples. So overall error rate could be the same but error distributions are very different.

I don't think so. Humans suck. We often think we hear and understand correctly when we don't. How often is there a conflict between two people disagreeing on if someone misspoke or misheard?
not very often, certainly not 6.3% of time.
 
I don't think so. Humans suck. We often think we hear and understand correctly when we don't. How often is there a conflict between two people disagreeing on if someone misspoke or misheard?
not very often, certainly not 6.3% of time.

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.
 
not very often, certainly not 6.3% of time.

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

But that is errors in interpretation, which goes beyond just getting the words spoken correct. Even if a person accurately hears every word, they often have interpretation errors, and a single misheard word out of 100 (a 1%) error rate can mean massive interpretation errors. 6.3% error rate means that the person would be mishearing about 1 word in every sentence, not even considering whether they interpret the set of words correctly. I doubt there is any research showing its that high.

Another issue is that humans are less likely to make errors on the most important words. Humans focus their attentional resources on the words that are most important for the meaning (e.g., subject, verb, object). Their errors are likely to be disproportionately on the minor words and articles (the, a, an, etc.). Also, humans use the contextual meaning to "guess" when a word wasn't accurately heard, so their errors while not the exact word will be "close" in terms of deeper meaning. In contrast, computers will make random errors, which means more errors than humans on key words, and the computer "guesses" will be way off in meaning since they "guess" based on semantic overlap rather than contextual meaning.
 
Back
Top Bottom