Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition

DrZoidberg · Oct 27, 2016

ronburgundy said:
DrZoidberg said:

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

Click to expand...

But that is errors in interpretation, which goes beyond just getting the words spoken correct. Even if a person accurately hears every word, they often have interpretation errors, and a single misheard word out of 100 (a 1%) error rate can mean massive interpretation errors. 6.3% error rate means that the person would be mishearing about 1 word in every sentence, not even considering whether they interpret the set of words correctly. I doubt there is any research showing its that high.

Another issue is that humans are less likely to make errors on the most important words. Humans focus their attentional resources on the words that are most important for the meaning (e.g., subject, verb, object). Their errors are likely to be disproportionately on the minor words and articles (the, a, an, etc.). Also, humans use the contextual meaning to "guess" when a word wasn't accurately heard, so their errors while not the exact word will be "close" in terms of deeper meaning. In contrast, computers will make random errors, which means more errors than humans on key words, and the computer "guesses" will be way off in meaning since they "guess" based on semantic overlap rather than contextual meaning.

I think we use context to help us. I lead teams across cultures. When people, from different cultures, aren't used to each other they can barely speak a sentence without needing clarifications. And this is when both are fluent and both are working with the same specialisation in a specialised industry.

Take sight. I think we assume the hell out of what people are seeing. And that's how our brains work. When I studied human-computer intertation I looked at pictures showing representations of the amount of information that the eye actually picks up. Most of what we think we see we don't actually see. It's the brain filling in the gaps. Hearing works the same way.

Humans have a huge head start on computers because we are humans. We will share the same type of limitations as whoever is speaking, so our powers are assumption will be pretty good. But a lot of that's just guesswork, and I think we should be humble about it. If we would be in conversation with a more intelligent non-human being I highly we would be able to understand that they are more intelligent than us. Simply because we'd be searching for specifically human cues. Computers would need to simulate that, which is hard when we still don't really know how the brain works.

ronburgundy · Oct 27, 2016

DrZoidberg said:
ronburgundy said:

But that is errors in interpretation, which goes beyond just getting the words spoken correct. Even if a person accurately hears every word, they often have interpretation errors, and a single misheard word out of 100 (a 1%) error rate can mean massive interpretation errors. 6.3% error rate means that the person would be mishearing about 1 word in every sentence, not even considering whether they interpret the set of words correctly. I doubt there is any research showing its that high.

Another issue is that humans are less likely to make errors on the most important words. Humans focus their attentional resources on the words that are most important for the meaning (e.g., subject, verb, object). Their errors are likely to be disproportionately on the minor words and articles (the, a, an, etc.). Also, humans use the contextual meaning to "guess" when a word wasn't accurately heard, so their errors while not the exact word will be "close" in terms of deeper meaning. In contrast, computers will make random errors, which means more errors than humans on key words, and the computer "guesses" will be way off in meaning since they "guess" based on semantic overlap rather than contextual meaning.

Click to expand...

I think we use context to help us. I lead teams across cultures. When people, from different cultures, aren't used to each other they can barely speak a sentence without needing clarifications. And this is when both are fluent and both are working with the same specialisation in a specialised industry.

Take sight. I think we assume the hell out of what people are seeing. And that's how our brains work. When I studied human-computer intertation I looked at pictures showing representations of the amount of information that the eye actually picks up. Most of what we think we see we don't actually see. It's the brain filling in the gaps. Hearing works the same way.

Humans have a huge head start on computers because we are humans. We will share the same type of limitations as whoever is speaking, so our powers are assumption will be pretty good. But a lot of that's just guesswork, and I think we should be humble about it. If we would be in conversation with a more intelligent non-human being I highly we would be able to understand that they are more intelligent than us. Simply because we'd be searching for specifically human cues. Computers would need to simulate that, which is hard when we still don't really know how the brain works.

I agree with pretty much all of that, but that relates to my second point, which is that the human use of context helps us to fill in gaps with reasonable guesses that get us close to the intended meaning. Computers don't do this well, so their "guesses" are more likely to be way off or just incoherent nonsense. So, it isn't just the statistical error rate that matters, but the qualitative nature of they type of errors being made and how much they alter the intended meaning.

barbos · Oct 27, 2016

DrZoidberg said:
barbos said:

not very often, certainly not 6.3% of time.

Click to expand...

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

I did, and found that under normal circumstances I make no mistakes in voice recognition, hence my problem with 6.3% number.

skepticalbip · Oct 27, 2016

DrZoidberg said:
barbos said:

not very often, certainly not 6.3% of time.

Click to expand...

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

That seems to be a completely different animal. There is a huge difference between correctly identifying each word in a string of words and understanding the intent of the person speaking them.

Example:
In the sentence, "I want you to make the selection process fair." the words are quite clear however that does not help in understanding what they mean by "fair".

barbos · Oct 27, 2016

skepticalbip said:
DrZoidberg said:

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

Click to expand...

That seems to be a completely different animal. There is a huge difference between correctly identifying each word in a string of words and understanding the intent of the person speaking them.

Example:
In the sentence, "I want you to make the selection process fair." the words are quite clear however that does not help in understanding what they mean by "fair".

In an ironic twist, DrZoidberg post not only express his point but illustrates it too

You have to give him credit for that.

DrZoidberg · Oct 28, 2016

barbos said:
skepticalbip said:

That seems to be a completely different animal. There is a huge difference between correctly identifying each word in a string of words and understanding the intent of the person speaking them.

Example:
In the sentence, "I want you to make the selection process fair." the words are quite clear however that does not help in understanding what they mean by "fair".

Click to expand...

In an ironic twist, DrZoidberg post not only express his point but illustrates it too
You have to give him credit for that.

2f90e4b0b5232f204e72d4fe4964f49d1d5cac6122717dee8fea25c023e8e4bf.jpg

DrZoidberg · Oct 28, 2016

skepticalbip said:
DrZoidberg said:

Lol. Well... it is my job to facilitate communication in large teams of people. It's what I do for a living all day. There are studies on this. 55% of all defects found in software are due to faulty (most often vague) requirements. Ie a misunderstanding on how something should be interpreted. And these are intelligent, trained people who are aware of communication problems in teams. They still fuck it up distressingly often.

I suggest to go and do a little research on your own on this topic. Humans are awful at this.

Click to expand...

That seems to be a completely different animal. There is a huge difference between correctly identifying each word in a string of words and understanding the intent of the person speaking them.

Example:
In the sentence, "I want you to make the selection process fair." the words are quite clear however that does not help in understanding what they mean by "fair".

I'd argue that humans don't understand that either. It's 100% pure projection. We love using these kinds of empty words, that are only about emotional bonding but which contain no information. Especially pertinent in election times. Words like justice, freedom, making better, helping, peace and in Trumps case consensual pussy grabbing.

What humans do when in discussions like this is to nod vigorously and smile. What they should be saying is "error, error, does not compute, please rephrase the statement".

skepticalbip · Oct 28, 2016

DrZoidberg said:
skepticalbip said:

That seems to be a completely different animal. There is a huge difference between correctly identifying each word in a string of words and understanding the intent of the person speaking them.

Example:
In the sentence, "I want you to make the selection process fair." the words are quite clear however that does not help in understanding what they mean by "fair".

Click to expand...

I'd argue that humans don't understand that either. It's 100% pure projection. We love using these kinds of empty words, that are only about emotional bonding but which contain no information. Especially pertinent in election times. Words like justice, freedom, making better, helping, peace and in Trumps case consensual pussy grabbing.

What humans do when in discussions like this is to nod vigorously and smile. What they should be saying is "error, error, does not compute, please rephrase the statement".

It seems that we have another example here. I assume that I correctly read the word "argue" here. Did you mean "agree"?

Transcription is only about correctly identifying individual words, not meaning. This is what the "breakthrough" program is doing.

The real work is still ahead, interpreting the meaning the speaker intended. Humans frequently have difficulty doing this so it is going to be a major challenge to develop a program so the computer can do what humans struggle with.

fromderinside · Oct 28, 2016

Noting one suggesting 5.9% is a high number I just want to bring politics in for a moment so I can say: "Trump thinks bigly"

bilby · Nov 2, 2016

Computers have a long way to go before we need to concern ourselves with being replaced by them for translation services.

Google Translate sings...

[YOUTUBE]https://youtu.be/rwOH3YsraNs[/YOUTUBE]

Deepak · Nov 3, 2016

Human hit rates are terrible too, as any misheard lyrics database would attest.

I'd be interested in a study asking humans who are hearing Bohemian Rhapsody for the first time to transcribe the lyrics. I'd expect it to be better than Google, but not by much...

And I'd really like to see the Microsoft tech at work on the same task.

barbos · Nov 4, 2016

Deepak said:
Human hit rates are terrible too, as any misheard lyrics database would attest.

I'd be interested in a study asking humans who are hearing Bohemian Rhapsody for the first time to transcribe the lyrics. I'd expect it to be better than Google, but not by much...

And I'd really like to see the Microsoft tech at work on the same task.

At least humans get better after repeat listening. Computer will not get better, in fact they already repeat the process many times during recognition.

My understanding is that this microsoft "achievement" even we forget about comparison methodology is very incremental with respect to previous one. I do suspect that they in fact comparing to human transcribing in real time which is totally unfair comparison.

Deepak · Nov 4, 2016

barbos said:
At least humans get better after repeat listening. Computer will not get better, in fact they already repeat the process many times during recognition.

That's not true - machine learning by definition can improve over time. I know that the OCR program I build back during my university days definitely improved after multiple trials. There's a reason why Captchas have moved from text recognition of warped characters with squiggly lines to semantic tests like 'check all of the street signs' or 'pictures with lakes'.

barbos said:
My understanding is that this microsoft "achievement" even we forget about comparison methodology is very incremental with respect to previous one.

Not sure why this is necessarily something to dismiss. I know many people who fail to display such incremental improvements, let alone yuge ones...

barbos · Nov 4, 2016

Deepak said:
That's not true - machine learning by definition can improve over time. I know that the OCR program I build back during my university days definitely improved after multiple trials. There's a reason why Captchas have moved from text recognition of warped characters with squiggly lines to semantic tests like 'check all of the street signs' or 'pictures with lakes'.

Speech recognition does not learn that way. I mean algorithm does not change with time. And humans don't learn that way either when they listen to the same audio few more times. It takes much more than that to learn something. When humans listen to the audio repeatedly they simply get more time to process it.

barbos said:
barbos said:

My understanding is that this microsoft "achievement" even we forget about comparison methodology is very incremental with respect to previous one.

Click to expand...

Not sure why this is necessarily something to dismiss. I know many people who fail to display such incremental improvements, let alone yuge ones...

I explained why I dismiss it. They created false impression and did so intentionally. and their comparison is totally unfair because they really gave humans huge handicap and forgot to mention this fact.

Deepak · Nov 4, 2016

barbos said:
Speech recognition does not learn that way. I mean algorithm does not change with time. And humans don't learn that way either when they listen to the same audio few more times. It takes much more than that to learn something. When humans listen to the audio repeatedly they simply get more time to process it.

barbos said:

My understanding is that this microsoft "achievement" even we forget about comparison methodology is very incremental with respect to previous one.

Click to expand...

Not sure why this is necessarily something to dismiss. I know many people who fail to display such incremental improvements, let alone yuge ones...

Click to expand...

I explained why I dismiss it. They created false impression and did so intentionally. and their comparison is totally unfair because they really gave humans huge handicap and forgot to mention this fact.

What do you mean speech recognition doesn't learn that way? It learns however the system is designed to learn. To wit: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tasl-deng-2244083-x_2.pdf

Now you could say, as the whitepaper mentions, that current commercialized solutions like Siri don't work that way - but that's not applicable to all machine learning approaches to speech recognition as a class. And when humans listen to audio repeatedly the don't get more time to process it, as long as the playback speed hasn't been modified. They get more trials in which to refine their initial recognition. This is why people who have listened to Blinded by the Light for 40 years still think he's saying 'wrapped up like a douche' even though there's no lack of hearing the song.

fromderinside · Nov 4, 2016

If computers use statistics for this task, and I know they do, then it would be a simple programming task to analyze the rate of misses and hits by cue (break, up sweep, diminish, etc) and use some simple Bayesian expert to increase probabilities of hit. Its a called tuning algorithm. Other tactics include memory comparisons and sorts etc. Computers do these well.

As for humans well. Those types have about a dozen processors going at once depending on the content and context, (both usually) which have noise and approximation processors going and they have capabilities to compare which are beyond those of computers right now since they actually work like a selectively wet sandbox with electrodes being stimulated. An an analog I've observed the behavior of pilots learning complex operational tasks. The so called golden arms process such at rates beyond the calculated ability, given the equipment and elements of the tasks, to do that work. They get there by a combination of practice and pressure and natural attributes. Its a bit like a biker learning to handle a bike on a new trail. About 80% of persons can't do it is the starting point here.

To put something like language in to an algorithm box is a bit like betting against the house. You might succeed once in a while, but ...

Treedbear · Nov 4, 2016

Deepak said:
barbos said:

At least humans get better after repeat listening. Computer will not get better, in fact they already repeat the process many times during recognition.

Click to expand...

That's not true - machine learning by definition can improve over time. I know that the OCR program I build back during my university days definitely improved after multiple trials. There's a reason why Captchas have moved from text recognition of warped characters with squiggly lines to semantic tests like 'check all of the street signs' or 'pictures with lakes'.
...

I think "get better after repeat listening" means something completely different than "learning over time". The latter case involves looking at many different samples rather than the same sample over and over again.

Deepak · Nov 4, 2016

Treedbear said:
Deepak said:

That's not true - machine learning by definition can improve over time. I know that the OCR program I build back during my university days definitely improved after multiple trials. There's a reason why Captchas have moved from text recognition of warped characters with squiggly lines to semantic tests like 'check all of the street signs' or 'pictures with lakes'.
...

Click to expand...

I think "get better after repeat listening" means something completely different than "learning over time". The latter case involves looking at many different samples rather than the same sample over and over again.

Gets better with repeated listening maps to my allusion about listening to the same song where it's one person and they're recording in a studio. It's empirically verified that humans can listen repeatedly and not improve (unless your semantic reference pool contains things called Deuce Coupes which are capable of being revved).

Learning over time would be map to listening to English speakers with accents over time which, based on the Microsoft whitepaper, is better than the currently employed HMM methods for speech recognition.

It's strange to me to refer to each the other way, but if you'd prefer it that's fine. How is my point weaker if you swap the referent terms?

barbos · Nov 4, 2016

Deepak said:
barbos said:

Speech recognition does not learn that way. I mean algorithm does not change with time. And humans don't learn that way either when they listen to the same audio few more times. It takes much more than that to learn something. When humans listen to the audio repeatedly they simply get more time to process it.

barbos said:

My understanding is that this microsoft "achievement" even we forget about comparison methodology is very incremental with respect to previous one.

Click to expand...

Not sure why this is necessarily something to dismiss. I know many people who fail to display such incremental improvements, let alone yuge ones...

Click to expand...

I explained why I dismiss it. They created false impression and did so intentionally. and their comparison is totally unfair because they really gave humans huge handicap and forgot to mention this fact.

Click to expand...

What do you mean speech recognition doesn't learn that way? It learns however the system is designed to learn. To wit: https://www.microsoft.com/en-us/research/wp-content/uploads/2016/02/tasl-deng-2244083-x_2.pdf

Now you could say, as the whitepaper mentions, that current commercialized solutions like Siri don't work that way - but that's not applicable to all machine learning approaches to speech recognition as a class. And when humans listen to audio repeatedly the don't get more time to process it, as long as the playback speed hasn't been modified.

You expect me to read that paper? And no, people don't learn anything when they listen to song few more times. They simply get themselves more time to process parts which they did not get in previous runs.

They get more trials in which to refine their initial recognition. This is why people who have listened to Blinded by the Light for 40 years still think he's saying 'wrapped up like a douche' even though there's no lack of hearing the song.

barbos · Nov 4, 2016

Treedbear said:
Deepak said:

That's not true - machine learning by definition can improve over time. I know that the OCR program I build back during my university days definitely improved after multiple trials. There's a reason why Captchas have moved from text recognition of warped characters with squiggly lines to semantic tests like 'check all of the street signs' or 'pictures with lakes'.
...

Click to expand...

I think "get better after repeat listening" means something completely different than "learning over time". The latter case involves looking at many different samples rather than the same sample over and over again.

Exactly!

Historic Achievement: Microsoft researchers reach human parity in conversational speech recognition

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Contributor

Mazzie Daius

Fair dinkum thinkum

Veteran Member

Contributor

Veteran Member

Contributor

Veteran Member

Mazzie Daius

Veteran Member

Veteran Member

Contributor

Contributor