ronburgundy
Contributor
I heard about and interesting follow-up to last years paper in Science showing the replication problems within Psychological research.
You might recall about a year ago when an article in Science reported a huge replication project where a group of psychologists took over 100 published results and tried to replicate them. In Social Psychology (which studies how people impact each others thought processes and how people think about other people) the replication rate was only 25%. In Cognitive Psychology (which studies human thought, reasoning, problem solving) more generally the replication rate was about 50%.
That got all the headlines, but lost in the hubub was that the picture was less gloomy when they combined the data from the original and the new replication studies. Then, the same significant results as in the original study alone emerged about 70% of the time. IOW, even though most replication didn't get a significant result on their own, they got patterns of results that were mostly consistent with the original study and thus the significant result remained most of the time when you combined to two data sets.
This suggested that the lack of replication was not due to fraud or wrong theories so much as weak methodologies and samples sizes that were too small to detect real effects even when present.
The talk i just heard follows up on this idea and does some complex data simulations to show that part of the problem is the strong bias in psychology to only publish papers with significant effects and to reject studies where the null hypothesis is retained. The rationale behind this has always been that a null result cannot be interpreted as good evidence against a theory because their are too many possible errors in sampling or measurement that might lead to a null result.
But a side-product of this is that the effects that do get published will just happen by chance to be inflated in their true size of the effect (for example, the experimental condition might be 1 standard deviation above the control group, when the true effect is only 1/4 a standard deviation). Since most studies use small samples due to practical constraints, only when the effect in that study happened to be inflated for some reason will it reach statistical significance and then get published. When that replication project designed their studies, they used the size of the effects in the original studies to estimate how many subjects they would need to be sure they would have the statistical power to detect the effect if it occurred in their study. But since those effect sizes were inflated, this led to an underestimate of the # of subjects they needed in the replication studies.
So, its a good news/ bad news thing. On the bright side, it means that many of the studies that did not replicate are still likely to be reporting real effects and not the result of fraud and will replicate with enough subjects. On the negative side, it means that psychologists are using samples that are too small and only lead to significant effects when the effect size is inflated, giving a misleading picture of how impactful the causal variables are.
Keep an eye out for this new paper in the next few months, possibly in Science again.
Oh, and it helps explain why the replication was twice as low in Social Psychology than Cognitive psychology.
The more random error in the measurement tools, the more variability there will be within each experimental condition, and thus the more subjects and statistical power you will need to detect a significant differences between the conditions. Social Psychology tends to measure their variables very indirectly with methods that leave lots of room for error.
For example, studies often rely upon the subjects giving self-reports about their beliefs, their emotions, or opinions. Not only might people lie, but we are not highly accurate judges of ourselves even when we are trying to be. Also, if they are usually having to translate their subjective level of confidence or certainty about something into a numerical scale on a questionnaire. Different people will interpret those numbers somewhat differently. The variation in their true beliefs will correlate with the variation in the numbers they circle on the scale, but far from perfectly. In cognitive psychology, they measure things that can be more reliably quantified like how many numbers in a series did you recall, how many seconds did it take you to solve the problem. What those number mean in terms of more abstract theoretical concepts is another matter, but replication is about reliability not theoretical validation.
You might recall about a year ago when an article in Science reported a huge replication project where a group of psychologists took over 100 published results and tried to replicate them. In Social Psychology (which studies how people impact each others thought processes and how people think about other people) the replication rate was only 25%. In Cognitive Psychology (which studies human thought, reasoning, problem solving) more generally the replication rate was about 50%.
That got all the headlines, but lost in the hubub was that the picture was less gloomy when they combined the data from the original and the new replication studies. Then, the same significant results as in the original study alone emerged about 70% of the time. IOW, even though most replication didn't get a significant result on their own, they got patterns of results that were mostly consistent with the original study and thus the significant result remained most of the time when you combined to two data sets.
This suggested that the lack of replication was not due to fraud or wrong theories so much as weak methodologies and samples sizes that were too small to detect real effects even when present.
The talk i just heard follows up on this idea and does some complex data simulations to show that part of the problem is the strong bias in psychology to only publish papers with significant effects and to reject studies where the null hypothesis is retained. The rationale behind this has always been that a null result cannot be interpreted as good evidence against a theory because their are too many possible errors in sampling or measurement that might lead to a null result.
But a side-product of this is that the effects that do get published will just happen by chance to be inflated in their true size of the effect (for example, the experimental condition might be 1 standard deviation above the control group, when the true effect is only 1/4 a standard deviation). Since most studies use small samples due to practical constraints, only when the effect in that study happened to be inflated for some reason will it reach statistical significance and then get published. When that replication project designed their studies, they used the size of the effects in the original studies to estimate how many subjects they would need to be sure they would have the statistical power to detect the effect if it occurred in their study. But since those effect sizes were inflated, this led to an underestimate of the # of subjects they needed in the replication studies.
So, its a good news/ bad news thing. On the bright side, it means that many of the studies that did not replicate are still likely to be reporting real effects and not the result of fraud and will replicate with enough subjects. On the negative side, it means that psychologists are using samples that are too small and only lead to significant effects when the effect size is inflated, giving a misleading picture of how impactful the causal variables are.
Keep an eye out for this new paper in the next few months, possibly in Science again.
Oh, and it helps explain why the replication was twice as low in Social Psychology than Cognitive psychology.
The more random error in the measurement tools, the more variability there will be within each experimental condition, and thus the more subjects and statistical power you will need to detect a significant differences between the conditions. Social Psychology tends to measure their variables very indirectly with methods that leave lots of room for error.
For example, studies often rely upon the subjects giving self-reports about their beliefs, their emotions, or opinions. Not only might people lie, but we are not highly accurate judges of ourselves even when we are trying to be. Also, if they are usually having to translate their subjective level of confidence or certainty about something into a numerical scale on a questionnaire. Different people will interpret those numbers somewhat differently. The variation in their true beliefs will correlate with the variation in the numbers they circle on the scale, but far from perfectly. In cognitive psychology, they measure things that can be more reliably quantified like how many numbers in a series did you recall, how many seconds did it take you to solve the problem. What those number mean in terms of more abstract theoretical concepts is another matter, but replication is about reliability not theoretical validation.