The replication "crisis" in psychology

ronburgundy · Mar 24, 2017

I heard about and interesting follow-up to last years paper in Science showing the replication problems within Psychological research.

You might recall about a year ago when an article in Science reported a huge replication project where a group of psychologists took over 100 published results and tried to replicate them. In Social Psychology (which studies how people impact each others thought processes and how people think about other people) the replication rate was only 25%. In Cognitive Psychology (which studies human thought, reasoning, problem solving) more generally the replication rate was about 50%.

That got all the headlines, but lost in the hubub was that the picture was less gloomy when they combined the data from the original and the new replication studies. Then, the same significant results as in the original study alone emerged about 70% of the time. IOW, even though most replication didn't get a significant result on their own, they got patterns of results that were mostly consistent with the original study and thus the significant result remained most of the time when you combined to two data sets.

This suggested that the lack of replication was not due to fraud or wrong theories so much as weak methodologies and samples sizes that were too small to detect real effects even when present.

The talk i just heard follows up on this idea and does some complex data simulations to show that part of the problem is the strong bias in psychology to only publish papers with significant effects and to reject studies where the null hypothesis is retained. The rationale behind this has always been that a null result cannot be interpreted as good evidence against a theory because their are too many possible errors in sampling or measurement that might lead to a null result.

But a side-product of this is that the effects that do get published will just happen by chance to be inflated in their true size of the effect (for example, the experimental condition might be 1 standard deviation above the control group, when the true effect is only 1/4 a standard deviation). Since most studies use small samples due to practical constraints, only when the effect in that study happened to be inflated for some reason will it reach statistical significance and then get published. When that replication project designed their studies, they used the size of the effects in the original studies to estimate how many subjects they would need to be sure they would have the statistical power to detect the effect if it occurred in their study. But since those effect sizes were inflated, this led to an underestimate of the # of subjects they needed in the replication studies.

So, its a good news/ bad news thing. On the bright side, it means that many of the studies that did not replicate are still likely to be reporting real effects and not the result of fraud and will replicate with enough subjects. On the negative side, it means that psychologists are using samples that are too small and only lead to significant effects when the effect size is inflated, giving a misleading picture of how impactful the causal variables are.

Keep an eye out for this new paper in the next few months, possibly in Science again.

Oh, and it helps explain why the replication was twice as low in Social Psychology than Cognitive psychology.
The more random error in the measurement tools, the more variability there will be within each experimental condition, and thus the more subjects and statistical power you will need to detect a significant differences between the conditions. Social Psychology tends to measure their variables very indirectly with methods that leave lots of room for error.

For example, studies often rely upon the subjects giving self-reports about their beliefs, their emotions, or opinions. Not only might people lie, but we are not highly accurate judges of ourselves even when we are trying to be. Also, if they are usually having to translate their subjective level of confidence or certainty about something into a numerical scale on a questionnaire. Different people will interpret those numbers somewhat differently. The variation in their true beliefs will correlate with the variation in the numbers they circle on the scale, but far from perfectly. In cognitive psychology, they measure things that can be more reliably quantified like how many numbers in a series did you recall, how many seconds did it take you to solve the problem. What those number mean in terms of more abstract theoretical concepts is another matter, but replication is about reliability not theoretical validation.

fromderinside · Mar 27, 2017

The primary problem, as I see it, is with definition of terms relating to more concrete standards of behavior and physical theory. Sure its tough to talk about combat fatigue when one is only talking about too many missions for instance. Need anchors. Afghanistan ground combat in field, city, mountains, air, or combat compared to other combat situations, etc. I chose something pretty arcane just to get across the point that finding sufficient parallels isn't easy when the topic is remote from normal behavior in this or that class in these or those cities which is still pretty arcane. On the other hand visual cliff experience is pretty definitive and relateable to age and sensory mode.

I suggest that experiments should be conducted under as structured a regime as is practicable with predictions while conclusions should be taken with significant mouth wash. There is something gained by these experiments. Even if it is we find we haven't tightened up enough on relating to previous understanding and theory or we haven't tied procedures to sufficiently robust measures.

As a psycho-physicist I rarely needed more than four or five observers to produce reproducible results. Whereas as a workload scientist I often had to re-run experiments with different measurement metrics to produce understandable results, much less easily reproducible outcomes. Although I always wound up attaining such eventually. My suggesting is if you can't repeat the experiment yourself and get similar results you need to examine your setup and method section and your experimental hypotheses.

The replication "crisis" in psychology

Loading....

ronburgundy

Contributor

fromderinside

Mazzie Daius