As I alluded to in my previous post, questionable research practices can be used to maximize the chance of a result being “statistically significant” (where p < .05).
There are a couple of apps that demonstrate how researcher choices (sample size, collecting extra participants, looking at multiple outcomes, etc.) can lead to erroneous conclusions.
Felix Schöbrodt has a nice walk-through of the interactive p-hacker app, which lets you make choices about the initial study: number of participants in the group, whether there really is a difference between the two groups of interest, and number of outcomes (“DVs” or “dependent variables”). The app then generates data based on these specifications, as though you had run your study in a population with these characteristics. (Of course, in the real world, we can’t know for sure these population characteristics, but simulations allow us to understand how our choices, assumptions, and practices give us results that do or do not reflect the “true state of things”.)
Even when there is no real effect in the population (the app effect is set to zero), some of these choices can result in a high probability of obtaining a “statistically significant” result. For example, I ran five “studies,” each time selecting no effect in the real world (set to zero), with 20 participants in each group, and looking at 10 outcomes (DVs). In two of these studies, I obtained 2-3 “marginally” significant results (out of a possible 10 in each study) with p-values of .057 to .089. These would often be interpreted as meaningful in some way.
In the two studies where there were these results, I then did some “p-hacking,” which allowed me to make some additional decisions that affected my interpretations. In Study 1, the removal of an outlier gave me one significant result (p=.025). There were no outliers in Study 2.
In Study 2, I added 5 new participants per group (10 total). Voila! I now had FOUR significant results (ps ranging from .016 to .043).
So, I conducted five “studies,” removed one outlier, and added 10 participants to one study and now I have five “statistically significant” results to talk about. But I want to remind you! THERE IS NO REAL EFFECT. Recall that we determined the “population” from which these data were drawn and there is no real effect. If this were real life instead of a simulation, we might publish the two studies with significant effects and this would become evidence that there really is a difference between the groups.
I want to end by pointing out that most of these issues are not because p-values or traditional statistics are inherently flawed. Rather, researchers make choices that capitalize on chance and then don’t appropriately report or correct for those choices. For example, we could have chosen to look only at two or three outcomes (DVs) in a single study and then used a correction such that we only paid attention to p-values closer to .01. When I ran a single “study” using these specifications, I didn’t have p-values anywhere close to the point where I would conclude there was a difference between groups.
Pre-registering studies (making public your decisions BEFORE you run the data), using corrections for the number of comparisons, and avoiding questionable research practices can help minimize false positives (that is, saying there is a difference when there isn’t one). As consumers of research it’s important to be aware of these issues and interpret differences more cautiously when authors don’t demonstrate that they are taking steps to avoid p-hacking-like practices.