Let’s start with a comic, shall we? Read the comic about “significant discovery” below. This is going to be our fun topic for today!
Before we understand what’s going on and why it’s bad – is it? – let’s recap a bit about hypothesis testing.
Hypotheses are “guesses” about model and data structure that we want to test from sample data. Hypothesis Testing is statistical term for how to do so in mathematically meaningful way, but we do so all the time even in daily experience without realizing. If TV isn’t working and you restarted the set-top-box, you have just tested a hypothesis. Hypothesis was that cause of TV not working lies in set-top-box, and restarting is way to test if that is correct. If your observations (result after restarting) align with your hypothesis (TV starts working or gets better), then you have more confidence in hypothesis. Then you may test another. Problem solving is essentially serial hypothesis testing. As we discussed in previous post on “11 facts of data science“, alternative to hypothesis testing is trial-and-error, where you will just tinker with everything and hope that something works.
In statistical terms hypothesis testing refers to having a prior belief (called Null Hypothesis), observing data and doing certain calculations on it, and seeking strong enough evidence to falsify prior belief (reject Null Hypothesis). Recall that in practice there is never certain, or 100%, evidence for anything – even that Sun will rise tomorrow – but only strong enough, say, 99.9…9%. There is slight risk inherent in rejecting null hypothesis without certainty, and that risk is represented as confidence level of α (alpha). There is, naturally, alternate risk of being too stubborn and demanding too much evidence, that we stick to prior belief even in presence of extraordinary evidence otherwise. That risk is less talked about, and is represented as β (beta) or power of the test (1-β ). In Fig. 1 below, blue curve represents distribution of some observation for sample data – because sample is not complete data, each different sample will provide slight different value of observation metric – under prior belief, and red curve represents same under alternative belief (called Alternate Hypothesis), often defined as anything other than prior belief. Black vertical line portrays boundary of our desired confidence, and tiny bit of blue curve on right of line indicates that there small chance of rejecting prior belief even if prior belief is true.
How small chance can you afford depends entirely on practical risk of making wrong decision. What if you claim someone has cancer when he doesn’t? What about claiming he doesn’t when does? Which is more risky? What about declining valid credit card transaction thinking it is fraud? What about not declining and actually risking fraud? Since αand β always play against each other, there is trade off in risk of false positive and false negative. Depending on your application you may accept anywhere from 20% to 0.001% risk in rejecting null hypothesis falsely. If your application isn’t specific, or both errors are equally bad, then statistical rule of thumb has emerged for α=5%. That means, about 1 in 20 times you will reject null hypothesis falsely just because sample happens to be on right on black line.
p-Value – probability Value – in hypothesis testing refers to chance that null hypothesis may be falsely rejected, and lower this value is, more confident we are about our observation and test results. Since α is generally 5%, p-Value lower than 5% is considered desirable.
Now, let’s get back to our comic. Rule of thumb of 5% is great when you are testing one observation on one data, but not so if you repeatedly test for multiple observations on same data. Since scientists in the comic tested twenty colours of beans as potential cause of acne (Null hypothesis: Beans don’t cause acne), it’s not surprising that one colour (green, in this case) happens to test positive beyond 5% confidence. Of course, for comic fun, note that 1 out of 20 is 5%, so it fits exactly the warning we cautioned against. p-Hacking refers to testing multiple observations on same data till one of them is statistically significant and then publish or present those results as mathematically valid conclusion.
Solution to avoid this junk analysis is to demand reproducibility of results and have lower .
One rule of thumb is to reduce according to Bonferroni method, where α is reduced by number of repeat observations on same data. Hence, if there are 20 tests on same data, as in our comic example, we should be seeking evidence at 5%/20=0.25% confidence level. This is hard and stringent requirement, and pushes up and reduced power of tests. Less stringent, if your risk of false positive allows that, is to put a threshold, much less than 1, on ratio of number of rejected hypotheses to expected number of rejected hypotheses (called False Discovery Rate). For beans-cause-acne example, there is one hypothesis rejected (green beans cause acne!) and expected number of rejections is 20*α =20*5%=1, thus making ratio of 1.
And p-Hacking is not theoretical problem. Read this fascinating story of how one nutritional research paper hacked p-Value to truly, yet, falsely, claim chocolate help reduce weight. Since Machine Learning and Analytics is applied statistics, hopefully now you will have your eyes open for potential p-Hacking!