Stats Training Materials – Statistical Inference & Hypothesis Testing

Mention P-values and most people will probably shudder at some memory of an incomprehensible lecture or lesson on statistical tests. Words like null hypotheses, t-tests, statistical significance might pop into your mind with little understanding of what they are about. What you may know is that scientists have to report a p-value for any experiment they do or do they?

The area of Statistical Inference is a core area of study for any statistician. Put simply, Inference means to infer from the observations you’ve made about your data and to draw conclusions about what might be happening in real life. There are two parts to Inference.

Exploratory analysis – where you explore your data through charts, tables and other statistics and end up with one or more hypotheses about what might be going on.
Confirmatory Analysis – where you seek to confirm your hypothesis which can often be through the use of statistical tests but should not be exclusively confirmed through such tests.

I am a fan of using the criminal justice system as an analogy to explain this. When a crime occurs, the police investigate and collect evidence i.e. they undertake an exploratory analysis of the data. The outcome of this is a hypothesis that a person is guilty of the crime. That person is then tried in a court where the null hypothesis is that the person is innocent. The evidence is then examined via a statistical test and the outcome is a p-value that the jury uses to come to a verdict. Either the verdict is to reject the null hypothesis of innocence and therefore find the person guilty or the verdict is that the null hypothesis cannot be rejected and therefore the verdict is not guilty. At no point does a court conclude that the person is innocent, that is not the outcome of a statistical test.

Below is a list of various materials that you can use to learn more about hypothesis testing.

A. Experimental Design

Classically, a hypothesis should be specified before any data is collected. This leads you into the area of Experimental Design (or DOE) which is a vast area of statistics. If you do this, then conclusions drawn once the data has been analysed are usually sounder than data collected by other means.

More commonly, a hypothesis is generated after some data has been collected and analysed. The problem with this approach is that the way the data was collected may not be sufficient for you to draw firm conclusions. In reality, any conclusions should be treated as hypotheses for a proposed experiment.

Two blog posts of mine explain more.

Find out the difference between experiments and observations in my Evidence Hierarchy.
See an example of an experiment and how it can be improved in “Who reads fake news?“
What is the gold standard for an experiment? The answer is GRRaCE (Generalisable, Reproducible, RAndomised, Controlled Experiment) which I will expand upon in a post soon.

B. Statistical Tests

Hypothesis testing causes a lot of confusion and often explained badly. I intend to add more links to articles that do a good job on this.

Why do women like my logo? To be published soon as an example of doing a 2-way Chi-Squared test in Microsoft Excel.
Is the Conservative Party intersectional for ethnicity & gender? This blog looks at the changing gender & ethnic diversity of Conservative MPs since 2001 and one section uses a 2-way Chi-Squared Test to examine the interaction between gender & ethnicity.
Do opinion polls tend to underestimate the gap between Conservative & Labour party? I use a simple T-test to examine this hypothesis towards the end of this blog though I don’t explain the method.
Has there been a step change in UK annual temperatures? I use a 2-sample t-test to see if temperatures in the 21st century are different from the 20th century. This article also introduces the basic principles of SPC (Statistical Process Control) which uses confidence intervals covered in section C below.
Did the Mayor Paris discriminate in favour of women? In 2019, the Mayor of Paris was fined for having too many women in her leadership team. I show in this tweet how stupid this was as a simple Binomial Test demonstrates what happened was completely consistent with a null hypothesis of no discrimination.
You the Jury! Is this case a prosecutor’s fallacy or not? In 2021, I gave evidence at a Medical Practitioners Tribunal where a doctor has been charged with cheating in an exam. The case came to light when the examination board noted an unusual degree of similarity between her answers and another candidate. I was required to test whether this similarity was unusual which I did using a Chi-Squared test for a Null Hypothesis which assumed the answers of the 2 candidates were independent. The link takes you to a Youtube recording an event hosted by the Statistics & Law section of the Royal Statistical Society where I show how I performed this test and the conclusions. I then went on to list the other evidence for and against the defendant before revealing the verdict. The recording contains a QR code which opens this form allowing you to vote on the probability of guilt as the evidence is revealed.

C. Confidence Intervals

Confidence intervals are often recommended as an alternative to using P-values when assessing statistical significance. Together they are like two sides of the same coin and a case can be made that communication of results is easier with confidence intervals rather than p-values.

Here are some examples of confidence intervals in action.

When is a gender pay gap statistically significant?
Is the published gender pay gap data for an employer correct? I show how SPC can be used to conclude whether the year on year change is plausible or not.
Is the land safe for human activities? I was the lead author of a professional guidance document for the contaminated land industry which explains how statistics (specifically confidence intervals) can be used to make decisions on whether land is safe or not.

D. P-Values

The heart of traditional hypothesis testing is the calculation and the interpretation of P-Values. Many scientists and researchers in many fields have learned that this is how you decide if your research is statistically significant.

Unfortunately, the use of p-values has not conformed to good statistical practice and a number of issues have emerged. As a result the American Statistical Association (ASA) undertook a widespread consultation to see if these issues could be addressed. The outcome of the consultation has been a series of guidances which are listed below.

In March 2016, the ASA Statement on P-Values was published which explained how P-values can be misused. The full statement can be downloaded here. This was widely discussed throughout the world of research.
In September 2016, the statement was a keynote session at the Royal Statistical Society’s (RSS) conference in Manchester. I am in the front row of the youtube clip taking many notes!
In March 2019, the ASA published new guidance “Moving to a world beyond p<0.05“. This note is intended to be guidance on what alternatives there are to using p-values to undertake statistical inference.
In August 2019, the Significance magazine published this article by William Cready which explored some of the difficulties in implementing the ASA guidance.
In February 2020, I coined the hashtag #Pexit as a shorthand for “Exit from P-Values” to describe 2019 ASA statement. In this twitter thread, I pointed out that Brexit and Pexit have a lot in common!

The conclusion from the 1st link really resonates with me and is the basis of how I teach hypothesis testing in my courses.

“Good statistical practice, as an essential component of good scientific practice, emphasizes principles of good study design and conduct, a variety of numerical and graphical summaries of data, understanding of the phenomenon under study, interpretation of results in context, complete reporting and proper logical and quantitative understanding of what data summaries mean. No single index should substitute for scientific reasoning.”

For an interesting discussion on the meaning of the word “significant”, have a read of this article by Neil Sheldon where he recommends substituting the word “outlier” instead.

E. Inference

Inference comes from the verb “to infer” and is about the drawing of conclusions (both strong and weak) from data. Hypothesis Testing & Confidence Intervals are the main statistical methods by which we do this but they are not the only methods. Forecasting and Risk Modelling are two other options available among many.

Here is a list of blog posts where I draw conclusions from the available data.

I was asked to provide an expert opinion of the claim made by Bath & North East Somerset Council about their proposed Clear Air Zone plan. Did I change their plans?
If you were to join an organisation where everyone was white, could you conclude that this might be due to racial discrimination? I give an answer by introducing the idea of Bayesian Inference.
Has the gap between top and bottom in the English Premier League widened? I look at trends in the league placings since 1993.
Are stronger teams doing better than expected in the 2019 Rugby World Cup? I used World Rugby’s rankings to predict matches for the 2019 World Cup and ahead of the final, I evaluate the model’s performance.
Understanding the use of statistics evidence in courts & tribunals – This is a joint publication by the Royal Statistical Society and the Inns of Court College of Advocacy in 2017. It is intended for legal professionals when confronted with statistical evidence. I refer to page 24 of this publication in the Youtube recording in link B6 above when I discuss whether this case was a prosecutor’s fallacy or not.

If you would like to book a training course in Statistical Inference, then please contact me.

For more information about my other training courses in statistics, please visit my Statistical Training homepage.