April Fools day 2020 saw the hive mind of social media asking what the sample size should be to measure the extent of the Coronavirus in the UK. I could see that many people responding were reaching for standard methodologies which are usually are based on specifying a desired confidence interval. In doing so, they were overlooking a much more effective and relevant alternative based on the methodology of **Acceptance Sampling, **first developed by the US Military in World War 2**.**

**Data, Evidence, Decisions**

This is the strapline of the Royal Statistical Society of which I am a member. It represents the order in which a lot of statistical analysis is done. First, collect the **Data**, then analyse it and extract the **Evidence** (or **Insights** which is my preferred word) and then, based on the evidence, make **Decisions**.

It also works in reverse. You have a **Decision** to make. From that you work back to identify the **Evidence** you need to make the decision. From the required evidence, you work back to identify the **Data** that is capable of giving you the evidence.

If Boris Johnson wants to commission a survey to understand what people think of his handling of the Coronavirus pandemic, the flow **Data > Evidence > Decisions** is the right one to use to estimate the sample size of the survey. You start by asking “*How accurately do we wish to measure the answers to the questions?*” and the answer activates the **Data** step. Once the data is collected, it can be turned into **Insights** and then **Decisions**.

As of today, the UK population is largely locked down, nearly 2,500 have died of (or with) Covid19 and the economy is in recession. The questions Boris wants answered above all else are “*Can I lift all or part of the restrictions today?*” and “*If not today, then when?*“. Both require the reverse flow **Decisions > Evidence > Data**. Given the various **Decisions** that Boris can take, his advisors can identify the **Evidence** needed to allow those decisions to be made and that evidence will determine what **Data** is needed.

**Boris in the cockpit**

The first part of any flight is Take Off. After receiving clearance to take off, the pilot open the throttles and the plane starts to accelerate down the runway. During Take Off, the pilot has the 3 key decision points to pass through **V1 – Rotate – V2. **Each denote an aircraft speed threshold above which the pilot is committed to certain courses of actions and other courses of actions are closed off. Specifically,

**V1**– above this, the aircraft is committed to take off since the aircraft will not have enough runway to stop should take off be aborted.**Rotate**– the speed at which the pilot pulls back on the stick to launch the aircraft into the air.**V2**– above this, the aircraft is going fast enough to climb away, raise its undercarriage and flaps and fly away to its destination.

Prior to V1, the pilot can abort take off and start all over again. After V2, the aircraft is flying safely and can settle down to its normal routine. In between, if a problem develops, it may not always be possible to resolve it safely. The tragic crash of the Air France Concorde in 2000 was an example. The aircraft had already passed V1 when its fuel tank was ruptured by debris on the runway and it caught fire. The aircraft had to take off but was unable to reach V2 and fly normally and crashed before it could make an emergency landing.

Boris is our pilot today who needs to get the aircraft UK Life and Economy flying again at some point. Before he can open the throttles and start take off, he needs to know his **Decision **points V1 & V2. His advisors can analyse the medical and economic data and tell him what **Evidence** constitutes V1 & V2. Now all he needs is the **Data** to tell him (and us) whether the UK has reached V1 and lift some restrictions or whether we have reached V2 and can lift all restrictions. What **Sample Size** will give us the necessary data?

**Calculating the Sample Size**

Twitter wisdom at the moment is telling us to do mass testing of the population to find out if we are at V1 or V2 or even if we can start our take off run. Is this how pilots do it? A pilot of Airbus A380 will have over 500 passengers on board. Surely it would be good idea to get every passenger to use a stopwatch during the take off run and estimate when V1 has been reached and tell the pilot accordingly? After all 500 is a good sample size!

I sincerely hope no-one has gone through this experience. Instead the measurement is taken by a sample size of one (a single instrument though I assume there are backups) and is noted by a sample size of two, the two pilots, who will translate that into a decision. This points out a fundamental tension in trying to measure something accurately. The overall error is a combination of **Sampling Error** (a function of sample size and design) and **Measurement Error** (a function accuracy, precision & speed).

When it comes to decision making, especially those that have to be made under extreme stress of time or consequences, targeted & rapid sampling that answers a precise and demonstrably relevant question is far superior to large & slow samples that simply provides data to set of a vague questions. That is what **Acceptance Sampling** does.

**A COVID19 example of Acceptance Sampling**

Let’s use the following example. Antibody tests are in the news and have been described as a “*game changer*” by the UK Chief Medical Officer. The reason is that if we can determine if an individual is immune to the disease using an antibody test, they can be released from restrictions and help get the UK flying again. If a certain percentage of the population is immune, then the population is said to have herd immunity.

Let’s use **M** to denote the % of the population that is immune to COVID19. How high does **M** have to be for ALL restrictions to be lifted? This is a question that Boris could plausibly ask of his experts. **M** will have to be at a level higher than the suspected herd immunity level (currently thought to be between 50-60%) both as a margin of safety but also to give an extra buffer for when international travel restarts and we start mixing again with populations with lower levels of immunity. I am going to set our target value for M to be 80% just to keep calculations simple. This makes 80% our V2 value, above which the UK is free and clear to fly and navigate. For V1, I am going to set this to be 50% (the minimum level for herd immunity) above which we can start to relax some restrictions but not all.

In the Acceptance Sampling world, which are formalised by British Standards BS6000 (Numerical data) & BS6001 (Sampling by Attributes) in the UK, V2 is called the **Acceptable Quality Level (AQL)** or Acceptable Outcome Level (AOL). V1 does not have a standardised name but I like to use the terms **Unacceptable Quality Level (UQL)** or Unacceptable Outcome Level (UOL). Also, acceptance sampling uses the language of defects or failures for each individual sample since it was initially developed by the US military initially as a way of inspecting ammunition batches to see if they were safe to dispatch to the troops. So the AQL for our COVID19 example is in fact 20% i.e. if less than 20% of the population is “defective” as in not immune, then that is an acceptable outcome. Similarly the UQL for COVID19 immunity is 50% i.e. if more than 50% of the population is not immune (“defective”) then that is an unacceptable outcome. Because of this switch from immune people to non-immune people, I will now refer to **P** instead of **M** where **P** = 100% – **M**.

An acceptance sampling plan consists of two elements denoted by **R & N.**

**N**– the number of items/people to be sampled and measured**R**– the minimum number of items/people who have to be deemed “defective” for the batch (or population) to be REJECTED

R & N create a decision rule which can be written as

*“Accept the batch (i.e. take off and fly) if less than R samples are defective out of a total sample size of N”**“Reject the batch (i.e. abort take off) if R or more samples are defective out of a total sample size of N”*

R & N are derived from the following specifications

**AQL**– already covered**UQL**– already covered**Alpha risk**– also known as the false positive risk or Type 1 Error. Strictly speaking the 3 things are not exactly the same but I’m ignoring this.**Beta risk**– also known as the false negative risk or Type 2 Error. Strictly speaking the 3 things are not exactly the same but I’m ignoring this.

**VERY IMPORTANT! Every Acceptance Sampling plan makes the following list of assumptions. If any one of these assumptions is not true, a different sampling plan must be used which often results in larger sample sizes.**

**The population to be sampled is clearly specified**– For COVID19 we will start with the entire UK population as of today**Each sampling unit of the population has an equal chance of being selected for the sample**– This requires us to have a full and complete list of everyone living in the UK as of today**Each sampling unit has an equal chance of being defective**– Every person in the UK is equally likely to have immunity.**The sample is selected at random from the specified population**– a computer is used to select names from the population database at random**The measurement method is perfect with no false positives or false negatives**– the antibody test will never tell someone who is immune that they are not immune (false negative) or someone who is not immune that they are immune (false positive).

I am going to assume that all assumptions are valid for the purposes of this blog post. In reality this is not the case especially for assumption 3 so the actual sampling plan will have to be revised to take these violations into account which can be done.

**Using OC Curves to work out R & N**

R & N are calculated using what are called **Operating Characteristic (OC) Curves**. They are produced using the **Binomial Distribution** to calculate the following conditional probability –

- “
*What is the probability of having less than R defects in a random sample of N items when the probability of a defective is P and therefore we can ACCEPT the batch and release it for its intended use”* - COVID19 translation – “
*What is the probability of having less than R people without immunity in a random sample of N UK residents when the probability of not being immune is P and therefore we can lift at least some restrictions?” .*

The OC curve is therefore the probability of **accepting** a batch for intended use hence the name Acceptance Sampling.

For my COVID19 example with AQL 20% and UQL 50%, the answer is N=29 and R=10. I bet you didn’t think it would be that low but it is a straightforward application of the binomial distribution which can be done with one formula in Microsoft Excel. If you book a place on my training course “** Make Better Decisions with Statistical Sampling**“, I will teach how to do this calculation and produce the chart below.

The chart tells you the following –

- The horizontal axis is the full range of possible values of P. We do not know what the true value of P is hence why we are taking samples.
- The vertical sides of the orange box mark the AQL and UQL on the horizontal scale. If P is less than the AQL of 20% defective, it’s V2 and we can lift all restrictions and take off. If P is greater than the UQL of 50% defective, we’re under V1 and can’t take off.
- The vertical scale is the probability of accepting the batch using the specified decision rule i.e. if the number of defectives is less than R out of a sample of N we can declare success and start to lift some restrictions as we pass V1. However, we are not out of the woods yet as I explain later.
- The horizontal sides of the orange box are predefined risks we are prepared to take when making our decisions.
- The top side (alpha risk) is the probability of incorrectly rejecting a sample where P for the population really is 20% i.e. the AQL criterion has been met but we made the wrong decision to say “no we can’t lift any restrictions yet”. This is sometimes called the false positive risk or producer risk. I have chosen 5% alpha risk here which means that the probability of correctly accepting a sample where P is equal to the AQL is 95% (=100% – 5%).
- The bottom side (beta risk) is the probability of incorrectly accepting a sample where P for the population really is 50% i.e. the UQL criterion has been met but we made the wrong decision to say “yes we can lift some restrictions”. This is sometimes called false negative risk or consumer risk. I have chosen 5% beta risk here which means that the probability of incorrectly accepting a sample where P is equal to the UQL is 5%.

Does the blue line representing the OC curve for the displayed decision satisfy the requirements of the orange box? Yes it does –

- When P is 20% (at the AQL), the probability of accepting the sample using the rule “
*Reject if R>=10 when N=29*” is 95.1% as calculated by the binomial distribution. Therefore we have a 4.9% chance of not accepting the sample which would be incorrect i.e. a Type 1 Error. We specified we wanted this to be less than 5% and it is. If P is less that the AQL, the Type 1 Error is even lower which is even better. - When P is 50% (at the UQL), the probability of accepting the sample using the rule “
*Reject if R>=10 when N=29*” is 3.1% as calculated by the binomial distribution. This would be an incorrect decision i.e. a Type II Error. We specified we wanted this to be less than 5% and it is. If P is greater that the UQL, the Type 2 Error is even lower which is even better.

What if P is in between 20% & 50%? Then we have an in between probability of accepting or rejecting the sample. This is why I said earlier that if we accepted the sample, we cannot know for sure that we are below the AQL of 20%. We can conclude we are below the UQL since the risk of an incorrect acceptance is very low as shown by the blue OC curve but in order to lift all restrictions, we would want to be sure we are below the AQL. That is why this sampling plan is good enough to make a decision to start lifting restrictions (V1) but it is not good enough for decisions on lifting all restrictions (V2). So what sample size would we need to lift all restrictions?

**A Smaller Sample Size!**

Have a look at the second red OC curve I’ve added to the plot below.

I’ve reduced the sample to 15 people and the value of R to 1. I bet that blew your mind!

This decision rule means that we can only lift all restrictions if everyone from a random sample of 15 people is immune. I call this kind of sampling **Presence Sampling** i.e. if defective items are **present** in the sample we reject the batch. The OC curve shows that the probability of accepting the sample (i.e. all 15 people are immune) when P is 20% (the level we said that Boris could lift all restrictions) is 3.5%. That easily meets our beta risk specification of 5% (note for the red line, 20% is actually the UQL and the AQL is 0%. (Think about why we make that change!).

However, you should have spotted that the red curve has a high alpha risk i.e. acceptable situations where P is less than 20% being incorrectly rejected. That is not a surprise since the only acceptable outcome is everyone in the sample is immune. However, there is nothing to stop us sampling in stages. Take 15 people now and if all 15 are immune, lift all restrictions. If 1 or more are not immune, take another 15 samples and re-evaluate (using different OC curve for your combined sample size of 30). If that is not conclusive, take another 15 and repeat.

You might ask, why not select a larger sample to begin with? The answer is speed of decision making which has to be a priority in these times. If your 1st 15 samples are good, why waste time doing more? This kind of rapid testing and decision making can work very well for sub populations e.g. doctors and nurses at a single hospital. Can we relax restrictions for these essential workers in this specific location? Acceptance sampling is the way to do it.

The approach I’ve described here is not the only way to do the calculations. Other more advanced methods exist using likelihood ratios and Bayesian inference. But what they all have in common is that we are not interested in knowing the precise value of P. All we want to do is prove (according to some specification of risks) that the aircraft is travelling faster than V1 or V2 and that we can get the UK flying again. Those calling for mass testing of the entire population are missing the point of why we need to test.