Thanks for the detailed response, and sorry for the long delay. Had to spend some time on other things...
This is the core of the disagreement. I'm not saying that I'm right and generations of statisticians are wrong. The issue is not the statistical procedure, but the assumptions you are making in applying them: mainly that the phenomena you are trying to analyse can be usefully described by a normal distribution... a.k.a. an average and a variance.
But it is-- R is an average value. So if you're saying that the vaccination strategy depends on the value of R, if that is the phenomena you are trying to analyze, then you're saying it depends on an average value, and estimates of average values are normally distributed by the Central Limit Theorem.
Aside: It doesn't actually matter whether that data is the whole population or a sample of it: as long as that sample is >> 100 you'll get approximately the same output. You'll just get less accurate measurements of the population mean and SD. and, yes, of course those means (and the deviations) will be normally distributed - and the SD of the distribution of those averages is not the population SD (although I'm guessing it's related somehow). The CLT will give you nice, accurate measures of the mean and SD, but, as soon as you try and interpret those, you're assuming that orange curve and ignoring the blue one.
The sample mean is an estimate of the population mean. Because it's an estimate, it's not exact. Depending on the particular random sample you pull from the population, you'll get slightly different values. If you run that experiment a few times, and look at the different values you get, you'll find that your slightly inaccurate estimates of the population mean are normally distributed. The more samples you take, the more precise your estimation of the population mean. If your population mean is µ, your population standard deviation is σ, and you pull N samples-- then your sample mean is also µ, and the standard deviation of the sample mean (the inaccuracy of your estimate of the population mean) is (σ/√N), so it is related to the population mean.
You probably don't need to pull >>100 samples, depending on what you need your precision to be. I see the number 30 thrown around a lot but, as with most rules of thumb, I wouldn't just take it at face value. With 100 samples, your sample mean will also be 2.2, for your example, and 68% of your trials will fall between 2.2 ±0.087.
Unfortunately, with covid, we have an enormous population to sample from, so the sample size isn't really a limitation.
So, for starters, if you disregard the real distribution and just look at the orange model, you'll deduce that the chances of anybody infecting 6 others are precisely zip - it's way outside the 99% confidence range. Yet there's actually a 5% chance of that happening.
Remember, the tails of a normal distribution go to infinity, so you'll never deduce the chances to be zip. Truth be told, you'd deduce that there is a finite probability of infecting 10 people (which the underlying model doesn't support) and a finite chance of infecting -2 people (which is just crazy talk).
I wouldn't (and haven't) recommend substituting one distribution for another.
I think maybe you're confusing the population distribution with the distribution of R itself. In your example for a super spreader, the population is an extreme bimodal distribution. R, by definition, is the average of that population. If you sample the population to determine R, you will find your estimates of R to be normally distributed with the expected value for that estimate of R (the sample mean) also being the expected value of the population-- which is different than saying the distribution of the population, or even the population model, is normally distributed around that sample mean.
Your orange line is not the distribution of R, and it is not a model of the blue population, it is another distribution entirely. The distribution of R would share a mode value with (peak at the same place as) the orange line, but the distribution of R would peak higher and be more tightly distributed than your orange line depending on the number of samples used in your estimate. I'd plot it to show you, but I think it might lead to confusion since the distribution of R and the population don't actually belong on the same axes, they're different things-- one is describing the population, the other is describing an estimate of the population mean.
This is why your two pseudo-code examples don't work as you think they do.
Your super spreader model doesn't say that R can vary between 2 and 6. It says that R is exactly 2.2. Likewise, your normal distribution doesn't say R varies from 0.46-3.95 with 95% confidence-- it says that R is exactly 2.2.
R is the same for both distributions because R is precisely the population mean.
If you are sampling to determine R, then R has a range depending on the sample size but for reasonable sample sizes that range is essentially the same for both distributions. If you are using 100 samples to estimate R, then your standard deviation of R is 0.087. But that's not how R is likely measured, as I'll illustrate better below.
Ok, so what happens if we do make the mistake of modeling the population with the wrong distribution-- say the normal distribution you show with your orange line, with the same mean and variance as the population, but a different probability mass function.
How would that lead to different expectations of spread? Different policies for conventions such as MWC?
I created your two distributions based on your example:
Python:
ss=stats.rv_discrete(name="SuperSpreader",values=((2,6),(.95,.05)))
nrm=stats.norm(ss.mean(),ss.std())
plt.hist(ss.rvs(size=10000),bins=51,range=(0,10),label="Super Spreader");
plt.hist(nrm.rvs(size=10000),bins=51,range=(0,10),label="Normal");
plt.title("Two Infectivity Distributions");plt.legend();
plt.xlabel("Infectivity");plt.yticks([]);
I selected a person from each population at random to get sick and defined that as the starting infected population. Based on the infectivity of each person in the infected population (infectivity being 2 or 6 in the super spreader model, or a normally distributed integer in the normal distribution) I generate a newly infected population for the next epoch, with each person selected from the distribution for that scenario. I run this simulation for 26 epochs, which would be roughly 1 year if the infectious period is 2 weeks.
Python:
def runExperiment(start,dist,iterations=50):
infectedPop=start
cntInfected=[len(start)]
for i in range(iterations):
nextGen=list((dist.rvs(size=sum(infectedPop))+0.5).astype('int'))
cntInfected.append(len(nextGen))
infectedPop=nextGen
return cntInfected
ssExp=runExperiment([ss.rvs()],ss,iterations=26)
nrmExp=runExperiment([int(nrm.rvs()+0.5)],nrm,iterations=26)
plt.figure(figsize=(12,5))
plt.plot(ssExp,label="Super Spreader Dist.")
plt.plot(nrmExp,label="Normal Dist")
plt.title("25 Epochs of Strawman-21 Infection")
plt.xlabel("Epoch");plt.ylabel("New Infections")
plt.legend();plt.grid(True,which='both');
Note: vertical scale is in the billions.
In this case, the normally distributed population wound up with the larger numbers, but that's mostly because of the first few people randomly infected. Sometimes the super spreader distribution will end with the larger numbers if you randomly infect a 6 in the first few epochs.
After the first few epochs though, the virus is spreading at the same rate through both populations, regardless of their underlying distributions. This is the same data plotted using a log scale on the vertical to better indicate the exponential rates of growth:
Python:
plt.figure(figsize=(12,5))
plt.semilogy(ssExp,label="Super Spreader Dist.")
plt.semilogy(nrmExp,label="Normal Dist")
plt.title("25 Epochs of Strawman-21 Infection")
plt.xlabel("Epoch");plt.ylabel("New Infections (log scale)")
plt.legend();plt.grid(True,which='both');
You can see the small variations in the lower left that lead the normal distribution to have higher total values at the end of this simulation, but the parallel lines from that point indicate that the virus is spreading at equal rates in both populations.
we don't measure R by going out and counting how many people each individual infected. We take the case data over time and fit a model to it. R would be one of the fitted parameters that we'd read out.
That plot shows what I mean. The data in those curves fit a model of
EndInfections=StartInfections*N^epochs
If I go into the data and measure, N, the rate of growth, I get this:
Python:
(ssExp[25]/ssExp[10])**(1/15)
which returns 2.2017547797517736 for my dataset.
and
Python:
(nrmExp[25]/nrmExp[10])**(1/15)
which returns 2.201650944579746
In other words, the rate of growth is the same for both populations and that rate is R, the population mean, regardless of the overall distribution. The distribution doesn't matter.
E.g. City A, population 9,500,000, might have an R of 2 while City B, pop 500,000 might have an R of 6 - maybe due to socio-economic factors, or perhaps a different strain is prevalent there...
That's true, but that's a different problem. R is a population mean. What you're saying is that City A and City B are different populations. If you want to know their local R value, you need to sample them individually. You can't find anything about them from their joint population. As I said in the beginning:
Most governments make policy for fairly large populations. At the national or provincial level, the numbers are big enough that the statics are more clear— the data within those populations are lumpy, cities might have reasonable variances but their values will differ relative to each other and the national numbers, for example. Neighborhoods and households probably start to get small enough that the data is pretty meaningless in isolation, which is why they tend to aggregate it in different ways (neighborhoods of color, income brackets, senior living developments, etc).