Become a MacRumors Supporter for $50/year with no ads, ability to filter front page stories, and private forums.

theluggage

macrumors 604
Jul 29, 2011
7,588
7,686
I showed you a simple simulation of a super-spreader scenario above and the averages were normally distributed as expected by CLT.
Edit: Sorry, these things always come over more... argumentative than intended. Substantially truncated original post while I go away and come up with an example that actually shows up the difference.

Your measurements of the average were normally distributed. That doesn't mean that the actual thing you re trying to measure (the average of) follows a normal distribution. In your super-spreader example all you were effectively doing was taking 10,000 samples of your Bernoulli distribution, taking the average, then rinse and repeat and plot the distributions of those averages.

The population distribution is still the Bernoulli distribution of the original data.

The average you measure is still the average - the question is how much any confidence intervals you place depend on the actual population distribution vs. the estimated population distribution you get from the average and variance of your measurements.
 
Last edited:

theluggage

macrumors 604
Jul 29, 2011
7,588
7,686
Usual disclaimer - this is two people having an internet argument about stats (about which we both could be getting wrong) and a hypothetical Strawman-21 virus with made up properties - no Covid 19 advice here, apart from "go get vaccinated"!

I'm not sure I follow your meaning here. Are you saying that I'm making the mistake of following standard statistical procedure by sampling a large population to determine the statistics of that population?

This is the core of the disagreement. I'm not saying that I'm right and generations of statisticians are wrong. The issue is not the statistical procedure, but the assumptions you are making in applying them: mainly that the phenomena you are trying to analyse can be usefully described by a normal distribution... a.k.a. an average and a variance.

(NB: When I talk about the "population mean" etc. I'm referring to the mean that you'd calculate if you had accurate data for the entire population - if I'm misusing the term, please correct me. It's not (just) about whether you divide by N or N-1 in the variance formula...)

So, let's take your super-spreader model. I'm going to tweak it a bit to actually get an epidemic with R>1. So, let's have 95% of the infected passing it on to 2 others, and 5% passing it on to 6.

Figure_1.png

The blue histogram shows the actual distribution of the data. If you ignore that distribution, and just take the mean and standard deviation of the data, what you are doing is modelling that distribution with the orange curve (a normal distribution with mean=2.2 and std. dev. 0.87).

Aside: It doesn't actually matter whether that data is the whole population or a sample of it: as long as that sample is >> 100 you'll get approximately the same output. You'll just get less accurate measurements of the population mean and SD. and, yes, of course those means (and the deviations) will be normally distributed - and the SD of the distribution of those averages is not the population SD (although I'm guessing it's related somehow). The CLT will give you nice, accurate measures of the mean and SD, but, as soon as you try and interpret those, you're assuming that orange curve and ignoring the blue one.

So, for starters, if you disregard the real distribution and just look at the orange model, you'll deduce that the chances of anybody infecting 6 others are precisely zip - it's way outside the 99% confidence range. Yet there's actually a 5% chance of that happening.

Here's what happens if you use the "normal model" to measure R and plug it into the herd immunity formula p = 1-1/R (from the Lancet article).

Code:
Mean R = 2.20 SD=0.87
Just using the mean R=2.20: vaccination rate needed: 55%
or:
R = 0.46-3.95 (+/-2 sigmas - 95% confidence)
Vaccination rate needed: -120% - 75%

So worst case, you'd need to vaccinate 75% of the population. Best case - no vaccination needed (ignore -120% since the formula is obviously invalid for R<1)

Now look at what's happening in our (hypothetical) real world:

Code:
From actual distribution
R = 2.00-6.00 (by observation)
Vaccination rate needed: 50% - 83%

So, first, that's eliminated the possibility of no vaccine, second, that's a big increase in the level of vaccinations needed.

Or, to put it another way, the normal model presumes that the "super spreaders" are randomly distributed through the population. The fact that the actual distribution isn't normal doesn't disprove that, but it is a massive hint that it might not be so.

E.g. City A, population 9,500,000, might have an R of 2 while City B, pop 500,000 might have an R of 6 - maybe due to socio-economic factors, or perhaps a different strain is prevalent there... Using the average - even plugging in confidence intervals derived from the normal model - will only solve the problem for City A. That's an extreme and unlikely scenario - based on an extreme and unlikely model - but it isn't impossible and it's very hard to rule out more subtle scenarios - where is the dividing line between that and a scenario for which that normal model is reliable?

As I said, "standard statistical processes" aren't wrong, but they make huge assumptions about what you are trying to measure and which could affect any projections you make from your results, and those do not go away "because of central limit theorem" - which is about the distribution of your measurements not the phenomenon you are analysing. That includes the assumption that you are either measuring a well-defined value or something that can be adequately modelled by a normal distribution. The only way to know those assumptions are valid is to test them - and looking at the actual distribution before leaping in and taking aggregates etc. will help with that.

what I meant is that we don't measure R by going out and counting how many people each individual infected. We take the case data over time and fit a model to it. R would be one of the fitted parameters that we'd read out.
...but that case data comes from (broadly speaking) somebody going out and counting infected people, and is subject to all sorts of errors. Someone has to try and deduce a likely value for the number of unreported infections. The model you fit almost inevitably assumes that R follows some sort of well-behaved distribution and/or that the uncertainty is small enough for the usual techniques for propagating errors to apply...

It is all built on a foundation of assumptions - it has to be in the absence of perfect knowledge - and however precise and rigorous the mathematics you are using it is, it can only ever be as good as those assumptions.

...and while I'm not suggesting this applies to you, applying something like a statistical test and assuming that this confirms the validity of your experiment is a very common mistake.
 

Analog Kid

macrumors G3
Mar 4, 2003
9,017
11,788
Thanks for the detailed response, and sorry for the long delay. Had to spend some time on other things...

This is the core of the disagreement. I'm not saying that I'm right and generations of statisticians are wrong. The issue is not the statistical procedure, but the assumptions you are making in applying them: mainly that the phenomena you are trying to analyse can be usefully described by a normal distribution... a.k.a. an average and a variance.

But it is-- R is an average value. So if you're saying that the vaccination strategy depends on the value of R, if that is the phenomena you are trying to analyze, then you're saying it depends on an average value, and estimates of average values are normally distributed by the Central Limit Theorem.

Aside: It doesn't actually matter whether that data is the whole population or a sample of it: as long as that sample is >> 100 you'll get approximately the same output. You'll just get less accurate measurements of the population mean and SD. and, yes, of course those means (and the deviations) will be normally distributed - and the SD of the distribution of those averages is not the population SD (although I'm guessing it's related somehow). The CLT will give you nice, accurate measures of the mean and SD, but, as soon as you try and interpret those, you're assuming that orange curve and ignoring the blue one.

The sample mean is an estimate of the population mean. Because it's an estimate, it's not exact. Depending on the particular random sample you pull from the population, you'll get slightly different values. If you run that experiment a few times, and look at the different values you get, you'll find that your slightly inaccurate estimates of the population mean are normally distributed. The more samples you take, the more precise your estimation of the population mean. If your population mean is µ, your population standard deviation is σ, and you pull N samples-- then your sample mean is also µ, and the standard deviation of the sample mean (the inaccuracy of your estimate of the population mean) is (σ/√N), so it is related to the population mean.

You probably don't need to pull >>100 samples, depending on what you need your precision to be. I see the number 30 thrown around a lot but, as with most rules of thumb, I wouldn't just take it at face value. With 100 samples, your sample mean will also be 2.2, for your example, and 68% of your trials will fall between 2.2 ±0.087.

Unfortunately, with covid, we have an enormous population to sample from, so the sample size isn't really a limitation.

So, for starters, if you disregard the real distribution and just look at the orange model, you'll deduce that the chances of anybody infecting 6 others are precisely zip - it's way outside the 99% confidence range. Yet there's actually a 5% chance of that happening.

Remember, the tails of a normal distribution go to infinity, so you'll never deduce the chances to be zip. Truth be told, you'd deduce that there is a finite probability of infecting 10 people (which the underlying model doesn't support) and a finite chance of infecting -2 people (which is just crazy talk).

I wouldn't (and haven't) recommend substituting one distribution for another.

I think maybe you're confusing the population distribution with the distribution of R itself. In your example for a super spreader, the population is an extreme bimodal distribution. R, by definition, is the average of that population. If you sample the population to determine R, you will find your estimates of R to be normally distributed with the expected value for that estimate of R (the sample mean) also being the expected value of the population-- which is different than saying the distribution of the population, or even the population model, is normally distributed around that sample mean.

Your orange line is not the distribution of R, and it is not a model of the blue population, it is another distribution entirely. The distribution of R would share a mode value with (peak at the same place as) the orange line, but the distribution of R would peak higher and be more tightly distributed than your orange line depending on the number of samples used in your estimate. I'd plot it to show you, but I think it might lead to confusion since the distribution of R and the population don't actually belong on the same axes, they're different things-- one is describing the population, the other is describing an estimate of the population mean.


This is why your two pseudo-code examples don't work as you think they do. Your super spreader model doesn't say that R can vary between 2 and 6. It says that R is exactly 2.2. Likewise, your normal distribution doesn't say R varies from 0.46-3.95 with 95% confidence-- it says that R is exactly 2.2.

R is the same for both distributions because R is precisely the population mean.

If you are sampling to determine R, then R has a range depending on the sample size but for reasonable sample sizes that range is essentially the same for both distributions. If you are using 100 samples to estimate R, then your standard deviation of R is 0.087. But that's not how R is likely measured, as I'll illustrate better below.


Ok, so what happens if we do make the mistake of modeling the population with the wrong distribution-- say the normal distribution you show with your orange line, with the same mean and variance as the population, but a different probability mass function.

How would that lead to different expectations of spread? Different policies for conventions such as MWC?

I created your two distributions based on your example:

Python:
ss=stats.rv_discrete(name="SuperSpreader",values=((2,6),(.95,.05)))
nrm=stats.norm(ss.mean(),ss.std())

plt.hist(ss.rvs(size=10000),bins=51,range=(0,10),label="Super Spreader");
plt.hist(nrm.rvs(size=10000),bins=51,range=(0,10),label="Normal");
plt.title("Two Infectivity Distributions");plt.legend();
plt.xlabel("Infectivity");plt.yticks([]);

1616554511663.png


I selected a person from each population at random to get sick and defined that as the starting infected population. Based on the infectivity of each person in the infected population (infectivity being 2 or 6 in the super spreader model, or a normally distributed integer in the normal distribution) I generate a newly infected population for the next epoch, with each person selected from the distribution for that scenario. I run this simulation for 26 epochs, which would be roughly 1 year if the infectious period is 2 weeks.

Python:
def runExperiment(start,dist,iterations=50):
    infectedPop=start
    cntInfected=[len(start)]
    for i in range(iterations):
        nextGen=list((dist.rvs(size=sum(infectedPop))+0.5).astype('int'))
        cntInfected.append(len(nextGen))
        infectedPop=nextGen
    return cntInfected
 
ssExp=runExperiment([ss.rvs()],ss,iterations=26)
nrmExp=runExperiment([int(nrm.rvs()+0.5)],nrm,iterations=26)

plt.figure(figsize=(12,5))
plt.plot(ssExp,label="Super Spreader Dist.")
plt.plot(nrmExp,label="Normal Dist")
plt.title("25 Epochs of Strawman-21 Infection")
plt.xlabel("Epoch");plt.ylabel("New Infections")
plt.legend();plt.grid(True,which='both');

1616554696432.png


Note: vertical scale is in the billions.

In this case, the normally distributed population wound up with the larger numbers, but that's mostly because of the first few people randomly infected. Sometimes the super spreader distribution will end with the larger numbers if you randomly infect a 6 in the first few epochs.

After the first few epochs though, the virus is spreading at the same rate through both populations, regardless of their underlying distributions. This is the same data plotted using a log scale on the vertical to better indicate the exponential rates of growth:

Python:
plt.figure(figsize=(12,5))
plt.semilogy(ssExp,label="Super Spreader Dist.")
plt.semilogy(nrmExp,label="Normal Dist")
plt.title("25 Epochs of Strawman-21 Infection")
plt.xlabel("Epoch");plt.ylabel("New Infections (log scale)")
plt.legend();plt.grid(True,which='both');

1616556235585.png


You can see the small variations in the lower left that lead the normal distribution to have higher total values at the end of this simulation, but the parallel lines from that point indicate that the virus is spreading at equal rates in both populations.

we don't measure R by going out and counting how many people each individual infected. We take the case data over time and fit a model to it. R would be one of the fitted parameters that we'd read out.
That plot shows what I mean. The data in those curves fit a model of

EndInfections=StartInfections*N^epochs

If I go into the data and measure, N, the rate of growth, I get this:

Python:
(ssExp[25]/ssExp[10])**(1/15)
which returns 2.2017547797517736 for my dataset.

and
Python:
(nrmExp[25]/nrmExp[10])**(1/15)
which returns 2.201650944579746

In other words, the rate of growth is the same for both populations and that rate is R, the population mean, regardless of the overall distribution. The distribution doesn't matter.

E.g. City A, population 9,500,000, might have an R of 2 while City B, pop 500,000 might have an R of 6 - maybe due to socio-economic factors, or perhaps a different strain is prevalent there...

That's true, but that's a different problem. R is a population mean. What you're saying is that City A and City B are different populations. If you want to know their local R value, you need to sample them individually. You can't find anything about them from their joint population. As I said in the beginning:

Most governments make policy for fairly large populations. At the national or provincial level, the numbers are big enough that the statics are more clear— the data within those populations are lumpy, cities might have reasonable variances but their values will differ relative to each other and the national numbers, for example. Neighborhoods and households probably start to get small enough that the data is pretty meaningless in isolation, which is why they tend to aggregate it in different ways (neighborhoods of color, income brackets, senior living developments, etc).
 
Last edited:

theluggage

macrumors 604
Jul 29, 2011
7,588
7,686
But it is-- R is an average value. So if you're saying that the vaccination strategy depends on the value of R,
No - it depends on the likely range of R or (if you want to be pedantic) the likely range of the number of people subsequently infected by each infected person- hence in the Lancet article they're not just plugging R=3 into the formula, they're using R=2.5 and R=3.5 to get the likely lower/upper bounds on the required vaccination rate. (Heck, even the BBC website quotes R as a range and the media are usually clueless about error margins.)

You're putting a lot of emphasis on R being defined as the average, which is pedantically correct but practically wrong: you never, ever use a single value like that for any serious purpose - even a credible "estimate" - without paying attention to the error margin: and that has to include (an estimate of) any baked-in variance in the real population as well as the "sampling error".

Assumptions like "95% of the population lie within 2 standard deviations of the mean" all assume normal distribution. If the actual distribution of the real thing you are trying to measure is the sort of extreme bimodal distribution in the example, those assumptions just won't be true.

Remember, the tails of a normal distribution go to infinity, so you'll never deduce the chances to be zip.
In a normal distribution, 99.7% of the population lies within 3 SDs of the mean. That is generally taken as "zip" for most scientific purposes - the classic "three sigmas" standard.

In the example distribution by definition 5% of the population - not 0.3% lies at 6, more than 3 SDs from the mean.

Your orange line is not the distribution of R, and it is not a model of the blue population

The blue population was something like a million data points simulated using the imaginary bimodal model. Then the the mean and standard deviation of the number of infections passed on by each member of that population were calculated. That mean and standard deviation are fixed by the underlying bimodal model - once the simulated population is >>100 they converge on those values and won't change if you simulate a population of 1 billion. We're not "sampling" anything at this stage or invoking the central limit theorem.

The mean is the value of "R" for that population - but that value alone is useless for any practical purpose without some measure of the uncertainty. The usual way of doing that is to take the standard deviation and assume something like the 95% rule - but at that point you are assuming standard deviation and modelling the population as the orange line.

That's true, but that's a different problem. R is a population mean. What you're saying is that City A and City B are different populations.
We defined two different "populations" when we came up with the extreme bimodal distribution. Two cities is just one way of describing it (which would be trivially easy to untangle) but the "populations" could just as easily be different virus variants, different age groups, different socio/economic groups (we know all those things are significant with covid) that would be harder to untangle and target.

Ok, so what happens if we do make the mistake of modeling the population with the wrong distribution
Except you've taken data with a distribution that strongly suggests that there are two populations, that have persisted even with a large number of infections (so it's not the "first six" problem) and applied that distribution to a very simple model that assumes that there is only one population (anybody can infect anybody in the whole population). What you've come up with is literally just an obfuscated way of calculating the mean of the bimodal distribution.

If there are two populations in there - whether it is regional, different variants, socio-economic groups, then the people infected by super-spreaders are likely to be predominantly super-spreaders themselves (live in the same city, working similar environments, catch the same variant, have an international tech conference habit...). So your model would be invalid. If you changed your model so that anybody infected by a super-spreader had a much larger chance than 5% of being a super-spreader themselves, then the value of R you calculated would increase - hitting 6 if you assumed the extreme case that super-spreaders only infected other super-spreaders (e.g. other people in the same city/socio-economic group/whatever and, obviously, if they were passing on a more-virulent strain).

We know that covid spread is strongly correlated with where you live, socio/economics, variants etc. so the reality is going to be more complicated than your model.

The point is not the math that you're doing, but the undeclared assumptions you are making before even getting started. That's why you should look carefully at your data - its distribution, any correlations, possible systemic biasses etc. before throwing it into the statistics sausage machine - because that will turn anything into a sausage with 95% confidence.
 

Analog Kid

macrumors G3
Mar 4, 2003
9,017
11,788
No - it depends on the likely range of R or (if you want to be pedantic) the likely range of the number of people subsequently infected by each infected person- hence in the Lancet article they're not just plugging R=3 into the formula, they're using R=2.5 and R=3.5 to get the likely lower/upper bounds on the required vaccination rate. (Heck, even the BBC website quotes R as a range and the media are usually clueless about error margins.)

I keep agreeing with this point going all the way back to your first mention of R ranges:
I don’t disagree with any of this in principle, though the numbers are a bit fast and loose.

Go back through and look at every time you talk about R ranges followed my saying "I agree"-- you're just having a hard time taking "yes" for an answer. R varies based on a number of factors and that range is important to account for to the extent that we're able to.

Now, as then, my points of discussion are around how you're arriving at your numbers and to some extent our ability to address differences as a matter of policy around things like conventions. I'm trying to make some very narrow points about how these numbers are arrived at that you seem to be interpreting as a larger argument against the fact different populations have different propensities to infect others.

Keep that in mind as you review my following responses:

You're putting a lot of emphasis on R being defined as the average, which is pedantically correct but practically wrong: you never, ever use a single value like that for any serious purpose - even a credible "estimate" - without paying attention to the error margin: and that has to include (an estimate of) any baked-in variance in the real population as well as the "sampling error".

Again, and I'll keep repeating this in the hopes that it helps, I am not arguing for using a single R value.

R is the average for the given population, full stop. It's a definition. The sampling error due to the distribution of the mean for populations as large as we're likely to be talking about is minuscule, that's not where the range comes from. A far bigger reason for looking at ranges is to account for different results for different populations and to account for varying assumptions around how to measure the infected population.

Very specifically, the range of R is not set by the two modes in a bimodal distribution. A bimodal distribution measures a single population and does not give evidence of more than one R value.

In a normal distribution, 99.7% of the population lies within 3 SDs of the mean. That is generally taken as "zip" for most scientific purposes - the classic "three sigmas" standard.

In the example distribution by definition 5% of the population - not 0.3% lies at 6, more than 3 SDs from the mean.

Rules of thumb are dangerous things if you're not thinking about where they come from and what their consequences are. No good scientist should apply a 3 sigma test to data if the tails represent "low probability, high impact events", such as a super spreader of a virulent disease. Or failures of passenger aircraft.

That said, I think we're in violent agreement here. You'll notice my very next statement is that I'm not advocating using a normal distribution to model a non-normally distributed population (and never have). My comments about the tails are arguments for why it would be a bad idea to do so.

The blue population was something like a million data points simulated using the imaginary bimodal model. Then the the mean and standard deviation of the number of infections passed on by each member of that population were calculated. That mean and standard deviation are fixed by the underlying bimodal model - once the simulated population is >>100 they converge on those values and won't change if you simulate a population of 1 billion. We're not "sampling" anything at this stage or invoking the central limit theorem.

You are invoking the central limit theorem when you say you are converging on the mean after >>100 samples.

The mean is the value of "R" for that population - but that value alone is useless for any practical purpose without some measure of the uncertainty. The usual way of doing that is to take the standard deviation and assume something like the 95% rule - but at that point you are assuming standard deviation and modelling the population as the orange line.

This is fundamentally conflating two very distinct things: the normal distribution of estimates of the sample mean about the population mean (the distribution of R) and the distribution of the population itself. Those are very different things, and the orange line is neither of those things, it is a third thing.

We defined two different "populations" when we came up with the extreme bimodal distribution. Two cities is just one way of describing it (which would be trivially easy to untangle) but the "populations" could just as easily be different virus variants, different age groups, different socio/economic groups (we know all those things are significant with covid) that would be harder to untangle and target.

Untangle. That was the whole point of my statement that the histogram is not the data. One histogram is one population. If you retained the data, you could slice it in different ways looking for ways to untangle the population. You could look at zip code, gender, occupation, genetic markers, age, initial viral load, etc, etc. If you find a parameter that allows you to sort one large population into two separate populations, you can look at R for the sub populations.

By presenting only the histogram of the population, however, you can't make any assumptions at all. You've destroyed the information you'd need to do that. Are we seeing the difference in social behaviors between San Francisco and Sturgis, or are we seeing the hand of god casting a plague on individual sinners? Can't say. You can't look at the distribution of one population and say "there are two distinct R values". One population has one R value.

If you want to understand different behaviors in sub populations, you have to aggregate them differently. At the risk of over-repeating myself, this is what I've been saying:
At the national or provincial level, the numbers are big enough that the statics are more clear— the data within those populations are lumpy, cities might have reasonable variances but their values will differ relative to each other and the national numbers, for example. Neighborhoods and households probably start to get small enough that the data is pretty meaningless in isolation, which is why they tend to aggregate it in different ways (neighborhoods of color, income brackets, senior living developments, etc).

Except you've taken data with a distribution that strongly suggests that there are two populations

How do you figure? There are two clusters in the data, but by definition one histogram describes one population, and the idea that this represents two separable populations is based solely on an unsubstantiated model in your head.

that have persisted even with a large number of infections (so it's not the "first six" problem)
If the first person infects 6 others, you will have 3 times as many cases as you would if the first person infects 2 people. The numbers are a product of all the earlier epochs.

applied that distribution to a very simple model that assumes that there is only one population (anybody can infect anybody in the whole population).

Yes, that's the only population you gave me.
What you've come up with is literally just an obfuscated way of calculating the mean of the bimodal distribution.
Yes, you can find R by fitting the data to the model of exponential growth. For the record though, I didn't come up with an obfuscated way of calculating the mean-- after several posts saying "typically you'd fit the data to a model and read out the value of R from the model", I showed you how it would be done.

We know that covid spread is strongly correlated with where you live, socio/economics, variants etc. so the reality is going to be more complicated than your model.
Wait. It was your model...

The point is not the math that you're doing, but the undeclared assumptions you are making before even getting started. That's why you should look carefully at your data - its distribution, any correlations, possible systemic biasses etc. before throwing it into the statistics sausage machine - because that will turn anything into a sausage with 95% confidence.

I feel like there's a weird role reversal going on. You presented a model, then started calling it my model. Then you started making the unsupported assumption that your histogram can be neatly partitioned into two groups of people. I'm simply pointing out that assumption has no basis in the data. The point I'm making isn't about your assumptions being wrong, but the math you're doing.

I don’t disagree with any of this in principle, though the numbers are a bit fast and loose.
 
Last edited:
Register on MacRumors! This sidebar will go away, and you'll see fewer ads.