Can someone check my Stats?

Grendel · Mar 30, 2018

I've forgotten all my stat-math.

Here's my problem:

In a population of 20 million people I know that 53 people have the surname 'X'. (I need to know random values, ignoring family and tribal connections)

If I reach into the people-bucket and pluck out one person, then I calculate the chances of that person being X are (20,000,000 / 53) = 370,000:1 against. To be statistically confident of plucking out X I would have to scoop out 370,000 people at once.

But if I can only scoop out 20,000 per dip, then what are the odds of 'X' being in that 20K scoop? I work this out at

A: (370,000 / 20,000) = 18.5

Now what are the chances of a random 20K scoop containing two 'X's?

B: (18.5 x 18.5) = 340:1 or 6.8 million people.

Now what are the chances of them both having the same initial?

C: (340 x 26 x 26) = 230,000:1 or 4.6 billion people.

I would have to scoop the bucket 230,000 times (with 20K scoops) or I would need a population of 4.6 billion people to be confident that two 'X's have the same initial?

Is this correct?

Cheers Greg :horsecrap:

Swammerdami · Mar 30, 2018

As a general principle, remember to distinguish "A to 1 against" odds from "B for 1 against." B = A+1

Another useful idea is to be precise: 20000000/53 is over 377,000, not 370,000. Approximations are fine for many purposes, but why impose unnecessary burdens when seeking help?

Grendel said:
If I reach into the people-bucket and pluck out one person, then I calculate the chances of that person being X are (20,000,000 / 53) = 370,000:1 against.

[X]: To be statistically confident of plucking out X I would have to scoop out 370,000 people at once.

But if I can only scoop out 20,000 per dip, then what are the odds of 'X' being in that 20K scoop? I work this out at

A: (370,000 / 20,000) = 18.5

Now what are the chances of a random 20K scoop containing two 'X's?

B: (18.5 x 18.5) = 340:1 or 6.8 million people.

[C]: Now what are the chances of them both having the same initial?...

X: Don't be too confident! You'll fail to find X in that sample 36.8% of the time. (That's the reciprocal of Euler's Number.)
Intuition: The expected (average) number of X you'll find is 1.0 exactly. But sometimes you'll find more than 1; therefore sometimes you'll find less.

[A] Your calculation yields 18.87 for 1 (or 17.87 to 1), not 18.5 to 1. But the calculation would be incorrect anyway, for reasons related to (x).
A better approximation(*) is that success occurs with probability 1 - exp(20000* ln(1 - 53/20000000)) = 5.16%
By coincidence this is expressed as 18.4 to 1 — almost the figure you give. Sometimes two wrongs do make a right!

(* - this formula works for sampling with replacement. It's good enough here because X's density, starting at 53 units, gets up only to 53.053 units after 19,999 non-X withdrawals.)

Repeating this calculation your way, but with the corrected numbers just shown, produces 1 chance in 375.3, written 374.3 : 1.
But your approach is incorrect; more accurate odds would be 737 : 1.
Why? Your multiplication assumes that the chance an unknown group will have an X is identical to the chance that a group known to have at least one X (though which element is X is unknown) will have a second X. But in fact these X-nesses are not independent.
I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )

[C] Do you think the 26 letters are equally likely as middle inititals? I don't ... and suggest you first refine your approaches for (A) and (B) anyway.

beero1000 · Mar 30, 2018

Swammerdami said:
[A] Your calculation yields 18.87 for 1 (or 17.87 to 1), not 18.5 to 1. But the calculation would be incorrect anyway, for reasons related to (x).
A better approximation(*) is that success occurs with probability 1 - exp(20000* ln(1 - 53/20000000)) = 5.16%
By coincidence this is expressed as 18.4 to 1 — almost the figure you give. Sometimes two wrongs do make a right!
(* - this formula works for sampling with replacement. It's good enough here because X's density, starting at 53 units, gets up only to 53.053 units after 19,999 non-X withdrawals.)

The calculation for sampling without replacement is not too difficult,and it's good to see how much that changes the answer. I get 1 - C(19999947,20000)/C(20000000,20000) = 0.0516452 for the probability without replacement and 1 - (19999947/20000000)²⁰⁰⁰⁰ = 0.0516201 for the probability with replacement. The difference between odds of 18.36:1 and 18.37:1.

Repeating this calculation your way, but with the corrected numbers just shown, produces 1 chance in 375.3, written 374.3 : 1.
But your approach is incorrect; more accurate odds would be 737 : 1.

Why? Your multiplication assumes that the chance an unknown group will have an X is identical to the chance that a group known to have at least one X (though which element is X is unknown) will have a second X. But in fact these X-nesses are not independent.

For the sample without replacement, the probability is 1 - (C(19999947,20000) + C(53,1)C(19999947,19999))/C(20000000,20000) = 0.00133195, giving odds of around 750:1.

I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )

Click to expand...

I think it might actually just be a coincidence. Try doing the calculations with different numbers.

Swammerdami · Mar 30, 2018

beero1000 said:
The calculation for sampling without replacement is not too difficult ...

I don't think it's a coincidence that the correct odds are almost exactly half what your incorrect calculation gives. (The correct odds of THREE or more X's would be 1/6 of what your calculation produces. 6 = 3! I'll leave a demonstration of this as an exercise! ... An exercise for myself; I've forgotten much of what I once knew. )

Click to expand...

I think it might actually just be a coincidence. Try doing the calculations with different numbers.

The formula isn't difficult. I just didn't know a fast way to calculate large c(,) or factorial on my machine and was too lazy/groggy to try to apply Stirling's approximation.
I see that Wolfram Alpha will calculate those large c(,), though not very quickly. How do you do it?

Yes, I could have divided all the large numbers by 10 — then my machine would handle them, though still slowly. But as I said, and you agreed, the with-replacement approximation was good enough here.

As for p2 ~= .5 * (1-p0)^2 when 20,000,000 >> 20,000 >> 53, where ">>" denotes "MUCH greater than" and pk is probability of exactly k hits, I did, just now, succeed in proving this ... though with great effort(*). Effort so great that I won't attempt to prove the conjecture pk ~= (1/k!) * (1-p0)^k

I suspect that if/when my brain unfogs I'll stumble on a familiar asymptotic formula with these approximations readily derived.

(* - Not "effort" in the sense of a mathematical challenge. Just effort in doing routine but tedious algebraic manipulations. I used to be better than this ... really!

)

Swammerdami · Mar 30, 2018

Swammerdami said:
... when 20,000,000 >> 20,000 >> 53 >> 1, where ">>" denotes "MUCH greater than" and pk is probability of exactly k hits, ... I won't attempt to prove the conjecture pk ~= (1/k!) * p0 * (1-p0)^k
[Note corrections in red]

And as soon as I stepped away from keyboard, proof became trivial!

Let's substitute s = 20 million; w = 20 thousand. We'll leave 53 alone.
Given s >> w >> 53 >> 1,k we seek to prove that

pk = C(53,k) C(s-53,w-k)) / C(s,w)

is approximated with

pk ~= p0 (1 - p0) ^ k / k!

Change the C(.) to factorials:

pk ~= 53! (s-53)! w! (s-w)! / (w-k)! (s-53-w+k)! s! k! (53-k)!

Change a! / (a-b)! to the approximation a^b whenever a >> b; and rearrange a bit to get

k! pk ~= 53^k w^k (s-w)^53 / s^53 (s-w)^k

Solve for p0 to get p0 ~= (s-w)^53 / s^53 and recall that, when s >> w >> 53, p0 ~= 1 - 53w/s or

1 - p0 ~= 53w/s

Observing that (s/(s-w))^k ~= 1, substitutions now produce

k! pk ~= p0 (1-p0)^k

Q.E.D.

Grendel · Mar 30, 2018

ummmmm ....

So, what are the odds of two people, surname X, initial A,
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million

?

Greg

Jokodo · Mar 31, 2018

Grendel said:
ummmmm ....

So, what are the odds of two people, surname X, initial A,
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million

?

Greg

Without knowing the frequency of initial A.?

Grendel · Mar 31, 2018

1 in 26 letters of the alphabet?

Lion IRC · Mar 31, 2018

Most popular boy/girl baby names in 1950:

1. James / Linda
2. Robert / Mary
3. John / Patricia
4. Michael / Barbara
5. David / Susan
6. William / Nancy
7. Richard / Deborah
8. Thomas / Sandra
9. Charles / Carol
10. Gary / Kathleen

Here's the list from last year:

1. Jacob / Emily
2. Michael / Isabella
3. Ethan / Emma
4. Joshua / Ava
5. Daniel / Madison
6. Christopher / Sophia
7. Anthony / Olivia
8. William / Abigail
9. Matthew / Hannah
10. Andrew / Elizabeth

Wait - I forgot Ahmed, Abdel, Ali, Ashraqat, Aya...

Grendel · Mar 31, 2018

Lion IRC said:
Most popular boy/girl baby names in 1950:

Wait - I forgot Ahmed, Abdel, Ali, Ashraqat, Aya...

Doesn't ass start with A?

Grendel · Mar 31, 2018

So, what are the odds of two people, surname X, initial A
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million and there are only 26 possible initials

?

Greg

Jokodo · Apr 1, 2018

Grendel said:
1 in 26 letters of the alphabet?

All with equal frequency as first initial? Highly doubtful.

In Lion IRC's list (for what it's worth, being as it is an unsourced piece of copy-pasta), there's for names with A. (Ava, Anthony, Abigail, Andrew), one with C. (Christopher), one with D. (Daniel), four with E. (Emily, Ethan, Emma, Elizabeth), one with H. (Hannah), one with I (Isabella), two with J. (Jacob, Joshua), three with M. (Michael, Madison, Matthew), and Olivia, Sophia and William with O., S., W. All other letters don't appear.

Jokodo · Apr 1, 2018

Grendel said:
So, what are the odds of two people, surname X, initial A
occurring in a random 20k sample,
when the surname X accounts for only 53 names in 20 million and there are only 26 possible initials

?

Greg

That depends on the frequencies of the different initials. If all initials have the exact same frequency, the chance that two random people have the same initial is 26 * (1/26)², which can be simplified to 1/26 precisely. You can think of it as drawing the two people successively. On the first draw, the chance that the person will have one of the 26 letters is 1, and since it doesn't affect later calculations which one it is, we can run with that figure. On the second draw, the chance that the person has the exact same letter, whichever it was, will be 1/26, or 3.846%

In a more skewed scenario, say where the three most frequent initials have each a share of 15% and the remaining 23 letters are evenly distributed, the chance more than doubles to 9.141%. The simplification from above is not applicable here, so the probability of two people having the same initial can only be expressed as the sum of the squares of each initial's frequency, so in this scenario 3*0.15² + 23*0.023913². Bottom line, it really does depend on the distribution of initials, just knowing how many letters there are isn't enough information.

In a first approximation, you can multiply this probability for two equal initials in a set of two with the probability you obtained for getting two people with the same surname, though the result will be a slight underestimation of the value you're looking for: Some (though few) of your "successful" samples (i.e., where there are at least two people with that surname) will contain 3 or more people with that surname, which significantly increases the chance that two of them will have the same initial.

Jokodo · Apr 1, 2018

Using the figures in this post and keeping in mind that they only add up to just under 90% (since this guy's source list only included the most frequent names), the actual figure seems to be close to 7% for the US, still almost twice what we would get with an unbiased distribution of initials (6.815 is what I get, but the number is approximate anyway since not all names are included in the raw data). Calculated with the following formula, where the division by ~0.9 ensures that the figures add up to 100%, and initials_all is the parsed frequencies from the link: sum([(entry/100/0.89994) ** 2 for entry in initials_all]).

What you did in your OP (besides pretending that all initials have the same frequency) is using the formula for the chance that both have a particular initial when the question should have been the chance that both have the same, whatever it may be. Correcting this alone means that you have to replace <chance_of_same_surname>*<chance_of_same_initial>² with <chance_of_same_surname>*<chance_of_same_initial>, increasing the chance of success by a factor of 26, and more if taking into account a possibly biased distribution of initials (factor of 46 with the data I found above).

Similarly, if you're actually not looking for a particular surname but for the probability of two people with any rare surname in such a sample, your chance again grows by quite a large factor.

Grendel · Apr 1, 2018

Dear Peoples,

Let's leave all anthropolgy, all tribal naming conventions, aside.

All I need is the probability. That is, the odds.

In a bucket of 20,000,000 marbles, 53 are red, all the rest are blue. Red may come in 26 different shades, some, not necessarily all, are present in the 53. (All shades of red are equally weighted and have an equal probability of occurring)

What are the odds that a random scoop of 20,000 marbles will contain 2 red balls, both of the same shade.

Greg ??

Lion IRC · Apr 1, 2018

Don't all 26 shades of red have to necessarily be present among the 53 in order to have an
"equal probability of occurring"?

Swammerdami · Apr 1, 2018

Grendel said:
In a bucket of 20,000,000 marbles, 53 are red, all the rest are blue. Red may come in 26 different shades, some, not necessarily all, are present in the 53. (All shades of red are equally weighted and have an equal probability of occurring)

What are the odds that a random scoop of 20,000 marbles will contain 2 red balls, both of the same shade.

I'm going to approximate your 53 with 52, which happens to be a multiple of 26.

Now. "all shades of red ... have an equal probability of occurring" can be interpreted in at least two ways.
(a) Each shade of red occurs exactly twice — the probabilities 2/20,000,000 are all equal to each other.
(b) A random process assigned the colors independently, with equal probabilities.
In case (a) multiply the answer previously given by 1/51. In case (b) multiply by 1/26.

Grendel said:
Dear Peoples,

Let's leave all anthropolgy, all tribal naming conventions, aside.

All I need is the probability. That is, the odds.

Real probabilities arise in the real world. Real statisticians address real problems. You seem to have some abstract notion of probability that doesn't mesh with the real world.

If I tell you I saw an animal just now, and that it was either a dog, a dolphin, or a unicorn; a real statistician might ask how close I am to the ocean and guesstimate a probability mass function like (99%, 1%, 0). It sounds like you'd opt for (1/3, 1/3, 1/3) .

Grendel · Apr 1, 2018

Lion IRC said:
Don't all 26 shades of red have to necessarily be present among the 53 in order to have an "equal probability of occurring"?

No

Grendel · Apr 1, 2018

O God, let me try again.

In a bucket of 20 million balls 52 are red, and the rest are blue.

The chance of reaching into the bucket and picking out a red ball are:

(20,000,000 / 52) = 384,615

Meaning that the probability of drawing out a red ball is 1 in 384,615 chances.

Meaning that I would have to reach into the bucket 384,615 times in order to be confident of drawing out a red ball. There are 384,614 chances that a random draw will result in a blue ball and only one that it will result in a red ball.

Can we agree on that?

OK?

Grendel · Apr 1, 2018

OK ... if we agree on the above, then the chances of drawing out a red ball in a scoop of 20K balls is:

(384,615 / 20,000) = 19.23

Meaning that if I reach into the bucket and in a single scoop remove 20,000 balls then there are 18.23 chances all the balls will be blue and 1 chance one ball will be red in the 20,000.

can we agree on that?

If not, then what is the chance of drawing a red ball in a random scoop of 20,000.

If we can just settle that for the moment then we can look at shades of red later. OK?

cheers ... Greg

Can someone check my Stats?

Member

Squadron Leader

Veteran Member

Squadron Leader

Squadron Leader

Member

Veteran Member

Member

Veteran Member

Member

Member

Veteran Member

Veteran Member

Veteran Member

Member

Veteran Member

Squadron Leader

Member

Member

Member