Intro To Statistics

steve_bank · 2024-07-17T16:07:00-0700

simple statistical simulation.

There are two general forms of random sampling. Sampling with and without replacement.

Put 80 red and 20 blue balls in a box. Shake and pick one, put it back in, shake, and repeat. Sampling with replacement, the probability of picking red or blue does not change on each trial.

If the ball is not returned to he box the probability of blue or red changes on the next trial. Sampling without replacement.

A shipment of 100,000 widgets are delivered to your factory. The colors are nominaly80% red and 20% blue. You need to estimate the distribution of colors before accepting ting the shipment.

If you run the code the cumulative distribution plots of the sample means shows a normal distribution. A consequence of the Central Limit This allows the use of normal statistics to estimate the true value parameters of a population from a sample, even if the underlying distribution i not normal. In this cae the population is not normal. There may be excretions to the CLT, but I never saw it in work.

The cumulative distribution is the plot of data in ascending order versus the parentage of points at at a point. Data can have a raggedy histogram but a clear CDF. The CDF is the integral of the PDF and acts as a smoothing function.

Intuitively we average repeated measurements to estimate the true value. If I remember right from maximum likelihood estimators a distribution has an expected value, the most probable. Take the derivative of the PDF, set to zero, ad solve for the function that maximizes.

For a normal distribution the best estimator is the arithmetic mean.

Run the code for increasing sample sizes and converges on 80% for red.

For random sampling the spread depends solely on the sample size. The confidence interval is a degree of confidence in an estimate of the true value of a parameter.

The sample size for a confidence interval can be calculated, there are online calculators.

Plenty of information and examples on the net.

Confidence interval - Wikipedia

en.wikipedia.org

Code takes a bit to run be patient

Code:

import math as ma
import array as ar
import statistics as st
import numpy as np
import scipy as sp
import random as rn
import matplotlib.pyplot as plt
        
#population integers 1-100
# 1-80 red balls  81-100 blue balls
        
npop = 100000  #population size
nsamp = 200 #sample size
niter = ma.floor(npop/nsamp)
population = ar.array("i",npop*[0])  #population
red = ar.array("i",niter*[0])
blue = ar.array("i",niter*[0])
pcred = ar.array("d",niter*[0])
cd = ar.array("d",niter*[0])
for i in range(npop):population[i] =  rn.randint(1,100)

print("Iterations  ",niter)
print("nsamp   ",nsamp)

for i in range(niter):
        bluecnt = 0
        redcnt = 0
        rn.shuffle(population)  #shake the box
        for j in range(nsamp):
            s = rn.randint(0,npop-1) # pick a ball and put it back
            if population[s] >80:
                bluecnt = bluecnt + 1
            else:
                redcnt = redcnt + 1
        red[i] = redcnt
        blue[i] = bluecnt
        pcred[i] = redcnt/nsamp

minblue = min(blue)
maxblue = max(blue)
minred = min(red)
maxred = max(red)
meanred = st.mean(pcred)
stdred = st.stdev(pcred)

print("counts min red   %d   max red   %d" %(minred,maxred))
print("counts min blue  %d  max blue  %d" %(minblue,maxblue))
print(" Mean Percent  %.5f  Standard Deviation  %.5f" %(meanred,stdred))


red = sorted(red)
blue = sorted(blue)
pcred = sorted(pcred)       
for i in range(niter):cd[i] = 100*(i+1)/niter     #cumulative distribution     

##fname  = "c:\\python\\data\\samp.txt"
##f  = open(fname,"w");
##for i in range(niter):
##        s = ""
##        s = s + repr(cd[i])+"\t"+repr(red[i])+"\t"+repr(blue[i])+"\n"
##        f.write(s)
##f.close()
#for i in range(niter):print("%f\t  %d\t  %d\t  %d" %(cd[i],red[i],blue[i],red[i]+blue[i]))

plt.grid(which='major', color='k',linestyle='-', linewidth=0.8)
plt.plot(red,cd,linewidth=2.0,color="k")
plt.show()


plt.grid(which='major', color='k',linestyle='-', linewidth=0.8)
plt.plot(blue,cd,linewidth=2.0,color="k")
plt.show()

plt.grid(which='major', color='k',linestyle='-', linewidth=0.8)
plt.plot(pcred,cd,linewidth=2.0,color="k")
plt.show()

Intro To Statistics

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...

Confidence interval - Wikipedia