Statistical Sampling

steve_bank · Apr 23, 2023

Discrete sampling is easy to emulate.

You receive a shipment of widgets wit an expectedd distribution of three colors. You take a small sample to check the ditrubtion of discrte colors.

You can vary the population size and sample size and see what happens.

You can download Scilab or port it. It would have been hard to generate distributions in a macro.

Code:

clear
clc
mprintf("RUNNING\n\n")

function [s] = rand_seq(nr,ns)
// random sequnce 1 - nr with no duplicates
//length ns
s = zeros(ns,1)
for i = 1:ns
   cnt = nr
   while(cnt > 0)
       flag = 1
       x = grand(1,1,"uin",1,nr)
       for j = 1:ns
              if(x ==  s(j))then flag = 0;end;             
       end //j
       if(flag == 1)then break;end;
       cnt = cnt - 1
   end //while   
   s(i) = x
end //i
endfunction


function [y] = rand_samp(samples,population)
    //pick random samples
    n = length(samples)
    for i = 1:n  y(i) = population(samples(i));end;
endfunction


npop = 1000
nsamp = 50
n1 = 50
n2 = 200
n3 = 750
for i = 1:npop
    if(i<=n1) then y(i) = 1;end;
    if(i>n1&&i<=n2) then y(i) = 2;end;   
    if(i>n2) then y(i) = 3;end;
end

//randomize population
nshuf = length(y)
rx = rand_seq(nshuf,nshuf)
for i = 1:nshuf
    r(i) = y(rx(i))
end

//take samples
rs = rand_seq(npop,nsamp)
s = rand_samp(rs,r)

cnt = zeros(3,1)
for i = 1:nsamp
    if(s(i) == 1)then cnt(1) = cnt(1) + 1;end;
    if(s(i) == 2)then cnt(2) = cnt(2) + 1;end;
    if(s(i) == 3)then cnt(3) = cnt(3) + 1;end;
end

//sample prcentages
pc1 =100. * cnt(1)/nsamp
pc2 =100. * cnt(2)/nsamp
pc3 =100. * cnt(3)/nsamp
 
//population percentages
pcn1 =100. * n1/npop
pcn2 =100. * n2/npop
pcn3 =100. * n3/npop

mprintf(" %.d   %d   %d\n",pcn1,pcn2,pcn3)
mprintf(" %.d   %d   %d\n",pc1,pc2,pc3)

steve_bank · Apr 23, 2023

Sampling from a continuous distribution.

The Central Limit Theorem says the means of random samples from any distribution will be approximately normally distribute. The sampling distribution of the means. It is the basis of inference by sampling. This allows normal statistics to predict the true mean of a population based on sampling.

Central limit theorem - Wikipedia

en.wikipedia.org

In probability theory, the central limit theorem (CLT) establishes that, in many situations, for identically distributed independent samples, the standardized sample mean tends towards the standard normal distribution even if the original variables themselves are not normally distributed.

The theorem is a key concept in probability theory because it implies that probabilistic and statistical methods that work for normal distributions can be applicable to many problems involving other types of distributions.

This theorem has seen many changes during the formal development of probability theory. Previous versions of the theorem date back to 1811, but in its modern general form, this fundamental result in probability theory was precisely stated as late as 1920,[1] thereby serving as a bridge between classical and modern probability theory.

If X 1 , X 2 , … , X n , … {\textstyle X_{1},X_{2},\dots ,X_{n},\dots } are random samples drawn from a population with overall mean μ {\textstyle \mu } and finite variance σ 2 {\textstyle \sigma ^{2}}, and if X ¯ n {\textstyle {\bar {X}}_{n}} is the sample mean of the first n {\textstyle n} samples, then the limiting form of the distribution, Z = lim n → ∞ ( X ¯ n − μ σ X ¯ ) {\textstyle Z=\lim _{n\to \infty }{\left({\frac {{\bar {X}}_{n}-\mu }{\sigma _{\bar {X}}}}\right)}}, with σ X ¯ = σ / n {\displaystyle \sigma _{\bar {X}}=\sigma /{\sqrt {n}}}, is a standard normal distribution.[2]

For example, suppose that a sample is obtained containing many observations, each observation being randomly generated in a way that does not depend on the values of the other observations, and that the arithmetic mean of the observed values is computed. If this procedure is performed many times, the central limit theorem says that the probability distribution of the average will closely approximate a normal distribution.

The central limit theorem has several variants. In its common form, the random variables must be independent and identically distributed (i.i.d.). In variants, convergence of the mean to the normal distribution also occurs for non-identical distributions or for non-independent observations, if they comply with certain conditions.

The earliest version of this theorem, that the normal distribution may be used as an approximation to the binomial distribution, is the de Moivre–Laplace theorem.

A population is constructed as a combination of a normal and exponential distribution with different means. For repeated samples it canbe seen that the means are normally distributed. Histograms are not the best way to assess a distribution. The cumulative distribution as an integral smooths out the curve, and the recognizable normal CDF can be seen for the sample means and standard deviations.

The important parameter is the standard deviation of the sample averaged means. It will vary with sample size.

The question becomes given a sample mean how close is it to the true mean, or given a required certainty how big a sample is required.

Coming soon to a thread near you, confidence intervals.

Code:

clear
clc
mprintf("RUNNING\n")


function [s] = rand_seq(nr,ns)
// random sequnce 1 - nr with no duplicates
//length ns
s = zeros(ns,1)
for i = 1:ns
   cnt = nr
   while(cnt > 0)
       flag = 1
       x = grand(1,1,"uin",1,nr)
       for j = 1:ns
              if(x ==  s(j))then flag = 0;end;             
       end //j
       if(flag == 1)then break;end;
       cnt = cnt - 1
   end //while   
   s(i) = x
end //i
endfunction


function [y] = rand_samp(samples,population)
    //pick random samples
    n = length(samples)
    for i = 1:n  y(i) = population(samples(i));end;
endfunction

function [y,x] = cum_dist(population)
    //cumulative distrbution
    x = gsort(population,"g","i")//ncreasing sort
    n = length(population)
    pop_sum = 0
    cd_sum = 0
    for i = 1:n pop_sum = pop_sum + population(i);end
    for i = 1:n
         cd_sum = cd_sum + x(i)
        y(i) = 100. * cd_sum/pop_sum //cum distribution     
    end
endfunction

nsample = 10
npop = 1000
nmean= 200
emean = 100
nstd = 1
popnorm = grand(npop,1,"nor",nmean,nstd) //random normal population
popexp = grand(npop,1,"exp",emean) //random exponential population
rand_samples = grand(nsample,1,"uin",1,npop)//sample picks
nr = 10000
ns = 50
ntrials = 200
// combne 2 distrinutions
k = 1
for i = 1: npop
   rcomb(k) = popnorm(i)
    rcomb(k+1) = popexp(i)
    k = k + 2
end

/*
//shuffle the combined population
nshuf = length(rcomb)
rx = rand_seq(nshuf,nshuf)
for i = 1:nshuf
    r(i) = rcomb(rx(i))
    end
*/

r = rcomb

for i = 1:ntrials
  rs = rand_seq(npop,ns)
  spop(i,1:ns) = rand_samp(rs,r)
  a(i) = mean(spop(i,1:ns))
 st(i) = stdev(spop(i,1:ns))
end
pavg = mean(r)
pstd = stdev(r)
mprintf("Population mean %.4f   std  %.4f\n",pavg,pstd)

savg = mean(a)
avstd = mean(st) // average std of the samples
sstd = stdev(a)  //std of sample means
smin = min(a)
smax = max(a)
mprintf("Samples Averaged mean %.4f a STD   %.4f ss std  %.4f\n",savg,avstd,sstd)
mprintf("Samples Min %.4f Max  %.4f\n",smin,smax)

avesort = gsort(a,"g","i")
stdsort = gsort(st,"g","i")
cdave = cum_dist(avesort)
cdstd = cum_dist(stdsort)


w1 = scf(1)
clf(w1)
subplot(2,2,1)
histplot(15,avesort)
xgrid
title"SAMPLE MEAN"
subplot(2,2,2)
plot2d(avesort,cdave)
xgrid
subplot(2,2,3)
histplot(15,stdsort)
xgrid
title"SAMPLE STD"
subplot(2,2,4)
plot2d(stdsort,cdstd)
xgrid

Statistical Sampling

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...

Central limit theorem - Wikipedia