• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Probability Distributions

steve_bank

Diabetic retinopathy and poor eyesight. Typos ...
Joined
Nov 9, 2017
Messages
13,952
Location
seattle
Basic Beliefs
secular-skeptic
For a distribution such as normal the practical problem is generating random samples for arbitrary means and standard deviations.

That is the quantile function. For a normal distribution the inverse error function is part of the normal quantile. Random normally distrusted numbers have a range of uses for example Monte Carlo methods. Same with the exponential distribution.

Using a factorial table improves speed.

Part of my small math library I have been working on.



I tried numeral integration for the erf() and it was not as acurate as the math tools. The approximation closely matches libraries.

Code:
class RAND_DIST{
    private:

    //URAND random number algorithm
    //URAND A Universal Random Number Generator  by Malcom and Moler
    unsigned long long randll = 1;  //random number
    unsigned long long m = 0x7fffffff;
    //unsigned long long k1 = 2*floor(m*(.5 - sqrt(3)/6.)) + 1;
    //unsigned long long k2 = 8*ceil(m*atan(1.)/8.) + 5;
    unsigned long long k1 = 2.*(m*(.5 - sqrt(3)/6.)) + 1;
    unsigned long long k2 = 8.*(m*(atan(1.)/8.) + 5.);

    public:
    void init(unsigned long long seed);
    int rand_exp(long long n,double *y,double u);
    int rand_nor(long long n,double *y,double u,double sigma);
    int uni_float(int n,double *y,int lo,int hi,int ndp);
    int uni_int(int n,int *y,int lo,int hi);
    double erf(double x);
    double ierf(double x);
};//RAND_DIST

void  RAND_DIST::init(unsigned long long seed){
        randll = seed;
    }//init()

double RAND_DIST::erf(double x){
        double factorial[23] = {
            1.,
            1.,
            2.,
            6.,
            24.,
            120.,
            720.,
            5040.,
            40320.,
            362880.,
            3628800.,
            39916800.,
            479001600.,
            6227020800.,
            87178291200.,
            1307674368000.,
            20922789888000.,
            355687428096000.,
            6402373705728000.,
            121645100408832000.,
            2432902008176640000.,
            51090942171709440000.,
            1124000727777607680000.};

        int i,n = 20;
        double s = 0,k;
        for(i=0;i<23;i++){
            k =   (1./(2*i+1))*pow(-1,i)/factorial[i];
            s += k * pow(x,2*i+1);
            }
        return s*2./sqrt(_PI);
}//erf()

double RAND_DIST::ierf(double x){
    //inverse error function
    if(x==0)return 0;
   //if(x <= -1 || x >= 1.)return 999;
    int k,m,nc = 1000;
    double einv = 0.,q = sqrt(_PI)*x/2.;
    #include "coef.txt"  //c[]
    //double c[nc];
    //for(k=0;k<nc;k++)c[k] = 0;
    //c[0] = 1;
    //for(k=0;k<nc;k++){
        //for(m=0;m<k;m++)
           // c[k]+= (c[m]*c[k-1-m])/((m+1)*(2*m + 1));
      //  }
   for(k=0;k<nc;k++)einv += ((c[k])/(2*k+1))*pow(q,(2*k+1));
    return einv;
}//ierf()

int RAND_DIST::rand_exp(long long n,double *y,double u){
    // array of random numbers from an exonentual distribution
    // p represents random probabilities >0 <1
    if(u <= 0 || n <= 0)return -1;
    double p;
    long long i;
    for(i=0;i<n;i++){
        randll = randll*k2 + k1;
        p = (double)(randll%m+1)/(double)(m+2);
        // quantile - probability to  variable for a given mean
        y[i] = -1.* log(1. - p) * u;
    }
    return 0;
}//rand_exp()


int RAND_DIST::rand_nor(long long n,double *y,double u,double sigma){
    // array of random numbers from a normal distribution
    // URAND random number algorithm
    // p random probabilities >0 <1
    if(n <= 0)return - 1;
    double q,p,k1;
    int i,k,nc = 0;
    q = sqrt(2)*sigma;
    for(i=0;i<n;i++){

        randll = randll*k2 + k1;
        // p >0  <1
        //p = (double(rand())+.01)/double(RAND_MAX + 1);
        p = (double)(randll%m+1e-6)/(double)(m+1e-3);
        p = 2.*p-1.;
        y[i] = u + ierf(p)*q;  //normal qantile
    }
    return 0;
}//rand_nor()

int RAND_DIST::uni_float(int n,double *y,int lo,int hi,int ndp){
    //uniform distribution floats > lo< hi + 1
    if(lo >= hi || n <= 0)return -1;
    int xint,i,nnums = (hi-lo),dp = pow(10,ndp);
    double xdec;
    for(i=0;i<n;i++){
        randll = randll*k2 + k1;
        xdec = (double)(randll%m)/(double)(m+1);
        if(ndp>0){
            xdec = xdec*dp;
            xdec = int(xdec);
            xdec = xdec/dp;
        }
        randll = randll*k2 + k1;
        xint = (randll%m)%nnums + lo;
        y[i] = xint + xdec;
    }
    return 0;
}//uni_float()

int RAND_DIST::uni_int(int n,int *y,int lo,int hi){
    //uniform distribution ints lo -> hi-1
    if(lo >= hi || n <= 0)return -1;
    int i,nnums = hi-lo;
    for(i=0;i<n;i++){
        randll = randll*k2 + k1;
        if(hi < 2)y[i] = (randll%m)%nnums;
        y[i] = (randll%m)%nnums + lo;
    }
    return 0;
}//uni_int(
 
We've done this at least twice before. I still like my version: It needs only 1½ transcendental functions and 1 uniform variate per normal variate. It's terse. Studying it gives insights about the uniform distribution.

Code:
double unif_random(void); // return a uniform variate on (0, 1)

double expon_random(void) // return an exponential variate with mean 1 and std-dev 1
{
         return -log(unif_random());
}

// return a normal variate with mean 0 and std-dev 1
// (mean + std_dev * norm_random()) for a  non-standard variate
double norm_random(void) 
{
         static double r, theta;
         static int iset = 1;
         if (iset ^= 1)
                 return r * sin(theta);
         r = sqrt(2 * expon_random());
         theta = unif_random() * 2 * 3.14159265358979323844;
         return r * cos(theta);
}
 
The normal distribution I coded is 6 to 7 sigma depending on the number of samples. To get that I used URAND with a larger max rand number, IOW finer resolution of the probabilities. Using C rand() resulted in +- 2.5 sigma, not good enough for simulations.

The 6 sigma tails are important when doing Monte Carlo simulations.
 

What kind of input values?
  • Discrete
  • Continuous
What is the range of input values?
  • Finite
  • Semi-infinite (usually 0 to +infinity)
  • Complete infinite (-infinity to +infinity)

A sampling of them:

Binomial
Parameters: total number n, individual probability p
Discrete, finite
\( \displaystyle{ P(k) = \frac{n!}{k! (n-k)!} (1-p)^{n-k} p^k } \)

Poisson
Parameters: mean m
Discrete, semi-infinite
\( \displaystyle{ P(k) = \frac{m^k}{k!} e^{-m} } \)

Poisson: limit of binomial for n -> oo with m = n*p

Chi-squared
Parameters: number of degrees of freedom n
Continuous, semi-infinite
\( \displaystyle{ P(x) = 2 \frac{(x/2)^{n-1} e^{-x/2}}{\Gamma(n/2)} } \)

Normal
Parameters: mean m and standard deviation s
Continuous, infinite
\( \displaystyle{ P(x) = \frac{1}{s\sqrt{2\pi}} e^{(x-m)^2/(2s^2)} } \)

 Central limit theorem - the normal distribution is a limiting case of many distributions
 
Testing the normal and exponential distributions.

I compared the cumulative distributions to the Scilab functions and they are close. They will never exactly match because the numbers are random.

Being symmetrical the median and the median of the normal distribution are the same.

For the exponential distribution the mean and median are not the same. The median = log(2)*mean.
The mean and starboard deviation may appear correct but for the exponential distribution the median = log(2)*mean. t as to be checked.

The methods are not 'mine' or the way I do it. I went through a learning curve, the methods are standard math theory.


Gnuplot script to plot the histogram an cumulative distribution for the normal and exponential,


norm.plt
set term windows background rgb "white" title "DFT" fontscale 1
clear
reset
set key off
set multiplot layout 2,1 columnsfirst

set border 3
# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
set yzeroaxis
# Each bar is4half the (visual) width of its x-range.
set boxwidth 4 absolute
set style fill solid 1.0 noborder
bin_width = 5;
bin_number(x) = floor(x/bin_width)
rounded(x) = bin_width * ( bin_number(x) + 0.5 )
set xrange[*:*]
plot 'norm.dat' using (rounded($1)):(1) smooth frequency with boxes


set grid lt 1 lw 1 lc rgb "black" dashtype solid
# signal
set autoscale x
show autoscale
set zeroaxis
set yrange[*:*]
plot 'norm.dat' using 1:2 with lines ls 4 lt -1 lw 3
show grid
unset autoscale y

exp.plt
set term windows background rgb "white" title "DFT" fontscale 1
clear
reset
set key off
set multiplot layout 2,1 columnsfirst

set border 3
# Add a vertical dotted line at x=0 to show centre (mean) of distribution.
set yzeroaxis
# Each bar is4half the (visual) width of its x-range.
set boxwidth 4 absolute
set style fill solid 1.0 noborder
bin_width = 5;
bin_number(x) = floor(x/bin_width)
rounded(x) = bin_width * ( bin_number(x) + 0.5 )
set xrange[0:*]
plot 'exp.dat' using (rounded($1)):(1) smooth frequency with boxes


set grid lt 1 lw 1 lc rgb "black" dashtype solid
# signal
set autoscale x
show autoscale
set zeroaxis
set yrange[*:*]
plot 'exp.dat' using 1:2 with lines ls 4 lt -1 lw 3
show grid
unset autoscale y

show grid
unset multiplot


Code:
oid save(int n,string fname, double *x1,double *x2){
    int i;
    FILE *p = fopen(fname.c_str(),"w");
    for(i=0;i<n;i++)fprintf(p,"%.10f\t %.10f\t \n",x1[i],x2[i]);
    fclose(p);
}//save()

int comp_doub_ascend(const void * a, const void * b) {
  if (*(double*)a > *(double*)b) return 1;
  else if (*(double*)a < *(double*)b) return -1;
  else return 0;
}
double cum_distf(int n,double *y,double *cd){
    int i,flag = 0;
    double m=0;
    qsort(y,n,sizeof(double),comp_doub_ascend);
    for(i = 0;i < n;i++){
        cd[i] = 100.* double(i)/double(n-1);
        if(!flag && cd[i] > 50.){
            flag = 1;
            m = y[i];
            }
    }//for
    return m;
}//cum_distf()

double medianf(int n,double *y){
    //y is sorted
    int i ;
    double c;
    for(i=0;i<n;i++){
        c = double(i)/double(n-1);
        if(c >= 0.5)break;
        }
    return y[i];
}//medianf()


void avestd(int n,double *y,double *av,double *std){
    int i;

    double asum = 0, ssum = 0;
    for(i=0;i<n;i++)asum += y[i];
    *av = asum/double(n);
    for(i=0;i<n;i++)ssum += pow(*av-y[i],2);
    *std = sqrt(ssum/double(n));
}//ave()

void minmaxi(int n,int *y,int *min,int *max){
    *max = y[0],*min = y[0];
    //printf("%d   %d\n\n",y[0],y[n-1]);
    int i;
    for(i=0;i<n;i++){
        if(y[i] < *min)*min = y[i];
        if(y[i] > *max)*max = y[i];
    }
}

void rand_nor_test(void){
    cout<<"norm test"<<endl;
    RAND_DIST rd;
    unsigned long long seed = unsigned (time(NULL));
    rd.init(seed);
    double m,u = 100,sigma = 20;
    double ave,std,minf,maxf,sx;
    int n = 1000;
    double y[n],cd[n];
    rd.rand_nor(n,y,u,sigma);
    cum_distf(n,y,cd);
    save(n,"norm.dat",y,cd);
    m = medianf(n,y);
    avestd(n,y,&ave,&std);
    printf("av  %.5f   st  %.5f\n",ave,std);
    minmaxf(n,y,&minf,&maxf);
    sx = (maxf-minf)/std; //data spread in standard deviations
    printf("min %.5f  max %.5f\n",minf,maxf);
    printf("median %.5f\n",m);
    printf("sx  %f\n",sx);
}//rand_nor_test()

void rand_exp_test(void){
    cout<<"exp test"<<endl;
    RAND_DIST rd;
    unsigned long long seed = unsigned (time(NULL));
    rd.init(seed);
    double m,u = 100,sigma = 20;
    double ave,std,minf,maxf,sx;
    int n = 1000;
    double y[n],cd[n];
   rd.rand_exp(n,y,u);
    cum_distf(n,y,cd);
    save(n,"exp.dat",y,cd);
    m = medianf(n,y);
    avestd(n,y,&ave,&std);
    printf("av  %.5f   st  %.5f\n",ave,std);
    minmaxf(n,y,&minf,&maxf);
    sx = (maxf-minf)/std;
    printf("min %.5f  max %.5f\n",minf,maxf);
    printf("expected median %.5f  actual median %.5f\n",log(2)*u,m);
    printf("sx  %f\n",sx);
}//rand_exp_test()

int main() {
    RAND_DIST rd;
    unsigned long long seed = unsigned (time(NULL));
    rd.init(seed);
   rand_nor_test();
   rand_exp_test();
   system("norm.plt");
   system("exp.plt");
    return 0;
}//main()
 
When the topic of probability distributions comes up I cannot resist mentioning the Cauchy distribution, named after its discoverer Augustin-Louis Cauchy. There are comments about it in a thread from 2½ years ago on "The Central Limit Theorem" -- a Theorem the Cauchy distribution does NOT obey.

Here's an image comparing the Cauchy and Gaussian bell-shapes. Does the "bell" of this weird distribution seem innocuous enough?

cauchy-png.35565
 
The Central Limit Theorem is important in applied samplng statistics. It says the distribution of sample means regardless of the underlying distributed will be normally distributed, or close enough to use normal statistics.


I know it holds true if the sample size is big enough.

This leads to the Chi^2 distribution another important applied statistic.and confidence intervals.

It is said quitting a statistic without a confidence interval is meaningless.

There used to be plastic circular calculators to do Chi^2 and confidence intervals.Dermig sample size for a given confidence interval.


A walk down memory lane. At my first job in a small company I was the manufacturing, test, reliability, and quality engineer. At my second job I was a full time reliability engineer.
 
We've done this at least twice before. I still like my version: It needs only 1½ transcendental functions and 1 uniform variate per normal variate. It's terse. Studying it gives insights about the uniform distribution.

Code:
double unif_random(void); // return a uniform variate on (0, 1)

double expon_random(void) // return an exponential variate with mean 1 and std-dev 1
{
         return -log(unif_random());
}

// return a normal variate with mean 0 and std-dev 1
// (mean + std_dev * norm_random()) for a  non-standard variate
double norm_random(void)
{
         static double r, theta;
         static int iset = 1;
         if (iset ^= 1)
                 return r * sin(theta);
         r = sqrt(2 * expon_random());
         theta = unif_random() * 2 * 3.14159265358979323844;
         return r * cos(theta);
}
Gone around twice?

I have not seen you post these before.

Your exponential function uses the exponential distribution quintuple function and does create a distribution with mean 1 as written. The same as the function I posted so there is no 'going around it again';


There is one issue, your function is technically incorrect. It is -log(1-p) not -log(p).

For p = ,2, -log(p) = 1.6904 -log(1-p) = .2231 different points on the distribution. For generating a sequence of random variables it does not matter, but if you are looking for a specific point on the distribution it is incorrect.

unsigned int seed = time(NULL);
srand(seed);
//random number >0,<1
double unif_rand(void){return double(rand()+1)/(RAND_MAX+2);}

double expon_random(double u) {return -log(1.-unif_rand())*u;}

Make u 1 to get a mean of 1.

The n0rmal function works but it is incomprehensible to me, can you elaborate on how it works or provide a link? Yes, you can shift the normalized distribution to any mean and standard deviation, that is basic textbook theory. Much easier to use the normal quantile.

Your function requires an exponential and uniform distribution, the more standard way requires the inverse error function and a uniform distribution.

You used the exponential quartile -log(1-p)*um why not use the nornal quartile?

You would have to add more code to shift the mean and standard deviation to make it a usable function.

If I combine the inverse error function and the normal quartile using rand(). No transcendental functions.

Derivation of norrnal quartile.



Code:
double rand_norm(double mean,double sigma){
    #include "coef.txt"  //c[]
    double y,p,ierf,q ;
    int k,m,nc = 1000;
    //p>0,p < 1
    p = (double)(rand() +1)/(RAND_MAX+2);
    p = 2.*p-1.;
    q =  sqrt(_PI)*p/2.;
   for(k=0;k<nc;k++)ierf += ((c[k])/(2*k+1))*pow(q,(2*k+1));
   y = mean + ierf*sqrt(2)*sigma;  //normal quantile
   return y;
}
 
...
// return a normal variate with mean 0 and std-dev 1
// (mean + std_dev * norm_random()) for a non-standard variate
Gone around twice?

I have not seen you post these before.

I did. I may have changed variable names or such. Some of my libraries are available on-line under my real name; I make minor changes to avoid Google hits and thus keep that "firewall" intact!

There is one issue, your [exponential] function is technically incorrect. It is -log(1-p) not -log(p).

:confused2: You seem to admit that the function returns variates correctly, what difference do the internals matter? When p is uniform on (0, 1), it and (1-p) provide exactly equivalent distributions.
The n0rmal function works but it is incomprehensible to me, can you elaborate on how it works or provide a link?

We went through this before also. Did you notice that the code produces (r, theta) on HALF the calls and returns (r * sin(theta)) trivially on the other half of its calls? It is actually producing a variate on a TWO-D gaussian distribution, and generating two ONE-D variates from that. That that works at all -- and is easier than producing a ONE-D variate -- seems very interesting to me, and reminds us of mathematical remarks about the Gaussian.
You would have to add more code to shift the mean and standard deviation to make it a usable function.

You had the same objection last time. I added a comment (in RED above) showing how trivial that conversion is.

There's a simple reason why I like my approach, avoiding unnecessary function arguments. Anyone using the function will know how to convert a standard normal variate into one with specified mean and standard deviation. Not only is it unhelpful for the library routine to do this, but it's much easier to memorize the argument order when there are only zero arguments instead of two!
 
Back
Top Bottom