• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

The Programming Thread

There are at least two "cryptographically secure" PRNGs with source code in the public domain. The most famous of these might be the Mersenne Twister (MT19937) with a cycle of (219937-1). We might assume that "security" is irrelevant but these random generators generate numbers which are VERY . . . random! Why not use them?

One possible reason to avoid them would be the computational cost of computing the next pseudo-random number in sequence. HOWEVER it is possible to minimize the consumption of the random bits.

Obviously, a binary decision with probabilities (0.5, 0.5) can be achieved with a SINGLE bit from the random-number generator. In other words we need only call MT19937 once to get 32 such random decisions.

Less obvious is that a binary decision like (0.8173, 0.1827) can be achieved with just 0.687 random bits on average.

Similarly, a uniform integer on N16 can be achieved with 4 random bits, and a uniform integer on N17 can be achieved with just 4.09 bits on average.

Is this hint enough to see how it's done?
 
The Mersenne Twister 32 bit is the default C++ library random number generator. there is also a 64 bit Mersenne Twister and a linear congruential generator.

I downloaded the Mersenne Twister code.

I have been using 10^10 iterations for testing and comparing URAND to C++ Mersenne Twister and C rand(). The C++ Mersenne Twister executes much slower than the URAND I posted. To be expected, the URAND code has one multiplication and one addition for each iteration.


In terms of frequency testing URAND look like the C++ Mersenne Twister in terms of randomness. They are equally close to a true unform distribution in terms of the number of coornces of a number. Both are an order of magnitude better than C rand() as a unform approximation.

Plotting URANFD and Mersenne Twister both look like random eectricall noise. I took the Fourier trasform of both and they both have the flat trasform of random noise as I would see with elecrcal noise.

That leaves the period and namdomness of intervals between occurences of a number, the gap test.

I have been codng a gap test.

Which to use? It alweys coes down to the application. The linear comgruential algorithm is simple, compact, and fast with reasonbale charteristcs. Suitable for the kind of simualtions I did on systems.
 
Execution times C++ Mersenne Twister vs LG.

for 10^10 iterations C++ took 68 seconds, URAND LG took .016 seconds.

Made a dumb mistake at first. Make sure the compiler is in release mode not debug, it runs slower in debug.

MS documentation says other than hardware devices Mersenne Twister has the highest performance, LG the lowest.


Code:
#include <random>
#include <chrono>

int main(){
    long long i,n = pow(10,10);
    int x;
//   Mersenne Twister 32 bit
//   unsigned seed = chrono::system_clock::now().time_since_epoch().count();
//   default_random_engine generator (seed);
//    uniform_int_distribution<int> dist (0,100); // random ints 0-100
//    for(i=0;i<n;i++)x = dist(generator);;
  
    //URAND LG 32 bit
    int sf = 100,os = 0;
    unsigned m = 0x7fffffff;
    unsigned k1 = 907633385,k2 = 1686629725;
    unsigned rand_num = 1;
    for(i=0;i<n;i++){
        rand_num = rand_num*k2 + k1;
        x = (rand_num%m)%sf + os; // random ints 0-100
       }
  return 0;
  }
 
Last edited:
My little library has several PRNGs to choose from, though I've found no reason to use something faster than Mersenne Twister. When I time these various PRNGs, the times to call the PRNG 5 billion times range from 55 seconds (Bob Jenkin's Burtle PRNG) to 119 seconds (Marsaglia's KISS PRNG). GNU's compound Tausworthe (random(), 69 seconds) and the Mersenne Twister (83 seconds) are about midway between Burtle (which emphasizes speed) and Marsaglia (which emphasizes ridiculously long period).

These numbers "compare apples with apples." They were measured with the same compiler and overhead on my VERY old laptop running cygwin. Much of the time was spent on loop-handling etc. OUTSIDE the actual PRNG. Looking at the source code for the PRNGs we see that, Yes, the time closely parallels the number of operations in the critical code.

As you can see, ALL these generators are fast enough. In a real application the choice of PRNG will have almost no effect on net speed.

As I've mentioned, applications which consume 32 random bits for a simple switch can often make do with a single bit or even less. . . .
Obviously, a binary decision with probabilities (0.5, 0.5) can be achieved with a SINGLE bit from the random-number generator. In other words we need only call MT19937 once to get 32 such random decisions.

Less obvious is that a binary decision like (0.8173, 0.1827) can be achieved with just 0.687 random bits on average.

Similarly, a uniform integer on N16 can be achieved with 4 random bits, and a uniform integer on N17 can be achieved with just 4.09 bits on average.

Is this hint enough to see how it's done?

However this is just a fun exercise! The PRNGs are all so fast that saving those random bits is pointless!

Execution times C++ Mersenne Twister vs LG.

for 10^10 iterations C++ took 68 seconds, URAND LG took .016 seconds.
Obviously something went badly wrong in your experiment! I'm sure you'll agree that you didn't do 10^10 iterations of anything in 16 milliseconds!
 
Don;t have a clue what you are 'hinting' at. I scanned through the Mersenne algorithm and there is plenty of code on the net. No mystery or puzzle.

Feel free to post complete working debugged code that others can run.
 
steve said:
Don;t have a clue what you are 'hinting' at. I scanned through the Mersenne algorithm and there is plenty of code on the net. No mystery or puzzle.

Most of my comments had nothing whatsoever to do with Mersenne Twister.
My question was about your "URAND LG". Perhaps you should show us that code.

Execution times C++ Mersenne Twister vs LG.

for 10^10 iterations C++ took 68 seconds, URAND LG took .016 seconds.
Obviously something went badly wrong in your experiment! I'm sure you'll agree that you didn't do 10^10 iterations of anything in 16 milliseconds!

No comment about this? :confused2:

- - - - - - - - - - - - - - - - - -

I will post two miscellaneous comments about my random-number routines.

(1) when choosing a uniform real variate over (0, 1) one starts with a specified number of bits. By default my library routines use 31 bits. The variate is then chosen to be distributed uniformly over
{1,3,5,7,9,... , 4294967291, 4294967293, 232-1 } ÷ 232
Note that this chooses a number from (0, 1) -- not [0,1) as many or most implementations do. Note that the mean will be 0.5 exactly. Perfectionistic? Sure, but why not? Steve, modify your code to use, say 8 bits instead of 32 bits and se what mean value you get.

(2) I'll present pseudo-code to illustrate how to minimize the number of random bits consumed by smallish selections and binary decision.

Let Nk denote {0, 1, 2, 3, ..., k-1 }
Internally, the code maintains a pair of integers (r, m). r is a uniformly-distributed random variable over Nm
I will describe two routines Augment and Decide.

Augment (k) // used to increase ("augment") the precision of (r, m)​
Retrieve k bits from the PRNG; i.e. retrieve a random x uniform over N2k
Replace (r, m) as follows. Obviously shifts will be used instead of multiplies.​
r ← r * 2k + x​
m ← m * 2k
Decide (p) // return 1 with probability p, 0 with probability 1-p​
If m is "too small" call Augment (k) for some appropriate k (e.g. make m a 31-bit number).​
Approximate p as s/m for some integer s.​
If r < s Then​
m ← s​
Return 1​
Else (r ≥ s) Then​
m ← m - s​
r ← r - s​
Return 0​

The basic idea is that there is UNUSED randomness available for free after a Decide. Keep that randomness, rather than discarding it and "wasting" a call to the 32-bit PRNG every time. (As mentioned, even very good PRNGs are so good that the technique, however "nifty", has no importance unless you're working with a very expensive physical RNG.)

Anyone who takes the time to study this "Decide (p)" routine -- perhaps working simple examples -- and then understands how it works, should give themself a very big pat on the back! I've chatted with enough expert programmers to realize that this method -- however simple the above pseudo-code may appear -- is likely to baffle. Kudos to anyone who can pursue it all the way to an "Aha!" moment.
 
And to close out the topic, comparing rand(), URAND , and the C++ imputation of mt19937.

Te first test is cumulative distribution. For all tree generators the plot is a straight line as expected. For
random ints 0-99 and 2^20 iterations

URAND median = 500
rand() median = 496
mt19937 median = 500

The second test is a frequency test, IOW a histogram.

For 10^10 iterations and random ints 0-99.

1. Each nonoccurence of each number in each generator is counted.
2. The percipient deviation from a theoretical perfect uniform distribution is calculated which can be positive or negative.
3, The average of the absolute values of the deviations is calculated

Typical average deviations from theoretical, they move around a little on each run.

URAND 0.00752%
mt19937 0.0251%
rand() 1.09125%

The counts and deviations for each number generated by rand() show a clear bias. Lower numbers are all on the high side, higher numbers are all on the low side. There is no randomness.

There are comments on the net about avoiding using rand(), and I see why.

URAND looks ok, it has a smaller period than mt19937. That could be addressed by tinging m to a 64 bit lint.

I;d say URAND is ok, it has a smaller period than mt19937
It is an older out of print book but was a re fence text, Knuth;s Semi Numeral Algorithms. There are PDFs, it has a lengthy chapter on random generators and generator testing.

ps://seriouscomputerist.atariverse.com/media/pdf/book/Art%20of%20Computer%20Programming%20-%20Volume%202%20(Seminumerical%20Algorithms).pdf


Code:
#include <chrono>
#include <random>


typedef mt19937 random_engine;
const unsigned seed = chrono::system_clock::now().time_since_epoch().count();
random_engine generator (seed);



double mers_doub(int lo,int hi){
    uniform_real_distribution<double> dist (lo,hi);
   return dist(generator);
}//mers_doub()

int mers_int(int lo,int hi){
    uniform_int_distribution<int> dist (lo,hi);
   return dist(generator);
}

int urand_int(int lo,int hi){
        static int init = 1;
        unsigned m = 0x7fffffff;
        unsigned k1 = 907633385,k2 = 1686629725;
        static unsigned rand_num = 1;
        if(init){rand_num = time(NULL),init = 0;}
        rand_num = rand_num*k2 + k1;
        int sf = 1 + hi - lo;
        return (rand_num%m)%sf + lo;
}//urand_int()

int comp_int_ascend(const void * a, const void * b) {
  if (*(int*)a > *(int*)b) return 1;
  else if (*(int*)a < *(int*)b) return -1;
  else return 0;
}

void cum_dist_test_int(string t){
   long long i,n = pow(2,20);
    int lo,hi,min,max,median;
    double *cd = new double[n];
    long long s;
    int   *y = new int[n];
    lo = 0;hi = 1000;
    min = hi + 1;max = lo - 11;
    srand(time(NULL));
    for(i=0;i<n;i++)cd[i] = 100*double(i)/double(n);
    if(t == "u")for(i=0;i<n;i++)y[i] = urand_int(lo,hi);
    if(t == "m")for(i=0;i<n;i++)y[i] = mers_int(lo,hi);
    if(t == "r")for(i=0;i<n;i++)y[i] = (rand()%RAND_MAX)%hi;
    qsort(y,n,sizeof(int),comp_int_ascend);
    for(i=0;i<n;i++){
        if(cd[i] >= 50.){
            median = y[i];
            break;
            }
        }
    for(i=0;i<n;i++){
         s += y[i];
         if(y[i]<min)min = y[i];
         if(y[i]>max)max = y[i];
     }
     printf("Min  %d  Max  %d Median  %d\n",min,max,median);
    FILE *p = fopen("urand.dat","w");
    for(i=0;i<n;i++)fprintf(p,"%d \t %.15f\n",y[i],cd[i]);
    fclose(p);
    delete [] y;
    delete [] cd;
 }//cum_dist_tes_int()

void rand_freq_test(void){
    cout<<"rand_freq_test"<<endl;
   //-------------------------------------------
   //C++
    typedef mt19937 random_engine;
    unsigned seed = chrono::system_clock::now().time_since_epoch().count();
    random_engine generator (seed);
    //-----------------------------------------
    //URAND
    unsigned m = 0x7fffffff;
    unsigned k1 = 907633385,k2 = 1686629725;
    unsigned rand_num = time(NULL);
    //------------------------------------------
    long long i,n = pow(10,10);
    int xu,xm,xr;
    int hi = 1000;
    long long countu[hi],countm[hi],countr[hi];
    double su = 0,sm = 0,sr=0,pcu[hi],pcm[hi],pcr[hi];
    for(i=0;i<hi;i++){countu[i] = 0;countm[i] = 0,countr[i] = 0;}
    uniform_int_distribution<int> dist (0,hi-1);
    srand(time(NULL));
    for(i=0;i<n;i++){
        rand_num = rand_num*k2 + k1;
        xu =(rand_num%m)%hi;
        xm = dist(generator);
        xr = (rand()%RAND_MAX)%hi;
        countu[xu]++;
        countm[xm]++;
        countr[xr]++;
        }
    int k = n/hi;  // k,n powers of 10
    for(i=0;i<hi;i++){
        pcu[i] = 100*(double(countu[i]) - k)/k;
        su += abs(pcu[i]);
        pcm[i] = 100*(double(countm[i]) - k)/k;
        sm += abs(pcm[i]);
        pcr[i] = 100*(double(countr[i]) - k)/k;
        sr += abs(pcr[i]);
        }
    su = su/hi;
    sm = sm/hi;
    sr = sr/hi;
    printf("Average Deviation  URAND %.5f mt19937  %.5f  rand()  %.5f\n",su,sm,sr);
    FILE *p = fopen("freqs.dat","w");
    for(i=0;i< hi;i++)fprintf(p,"%lld  %lldl   %+.5f   %lld  %+.5f   %lld  %+.5f\n",
        i,countr[i],pcr[i],countu[i],pcu[i],countm[i],pcm[i]);
    fclose(p);
}//rand_freq_test()

int main(void){
    //cum_dist_test_int("m");
   rand_freq_test();
   return 0
}
 
Last edited:
A section of the frequncy output file.

First column is the numer, secnd is the counts fr rand()m abd te trhird column s the deviation form theretical.

Code:
     rand() count,percnt  URAND              mt19937       
761  100902l   +0.90200   100011  +0.01100   100582  +0.58200
762  100472l   +0.47200   99783  -0.21700   99770  -0.23000
763  101136l   +1.13600   100047  +0.04700   99717  -0.28300
764  100661l   +0.66100   100220  +0.22000   100507  +0.50700
765  100704l   +0.70400   99802  -0.19800   100327  +0.32700
766  100687l   +0.68700   100131  +0.13100   100041  +0.04100
767  97911l   -2.08900   100712  +0.71200   100540  +0.54000
768  97429l   -2.57100   99495  -0.50500   99999  -0.00100
769  97485l   -2.51500   99550  -0.45000   100360  +0.36000
770  97276l   -2.72400   99697  -0.30300   99926  -0.07400
771  97414l   -2.58600   99761  -0.23900   99748  -0.25200
772  97407l   -2.59300   100659  +0.65900   100400  +0.40000
773  97982l   -2.01800   99857  -0.14300   100028  +0.02800
774  97181l   -2.81900   99664  -0.33600   100125  +0.12500
 
This showed up in my YouTube shorts:

It talks about C being an unsafe language...



(code with colours)
C:
int main() {
  int array[4096] = {0};
  array[7000] = 0;
  print("%d\n", array[7000]);
}
 
Last edited:
Yes, lack of bounds checking for arrays. If one is not careful in accessing them, one will go off the edge.

It's possible to avoid doing that by having bounds checking, but that requires additional calculation.

The C++ Standard Template Library has several container classes, including "array" and "vector" with indexed access. Those classes have two member functions for doing indexed access, operator[] and at(). The first one is not bounds checked, while the second one is. It throws an exception for an out-of-bounds input.

C also has difficulties with allocated memory. One can forget to deallocate it, making a memory leak. One can use a pointer to deallocated memory: a dangling pointer.

C++ STL container objects allocate the memory that the use internally, behind the scenes. When they go out of scope, they deallocate this memory.

Similarly, C++ supports "smart pointers" that deallocate their allocated memory when they go out of scope. C++ smart pointers can be reference counted, with counters that are incremented when a smart pointer is copied and decremented when a copy goes out of scope.
 
Slot Machine

Code:
import numpy as np

symbols = ["cat     ","dog     ","trump    ","biden   ","jesus   ",
                    "cherry  ","apple   ","chevy   ","cow     ","peach   ",
                   "car     ","bike    ","scooter ","crow    ","gold    ",
                   "sun     ","moon    ","donkey  ","pot     ","jackpot "]
while(1):
    w1 = np.random.randint(0,20)
    w2 = np.random.randint(0,20)           
    w3 = np.random.randint(0,20)
    print("%s  %s  %s" %(symbols[w1],symbols[w2],symbols[w3]))
    print("p to paly x to exit")
    x = input()
    if x == 'x':break
 
If you want a C math library there is the GNU Scientific Library.


I down loaded the source code. 1400 functions from which to find code to use, comes wit test code.


The GSL project was initiated in 1996 by physicists Mark Galassi and James Theiler of Los Alamos National Laboratory.[3] They aimed at writing a modern replacement for widely used but somewhat outdated Fortran libraries such as Netlib.[4] They carried out the overall design and wrote early modules; with that ready they recruited other scientists to contribute.

The "overall development of the library and the design and implementation of the major modules" was carried out by Brian Gough and Gerard Jungman.[3] Other major contributors were Jim Davies, Reid Priedhorsky, M. Booth, and F. Rossi.[3]

Version 1.0 was released in 2001. In the following years, the library expanded only slowly; as the documentation stated, the maintainers were more interested in stability than in additional functionality. Major version 1 ended with release 1.16 of July 2013; this was the only public activity in the three years 2012–2014.

Vigorous development resumed with publication of version 2.0 in October 2015. The latest version 2.7 was released in June 2021.
 
We've looked at generating random numbers from various continuous pdfs (probability density functions).
How about DISCRETE distributions (i.e. "probability mass functions")? In practice these can arise variously, e.g. in heuristic optimizations, generating test sequences, etc.

For a toy example here, we will select people from one of 50 states, weighted by the state's population. For example, Montana has 0.338926% of the 50 states' total population (as of 2023) and we want to output "MT" with probability 0.338926%. (We are "drawing from an urn with replacement.")

Think about how you would write a program to support that! Assume you seek BOTH accuracy and speed. A typical solution will search through a table. That would be OK here where there are only 50 items to choose from, but what if we had 50,000 items? We'd spend 1000 times as much time compared with 50 items with linear search, and 10 times as long with binary search. The speed of the code here does NOT vary with table size.

Here is working C code which can be compiled and output for the toy distribution. I've enclosed it in Spoiler to avoid cluttering this post, but also to encourage you to try to "reinvent a wheel!" The algorithm used is quite nifty.

This algorithm is NOT my invention. In fact I think it's already given in Donald Knuth's The Art of Computer Programming -- though that does NOT imply that the algorithm is well known.

Code:
#include        <stdlib.h>
#include        <stdio.h>

// State will be selected with probability proportional to its population
// Following table must be generated separately.
struct wtent {
        struct {
                char x[3];
        } t[2];
        double  prob;
} WTable[] = {
        "WY", "CA", 0.085625,
        "VT", "CA", 1.094920,
        "AK", "CA", 2.107520,
        "ND", "CA", 3.114926,
        "SD", "CA", 4.134775,
        "DE", "CA", 5.151278,
        "RI", "TX", 6.160671,
        "MT", "TX", 7.166074,
        "ME", "TX", 8.204617,
        "NH", "TX", 9.205545,
        "HI", "TX", 10.210396,
        "WV", "FL", 11.259498,
        "ID", "FL", 12.288035,
        "NE", "FL", 13.290036,
        "NM", "FL", 14.309973,
        "CA", "NY", 15.401459,
        "TX", "NY", 16.419179,
        "MS", "NY", 17.430967,
        "KS", "NY", 18.431093,
        "AR", "NY", 19.449739,
        "NY", "PA", 20.001637,
        "FL", "PA", 21.462343,
        "PA", "IL", 22.364202,
        "NV", "IL", 23.468276,
        "IA", "IL", 24.470157,
        "IL", "OH", 25.142457,
        "UT", "OH", 26.501050,
        "OH", "GA", 27.371361,
        "CT", "GA", 28.530289,
        "OK", "GA", 29.594303,
        "GA", "NC", 30.112871,
        "OR", "NC", 31.620623,
        "NC", "MI", 32.322010,
        "KY", "MI", 33.663548,
        "LA", "MI", 34.670525,
        "MI", "NJ", 35.127576,
        "AL", "NJ", 36.748917,
        "SC", "NJ", 37.787779,
        "NJ", "VA", 38.026338,
        "MN", "VA", 39.841196,
        "CO", "VA", 40.861675,
        "VA", "WA", 41.006957,
        "WI", "WA", 42.866564,
        "WA", "AZ", 43.018913,
        "MD", "AZ", 44.906044,
        "AZ", "TN", 45.014414,
        "TN", "IN", 46.059179,
        "MO", "MA", 47.908375,
        "IN", "MA", 48.065198,
};
#define SIZ     (sizeof WTable / sizeof *WTable)

// Generate a standard uniform variate on (0,1)
#define UNIFSTD ((1 | random() & (1 << 30) - 1) / (double)(1 << 30))

int main(int argc, char **argv)
{
        // default cnt is 0.1% total pop of 50 states
        int     cnt = argv[1] ? atoi(argv[1]) : 334236;
        while (cnt--) {
                double k = SIZ * UNIFSTD;
                struct wtent *w = WTable + (int)k;
                printf("%s\n", w->t[w->prob < k].x);
        }
        exit(0);
}

Comments:
  • The hardest part of the algorithm is NOT shown. That is the one-time code which generates the special table from a histogram. (If you really want or need this code, PM me.)
  • Although the algorithm is NOT mine, coding details are. In fact I "showed off" a little: The standard approach ends up with a table of size 50 (or the number of distinct choices). I (pointlessly!) make do with one less -- 49 here.
  • The code is very brief and terse: most of it is just the 49-item table. But some constructions may seem "fancy", almost obfuscated. I think C programmers should be VERY comfortable with C. My code may contribute to that goal. 8-)
  • Someone had conniption fits last time I showed UNIFSTD -- How to generate a standard uniform variable. The conniptions won't show up for me, so remind conniptioner, if he complains how "coarse" the variable is, that it is much finer than the real numbers in the table.
 
When data sets are fixed the most direct way is assign a unique integer to each text tag. The ieger is index for direct indexing into tables without searching.

It doesn't work for searching for random text strings.

In a galaxy far far away there is an empire that has dominion over 6 planets. Periodically they randomly select a citizen from each planet for an all expense paid vacation at the imperial palace.

Each adult citizen on each planet has a number from 1 to the population size.

A table of planet populations is created with random numbers.

The last part uld be creating a citizen info data base.

It executes 1 million times with little delay.


Code:
import math as ma
import array as ar
import random as rn


def select_citizen(planet,data_base):
        return  rn.randint(1,data_base[planet])

#def citizen_data(planet,data_base):
 #               a = [name,address]
#                return a

          
nplanets = 6
planets = ["Moe","Larry","Curley","Groucho","Abbot","Costello"]
population_data_base = ar.array("i",nplanets*[0]) #population table
for i in range(nplanets):population_data_base[i] = rn.randint(1,100000)

for planet in range(nplanets):
        x = select_citizen(planet,population_data_base)
        print("Planet %s\t  Citizen Number  %d" %(planets[planet],x))
       # info = citezen_data(planet,x,citizen_data_base)


# or this 
moe = 0
larry = 1
curly = 2
groucho = 3
abbot = 4
costello = 5

x = select_citizen(curly,population_data_base)
print("\n\nPlanet Curely  Citizen Number   ",x)
 
Hash sorting. Probably trivial for a data science practitioners

The hash keys in the example are a and c. The hash function is the switch code that converts keys to indices.

From scanning through Knuth's Sorting And Reaching as keys get complex there is no way to avoid collision or overlap of keys. A hash function ca be ambiguous.

Data is sorted into buckets based on keys.

Searching an abbreviated list of states for information.

Code:
string states[7] = {
//0-4
"alabama",
"alaska",
"arizona",
"arkansas",
//4-6
"california",
"colorado",
"connecticut"};

int hash_table[2][2] = { {0,4} , {4,7} };

struct Data{
    int population;
    int area;
};

void create_data_base(struct Data db[]){
    db[0].population = 1000;
    db[0].area = 10000;
    db[1].population = 2000;
    db[1].area = 20000;
    db[2].population = 3000;
    db[2].area = 30000;
    db[3].population = 4000;
    db[3].area = 40000;
    db[4].population = 6000;
    db[4].area = 50000;
    db[5].population = 6000;
    db[5].area = 60000;
    db[6].population = 7000;
    db[6].area = 70000;
}


int search_data_base(string state,struct Data data_base[]){
    int i,lo,hi;
    switch (state[0]){
        case 'a': lo = hash_table[0][0];
                       hi = hash_table[0][1];
                       break;
        case 'c':  lo = hash_table[1][0];
                       hi = hash_table[1][1];
                       break;
        default: cout<<"bad inputl"<<endl;return -1;
        }

        for(i = lo;i<hi;i++){
            if(state == states[i]){
                cout<<states[i]
                <<"  population  "<<data_base[i].population
                <<"  area  "<<data_base[i].area
                <<endl;
                break;
                }
        }

    return 0;
}


int main(){
    Data data_base[7];
    create_data_base(data_base);
    string s;
    while(1){
        cout<<"enter state  ";cin>>s;
        if(s=="x")break;
        search_data_base(s,data_base);
    }
}
 
Hash sorting. Probably trivial for a data science practitioners

The hash keys in the example are a and c. The hash function is the switch code that converts keys to indices.

From scanning through Knuth's Sorting And Reaching as keys get complex there is no way to avoid collision or overlap of keys. A hash function ca be ambiguous.
Very useful when you need to work with data bigger than memory.
 
Using OOP encapsulation.

When dealing multiple empirical calibration tables I used class objects. It protected the tables from roaming pointers, provided error trapping to a log file, and interpolated. Trapping run time errors can be difficult.

It could be done with functions and not a class object but it would have been messy and prone to errors.

If say the interpolation method were changed it would be a simple matter of changing the class internals, rebuild the object, and recompile the system. No risk of upsetting the rest of the system as long as the public interface does not change. That is maintainability. Test the object without having to retest the entire system.

Code:
#include <ctime>

class CalData{
    public:
    double xmin,xmax,ymin,ymax;
    int n = 0,error = 0;
    char date[20],revision[20];
    int load(string file_name);
    double read(double val);
    string errors(void);
    private:
    double *xr,*yr;
};

double  CalData::read(double val) {
    int i;
    //linear interpolation
    if(val < xmin || val > xmax){
        error = 2;
        time_t now = time(0);
        char* dt = ctime(&now);
        FILE *p = fopen("log.txt","a");
        fprintf(p,"%s  %s\n",dt,"Range Error");
        fclose(p);
        return 0;
        }
    if(val == xr[n-1])return yr[n-1];//check last value
    for(i=0;i<n-1;i++){
        if(xr[i] == val)return yr[i];
        if(val > xr[i] && val < xr[i+1])
            return yr[i] + (val-xr[i])*(yr[n+1]-yr[i])/(xr[n+1]-xr[i]);
        }
   return 0;
}

string CalData::errors(void){
    switch(error){
        case 0:return "No Errors";break;
        case 1:return "File Eroor";break;
        case 2:return "Range Error";break;
    }
}


int CalData::load(string file_name) {
    int i;
    FILE *data = fopen(file_name.c_str(),"r");
    if(!data){error = 1;cout<<"error"<<endl;return 1;};
   char s1[30],s2[30];
    fscanf(data,"%s\n",date);
    fscanf(data,"%s\n",revision);
    fscanf(data,"%s\n",s1);
    n = atoi(s1);
    double *x = new double[n];
    double *y = new double[n];
    for(i=0;i<n;i++){
        fscanf(data,"%s  %s",s1,s2);
        x[i] = atof(s1);
        y[i] = atof(s2);
        }
   fclose(data);
   xmin = x[0];xmax = x[n-1];
   ymin = y[0];ymax = y[n-1];
    yr = y;
    xr = x;
    return 0;
}

int main(void){
    double x;
    CalData data[10];
    data[0].load("data.txt");
    x = data[0].read(2.6);
    cout<<"File Date  "<<data[0].date<<endl;
    cout<<"Revision  "<<data[0].revision<<endl;
    cout<<"Error   "<<data[0].error<<endl;
    cout<<data[0].errors()<<endl;
    cout<<"Interpoated Value   "<<x<<endl;

    return 0;
}

data.txt
07.23.24
Rev-1.0
4
1 10
2 20
3 30
4 40
 
Last edited:
Hash tables lead to a variety of interesting issues and techniques. When are they even used to start with? I'll guess that many broad applications very seldom have a need for hash tables, while they become ubiquitous in some other broad applications. We have many programmers here; tell us whether you have often, seldom or never used a hash table.

Hash tables are usually divided into two types: "open address" and "chained address." But in the 21st century, a new type was invented:  Cuckoo hashing. (Wikipedia classifies it as "open address" but I think it should get its own, new type.)

In its simplest form, cuckoo hash uses 2 tables of roughly equal size, each with its own hash function. To insert a new element you check the lone hashed-to location in each of the two tables, and insert the new element into whichever one is empty. If neither is empty, you insert it anyway! (Much as a cuckoo bird will insert its egg into an already occupied nest.) The removed entry is then re-inserted recursively. I think Cuckoo tables might have become popular had they been invented sooner, but by the 21st century a lot of code was fairly settled.

One advantage Cuckoo tables have over Open-address is that they facilitate the space-saving technique called "quotienting." I think "quotienting" is little known though, again, IIRC it was introduced by Donald Knuth in his famous book.
 
Binary Search

Works as advertised.

Binary search is faster than linear search except for small arrays. However, the array must be sorted first to be able to apply binary search. There are specialized data structures designed for fast searching, such as hash tables, that can be searched more efficiently than binary search.


Code:
int binary_search(int n,int *d,int x){
    int lo,hi,lastlo,lasthi,mid,niterations = 0;
    static int flag = 0;
    lastlo = 0;
    lasthi = n;
    hi = lasthi;
    lo = lastlo;
     while(1){
        niterations++;
        mid = lo + (hi-lo)/2;
       //cout<<lo<< "  "<<hi<<"  "<<mid<<" "<<d[mid]<<endl;
        if(x == d[mid])break;
        if(x<d[mid]) {lo = lastlo;hi = mid; }
        if(x>d[mid]) {lo = mid; hi = lasthi; }
        lastlo = lo;
        lasthi = hi;
        if(niterations == n+1){cout<<"Runaway Loop"<<endl;break;}
    }
    //restart output file
    if(flag == 0){
        FILE *p1 = fopen("bin.test","w");
        fclose(p1);
        flag = 1;
        }
    FILE *p = fopen("bin.test","a");
    fprintf(p,"%d\t  %d\t  %d\n",d[mid],x,niterations);
    fclose(p);
    return mid;
}

void binary_test(void){
    int i,n=2001 ,x;
    int d[n];
    for(i=0;i<n;i++)d[i] = i;
    for(i=0;i<n;i++){
        x = binary_search(n,d,d[i]);
        //cout<<d[i]<<"  "<<x<<endl;
        }
}

int main(){
    binary_test();
    return 0
}

File output
1989 1989 9
1990 1990 11
1991 1991 10
1992 1992 11
1993 1993 8
1994 1994 11
1995 1995 10
1996 1996 11
1997 1997 9
1998 1998 11
1999 1999 10
2000 2000 11
 
Back
Top Bottom