• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Language as a Clue to Prehistory

More Mother Tongue 23. Robert Lindsay has a review of Lyle Campbell's and Mario Mixco's "A Glossary of Historical Linguistics"
As will be shown, Campbell and Mixco repeatedly seriously distort the state of consensus regarding many language families, particularly long-range ones. They usually favor a more negative and conservative view, saying that a family has little support when it has significant support and saying it is controversial when the consensus in the field is that the family is real. Campbell and Mixco engage in serious distortions of fact all through the book.
RL wants the opinions of those who have researched some langs and lang families. "I don’t see why the opinions of non-specialists are so important. Go study the languages and get back to us!"

"I tried to be as fair as possible. As a long-ranger, clearly I am biased towards long-range proposals and against conservative views. However, you will see below a number of cases where I acknowledge that consensus either rejects or even strongly rejects the proposal."

Then what the more cautious linguists are arguing. Separate origin? I agree that that is unlikely. In fairness to such people, they seem to be arguing that language change makes it too hard to detect relationships in the early Holocene and before, before the mid-Holocene time of the likes of Indo-European and Austronesian.

RL then assesses LC & MM's asessments in gory detail.
 
LC & MM start with Afroasiatic: "Enjoys wide support among linguists but it is not uncontroversial". RL responds by saying that it is uncontroversial and that all its subfamilies are correct.

This is even though we cannot reconstruct the numerals for it. However, low-tech people generally don't have words for many numbers, often up to 2 or 3 or 4.

"I am not aware of any serious proposals to see Cushitic and Omotic as an Altaic-like Sprachbund of mass borrowings."

"As Campbell notes here, there is consensus that at least some form of Greenberg’s Niger-Congo is a valid grouping." All but Mande and Ijo, and maybe also Dogon and most Ubangian langs.

Then Nilo-Saharan, which is very controversial, with only some of it generally accepted: Eastern Sudanic, Central Sudanic, and a few others. Likewise, Khoisan is often considered very doubtful. Referring to an article about that family, RL says "Most of the proposals listed in this article are more deserving of moderate skepticism than decisive denial if one is inclined to reject them. If more conservatives talked like this, the debate would become less political and perhaps we could get back to science again."
 
Not surprisingly, LC & MM are very skeptical about Nostratic. RL: "It is correct that consensus among specialists is to reject Nostratic, but serious papers taking apart the proposal seem to be lacking."
Also, since Nostraticists are using the comparative method of reconstruction of proto-families and finding sound correspondences, they win the approval of even critics like Campbell, who at least appreciate that derided methods like mass comparison are not being used (Campbell 1998).
RL then notes the m-t pronoun pattern in the Eurasiatic part of Nostratic and also in Kartvelian, and he is sure that it has to be inherited, dismissing alternate hypotheses as absurd. "That said, it’s not completely apparent to me that Dravidian and Afroasiatic match well with the rest of the somewhat expanded Eurasiatic. I will side with conservatives with that theory and say it’s not yet proven."

Then Greenberg's Eurasiatic. Like Nostratic, it is not generally supported, by critiques of it have been uncommon. It has some support from computerized mass comparison. "Eurasiatic may be a more solid entity than Nostratic. Correlatively, the parts of Nostratic that overlap with Eurasiatic may be the best supported."

Indo-Anatolian (or Indo-Hittite) - that's splitting Indo-European into Anatolian and the rest of the family.

LC & MM: "Indo-Uralic: The hypothesis that the Indo-European and Uralic language families are genetically related to one another. While there is some suggestive evidence for the hypothesis, it has not yet been possible to confirm the proposed relationship."

RL: "This summary seems too negative. Indo-Uralic is probably one of the most promising long-range proposals out there."

LC & MM: "Yukaghir: A small language family of Siberia, composed of Tundra (Northern) Yukaghir and Kolyma (or South- ern) Yukaghir. It is often thought possibly to be related to Uralic, though the evidence has not yet been sufficient to confirm this proposal."

RL rebutted this assertion, finding evidence like lots of matches of verbs, a kind of word less borrowed than nouns. The pronoun paradigms are also very close.
 
LC & MM dismiss Altaic, even Narrow Altaic: Turkic, Mongolian, Tungusic
The most serious problems for the Altaic proposal are the extensive lexical borrowing across Inner Asia and among the ‘Altaic’ languages, lack of significant numbers of convincing cognates, extensive areal diffusion, and typologically commonplace traits presented as evidence of relationship.

The shared ‘Altaic’ traits typically cited include vowel harmony, relatively simple phonemic inventories, agglutination, their exclusively suffixing nature, (S)OV ([Subject]-Object-Verb) word order, and the fact that their non-main clauses are mostly non-finite (participial) constructions.

These shared features are not only commonplace typological traits that occur with frequency in unrelated lan- guages of the world and therefore could easily have developed independently, but they are also areal traits shared by a number of languages in surrounding regions, the structural properties of which were not well-known when the hypothesis was first framed.
I found the cat who is scratching the furniture.
would become
I the furniture-scratching cat found.
RL:
It is true that Altaic is still up in the air, and anti-Altaicists are not rare, even currently (Nichols 2012; Vajda 2020), but Campbell and Mixco are not correct when they say that the idea has been abandoned. Most linguists in the West regard it as a laughingstock, and if you say you believe in it, you will experience intense bullying and taunting from them. Oddly enough, in Russia, Altaic is regarded as obviously valid. The anti-Altaicists are frankly belligerent, hysterical, and typically violate the rules of academic decorum (see the Amerind debate for a similar situation). A recent overview (Blažek 2019) looked at the history of pro- and anti-Altaic views over time and showed that neo-pro-Altaicists are quite common.
"The problem is that most of the linguists who will laugh in your face and call you an idiot if you believe in Altaic are not specialists in these languages." - while people who have done research on them usually conclude that at least some of them are related, like Turkic and Mongolian, or Mongolian and Tungusic.

The usual counterargument is enormous amounts of borrowing, including of 1st and 2nd person pronoun paradigms.
Furthermore, the borrowings must have occurred at an early stage. Gerhard Doerfer, the famous Turkologist opponent of Altaic, had to keep pushing his massive borrowings of core vocabulary further and further back until he eventually had the scenario taking place at the Proto-Turkic, Proto-Tungusic, and Proto-Mongolic levels.

In my opinion, massive borrowing of core vocabulary at the proto-language level is simply another word for genetics. This argument – mass borrowing of core vocabulary at the proto-language level – has also been used by conservatives in the cases of Uralo-Yukaghir (Häkkinen 2012) and Quechumaran (Adelaar 1992) to explain away genetic explanations for those families.
 
Then talking about Japanese-Korean. Also controversial, with many Japanese linguists rejecting it outright, even if some researchers into Altaic accept it. Then Yeniseian, with LC & MM neglecting the Dene-Yeniseian proposal, then Nivkh, then "Paleosiberian" langs.

LC & MM then head southward and say "Austroasiatic: A proposed genetic relationship between Mon-Khmer and Munda, accepted as valid by many scholars but not by all." Another uncontroversial group.

LC & MM: "Miao-Yao (also called Hmong-Mien): A language family spoken by the Miao and Yao peoples of southern China and Southeast Asia. Some proposals would classify Miao-Yao with Sino-Tibetan, others with Tai or Austronesian; none of these has much support." RL: this ignores Austroasiatic.

LC & MM: "Austro-Tai: A mostly discounted hypothesis of distant genetic relationship proposed by Paul Benedict that
would group together the Austronesian, Tai-Kadai, and Miao-Yao." -- except that it's only of Austronesian and Tai-Kadai. RL: "In fact, evidence is building towards acceptance of Austro-Tai after papers by Weera Ostapirat (2005, 2013) and Laurent Sagart (2005, 2019) proved the case using the comparative method."

"Even the larger version of Austric, including all of Benedict’s families plus Ainu and the South Indian isolate Nihali, has some supporters and there is some suggestive evidence that it may be correct (Bengtson 2006)."

LC & MM say about Tai-Kadai that it is "generally but not universally accepted", but it's an uncontroversial family.

Then Na-Dene. There is a lot of controversy over whether Haida is a member, but it's generally agreed that Tlingit is, with the core of Na-Dene being Eyak-Athabaskan.
 
LC & MM dismiss Greenberg's Amerind, saying "There is an excessive number of errors in Greenberg’s data." and "In various instances, Greenberg compared arbitrary segments of words, equated words with very different meanings (for example, ‘excrement/night/grass’), misidentified many languages, failed to analyze the morphology of some words and falsely analyzed that of others, neglected regular sound correspondences, failed to eliminate loanwords, and misinterpreted well-established findings."

RL then mentions n-m pronouns and also t’Vna -- t'ana = child, t'una = girl, woman, female relative, t'ina = boy, man, male relative. For my part, I think that n-m looks strong, but I'm not sure about t'Vna. GIven the wide semantics posited for this root, and the large number of langs that one can look in, I'm not very impressed. At least with n-m one has well-defined semantics.
With this review, you will see that in the thirty-four years since LIA, Americanists have slowly come to the conclusion that, the more people looked into it, more and more of Greenberg’s basic Amerind classification is correct.

Most of the larger nodes are still rejected, but a shocking number of the smaller nodes and the families contained within them have been slowly shown to be correct. Yes, he was wrong at times. In the South American section, you will see that he sometimes got whole language groups in the wrong families. But still, more of his views have been accepted than rejected.
Those successes are where JG's method can work relatively well, it must be noted. To go much further, one will need to find protolanguages -- they will encapsulate a lot of comparative work.

RL went into some of the history of Amerindian macrolinguistics, noting the lumping of the first half of the 20th cy., then the splitting that followed it. We are now doing lumping again, but usually much more carefully.
In this sense, the label conservative to apply to Campbell, etc. may be misleading. In the sense that they are trying to stop or slow down what they see as a radical, excessively encompassing, poorly-thought out tendency of the long-rangers, their argument for sober and deep reflection before arriving at major conclusions is indeed conservative, possible in the positive sense of the word: “Hey, slow down! Not so fast now! Let’s not go overboard. Let’s think this out before we jump to conclusions.”

In addition, by requiring long-rangers to dot every I and cross every t and make sure the work is at a very high standard, the conservatives are forcing the long-rangers to work harder and harder and accumulate more and more evidence for their proposals. In this sense of holding the long- rangers’ feet to the fire, the conservatives are actually doing the long-rangers a favor.
By being devil's advocates.
 
After a long discussion of Amerindian langs, Robert Lindsay gets to Australia. LC & MM say about it "Pama-Nyungan is accepted by most Australianists as a legitimate language family, but not uncritically and not universally. It is rejected by Dixon; it is held by others to be plausible but inconclusive based on current evidence."

Aside from RMW Dixon, RL states "Instead, Pama-Nyungan is about as uncontroversial as Macro-Gê, Afroasiatic, or Austroasiatic."

About Papuan langs, about half of them are in Trans-New-Guinea. It is not just what LC & MM say about it, "promising but not yet confirmed", but uncontroversial.

RL then grumbles that historical-linguistics conservatives dominate academia, to the point that it's hard to get a position if one wants to work on long-range comparisons.
The problem is that when a paradigm is in effect, all scholars are supposed to publish within the paradigm. Publishing outside the paradigm is regarded as evidence that one is a kook, a crank, is practicing pseudoscience, or that one is crazy or a fool. It is instructive in this debate to note that most of the prominent long-rangers are independent scholars operating outside of academia.
This seems like what many advocates of crackpottery like to say, that mainstream scientists are closed-minded orthodox oxen. Also, some long-range proposals seem like crackpottery, like Proto-World. If it's hard to go farther back than the mid-Holocene, then going back to when our ancestors started to split up seems like an impossibility.
I have had long-rangers tell me that the only reason they can take the long-ranger position is because they are independent and don’t have a university job so there are no repercussions if they are wrong. They told me that if they had a professorship, they would not be able to do this work. They have also told me that they know for a fact that certain conservatives might jeopardize their jobs, careers, and especially their funding if they took a long-ranger position. This was given as one of the reasons for their dogmatic conservatism. As an example, prominent long-ranger Merritt Ruhlen was never able to advance beyond Lecturer at Stanford University although he was an excellent scholar (Bengtson 2021).
 
Then Gregory Haynes with an article on "Resonant Variations on Immortality"

*n-mer-(t)- "undying, immortal" ~ *n-mel-(t)- "immortal, without harm, unfaded, undamaged, unwounded"
with proposed roots
*n-men-(t)- "immortal, unharmed" ~ *n-me-(t)- "undying, immortal, undestroyed, unhurt’

Consonantal Alternations in Indo-European Roots: Diatopic and/or Diachronic Variants or Functional Mechanism? 1 | Aldo L Bizzocchi - Academia.edu
Diatopic: about changes over space, diachronic: about changes over time

Lists several, like *kap- ~ *ghabh- (Latin capere "to take", English "to have" ~ Latin habêre "to have", English "to give")

Also some l ~ n alternations.

Mother Tongue 4 has articles by Roger Wescott and John Bengtson, with more such alternations, and I'd commented on them.

Back to MT 23. It has a repeat of Vaclav Blazhek's article on Indo-European connections of China's Ymir-like primordial giant Pangu with a little added at the end.
 
Mother Tongue 23 ends with John Saul's "The Ends of the Earth", a lot of farfetched mythological analysis.
Early groups of our ancestors may have left Africa and migrated north in an effort to learn the “immortalizing secret” of the storks or cranes who would annually leave Africa and then return, seemingly born again, as I have proposed elsewhere. On reaching southern Europe or the Middle East, our ancestors would have discovered the disorienting ornithological truth. At that point, our ur-religion, if we may call it such, may have undergone a schism, as have all religions since, with some of our ancestors following the stars to the west and others heading east toward the birthplace of the life-giving Sun. Driven by hope and ideas, and moving through lands lacking other sapiens, they could have reached the “ends of the Earth” within a very few years (as did Marco Polo), and there left their mark.
There are more pragmatic reasons to move elsewhere, like a nicer environment and being away from other people. This brings to mind a speculation by science-fiction writer L. Sprague de Camp. After noting science-fiction fans doing some nasty squabbling, and noticing lots of nasty squabbling more generally, he proposed a theory that that's a way of helping our ancestors to avoid overexploiting the land where they live. If people multiply and continue to live near each other, they go over the carrying capacity of the land where they live. But if they start squabbling, then some of them may be motivated to move elsewhere, thus spreading out the population.

He earlier contributed to MT: “Was the First Language Purposefully Invented?” in MT 7 -- then talking about notions of death and afterlife, something I found *very* disappointing. I would have been more interested in what syntactic complexity our predecessor species might have been capable of.
 
Anatolian *meyu- ‘4, four’ and its cognates by Alexei Kassian

That's a departure from the rest of Indo-European: *kwetwores

AK starts out by noting the number systems used by people with very low levels of technology:
  • this one, >1/several/many
  • 1, 2, >2/several/many
  • 1, 2, 3, >3/several/many
  • 1, 2, 3, 4, >4/several/many
This system can be extended in several ways:
  • Body parts: eyes = 2, hand = 5, fingers, ...
  • With a sibling/companion = even numbers, without a sibling/companion = odd numbers
  • Arithmetic: 3 = 2+1, 4 = 2+2 = 2*2, 6 = 5+1 = 2*3, 7 = 5+2, 8 = 5+3 = 2*4 = 10-2, 9 = 5+4 = 10-1, 10 = 2*5, ...
  • Borrowings from other languages
For all Nostratic: 1, 2. For Uralic, Altaic, Dravidian: 1, 2, 3, 4 -- are 3, 4 unstable?

For IE: 2 to 4 are underived, 5 = hand, 10 = right hand or two hands, 7 = borrowing from Semitic or from some shared source, 9 = new?

IE *oktô -- dual ending -- 2*4 -- Proto-Kartvelian *otxo- = 4 -- possible form with that meaning in IE
Iranian *ašti- ‘(breadth of) four fingers’ (measure of length), retained in Avestan compounds
 
Joseph Greenberg: "Generalizations about numeral systems" has this universal, #6 in his list: “The largest value of
L in system with only simple lexical representation is 5 and the smallest is 2” -- L is “the next largest natural number after the largest expressible in the system”.

1, many -- 1, 2, many -- 1, 2, 3, many -- 1, 2, 3, 4, many

AK proposes Proto-Altaic *moyu "all, whole", and he claims that a lot of Nostratic words survived in Anatolian but not in other branches of IE.
 
Numeral Systems of the Languages of California on JSTOR
Author(s): Roland B. Dixon and A. L. Kroeber
Source: American Anthropologist , Oct. - Dec., 1907, New Series, Vol. 9, No. 4 (Oct. - Dec., 1907), pp. 663-690

A *lot* of variety. Mentions some Australians and South Americans who count 1, 2, 2+1, 2+2, 2+2+1 and 1, 2, 3, 2+2, 3+2 -- hard to get very far with that.

For the Californians, base-5 counting with a subsequent base of either 10 or 20 or 10 then 20.

Body-part names: in Yurok, 7 = pointer (index finger), 8 = long (middle finger)

The word "index" itself is from Latin, where it means "pointer, indicator" and related meanings.

Some of the Californians use base 4.

Also mentions multiplication: 4 = 2*2, 6 = 2*3, 8 = 2*4, 12 = 3*4, 15 = 3*5.

Addition is the most common compounding process, with subtraction sometimes used: 9 = 10-1, 14 = 15-1, 19 = 20-1.

Also mentions phonetic analogies, making words for successive numbers sound similar.

For numerals above ten, on the other hand, the decimal system, generally pretty pure, occurs in the enormous majority of cases, covering the entire continent with the exception of parts of California and Mexico, the Eskimo area, and the sections occupied by the various members of the Caddoan stock. Only in these few areas does no trace of the decimal system exist above ten. At a number of points on the Northwest coast a quinary system somewhat mixed with decimal occurs.

Mexico is noteworthy for practically not possessing a single native language showing the decimal system either below or
Mayan speakers used base 20 even for the very large numbers that they sometimes named. Also, Uto-Aztecan and Numbers in Nahuatl -- Nahuatl uses base-20, while the rest of Uto-Aztecan uses base 10. So base 20 is likely a Central American areal feature.
Consistent or thorough decimal systems, where all the numerals, both below and above ten, are on this basis, cover very large areas, including the regions occupied by the large and important Siouan, Athabascan, Shoshonean, Iroquoian, and Salish stocks. This area is in the main that of the central portion of the continent, and it extends to the Pacific coast in only one or two places.

As contrasted with the wide extension of thorough decimal systems, consistent quinary-vigesimal systems occur but rarely. Outside of Mexico, they are to be found only among the Caddoan tribes, the Eskimo, and in parts of California.
I checked on Algonquian langs, and from Numbers - Learning Ojibwe that lang uses a decimal system.
 
Mark Pagel, a coauthor of Ultraconserved words point to deep language ancestry across Eurasia | PNAS has coauthored some other interesting papers.

Dominant words rise to the top by positive frequency-dependent selection | PNAS - from wanting to use words that others use.

Frequency of word-use predicts rates of lexical evolution throughout Indo-European history | Nature - looking at English, Spanish, Russian, and Modern Greek, using text corpora with 20 to 100 million words each.

Their written histories:
  • English: 1,300 years (Old English)
  • Spanish: 2,300 years (Old Latin)
  • Russian: 900 years (Old East Slavic)
  • Greek: 3,400 years (Mycenaean Greek), 2,800 years (Homeric Greek)
Their ancestors diverged with the spread of the Corded Ware culture in Northern Europe, around 5,000 years ago.

Noting that IE langs' words for tail is very variable - English "tail", German Schwanz, French queue, Greek oura - while their words all come from a single protoform, *dwô.
Figure 1a shows the inferred distribution of rate estimates, where we observe a roughly 100-fold variation in rates of lexical evolution among the meanings. At the slow end of the distribution, the rates predict zero to one cognate replacements per 10,000 years for words such as ‘two’, ‘who’, ‘tongue’, ‘night’, ‘one’ and ‘to die’. By comparison, for the faster evolving words such as ‘dirty’, ‘to turn’, ‘to stab’ and ‘guts’, we predict up to nine cognate replacements in the same time period.
Thus confirming Aharon Dolgopolsky's results.

Figure 2 shows that the distribution of word-use frequencies in each language is highly skewed, such that most words are used relatively infrequently (fewer than 100 times per million words), with a small number of frequently used words (as
often as 35,000 times per million words) accounting for most speech. Word-use frequencies are highly correlated among the four languages (0.78 < r < 0.89, mean r = 0.84; Supplementary Fig. 2),showing that words used at a high frequency in one language tend to be used at a high frequency in the other languages.
The authors suggest that these frequencies have been very stable over the history of Indo-European. But there is a confounding factor. All four corpora are likely of present-day and relatively recent texts, and living in present-day society may account for this similarity.

There is a good check: find these results for literature from the past. One won't get as good a sample size, but one should be able to get enough to find overall trends. However, a problem may be limited subject matter. The oldest continually-transmitted work of literature is the Rig Veda, and that is a big collection of hymns. The oldest Greek texts are Linear B tablets, and those are all bookkeeping records. Etc.

The authors found a negative correlation between rate of use and rate of change. Since some parts of speech are used more than others, so they split the words up by part of speech. "To examine this effect, we categorized meanings as either nouns, adjectives, verbs, pronouns, numbers, conjunctions, prepositions or special adverbs (‘what’, ‘when’, ‘where’, ‘how’,‘here’, ‘there’ and ‘not’)"

What they found: "For a given frequency of meaning-use, prepositions and conjunctions evolve most quickly, followed by progressively slower evolution for adjectives, verbs, nouns, special adverbs, pronouns and finally numbers."

Also, "The generality of this influence is suggested in the finding that estimates of the rate of lexical replacement in Indo-European languages are correlated with rate estimates in Bantu, Cushitic and Malayo-Polynesian."

Then on language change in general. "One is that we expect languages to diverge in the least frequently used parts of their vocabularies. This may mean that languages retain mutual intelligibility far longer than expected from simple uniform rates models of linguistic divergence."

One can check such estimates with almost mutually intelligible languages like Spanish and Italian, Russian and Ukrainian, etc.
 
How do we use language? Shared patterns in the frequency of word use across 17 world languages - PMC
original at
How do we use language? Shared patterns in the frequency of word use across 17 world languages | Philosophical Transactions of the Royal Society B: Biological Sciences
For example, among a sample of 87 Indo-European languages, all speakers use a related group of sounds or words to describe ‘two’ (we use the symbol <'> to denote a given meaning, or concept, and the symbol <”> to refer to a word form) objects but use 45 or more different and unrelated words to describe something as ‘dirty’.
The authors used 17 corpora representing 16 languages:
  • Indo-European:
    • Germanic: English
    • Italic: Romance: Spanish, Portuguese, Italian
    • Slavic: Russian, Polish, Czech
    • Hellenic: Greek
  • Uralic: Finnic: Finnish, Estonian
  • Altaic: Turkic: Turkish
  • Sino-Tibetan: Chinese
  • (isolate): Basque
  • Austronesian: Maori
  • Niger-Congo: Swahili
  • Creole: Tok Pisin (English-based, New Guinea)
Macrofamily membership:
  • Borean
    • Eurasiatic: Indo-European, Uralic, Altaic
    • Dene-Caucasian: Chinese, Basque
    • Austric: Austronesian
  • Congo-Saharan: Niger-Congo
The authors then used Morris Swadesh's 200-word list. "Despite being separated by thousands of years of linguistic evolution, the average inter-correlation among the four languages in the frequency with which they used these common words was 0.85." But the present-day-society problem is still in evidence, and the authors could have used corpora for past centuries.
We find it remarkable that frequency of use and a word's part of speech can together account for close to half of the variation in rates of lexical replacement. The results using Starostin's rank-order list are encouraging that this might be a very general effect, and we look forward to testing whether our results predicting rates of lexical replacement hold in new samples using rates derived from other language families.
In order, and with clustering:
((numeral, pronoun), adverb), (verb, noun), adjective

Their earlier results:
numeral, (pronoun, adverb, noun, (verb, adjective)), (preposition, conjunction)

So when they are stable, numerals and pronouns are superstable.
 
The deep history of the number words | Philosophical Transactions of the Royal Society B: Biological Sciences - comparing Indo-European, Bantu, and Austronesian

They found the words for low numbers, for 1 to 5, have some of the lowest replacement rates of Swadesh-list words.

Replacement rate per thousand years:
WhatIEBtAN
Words200102154
Overall0.20 +- 0.110.23 +- 0.090.35 +- 0.12
Range0.0047 - 0.610.026 - 0.450.065 - 0.65
1 to 50.01 +- 0.040.11 +- 0.090.16 +- 0.05
2 to 50.06 +- 0.030.10 +- 0.05
For the Pama-Nyungan langs of Australia, 1, 2, 3 have the lowest mean rank order of all the 17 categories of words, categories like kinship and the environment.

The slowest-replaced words in each family:
  • IE: 2, 3, 5, who, 4, I/me, 1, we, when, tongue, name
  • Bantu: eat, tooth, 3, eye, 5, hunger, elephant, 4, person, child, 2
  • AN: child, 2, to pound/beat, 3, to die, eye, 4, 10, 5, tongue, 8

The authors speculate on possible causes:

(a) Evolutionarily conserved brain regions associated with numerosity (somehow) influence the learning and use of linguistic-symbolic number words.
Data from a study of the age of acquisition for 30,000 English words might be relevant to this idea. Children learn words earlier the more frequently those words are used in common everyday speech. But using the Kuperman et al. data, we find that all 10 number words from one to ten have earlier ages of acquisition than is predicted from their frequency of use (binomial test, p < 0.002, two-tailed; figure 4).
noting
Age-of-acquisition ratings for 30,000 English words | Behavior Research Methods

(b) Number words are unambiguous in their meanings and therefore less likely to admit alternatives.
Large-scale surveys that record the words people use in conversation reveal that for some common objects and actions a variety of different words might be used, whereas for others most respondents use the same word: days of the week, months of the year and the number words fall into this latter category.

(c) Number words occupy a region of the phonetic space that is relatively full.

Also,
Some words we might expect to be highly conserved are not. Names of body parts, and relational words for mother, father, husband and wife, or he and she, or perhaps words for fire or spears might all be expected to play central roles in everyday speech and especially so in ancient societies, and therefore be conserved. But with the exception of child, eye and tongue none of these words made it into the slowest-evolving set of words of any of the language families. Indeed, in contrast to the extreme conservation of the number words, there are 43 different cognate forms of the words for husband in the IE languages, and 37 of the words for wife.
 
Reconstruction: Proto-Indo-European/h₁óynos - Wiktionary, the free dictionary - many langs' number-word pages have boxes where one can step through the words.

All but 1 are well-preserved in all of the IE langs but Anatolian - Albanian, Armenian, Balto-Slavic, Celtic, Germanic, Hellenic, Indo-Iranian, Italic, Tocharian.

*treyes 3 -- protruding (middle) finger?
*kwetwores 4 -- too many consonants for a typical IE root: kw-tw-r

1 has variations:
  • *oynos - Albanian, Baltic, Celtic, Germanic, Italic
  • *oykos - Indic
  • *oywos - Iranian
  • *edhinos (?) > *edinu - Slavic
  • *sem - Armenian, Hellenic, Tocharian
As I'd mentioned earlier, Anatolian is an oddity: Anatolian *meyu- ‘4, four’ and its cognates -- the Hittite word for 1, sia-, matches IE demonstrative *so (> English the, that, etc.). But 2, 3, 7, 9 seem to agree with the rest of IE.

The typology and diachrony of higher numerals in Indo-European: a phylogenetic comparative study | Journal of Language Evolution | Oxford Academic - going into a lot of detail. Teens have the most variety, with later numbers' ones values being added as compounds. Multiples of 10 usually originated as compounds, but there are exceptions, like East Slavic 40 = sorok. Multiples of 100 and 1000 are always compounds: some number of 100 and 1000.
 
Doing that for the Semitic langs, I find
  • 1: East: *\aSt-, West */aHad-
  • 2: Ethiosemitic *kal'-, other *Tin-
  • 3 - 10 -- well-preserved: 3 *TalâT-, 4 */arba\-, 5 *HamS-, 6 *SidT-, 7 *Sab-, 8 *Tamâniy-, 9 *tiS\-, 10 *\asar-

I looked in Turkic and Dravidian, using The Numbers List, and I found them mostly well-preserved over 1 - 10. Exceptions: Chuvash (Turkic) has 1 nep, South Dravidian langs. have 9 = 10 - 1. Austronesian was more difficult, however.

WALS Online - Feature 131A: Numeral Bases and WALS Online - Chapter Numeral Bases - most of them are base-10, though there are some base-20, and some mixed base-10-20 systems: base-20 up to 100, then base-10. Some of them use other bases, some use extended body-part systems, and some of them are restricted, not counting very far.

Ekari (Trans New Guinea) - base 60: 10 - 6, like Sumerian
Supyire (Gur, Niger-Congo) - base 80: 5 - 2 - 2 - 4

The WALS contributors unfortunately did not add any past languages or reconstructed protoforms. But as far as I can tell, base 10 is by far the most common among independently-elaborated systems, with base 20 next, and other bases following. It would be a big task to construct a comprehensive list, I must concede.

The Numbers List and Numeral Systems of the World and  Numeral (linguistics) about number words.

 Proto-Indo-European numerals
 
The field of macro-linguistics attracts a lot amateurishness and crackpottery and just plain taurine excrement.

The Arabic Origins of Numeral Words in English and European Languages | Jassem | International Journal of Linguistics

With very contrived phonetic changes, like

khams > khamf > famf > fanf (German fünf) > faf (English five)
khams > kams > kamk > kank (Latin quînque) > sank (French cinq)
khamsat > kham(th/f)at > famfat > fampat > pampat > panat (Greek pente)

In conclusion, one can say with confidence that Arabic is the origin of all European numerals. This has huge theoretical implications as to the validity and practicality of postulating a proto-Indo-European origin for European languages, which is a mirage with no existence in reality.
Because there is no surviving documentation of any Proto-Indo-European dialect? Does this author believe that the world around him disappears when he closes his eyes?

Rather curiously, his paper makes no reference to other Semitic languages, like Hebrew or Aramaic or Akkadian or Amharic. That paper also doesn't have a lineup of words for numbers in various langs. If one makes such a lineup, it's obvious which ones most closely resemble which other ones.
 
Biblical Criticism & History Forum - earlywritings.com - Index page - in the thread "Zeus is the Jewish God" I'd responded at length to someone who advocated the theory in this book:

Hebrew is Greek : Free Download, Borrow, and Streaming : Internet Archive by Joseph Yahuda

Amazon.com: Customer reviews: Hebrew is Greek
has positive reviews saying what great work JY has done, and negative reviews saying that his book is crackpottery:
Like similar efforts to link Hebrew and native Amercian languages done by the LDS Church, this work depends almost entirely on "assonances," the similar sound of two unrelated words.
Words then connected with some farfetched etymology, like Goropius's Latin quercus "oak" < Brabantic werd-cou "protector from cold".

Joseph Ezekiel Yahuda: The Man, the Myth, the Legend : AncientGreek

JY: "Looked at from any and every aspect, it should be manifest that Hebrew is Greek by another name" and "Hebrew is Greek, albeit somewhat altered Greek—Asiatic or Continental Greek, as distinct from European Greek"

Reviewer u/Kowber: "He puts great weight on the alphabets, seeming to imagine that he is among the first to note commonalities" -- as if writing systems demonstrate language relationships.

u/K: "Yahuda's ability to connect any Hebrew word to a Greek one rests not so much on his lack of interest in systematic correspondences of sound (which a linguist would demand) but his 'method' of turning one Hebrew sound into a variety of others." - using several dialects of Hebrew and Greek as sources, even though that makes it more likely to find close resemblances.

JY tried to argue that Greek and Hebrew have the same grammar, and he says that "the ancestors of the Jews must have been among the noblest and/or the most ancient of the Hellenes, and that they spoke a language far more ancient than classical Greek".

JY also says:
I repeatedly tried to get others to join me in the venture, without success. Thus early on, at the end of a two-hour session with one of the prospective collaborators, he exclaimed: "All this is rubbish, and we've wasted each other's time." My response was: "You, as well as I, will be judged by these words which I shall quote whenever I discuss my work again."

Shortly after the aforesaid encounter, I quoted the disparaging remark uttered at its conclusion to the late Christodoulos Hourmouzios, a graduate of the University of Athens and an expert on Homer, who said to me: "But I think you are one of the greatest glossologists I know."
 
The Database of Cross-Linguistic Colexifications, reproducible analysis of cross-linguistic polysemies | Scientific Data

Many words have more than one related meaning - polysemy - and one can uses such multiple meanings to create a network of related meanings. Example cited: Russian and Nahuatl words for "tree" also meaning "wood". Using Wiktionary's translations, I found several other examples.

For instance, arm and hand are closely related, since many langs have the same word for both. Here are some related meanings:
  • arm, hand - wrist
  • arm, wrist - elbow - knee - kneel
  • arm - fathom
  • hand - five
  • hand - palm of hand
One might also make a database of semantic shifts, like "hand" > "five".

Sound symbolism is not very common in natural language, but it does exist.

Resolving the bouba-kiki effect enigma by rooting iconic sound symbolism in physical properties of round and spiky objects | Scientific Reports
The “bouba-kiki effect”, where “bouba” is perceived round and “kiki” spiky, remains a puzzling enigma. We solve it by combining mathematical findings largely unknown in the field, with computational models and novel experimental evidence. We reveal that this effect relies on two acoustic cues: spectral balance and temporal continuity. We demonstrate that it is not speech-specific but rather rooted in physical properties of objects, creating audiovisual regularities in the environment. Round items are mathematically bound to produce, when hitting or rolling on a surface, lower-frequency spectra and more continuous sounds than same-size spiky objects. Finally, we show that adults are sensitive to such regularities. Hence, intuitive physics impacts language perception and possibly language acquisition and evolution too.
Also called the “maluma-takete effect”.
 
Back
Top Bottom