• Welcome to the new Internet Infidels Discussion Board, formerly Talk Freethought.

Language as a Clue to Prehistory

Proto-Amerind Numerals by Merritt Ruhlen
The Amerind language family includes all the aboriginal languages of North and South America, except for those belonging to the Eskimo-Aleut and Na-Dene families. Comparative linguistic evidence from extant (or attested) Amerind languages indicates that Proto-Amerind - the language from which all Amerind languages derive - used a system of counting in which an obligatory numeral prefix, *ne- ,preceded the numeral root. The first three numerals in Proto-Amerind seem to have been *ne-kwe '1,' *ne-pale '2,' and *ne-kwatlas '3.' A fourth numeral, Proto-Amerind *ta-pale '4,' combined a reflexive prefix with the Proto-Amerind root for '2' in order to express the number '4.'
MR could find a sizable amount of evidence for his reconstruction of "1" and "2", but "3" mostly in Almosan (N America W Coast) and Andean (S America W Coast). For "4", he found 2+2 and (reflexive)-2 ("2 with itself"). He was unable to find any separate word for "5", noting that it is often derived from "hand", something common cross-linguistically, like "finger" > "1" and "human being" > "20".
 
TYPOLOGY OF NUMERAL SYSTEMS by Bernard Comrie

"Restricted systems, with little or no internal structure"
Pirahã has none
Most of them go up to 3 or 4 or at most 5 (often < "hand").

Then mentioning "subitizing", the ability to estimate a number of things at a glance without counting them. We typically go up to 4.

"Simple systems with addition only"
Often addition of 2's and 1's.

"More complex systems using multiplication and addition applied to a base"
Listing bases 6, 8, 10, 12, 20, 32, 60

All of them:
2, 3, 4, 5, 6, 8, 9, 10, 12, 20, 32, 40, 60, 80

Somatic origins:
  • 10 - fingers
  • 20 - fingers and toes; each finger twice (two phalanges/knuckles)
  • 8 - spaces between fingers (attested for some California languages)
  • 12 - phalanges or knuckles of fingers (excluding thumbs)
Phalange - finger segment
Knuckle - finger joint

If the lowest multiplicative base is higher than about 12, and the system is not an extended body-part counting system, then the higher numbers below the lowest multiple base make use of an additive base, ... some languages use this even for a smaller lowest multiplicative base, ...
like 5 then 10 and 10 then 20.
 
"Idiosyncrasies relating to bases" - lots of them

"Exponentiation and other higher bases" - Indo-European has powers 1 to 3 of 10, Greek adds 4: murias "myriad, 10,000", and Sanskrit and Chinese have even higher powers.

Classical Nahuatl: 20 ~ "to count", 400 = 20^2 ~ "hair", 8000 = 20^3 ~ "bag, sack"

English has short and long counts of high powers: million: 10^6, billion: 10^9 or 10^12, trillion: 10^12 or 10^18, ... replacing the m- with a numeral prefix.

Sometimes powers are from "big", like the origin of "million": Italian -one (augmentative, "big" something): mille "1000" milione "million" < "big 1000"

That likely happened in PIE to form *kmtom 100 from *dekm 10 and *tuHsont- 1000 from *tewH- "to swell" and *kmtom
 
Languages | Free Full-Text | Ancient Connections of Sinitic
Considering various hypotheses, then deciding on a "linkage" between Sino-Tibetan and Dene-Yeniseian.

Sino-Tibetan has two main subfamilies: Sinitic (Chinese) and Tibeto-Burman, though some phylogenies put Sinitic inside of "Tibeto-Burman" as an early brancher.
  1. Sino-Tibetan includes Kra-Dai and Miao-Yao as branches close to Sinitic
  2. Austric: Austronesian and Austroasiatic
  3. Austro-Tai: AN and KD
  4. East Asian / Trans-Himalayan: ST, (Yangzian: MY, AA), AT
  5. Sino-Austronesian or STAN: ST, AT
  6. ST related to DY, more broadly, Sino-Caucasian or Dene-Caucasian

About (1),
However, nearly all of the lexical evidence for this connection is loanwords from Sinitic into Tai-Kadai and Miao-Yao, as Li and others have shown. Shafer and others suggest that Sinitic, Tai-Kadai, and Miao-Yao form a subgroup within ST, but actually, Tai-Kadai and Miao-Yao each have distinct basic vocabulary.
Chinese cultural prominence? Like for Korea and Japan and Vietnam, all three of which have numerous words with Chinese origin, like Chinese number words alongside their native ones. That went further in Proto-Tai, its speakers dropping their inherited number words for Chinese ones.

About (2), "Diffloth (1994) shows that the Austric linkage hypothesis is unsustainable."

About (3), "Much more solidly, Benedict (1942) showed (3) a very close relationship with a large number of cognates between Tai-Kadai and Austronesian in a family which he called Austro-Thai; this connection is fairly widely accepted, and Tai-Kadai is subsumed within (4) Starosta’s East Asian and (5) Sagart’s STAN as part of Austronesian."

Unfortunately, he does not mention Weera Ostapirat's work on Austro-Tai.

Author David Bradley rejects (4) and (5) and then considers (6).
As we have seen, there is substantial evidence of syntactic, morphological, phonological, and lexical similarities between ST, Yeniseian, and ND. These are particularly strong in stable areas such as basic structural features, including negation, prohibition, valency increase, and so on; also in SAP pronouns, lower numerals, basic kinship terms, and so on.
Valency increase: causatives and the like, which add another verb argument, as it might be called.

SAP = Speech Act Participant: first and second person pronouns.

The linkage (5) of Proto-ST with Yeniseian and ND, which cannot be attributed to contact, is supported by various evidence briefly summarised above. The lexical evidence suggests a pre-Neolithic linkage, sharing only the domestic dog. The Yeniseian languages are to the northwest of Proto-ST in central Siberia; the ND groups later migrated from northeast Siberia into northwest North America. The linkage is also supported by genomic evidence presented in Bradley (2023, forthcoming). The shared linguistic retentions of ST and Dene-Yeniseian languages have persisted over great geographical distances, despite many millennia of lack of contact.
I concede that this paper is somewhat disappointing in having no statistical analyses of highly-conserved vocabulary, Kassian-Starostin and Ostapirat sorts of analyses.
The generally-agreed location for the origins of Sinitic is the upper Yellow River valley. In the early Neolithic period corresponding to Proto-ST, cultivation of Setaria and Panicum millets, Glycine (soybean), and the domestic pig started in this area and later diffused more widely. Etyma for these crops and this animal are reconstructed for Proto-ST and attested in Sinitic and nearly every branch of TB across East, Southeast, and South Asia (Bradley 2011, 2016, 2022). The chronology of the subsequent dispersal of Sinitic and the TB languages across this wide area can be traced through regular sound and morphosyntactic changes, as well as lexical innovation, including new vocabulary for new crops and new domestic animals over the period from 5.6K YBP to the present. For more discussion of the phylogeny and spread of Proto-ST, see Bradley (2022, 2023, forthcoming); Bradley et al. (forthcoming) and many other sources.

This chronology, along with archaeological and genomic findings summarised in Bradley et al. (forthcoming) and the early cognate etyma within Proto-ST, suggest that Proto-ST was possibly spoken during the Peiligang Culture and certainly during early to mid-Yangshao Culture in the upper Yellow River valley and that Sinitic was spoken during the late Yangshao and Longshan cultures, spreading downriver into northeast China, where Sinitic speakers took up the cultivation of rice and developed a high culture which they later spread and diffused across the rest of China (Bradley et al., forthcoming).
 Peiligang culture - 7000 - 5000 BCE - Neolithic - Yi-Luo river (Henan Province)
 Cishan culture - 6500 - 5000 BCE - Neolithic - E foothills of Taihang mountains
 Yangshao culture - 5000 - 3000 BCE - Neolithic - middle Yellow RIver
 Longshan culture - 3000 - 1900 BCE - Neolithic - lower and middle Yellow RIver
Longshan = "Dragon Mountain" in Chinese

Dated language phylogenies shed light on the ancestry of Sino-Tibetan | PNAS - "Our findings point to Sino-Tibetan originating with north Chinese millet farmers around 7200 B.P. and suggest a link to the late Cishan and the early Yangshao cultures." - 5200 BCE
Dated phylogeny suggests early Neolithic origin of Sino-Tibetan languages | Scientific Reports - "But we find that the initial divergence of this group occurred earlier than previously suggested, at approximately 8000 years before the present, coinciding with the onset of millet-based agriculture and significant environmental changes in the Yellow River region." - 6000 BCE
 
Prehistory is history before writing or before some starting point. So I'll consider the history of writing.

Before full-scale writing, writing that can fully represent spoken language, there was  Proto-writing possibly extending well into the Pleistocene. Writing was independently invented only a few times, with most other forms of writing being descended from these inventions or else inspired by learning about writing: stimulus diffusion.

The four independent inventions and their descendants:

Egyptian hieroglyphics
  • Hieratic -> Demotic -> Meroitic
  • Proto-Sinaitic
    • Ugariitic
    • South Arabian -> Ge'ez
Phoenician had oodles of descendants, and this is not a complete list:
  • Paleo-Hebrew
  • Aramaic
    • Brahmi -> numerous South Asian writing systems like Devanagari and Tibetan
    • Square Hebrew
    • Nabataean -> Arabic
    • Syriac -> Central Asian writing systems like Mongolian
    • Mandaic
  • Greek
    • Etruscan -> Roman
    • Coptic
    • Gothic
    • Armenian
    • Georgian
    • Glagolitic
    • Cyrillic

Cuneiform writing

Chinese writing -> Japanese hiragana, katakana

Central American writing
 
Now for kinds of writing.
  • Pictographic writing - of pictures
  • Ideographic writing - for concepts
  • Logographic writing - for words or word parts (morphemes)
  • Syllabary - for each syllable
  • Alphabet - for each speech sound
  • Abjad - for consonants, with vowels usually omitted
  • Abugida - for consonants, with vowels indicated if other than some default one

I remember an article about road signs from my childhood saying that they seem to have regressed by showing pictures instead of text, like a picture of a leaping buck rather than the words "DEER XING". But it's easier to recognize at a glance.

In road signs, possible or permitted directions are presented with arrows that may be straight or curved or multiple. That seems to me to be only quasi-pictographic, because it is an abstraction, thus making it ideographic.

A recent form of pictographic and quasi-pictographic ideographic writing is variously called smilies or emoticons or emojis. Though some people hate them, I like them.

Another form of ideographic writing is representation of mathematics, including numbers. Writing numbers ideographically is very old and very common, with the Egyptian and Mesopotamian numerals almost as old as their writing systems.

Writing numbers ideographically rather than as words sometimes causes trouble in historical-linguistics research: Hittite Grammar - Numbers - "The pronunciation of most numbers is unknown since numbers are generally written with cuneiform ideograms."
 
I remember an article about road signs from my childhood saying that they seem to have regressed by showing pictures instead of text, like a picture of a leaping buck rather than the words "DEER XING". But it's easier to recognize at a glance.
And is effective even for motorists who aren't fluent in the local language.

Using internationally recognised standard road signs with as little text as possible is a major benefit to road safety.

When the FIFA Womens World Cup was being hosted here in Australia, we had a match in Brisbane between Germany and South Korea. A lot of German tourists took the opportunity to visit our city, and a significant proportion did not comprehend the signs that read "Authorised Buses Only" and ended up driving hire cars on the busway (occasionally on the wrong side of the road).
 
Some ancient texts did not have much ideographic writing of numbers. Consider the Bible.

2 Chronicles 7:5 - interlinear with the original Hebrew

John 21:11 - interlinear with the original Greek

The originals are written out, and the King James Version also uses written-out forms:

2 Chr 7:5 - And king Solomon offered a sacrifice of twenty and two thousand oxen, and an hundred and twenty thousand sheep: so the king and all the people dedicated the house of God

John 21:11 - Simon Peter went up, and drew the net to land full of great fishes, an hundred and fifty and three: and for all there were so many, yet was not the net broken.

However, some modern-English translations use ideographic writing of numbers, like the New English Translation:

2 Chr 7:5 - King Solomon sacrificed 22,000 cattle and 120,000 sheep. Then the king and all the people dedicated God’s temple.

John 21:11 - So Simon Peter went aboard and pulled the net to shore. It was full of large fish, 153, but although there were so many, the net was not torn.

Yes, (Hindu-)Arabic numerals are ideographic.
 
The first writing systems were pictographic, made logographic by using pictures for similar-sounding words in the fashion of a  Rebus puzzle. That name is short for Latin "non verbis sed rebus": "not with words but with things", about making pictures to represent names in heraldry.

I say logographic and not ideographic because each symbol represents a word or a word part (morpheme), and not necessarily a concept.

Trying to invent logograms for everything is very difficult, and in most systems, some logograms becaame used for their syllable sounds - syllabograms - making a logosyllabic system, a logosyllabic system with a syllabary. Egyptian went the same way, but only for initial consonants, making a logoconsonantal system, a logographic system with an abjad.

Chinese speakers went the farthest in trying to make a logogram for every word, and they invented many compound characters, often with one part specifying some meaning and the other part specifying some sound. For instance, the Chinese character for mother is the characters for woman and horse, a word for a kind of woman, a word which sounds like the word for horse. But much of this analysis was done for Chinese pronunciation of some 2,000 - 3,000 years ago, and such details are often obscured.

The result is a very difficult writing system -- one has to learn a character for every one-syllable word or word part.

Even so, some Chinese characters are used as a syllabary for foreign names and the like.

Japanese continue to use about 2,000 - 3,000 Chinese characters, kanji, alongside two syllabaries, kana, both derived from Chinese characters: hiragana and katakana. Hiragana is used for native words and grammatical parts, and katakana for non-Chinese borrowings and names.

Koreans, however, use Chinese characters, hanja, much less than in the past, mostly using their alphabet, hangul. Letters in a syllable are written in a circle in it, giving hangul a vaguely Chinese-like appearance.

All the other logographic systems have fallen out of use long ago - Egyptian, Mesopotamian, Anatolian, Cretan, Central American - all replaced by alphabets, abjads, or abugidas ultimately derived from Proto-Sinaitic, a descendant of Egyptian hieroglyphics. Like what you are reading right now.
 
 Numeral (linguistics)
Base 80: octogesimal
Supyire:
Bases: 1, 5, 10, 20, 80, 400
Ratios: 4, 2, 2, 4, 5

Some other mixed ones:
Sumerian sexagesimal: 1, 10, 60 -- 10, 6
1, 5, 10 -- 5, 2
1, 5, 20 -- 5, 4
1, 10, 20 -- 10, 2
1, 5, 10, 20 -- 5, 2, 2

For writing numbers, Western (Hindu-)Arabic numerals, as they might be called, have almost completely taken over, with alternatives mainly persisting for numbers in sequence and other special uses.

Among the systems displaced were  Roman numerals
Usage varied greatly in ancient Rome and became thoroughly chaotic in medieval times. Even the more recent restoration of a largely "classical" notation has failed to produce total consistency: variant forms are even defended by some modern writers as offering improved "flexibility".
1 I, 2 II, 3 III, 4 IV, 5 V, 6 VI, 7 VII, 8 VIII, 9 IX, 10 X, 11 XI, 12 XII, 13 XIII, 14 XIV, 15 XV, 16 XVI, 17 XVII, 18 XVIII, 19 XIX, 20 XX, 30 XXX, 40 XL, 50 L, 60 LX, 70 LXX, 80 LXXX, 90 XC, 100 C, 500 D, 1000 M
with alternatives like IIII for IV

Another system displaced was  Greek numerals It uses letters of the Greek alphabet, the first nine for 1 to 9, the second nine for 10, 20, ..., 90, and the third nine for 100, 200, ..., 900. Since the Greek alphabet has only 24 letters, disused letters were used for the remaining three letters. This is an  Alphabetic numeral system and a Roman-alphabet version would be

1 to 9: A B C . D E F . G H I
10 to 90: J K L . M N O . P Q R
100 to 900: S T U . V W X . Y Z &

Isaac Asimov mentioned & as the extra letter in one of his essay books, but I can't find a source for that.

In antiquity,  Isopsephy (Greek) and  Gematria (Hebrew) was the practice of adding up the numerical values for a word's letters and then trying to interpret the meaning of that number.

That's what's behind 666 being the Number of the Beast in the Book of Revelation. It's often interpreted as a gematria version of "Neron Caesar".
 
Turning to other systems, the Egyptian one had a symbol for each power of 10 and repeating each symbol as needed. Thus, 42 is 10 10 10 10 1 1

The Babylonian one had a symbol for 1 and a symbol for 10 repeated as needed for each base-60 digit. It also had a place system, with zero indicated as a space, and later as a symbol.

It's rather obvious that these systems for writing numbers are all ideographic, corresponding to the mathematics rather than to their users' words.

Consider 12,345.

Written out in English, it is "twelve thousand three hundred forty-five" - somewhat irregular.

A more regular example is Chinese: 一万二千三百四十五
yī wàn èr qiān sān bǎi sì shí wǔ

Word for word, it is
one - ten thousand - two - thousand - three - hundred - four - ten - five
or
1 - 10,000 - 2 - 1,000 - 3 - 100 - 4 - 10 - 5

But Chinese, like English, has separate words for 1, 10, 100, 1,000, and Chinese also has 10,000 "myriad".

Though English, like many other Indo-European languages, is somewhat irregular from 11 to 99,  Hindustani numerals are a champion, with just about every one in that range being irregular. But I must note that much of this irregularity is due to various contractions.
 
I've been discussing cardinal numbers, the numbers for counting members of a set, but there are several other types of number words. For instance, ordinal numbers are numbers in sequence: first, second, third, fourth, ...

To keep the clutter down for other kinds of numbers, I will only do 4.
  • Cardinal number: four
  • Ordinal number: fourth, quaternary
  • Adverbial number: four times
  • Multipler: fourfold, quadruple
  • Collective: set of four, foursome, quadruplet, tetrad, quadri-, tetra-
  • Distributive: four at a time, four of each, in fours, in groups of four, four of something in a group, quadruply
  • Fractional: quarter, fourth
Nearly all of these various kinds of numbers have words that are derived from the corresponding cardinal numbers, whether native or Latin or Greek.

English "second" is borrowed from Old French, in turn descended from Latin secundus, literally "following". Also used in Latin was alter "other". Romance languages have descendants of secundus, though many French speakers nowadays use deuxième, the regularly formed ordinal: deux 2 -ième.

English "first" and most other Indo-European langs' words contain PIE *per- "before, in front, first".

Elsewhere, Hebrew rishon "first" < rosh "head".

However, the Turkic langs are completely regular, with "first" and "second" formed from 1 and 2 with the Turkic ordinal suffix. Thus, Turkish birinci < bir -inci and ikinci < iki -inci.

Korean also has an ordinal suffix for all numbers, 째 -jjae. Though 첫째 cheotjjae "first" has that suffix, it is otherwise irregular.

Chinese has an ordinal prefix for all numbers, 第 dì-, Japanese, 第 dai-, Thai, ที่ tîi-, Vietnamese, thứ.
 
Last edited:
(PDF) Seven Dene-Caucasian Etymologies by John Bengtson
Seems rather paltry, but I decided tor read it anyway.
Therefore it is not imagined that the discussion of these seven etymologies is sufficient to “prove” the Dene-Caucasian macro-family.
Just to illustrate it, I think.

Then noting that Basque has a causative prefix -ra-, West Caucasian causative -r-, Tibetan valency increaser r-, Na-Dene *tl-, also Tibetan s-, Haida s-, and the Burushaski transitivizer -s-.

Then addressing the counterargument of the extent of DC, from SW Europe to W North America. That's not really a problem, because of its time depth and how far people have traveled over similar time depths. Imagine a group of people who travel a day's walk each generation. That's roughly 1 kilometer per year, and that is fast enough for dispersal over most of the time that our species has existed. Speeding up to a day's walk each year, some 30 kilometers, one gets much faster dispersion. The total extent of DC is roughly 20,000 km, and one could travel that entire distance in only 700 years.

Furthermore, it seems to me that wide dispersion strengthens the case for shared ancestry rather than for borrowing. Ancestry vs. borrowing has been a big problem for Altaic, because the homelands of the three Core Altaic subfamilies are rather close, in or near present-day Mongolia.

Then getting to the question of Haida. Is it a member of Na-Dene or is it an isolate? JB argues that one can ignore that question, since one can work from Tlingit, Eyak, and Athabaskan.

His list:
  1. stomach, vomit -- Bsq, NC, Bur, ST, ND: A E H
  2. fire, smoke -- Bsq, NC, Bur, Yen, ST, ND: A E
  3. gum, wax -- Bsq, NC, Bur, Yen, ND: A T
  4. limb, bone -- Bsq, NC, Bur, ST, NDL A E T H
  5. liver -- Bsq, NC, Bur, Yen, ST, ND: A E
  6. finger, thumb -- Bsq, NC, ND: A E T H
  7. water -- Bsq, NC, Bur, Yen, ST, ND: A T H
Some of the semantics are a bit stretchy, it seems to me.

The article ended with some sound correspondences.
 
[2404.00284] A Likelihood Ratio Test of Genetic Relationship among Languages with HTML-rendered full text A Likelihood Ratio Test of Genetic Relationship among Languages

Statistical testing of long-distance relationships - nice idea. But there are some big problems with the authors' methodology. They try to use methods from working with gene and protein sequences. Bioinformatics is well-developed enough to have a name, and enormous amounts of gene-sequence and protein-sequence data have been collected. Genetic material, DNA and RNA, has four building blocks or "letters", their nucleotides: A, T (in DNA) or U (in RNA), G, C. Proteins have twenty building blocks or "letters", their amino acids. A "codon" of three nucleotides maps onto one amino acid.

They used Dolgopolsky-like simplified phonology, using several sets.

They use a substitution model that involves constant rates of all substitutions, a starting point for point mutations but not for language features. Sound changes are usually very regular, something associated with the Neogrammarian historical linguists of late 19th cy. Germany (Junggrammatiker, lit. "young grammarians"). But different languages have different sound changes.

They then get into phylogeny testing, something that I did not follow very well.

They used Afrasian, Dravidian, Indo-European, Kartvelian, Lolo-Burmese, Mayan, Mixe-Zoque, Mon-Khmer, Munda, and Uto-Aztecan, and varying numbers of family members, concepts, and words. One of their sources was Wiktionary, a crowdsourced Wikipedia-like online dictionary, but they should have used its references.

"In the Nostratic grouping, we considered the languages that are surviving or have surviving descendants and were attested by the 10th century CE. The motivation behind this choice is that older languages should be closer to the ancestral language and each other if at all there is any relationship."

:rolleyes:

They should have used protolanguages as far as they can be reliably reconstructed. Here are earliest attestations and protolanguage estimations for Nostratic.
  • Indo-European: 1725 BCE (Hittite), 4000 BCE
  • Uralic: 1055 CE (Hungarian), 8000 - 2000 BCE
  • Turkic: 750 CE (Old Turkic), 3000 - 500 BCE
  • Mongolic: 1225 CE (Middle Mongol), 1200 CE
  • Tungusic: 1119 CE (Jurchen), 500 BCE - 500 CE
  • Chukotko-Kamchatkan: (last few centuries), 2000 BCE
  • Nivkh/Gilyak: (last few centuries)
  • Eskaleut: (last few centuries), 4000 - 2000 BCE
  • Korean: 1287 CE (Old Korean), (Old Korean)
  • Japonic: 712 CE (Old Japanese), 700 - 300 BCE
  • Kartvelian: 430 CE (Old Georgian), 2000 BCE
  • Dravidian: 200 BCE (Old Tamil), 3000 BCE
By using protolanguages, one goes back 2,000 to 6,000 years.

They used Afrasian only a little bit, along with Lolo-Burmese (Sino-Tibetan). as a pair without recognizable relationship.

So I find this paper to be very disappointing.
 
Fortunately, that paper had lots of interesting references, and those papers in turn pointed me to some similar ones. Here goes:

Global-scale phylogenetic linguistic inference from lexical resources | Scientific Data by Gerhard Jäger

He earlier published Support for linguistic macrofamilies from weighted sequence alignment | PNAS

One of its results was likely sound shifts, by comparing phonemes in known cognates. It used the  Automated Similarity Judgment Program simplified phonology, less simplified than Aharon Dolgopolsky's and similar phonologies.

Clusters in this data have characteristic points of articulation:
  • Tip of tongue: dental, alveolar
  • Lips: labial
  • Back of tongue, throat, voice box: velar, uvular, pharyngeal, glottal
  • Vowels
GJ then constructed a cognacy-judgment classifier that uses this table, and then trained it on known cognates and non-cognates. It seems to be at least somewhat successful.
 
The evolutionary dynamics of how languages signal who does what to whom | Scientific Reports
Languages vary in how they signal “who does what to whom”. Three main strategies to indicate the participant roles of “who” and “whom” are case, verbal indexing, and rigid word order. Languages that disambiguate these roles with case tend to have either verb-final or flexible word order.
Cases: noun cases to indicate agent and patient/target/undergoer. "Verbal indexing"? Indicating what is A and what is P on the verb? Like passive voice?
Our results corroborate the claims that verb-final word order generally gives rise to case and, strikingly, establish that case tends to lead to the development of flexible word order. The combination of novel statistical methods and the Grambank database provides a model for the rigorous testing of causal claims about the factors that shape patterns of linguistic diversity.
Something like Joseph Greenberg's work on linguistic universals.
In utterances with an Agent (A) and a Patient (P), for example, Henry kissed Mark, languages need to signal which argument maps onto which role. To highlight the distinct roles of these two arguments, languages choose between grammatically marking verbs or nouns, enforcing rigid word order, or relying on semantic cues, such as animacy.
Like if the more animate one is the agent, with some marking to indicate if it is instead the patient.
1) verb-final languages mark case on nominal words; (2) verb-initial languages index both A and P arguments on verbs; and (3) verb-medial languages make use of rigid word order more than the two mentioned word order type languages
For (2) is that for Austronesian voice? That's having the role of the "subject" marked on the verb. Like passive voice, but broader.
 
Another explanation stems from the noisy-channel hypothesis(11,20): when A and P arguments are adjacent as in verb-final languages, the chances for misinterpreting the roles of the two are higher than when the arguments are separated by a verb. Specifically, verb-medial languages are more robust to noise disrupting the linguistic signal because when one of the arguments cannot be recovered, the position of the available argument on either side of the verb will inform its role (A or P), which is not the case in verb-final languages with both arguments preceding the verb. Thus, explicit disambiguation between two arguments in the form of case marking occurs more frequently in verb-final languages that are less robust to noise than verb-medial ones(7,9,10).

The co-occurrence of flexible word order and case is explained by the trade-off hypothesis(14), which claims that languages should generally aim for a balance between clarity and ease of production. Redundancy would manifest itself in using several means for the same purpose simultaneously. This could be advantageous for the recipient as it would provide a more robust signal if there is noise or misunderstandings but can be costly for the sender. The economy principle on the other hand dictates that the sender should use as few resources as possible to get the message across, avoiding redundancy and unnecessary content.
They find support for (verb-final order) -> (noun cases) -> (flexible word order)
 
The authors used a database of 195 grammatical features: Grambank including several for word order, whether or not there are indefinite and/or definite articles, whether there is a dual, a plural for two, etc.

Just the thing for Greenbergian research.

It is a part of Glottobank "an international research consortium established to document and understand the world’s linguistic diversity."

They also have Lexibank - "A framework to curate lexical data." Its pages on concepts link to entries in CLLD Concepticon 3.2.0 - "A Resource for the Linking of Concept Lists" Its list of concepts include several versions of the Swadesh list.

They have some others that they don't seem to have released.

Parabank - "a large database of selected paradigmatic structures found in the world’s languages." like which are common kinds of noun-case syncretism, different cases looking alike. One of them is the Indo-European Neuter Law, which states that the neuter or inanimate gender has same-looking nominative and accusative cases.

Phonobank - "a cross-linguistic comparative database of sound patterns, sound correspondences, and sound shifts." Which would be very valuable for historical linguistics.

Numeralbank - " public database and repository on numeral systems in the world’s languages."

This is the most that I could find: Numeralbank "(under-construction)" - at the site Numeral Systems of the World - like The Numbers List but giving full sets, though lacking reconstructed protoforms.
 
Grammars Across Time Analyzed (GATA): a dataset of 52 languages | Scientific Data - the time depth is very low: a century plus or minus half a century.

Not enough to see the bigger changes in grammar that we have had over recorded history.

Expansion by migration and diffusion by contact is a source to the global diversity of linguistic nominal categorization systems | Humanities and Social Sciences Communications

"Languages of diverse structures and different families tend to share common patterns if they are spoken in geographic proximity." - are they borrowed or inherited?

Both lexicon and grammar vary with respect to their inherent stability (Haspelmath and Tadmor, 2009; Dediu and Cysouw, 2013) but in general, more grammaticalized features of grammar have higher stability rates than more lexical features, and more frequent grammatical and lexical features have higher stability rates than less frequent features (Thomason and Kaufman, 1988; Wilkins, 1996; Matras, 2009). Even though lexical morphemes can be borrowed at varying degrees, grammatical morphemes are very seldom borrowed (Matras and Sakel, 2007). The most frequent lexical items of basic vocabulary have high stability rates and are usually not borrowed (Greenhill et al., 2017), but a majority of the lexicon has lower stability rates and is subject to borrowing at varying degrees (Haspelmath, 2009; Carling et al., 2019). Grammaticality can be viewed as a continuum, ranging from the most grammatical items of grammar (frequent function words of low transparency) to the least grammatical items of the lexicon (cultural and non-frequent content words of high transparency) (Matras and Sakel, 2007).
Then noting three main kinds of classification: grammatical gender, like in Indo-European and Semitic, noun classes like in Bantu languages, and classifiers, like in many eastern Asian languages.
On a grammaticality continuum, gender and noun class markers are thus typical examples of ‘grammatical items’, while classifiers are relatively closer to ‘lexical items’, or ‘content words’.
Grammatical gender may be interpreted as a kind of noun class with a small number of classes.
 
Based on this premise, the existing literature suggests that, on the one hand, classifiers are more easily diffused (horizontally) across language families than gender and noun class (Nichols, 1992: 32, 2003; Wichmann and Holman, 2009: 54–55; Seifart, 2010; Greenhill et al., 2017). On the other hand, in terms of vertical inheritability within languages of the same family, grammatical gender and noun class systems are much more stable than classifiers (Nichols, 2003; Greenhill et al., 2017; Allassonnière-Tang and Dunn, 2020). Studies indicate that grammatical gender hardly ever arises in the course of language contact (Stolz and Levkovych, 2021).
To test this hypothesis, the authors looked through descriptions of 3077 languages for references to gender, noun classes, and classifiers. This far exceeds what is in the World Atlas of Language Structures: WALS Online - Home - something much like Grambank.
The data were compiled by automatic data extraction and checked manually according to precisely defined linguistic criteria for identifying the presence/absence of different nominal classification systems. Data were first extracted from language grammars and grammar sketches using a lightweight keyword-extraction technique (Supplementary material 1.3). Thereupon, manual checking was performed for each individual language and feature, using the Gramfinder tool as an aid for navigating through grammars more effectively (Supplementary material 1.4).
They used AI to search through the literature, then checked what the AI found.
Within the data, 26.5% (814/3077) of the languages have classifiers, while 20.1% (634/3077) have gender and 10.3% (317/3077) have noun classes. We can also see that 46.6% (1434/3077) of the languages do not have any of the three systems.
The geography also agreed with previous surveys.
 
Back
Top Bottom