Method of determining lexical similarity

there appeared to be two or more semantically similar or even identical words under the same gloss and we had not elicited both words in all locations a description of our wordlist elicitation procedures can be found in chapter 1, section 1. 4.2. For example, there are two words in Sui for ‘wind’: kʰaːŋ⁵ and lum¹. Some locations appear only to use one word or the other, whereas other locations seem to distinguish between the two, the former being a general word for ‘wind’ or ‘to blow a wind’ and the latter a word for a ‘twisting wind’ or ‘tornado’. We did not discover the difference until halfway through the survey and thus did not consistently probe for both words in all locations. We therefore omitted both words completely from our lexical comparison. Occasionally we excluded words from individual pairwise comparisons because we suspected that we may have elicited the wrong word in a certain location. For example, there is one commonly used word in Sui for ‘to lose or misplace’: tʰa¹. However, in Tangzhou TZ, Western Sui we only elicited the word tok⁷ for this gloss. tok⁷ is used in all Sui dialects to mean ‘to drop onto the ground’ and therefore by extension ‘to lose’ in a different sense from tʰa¹, which means ‘unable to find something because you forgot where you put it’. Tangzhou was a relatively early data point and at that point we were unaware of the fine semantic distinction between these two words so we failed to probe for the word tʰa¹. Therefore we excluded this particular word from our lexical comparisons between Tangzhou and other lects. In total our lexical comparison among Sui dialects included 594 lexical items. The fewest number of items included in a single pairwise comparison was 565, in the comparison between Jiarong JR, Southern Sui and Banliang BL, Yang’an. In general, there were slightly fewer words in our JR comparisons because JR was our very first data point and we had not completely finalised our wordlist at that point.

6.2.2 Method of determining lexical similarity

We use the term “lexical similarity” advisedly; our approach bears no relation to the largely discredited see Campbell 2004:200 method of “glottochronology” not synonymous with “lexicostatistics”, incidentally originally propounded by Swadesh 1952, 1955. Rather, by means of counting relative numbers of likely historical cognates, we hoped to gain some measure of divergence between the dialects. Heggarty 2010:307ff. defends the validity of such an approach, which is based on the suppositions that: a the varieties in question are all descended from one proto language; and b the more two varieties have diverged, the more cognates inherited from their common ancestor language will have been lost. Lexical divergence could occur either due to the replacement of older words by loan- words, or to the meanings of cognate words diverging semantically to such an extent that they are no longer elicited for the same meaning slots. In calculating lexical similarity, we employed a method significantly different from the “lexicostatistics” method described by Blair 1990 and Rensch 1992. This is because our primary aim was not to find out the likelihood of inherent intelligibility between the dialects. 1 Rather, we hoped to determine the relative number of historical cognates shared by the different dialects. Historical sound changes were thus a key factor in our computations. The resulting percentages were significantly higher than they would have been if we had calculated them using a purely lexicostatistic approach. In general, if two words could be shown to be historical cognates due to the fact that any pronunciation differences could be explained by regular diachronic sound changes as elucidated in chapters 4 and 5, we counted them as “similar”. For example, the word for ‘hot’ is tu³ in Southern Sui JQ and saːu³ in Southern Kam. The difference in pronunciation of the onset and rime can both be explained by regular sound changes which we describe in chapter 5. Therefore we counted these two 1 “Inherent intelligibility” refers to comprehension between speakers of two different dialects due to linguistic similarity rather than comprehension due to frequent contact between dialects which is known as “acquired intelligibility”. Blair 1990 and Grimes 1995 show a link between lexical similarity percentages and inherent intelligibility. words as “similar”. ‘Dew’ is pronounced ȵi² in Central Sui SD and mɛ¹ in Yang’an TN, BL. Again, we counted these as similar because they are clearly historical cognates. If two words are pronounced differently and the difference in pronunciation could not be explained by regular diachronic sound changes, we counted the words as “dissimilar”. For example, in most dialects the word for ‘mouth’ is paːk⁷. However, in Yang’an it is mup⁷. We counted these two words as dissimilar because there are no regular sound change rules which can account for the different onset and rime. Similarly, the word for ‘rope’ is laːk⁷ in most Sui dialects, whereas it is lɛ¹ in Yang’an. Although these two words look similar both have a lateral onset followed by a front non-high vowel and may be cognate, we found no regular sound changes which could explain the different onset and tone. Therefore we counted these two forms as “dissimilar”. It would be inaccurate to describe our lexical similarity counts as historical cognate counts since we only compared words which were semantically equivalent. Simple cognate counts would result in much higher percentages because language varieties often share historical cognates even though their meaning or usage is different. For example, Kam tends to use the word kaːu³ for ‘head’ whereas Sui particularly Central, Eastern and Southern Sui uses the word qam⁴. These words are not historical cognates and we therefore counted them as dissimilar. However, all varieties of Sui do use a word cognate with Kam kaːu³, usually pronounced ku³, in a slightly different sense, viz. the “head” or “end” of a bridge, road, etc. We used computer software developed by Taylor University, WordSurv 7 White and Colgan 2012 to conduct lexical comparisons and calculate lexical similarity percentages. WordSurv calculates tallies of cognates as designated by the user and then displays either the cognate count, the percentage of identical cognate forms, or a difference ratio, between every possible pair of villages. For our 594-word multiple Sui dialect comparison, we imported the lexical similarity percentages calculated by Wordsurv into Gabmap, an on-line dialectology software package Nerbonne et al., 2011; see chapter 8 for more information, in order to perform cluster analysis. This is described in more detail in section 6.3.2.

6.3 Lexical similarity counts