Selection of Sui and Kam data for comparison

7.3.2 Selection of Sui and Kam data for comparison

Dialectologists have adopted different approaches when it comes to selecting data for comparison using LD. LD when applied to phonetic transcriptions is usually described as a measure of “phonetic distance”. Strictly speaking, then, the algorithm should only be applied to pairs of root morphemes which are true cognates, otherwise the resulting “distances” reflect not only phonetic distance but also morphological and lexical distance. As we discussed in section 7.2, however, dialectology is not only concerned with the phonetic features which characterise dialects; phonological and morphological systems, lexicon, syntax and even discourse structures can all vary from dialect to dialect and must all be considered. Some dialectologists have therefore included non-cognates in their LD comparisons, arguing that the resulting scores give a better picture of overall linguistic distance between dialects Jackson et al., 2012. And indeed, several studies have shown that LD comparisons involving both cognate and non-cognate words produce just as meaningful results as comparisons involving only cognates for example Kessler 1995, and Yang 2009. However, if non-cognate words are included in LD comparisons, it is necessarily unclear what the final distance figures are actually measuring. It is impossible to know the exact weight given to lexical differences in the overall score, since the non-cognate items in the data set could, purely by chance, be mostly phonetically similar or mostly phonetically dissimilar. Part of the purpose of this work is to examine and compare the contributions of various aspects of language genetic relatedness, lexicon, phonetic similarity and intelligibility to the overall dialect landscape. We thus restricted our LD comparisons to historical cognates. Only in this way are we able to meaningfully compare the relative contributions of, for example, lexicon and phonetic distance to measured intelligibility. Other dialectologists have adopted the “cognate-only” approach for the same reasons for example Gooskens 2006, and Beijering et al., 2008. In our Sui data we found a total of 822 lexical items for which we elicited a cognate in at least two locations. Using all 822 cognates in our LD comparison would give a range of between 526 items for PD and 616 items for SD that could be compared for each location. We decided that using all of these data may significantly distort the results since no comparison between any two dialects would be based on exactly the same set of cognate pairs. Therefore we decided to include only words for which we elicited cognates at all sixteen data points. Only by doing so could we ensure that the resulting phonetic distances were directly comparable. We found a total of 319 lexical items for which we had cognates in every location. We restricted our LD comparison to these items and the results presented in section 7.4 are based on these data. However, we did also conduct LD comparisons using the full data set of 822 lexical items and found that, both for the narrow transcriptions and the phonemicised transcriptions, the resulting MDS plots and clusterings were practically identical to those produced by using the reduced set of 319 items although validation showed that the clusters were less “certain” when using the complete set. For the Sui-Kam LD comparison, we supplemented our own Sui data with Kam data provided by Shi and Strange 2004. To keep things simple, we chose to compare the same seven locations four Sui and three Kam as for our Kam-Sui lexical comparison see chapter 6, sections 6.2.1 and 6.3.2. Restricting the comparison to cognates which only occurred in all seven data sets gave us a list of just 222 lexical items, significantly fewer than the 319 items which we used in our Sui-only comparison. In the hope of making the two sets of results more comparable, we decided to use an expanded list of 276 lexical items, all of which appeared in at least six out of the seven wordlists compared. Of these 276 words, 225 overlapped with the 319 words in the Sui-only comparison.

7.3.3 Pre-processing of Sui and Kam data