Pre-processing of Sui and Kam data

7.3.2 Selection of Sui and Kam data for comparison

Dialectologists have adopted different approaches when it comes to selecting data for comparison using LD. LD when applied to phonetic transcriptions is usually described as a measure of “phonetic distance”. Strictly speaking, then, the algorithm should only be applied to pairs of root morphemes which are true cognates, otherwise the resulting “distances” reflect not only phonetic distance but also morphological and lexical distance. As we discussed in section 7.2, however, dialectology is not only concerned with the phonetic features which characterise dialects; phonological and morphological systems, lexicon, syntax and even discourse structures can all vary from dialect to dialect and must all be considered. Some dialectologists have therefore included non-cognates in their LD comparisons, arguing that the resulting scores give a better picture of overall linguistic distance between dialects Jackson et al., 2012. And indeed, several studies have shown that LD comparisons involving both cognate and non-cognate words produce just as meaningful results as comparisons involving only cognates for example Kessler 1995, and Yang 2009. However, if non-cognate words are included in LD comparisons, it is necessarily unclear what the final distance figures are actually measuring. It is impossible to know the exact weight given to lexical differences in the overall score, since the non-cognate items in the data set could, purely by chance, be mostly phonetically similar or mostly phonetically dissimilar. Part of the purpose of this work is to examine and compare the contributions of various aspects of language genetic relatedness, lexicon, phonetic similarity and intelligibility to the overall dialect landscape. We thus restricted our LD comparisons to historical cognates. Only in this way are we able to meaningfully compare the relative contributions of, for example, lexicon and phonetic distance to measured intelligibility. Other dialectologists have adopted the “cognate-only” approach for the same reasons for example Gooskens 2006, and Beijering et al., 2008. In our Sui data we found a total of 822 lexical items for which we elicited a cognate in at least two locations. Using all 822 cognates in our LD comparison would give a range of between 526 items for PD and 616 items for SD that could be compared for each location. We decided that using all of these data may significantly distort the results since no comparison between any two dialects would be based on exactly the same set of cognate pairs. Therefore we decided to include only words for which we elicited cognates at all sixteen data points. Only by doing so could we ensure that the resulting phonetic distances were directly comparable. We found a total of 319 lexical items for which we had cognates in every location. We restricted our LD comparison to these items and the results presented in section 7.4 are based on these data. However, we did also conduct LD comparisons using the full data set of 822 lexical items and found that, both for the narrow transcriptions and the phonemicised transcriptions, the resulting MDS plots and clusterings were practically identical to those produced by using the reduced set of 319 items although validation showed that the clusters were less “certain” when using the complete set. For the Sui-Kam LD comparison, we supplemented our own Sui data with Kam data provided by Shi and Strange 2004. To keep things simple, we chose to compare the same seven locations four Sui and three Kam as for our Kam-Sui lexical comparison see chapter 6, sections 6.2.1 and 6.3.2. Restricting the comparison to cognates which only occurred in all seven data sets gave us a list of just 222 lexical items, significantly fewer than the 319 items which we used in our Sui-only comparison. In the hope of making the two sets of results more comparable, we decided to use an expanded list of 276 lexical items, all of which appeared in at least six out of the seven wordlists compared. Of these 276 words, 225 overlapped with the 319 words in the Sui-only comparison.

7.3.3 Pre-processing of Sui and Kam data

We conducted some pre-processing of the data before importing it into Gabmap. Firstly, in our narrow phonetic transcriptions, there were a small number of phonetic features which we had strong reason to believe were idiolectal features rather than dialectal features. We standardised the transcriptions of these features in order to ensure that the LD analysis was calculating dialectal phonetic distance rather than idiolectal phonetic distance. These features are given in table 7.1. Table 7.1. Substitutions of IPA letters in phonetic transcriptions in preparation for LD analysis Original transcription Replaced by Reason for substitution tɕ- ȶ- Pronunciation virtually identical; we suspect that exact pronunciation varies by idiolect and does not characterise a particular dialect. ʑ- j- [ʑ] was an alternative realisation of j in onset position and we suspect that exact pronunciation varies by idiolect. ɔu- ɐu- We were not confident of our own transcriptions of this diphthong. However, both pronunciations are realisations of the phoneme au. ɴɢ- ɢ- We standardised the transcription of prenasalised onsets by removing the nasal symbols because no Sui dialect distinguishes between plain voiced stops and prenasalised voiced stops. We judged that the presence or absence of prenasalisation varies by idiolect rather than by dialect. ŋɡ- ɡ- ᵐb- b- ⁿd- d- - ʔ -k TP only Only TP and JL had final glottal stops. In JL they were used consistently by all three informants. In TP they were only used by one of our three informants and they were used sporadically in place of final -k, thus we judged them not to be characteristic of TP speech. Secondly, we decided to represent tonemes phonemically as single superscript numbers for both sets of transcriptions phonetic and phonemicised, with the exception of Tone 6. Yang and Castro 2010 have shown that the best representation of contour tone in LD for correlation with mutual intelligibility is a two-letter system denoting onset and contour. For example, a high level 55 tone would be represented as HL High Level, or a tone starting with a mid-pitch and rising to high 35 would be represented as MR Mid Rising. In Sui, however, phonetic differences between most of the tonemes are extremely slight see chapter 4, section 4.6 and Stanford, 2008a and cannot be adequately captured using a simple two-letter representation. Moreover, the differences are so subtle that we doubt they would have any influence on mutual intelligibility except in cases where a cognate word bears an entirely different toneme. The only toneme which shows salient variation is Tone 6 which is realised as a high, level tone 55 in some dialects and a mid, rising tone 24 in others. 5 In order to capture this significant difference, we decided to represent a high-level Tone 6 as a superscript 6 and mid-rising Tone 6 as a superscript 9 in our phonetic transcriptions. Thus the phonetic distance between any two different tonemes was always 1 a simple substitution and the phonetic distance between a mid-rising Tone 6 in one dialect and a high-level Tone 6 in another was also 1. Thirdly, we removed all prefix and suffix syllables such as classifiers and modifiers from the data, with the exception of cognate prefixes or suffixes used in all sixteen locations. This is because we only wanted to compare the pronunciation of cognate morphemes and did not want the phonetic distance calculations to be influenced by morphological disparities between dialects. Due to the heavily monosyllabic nature of Sui this was a relatively straightforward task. Our final list of 319 words for comparison contained only seven words comprising more than one syllable. Finally, we adjusted the Kam transcriptions to agree with the transcription conventions of our Sui data. Since the Kam data was already in partially phonemicised form, we could only use our phonemicised Sui transcriptions for the Sui-Kam comparison. We were already extremely familiar with the Kam data through our historical comparative work chapters 3 to 5, so adjusting the transcriptions 5 The dialects which pronounce Tone 6 as a mid, rising tone also have a high level tone sometimes written as 6 used in some Chinese loanwords. to fit the Sui was not difficult. All changes that we made are given in table 7.2. We retained the secondary tone split markings in the Kam data see chapter 4, table 4.1 because these extra tones, where they occur, have very different phonetic realisations to their regular non-aspirated counterpart tones. Table 7.2. Adjustment of Kam transcriptions in preparation for Sui-Kam LD analysis Original transcriptions Replaced by ɑ a aC aːC aV aːV ɐC aC əu, ɐu au əi, ɐi ai ɯi ui ʊ u j ɪ j ə ɛ e

7.4 Results of Sui dialect comparison