Calculating Levenshtein distance Methodology

system of another, in the hope that such a measure would be a good predictor of inherent intelligibility. This has been employed by Wu Wenyi, Snyder and Liang 2007 for Bouyei dialects and by Hansen and Castro 2010 for Zhuang dialects, although in the latter case it was only applied to tonal systems. Finally, some linguists have used computational methods to calculate the overall “phonetic distances” between dialects. In recent years, use of the Levenshtein algorithm Levenshtein 1965 for calculating string edit distance has become particularly popular. It was first used by Kessler 1995 for Irish Gaelic and has since been applied to Dutch Heeringa 2004, Norwegian Heeringa 2004, Gooskens and Heeringa 2004 and various other European languages. Yang 2009 was the first to apply it to languages in the Sinosphere, initially for Nisu, then also for Bai and Zhuang Yang and Castro 2010 and Lalo Yang 2012. The popularity of Levenshtein distance among dialectologists is largely due to its relative simplicity and ease of application. A set of computer programs developed by Peter Kleiwig, known as “Gabmap”, makes LD calculation and subsequent data manipulation, including clustering and mapping, a very straightforward process Nerbonne et al., 2011. Dialect areas as revealed by LD have been shown to: a largely correspond to dialect areas established by more traditional approaches Kessler 1995, Heeringa 2004, Valls et al., 2010; and b correlate strongly with measured levels of mutual intelligibility Gooskens 2006, Beijering 2007, Beijering et al., 2008, Yang 2009, Yang and Castro 2010. The Gabmap software also provides a way of visualising the correlation between geographical distance and phonetic distance. By applying this to Dutch dialects, Heeringa and Nerbonne 2001 have shown that the traditional “dialect area” view and the more recent “dialect continuum” view are both useful for understanding dialect situations. They conclude that “the dialect landscape may be described as … a continuum with unsharp borders between dialect areas.” Heeringa and Nerbonne 2001:19 Our study in this chapter backs up this view.

7.3 Methodology

7.3.1 Calculating Levenshtein distance

LD can be defined as “the cost of the least expensive set of insertions, deletions, or substitutions that would be needed to transform one string into [another].” Sankoff and Kruskal 1983, in Kessler 1995:62 Thus the LD between the words “fat” and “fit” would equal one, since one substitution “i” for “a” is required to convert the former word into the latter. The LD between the words “fat” and “thin” would be four, since three substitutions ft, ai, tn and one insertion Øh are required to convert the former into the latter. LD has been applied in different ways in dialectology. Some linguists have attempted to apply it directly to acoustic data. Recordings of a word spoken by speakers of dialects A and B are transformed into two strings of numbers typically by taking cross-sections of formant tracks and LD is applied to the resulting two strings Heeringa et al., 2009. To obtain reliable results, much pre-processing of acoustic data such as normalisation of pitch, monotonisation and normalisation of tempo is required. Optimal pre-processing of acoustic data for tonal languages has yet to be researched and evaluated. Most successful applications of LD in dialectology have been to phonetic data transcribed using the International Phonetic Alphabet IPA. Since there is a finite and relatively small number letters and symbols in IPA, it is simple to calculate the number of substitutions, insertions and deletions of phonetic symbols required to turn the transcription of a word in dialect A into the transcription of the same word in dialect B. Intuition would suggest that the exchange of two phonetic symbols representing two sounds which are extremely close in phonetic space e.g., [i] for [ ɪ] should bear less weight than the exchange of two phonetic symbols representing two sounds at opposite ends of the phonetic spectrum e.g., [a] for [u]. However, various studies have shown that a crude, unweighted algorithm produces just as good results for uncovering established dialect areas as a complex algorithm weighted according to supposed “phonetic distance” between the IPA symbols Kessler 1995, Heeringa 2004:185. Therefore, in this study, we use an unweighted algorithm. 2 Some studies have shown that “normalised” LD calculations, in which each word in the compared set of data is given equal weight regardless of word length, are better predictors of intelligibility than LD calculations which are not normalised Beijering et al., 2008, Yang 2009. The Gabmap software normalises its LD calculations by default Nerbonne et al., 2011. Therefore we also employ normalisation for our calculations. In the case of Sui, though, we doubt that normalisation would significantly affect the results because Sui is a largely monosyllabic language only seven out of the 319 items used in our LD calculations were not monosyllabic and Sui syllable structure is extremely restrictive see chapter 3, section 3.2.1. As far as the authors are aware, previous dialectometric studies applying LD have not paid much attention to the type of IPA transcriptions that are used, whether narrowly phonetic or broadly phonemic. Presumably, if all of the data is transcribed by the same phonetician, whether they transcribe them narrowly or broadly, the relative distances would be the same. Having said that, if extremely accurate i.e., narrow IPA transcriptions are used, the results could be distorted due to the pronunciation idiosyncrasies of the particular speakers who were transcribed. In his seminal work Language, Bloomfield 1933:45 noted that: “If we observed closely enough, we should find that no two persons—or rather, perhaps, no one person at different times—spoke exactly alike” 3 . On the other hand, very broad or phonemicised transcriptions, the type of which are often seen for dialect data published in China, may obscure some of the phonetic peculiarities of certain speech varieties. In order to investigate this a little further, we chose to conduct two sets of LD calculations; one based on narrow transcriptions, the other on broader, phonemicised transcriptions. In particular we wanted to ascertain whether phonetic distances calculated by applying the LD algorithm to phonemicised transcriptions would indicate the same dialect areas as those indicated by narrow transcriptions. For if this were the case, it may be possible to include many different sets of dialect data in future LD comparisons even if the data were transcribed by different phoneticians, because phonemicising the data should erase most, if not all, of the inter-transcriber differences. Thus the potential application of the LD algorithm would be broadened. We now give some examples of how Gabmap applied the LD algorithm to our data. Figure 7.1 shows an example of the alignment of the word ‘to say’ in TN and RL. In both places it is pronounced identically, including the tone the superscript number 4 . Therefore the total cost of transforming one string into the other is zero and the “phonetic distance” between the two is zero. TN f a n ² RL f a n ² Cost Figure 7.1. TN and RL ‘to say’. However, the same word for ‘to say’ compared in TN and PD shows a cost of 2, as seen in figures 7.2 and 7.3. In this case there are two ways of aligning the phones. In such a situation, Gabmap calculates the cost for both alignments and takes an average of them both. The insertion of [h] in figure 7.2 or [w] in figure 7.3 incurs a cost of one. The substitution of [f] for [w] in figure 7.2 or [f] for [h] in figure 7.3 incurs a further cost of one. In this instance, the alignment does not affect the overall cost. 2 Gabmap does force consonants to be compared with consonants and vowels with vowels, but otherwise its comparisons are unweighted Nerbonne et al., 2011. 3 Kuhl 2003 provides a helpful summary of how the problem of the “idiolect” is dealt with by competing schools of thought in linguistics, in particular Chomsky and the generativists on the one hand and Labov and the sociolinguists on the other. 4 See Section 3.3 below for discussion regarding the representation of tone in LD. TN f a n ² PD h w a n ² Cost 1 1 2 Figure 7.2. TN and PD ‘to say’. TN f a n ² PD h w a n ² Cost 1 1 2 Figure 7.3. TN and PD ‘to say’. The normalised phonetic distance between these two words is calculated by dividing the overall cost by the total number of phones in the longest of the two words compared, in this case 5 for PD hwan². Thus the normalised phonetic distance between these two words is 0.4. The maximum normalised phonetic distance between any two words is 1, i.e., when all of the phones must be substituted, inserted or deleted in order to transform one word into the other. The substitution, insertion or deletion of a diacritic or suprasegmental symbol in cases where the primary phone is the same is counted as a cost of 0.5. Thus the phonetic distance between the word for ‘rat’ in SD and BL equals 0.167 total cost 0.5 divided by total length 3, as seen in figure 7.4. SD n̥ ɔ ³ BL n̥ ɔ̃ ³ Cost 0.5 0.5 Figure 7.4. SD and BL ‘rat’. Finally, we show how a phonemicised transcription will generally give a lower LD than a narrow phonetic transcription. Figures 7.5 and 7.6 show the cost of transforming the word for ‘female’ in SD into the word for ‘female’ in TZ. When using a narrow phonetic transcription, the cost is 3 and the resulting normalised phonetic distance is 0.5. When using a phonemicised transcription, the cost is 2 and the normalised phonetic distance is 0.333. SD ʔ b j aː k ⁷ TZ ʔ m i e k ⁷ Cost 1 1 1 3 Figure 7.5. SD and TZ ‘female’ phonetic transcription. SD ʔ b j aː k ⁷ TZ ʔ m j e k ⁷ Cost 1 1 2 Figure 7.6. SD and TZ ‘female’ phonemicised transcription. Both the narrow, phonetic transcriptions and the broad, phonemicised transcriptions which we used in our LD comparisons are given in appendix H. After calculating the phonetic distance between each pair of words, Gabmap calculated the overall phonetic distance between each pair of speech varieties by averaging the distances for all of the individual pairwise comparisons.

7.3.2 Selection of Sui and Kam data for comparison