Introduction Background: Dialectometry and Levenshtein distance LD

139 7 Phonetic Distance Melissa Partida, Andy Castro

7.1 Introduction

In this chapter we compute relative phonetic distances between the Sui dialects by means of the Levenshtein distance LD algorithm Levenshtein 1965, Heeringa 2004. We show that clustering based on phonetic distance reveals the same four broad dialect divisions that are indicated by more traditional methods SDB 1958, Zhang Junru 1980, Castro 2011, Stanford 2011, viz. Pandong dialect, Yang’an dialect, Sandong Central-Western-Eastern dialect and Sandong Southern dialect. Secondly we show that, for our Sui data, LD calculations based on broad, phonemic transcriptions indicate the same basic dialect areas as those based on narrow, phonetic transcriptions. This suggests that the application of LD to the swathes of previously published dialect data in China, most of which is transcribed in a broad, phonemic manner, would be a meaningful and valuable exercise. Mean LDs also indicate that the Central Sui varieties SD and ZH are the most “central” varieties in terms of their pronunciation. Finally, by applying LD to a subset of both Sui and Kam data, we show that phonetic distances do not necessarily reflect genetic relatedness. Regardless of the clustering algorithm used, Yang’an always falls into the Sui cluster rather than the Kam cluster. This suggests that while LD may be useful for showing synchronic perceived dialect groupings or even for predicting intelligibility, see chapter 8, it should not be used for “classifying” languages and dialects. 1 Despite this, LD applied to phonemicised transcriptions is useful to the historical linguist in providing a starting point for historical analysis and may be more revealing than lexical similarity counts.

7.2 Background: Dialectometry and Levenshtein distance LD

Dialectologists have long been concerned with identifying dialect areas and dialect boundaries. Various methods have been used to do this. Traditionally, the most widely used has been the isogloss approach, by which lines known as “isoglosses” are drawn on a map indicating the geographical boundaries either of regional pronunciations of the same word “phonetic isoglosses” or of regional lexical variants of a single concept “lexical isoglosses”. Dialect boundaries are then drawn where several isoglosses bundle together. This approach has not proved completely satisfactory, particularly since isoglosses for different words often fall in different places. Different “isogloss bundlings” emerge depending on which particular lexical items the dialectologist chooses. Thus the resulting dialect boundaries are to some extent defined on the whim of the linguist. Furthermore, dialectologists have come to realise that dialect boundaries are usually not sharply defined. In fact, there is often a “dialect continuum” in which changes in pronunciation and lexicon are cumulative across geographical and, we would argue, cultural distance. The further apart two dialects are, the greater the linguistic differences between them Chambers and Trudgill 1998. Thus several other methods for quantifying linguistic differences between dialects have been developed. Some dialectologists have attempted to quantify the differences between the overall phonological or morphological systems of different dialects, for example Moulton 1960 for Swiss German, Cheng 1997 for Chinese and Viaplana 1999 for Catalan. Others have tried to calculate lexical similarity, for example Rensch 1992 for languages in India, Castro et al., 2012 for Hmong, and our own work in chapter 6 of this work for Sui. Milliken and Milliken 1996 proposed a “systems relations” method for computing the degree to which the sounds of one dialect “map onto” the phonemic 1 Yang 2010 discovered the same phenomenon when subgrouping Lalo languages in Yunnan. system of another, in the hope that such a measure would be a good predictor of inherent intelligibility. This has been employed by Wu Wenyi, Snyder and Liang 2007 for Bouyei dialects and by Hansen and Castro 2010 for Zhuang dialects, although in the latter case it was only applied to tonal systems. Finally, some linguists have used computational methods to calculate the overall “phonetic distances” between dialects. In recent years, use of the Levenshtein algorithm Levenshtein 1965 for calculating string edit distance has become particularly popular. It was first used by Kessler 1995 for Irish Gaelic and has since been applied to Dutch Heeringa 2004, Norwegian Heeringa 2004, Gooskens and Heeringa 2004 and various other European languages. Yang 2009 was the first to apply it to languages in the Sinosphere, initially for Nisu, then also for Bai and Zhuang Yang and Castro 2010 and Lalo Yang 2012. The popularity of Levenshtein distance among dialectologists is largely due to its relative simplicity and ease of application. A set of computer programs developed by Peter Kleiwig, known as “Gabmap”, makes LD calculation and subsequent data manipulation, including clustering and mapping, a very straightforward process Nerbonne et al., 2011. Dialect areas as revealed by LD have been shown to: a largely correspond to dialect areas established by more traditional approaches Kessler 1995, Heeringa 2004, Valls et al., 2010; and b correlate strongly with measured levels of mutual intelligibility Gooskens 2006, Beijering 2007, Beijering et al., 2008, Yang 2009, Yang and Castro 2010. The Gabmap software also provides a way of visualising the correlation between geographical distance and phonetic distance. By applying this to Dutch dialects, Heeringa and Nerbonne 2001 have shown that the traditional “dialect area” view and the more recent “dialect continuum” view are both useful for understanding dialect situations. They conclude that “the dialect landscape may be described as … a continuum with unsharp borders between dialect areas.” Heeringa and Nerbonne 2001:19 Our study in this chapter backs up this view.

7.3 Methodology