The Englishes of English tests: bias revisited

  World Englishes, Vol. 27, No. 1, pp. 26–39, 2008. 0883-2919

The Englishes of English tests: bias revisited

  ∗ ∗∗

  LIZ HAMP-LYONS and ALAN DAVIES

  

ABSTRACT: The two authors conducted a small empirical study to attempt to find support for − or

evidence against − the view that international tests of English language proficiency are unfair to speakers

of non-standards forms of English, since these tests privilege standard forms. We explore the question of

whose norms should be imposed in these tests, and what the consequences for test-takers are if the norm

imposed by the test is not the “normal” variety accepted in their own society. Data used for the study are

written texts by English learners from six language backgrounds, scored by raters from their own language

backgrounds as well as by native American English raters. Interesting patterns emerge, but we conclude

that the complexity of the variables involved, the small n-size, and the inherent unreliability of scoring

1 productive samples prevent any definitive claims being made.

  

INTRODUCTION

  The increasing use of international tests of English proficiency (e.g. TOEFL, TOEIC,

  IELTS, MELAB), indicative of the continuing worldwide spread of the English language, has been condemned on the grounds that such tests are biased or unfair (unfair in the sense that a test favouring boys could be said to be unfair to girls). Commentators on the global spread of English disagree as to which norms should be employed, whether they should be exonormative or endonormative. An International English (IE) view recognises only one norm, that of the educated native speaker of English (acknowledging, of course, that there are somewhat distinct norms for British, American, Australian, Irish, Canadian, New Zealand, and South African varieties). A strong World Englishes (WEs) view maintains that to impose IE on users of WEs may be discriminatory against non-native English speakers (NNES), arguing that local standards are already in place (e.g. in India and Singapore). At the end of an earlier study (Davies, Hamp-Lyons, and Kemp, 2003) it was concluded that the study had raised three basic questions:

  1. How possible is it to distinguish between an error and a token of a new type?

  2. If we could establish bias, how much would it really matter?

  3. Does an international English test privilege those with a metropolitan Anglophone education? International English (IE) stands for the universalist view that there is an Inner Circle (see

  Kachru, 1985) native English competence. As the term is used in this paper, International English does not mean that there is one common or shared international norm. WEs, on the other hand, stands for the view that English has split postcolonially into a plurality of lects: the English of Singapore, of India, of Malaysia, of Nigeria, and so on. Each polarity has its supporters and its opponents. The universalist view is reckoned by its supporters ∗∗ Faculty of Education, University of Hong Kong, HOC 323A, Pokfulam, HKSAR. E-mail: lizhl@hkucc.hku.hk

Department of Linguistics and English Language, University of Edinburgh, Adam Ferguson Building, 40 George

  Square, Edinburgh EH8 9LL, Scotland, U.K. E-mail: a1adavie@staffmail.ed.ac.uk

  

The Englishes of English tests

  27

  to be enabling, while its opponents bitterly object to its hegemonizing grip on the modern westward-leaning world (de Beaugrande, 1999).

  The WEs view (Kachru, 1986; 1992) is that without attention to the local norms, all inter- national uses of English are necessarily biased in favour of those for whom the metropolitan forms (those of the Inner Circle) are native, on the grounds that IE does in practice equal Standard British English, Standard American English, and so on. This applies equally to postcolonial WEs (Pakir, 1993), and to EFL learners and users whose primary need for English is in-country (e.g. in China), not international.

THE ISSUES

  For holders of both views, IE and WEs, what is at issue is not the existence of variation but the role and status of language norms (Bartsch, 1988; Davies, 1999). The IE view is strengthened both by the strict view of norm acquisition – viz. that it needs a large enough body of native speakers to take on its responsibility (Davies, 2003; but see Graddol, 1999) − and by the need of many EFL learners for a test to provide international recognition of their English proficiency, for example for certification, university entry, employment, or immigration.

  The concept of “world Englishes” (WEs) refers to a belief in and a respect for multiple varieties of English. Each of these varieties takes and adapts some parent form of English into a stable dialect which is not only “correct” in its home milieu but may for many of its users be the only form of the spoken language they hear (Kachru, 1986). The core value of WEs is that “the English language now belongs to all those who use it” (Brown, 2000). In the WEs view, there is no “right” English, no such thing as the “native speaker norm”. The WEs view is strengthened by the empirical study of language acquisition which shows that normal language development is “haphazard and largely below the consciousness of speakers” (Hudson, 1996: 32).

  The WEs position springs from a world-view which asserts that native speakers of a highly dominant language, such as English, have the responsibility to give serious attention to the harm − linguistic, cultural, social, and economic − that may be caused by its spread. An example of the issues raised by WEs advocates is that they increasingly question the

  IE assumption that only the “Standard” (i.e. the “native”) model should be used for the assessment of English language proficiency. However, this strong postcolonial view of the role and standards of English raises technical difficulties for the assessment of English language proficiency in the Expanding Circle of Englishes.

  Because the authors of this paper themselves hold different positions on the IE/WEs issue, we have agreed not to engage in debate about the use of standard or other varieties of English in language tests and assessments used within a community that uses English for certain purposes. Instead we are confining this paper and this research to English proficiency testing in high-stakes contexts, such as TOEFL, TOEIC, and IELTS (Criper and Davies, 1988; Spolsky, 1993; Clapham, 1996). There are two important questions here:

  1. Whose norms are to be imposed in the test materials?

  2. What are the consequences for test-takers if the norm imposed by the test is not the “normal” variety accepted in their own society?

  Concerned about these issues, Lowenberg (1993) carried out an analysis of the TOEIC test. He concludes (p. 104):

28 Liz Hamp-Lyons and Alan Davies

  the brief analysis presented in this paper is sufficient to call into question the validity of certain features of English posited as being globally normative in tests of English as an international language, such as TOEIC, and even more in the preparation of materials that have developed around these tests. Granted, only a relatively small proportion of the questions on the actual tests deal with these nativized features: most test items reflect the “common core” of norms which comprise Standard English in all non-native and native speaker varieties. But given the importance usually attributed to numerical scores in the assessment of language proficiency, only two or three items of questionable validity on a form could jeopardize the ranking of candidates in a competitive test administration.

  

Jenkins (2006) considers it to be well accepted that English now has a growing number

of standard varieties and not just two “globally useful or appropriate versions” (p. 42).

She argues that more and more sociolinguists are now willing and able to distinguish

between “NNS varieties” and “interlanguage”, and therefore that “there seems to be no

good reason for speakers from the Outer or Expanding circles to continue to defer to NSs

of the Inner Circle, to bow to exo-normative standards or conform to norms that represent

other people’s identities” (p. 43). She argues that it seems unreasonable to expect a more

consistently accurate and standard form of English than is demanded in practice of NSs:

however, she points out that candidates taking English language tests are almost universally

expected to demonstrate formal and informal command of “correct” lexico-grammar and

syntax of some “standard” English. Deviations from these norms (she cites I’ve got less

cars in my picture as such a deviation) are judged to be errors and penalized on tests, while

they generally pass unnoted in informal NS speech. Jenkins cites our paper (Davies et al.,

2003) as supporting her position; but in fact in our conclusion to that paper we made it

clear that from our point of view the jury is still out. Lowenberg (1993) challenges “the

assumption held by many who design English proficiency tests . . . that native speakers still

should determine the norms for Standard English around the world” (p. 104). Lowenberg

has followed up his earlier work with an analysis of newspaper style sheets, government

documents, and ESL textbooks in Malaysia, Singapore, Brunei, and the Philippines, and has

found that these diverge from native speaker varieties at all levels, from the morphosyntactic

and lexical to pragmatic and discoursal conventions (Lowenberg, 2002).

  Responding to Jenkins (2006), Taylor (2006) argues that many learners of English prefer

to study a “native-like” variety of English for pragmatic and instrumental reasons such as

the desire to study in an academic environment where standard varieties are still privileged:

“we must avoid acting as ‘liberators’ only to impose a new bondage” (Taylor, 2006: 52).

Taylor also points out that these days language tests increasingly enact communicative

language use principles such as “communicative effectiveness” (writing) or “ability to

achieve meaningful communication” (paired speaking tests), while maintaining an essential

attention to form and accuracy in order to ensure that comprehensibility is the dominant

factor. Further, Taylor argues, “For testers, especially providers of large-scale, high-stakes

English proficiency tests, issues of quality and fairness must be paramount” (p. 56) She

believes that fair tests depend on well-developed models of language use, which currently

means principally American and British Standard English; however, she suggests that there

may be a case for tests of English as an International Language (EIL), since this is becoming

globally recognised and its descriptive codification is proceeding fast.

  One way to avoid using the global norms to which Jenkins, Lowenberg, and others object

is to investigate to what extent local norms are appropriate both locally and beyond the

local, and to use this information in test development. Such an investigation is reported by

  

The Englishes of English tests

  29

  Hill (1996) and Brown and Lumley (1998), both referring to the development of an English proficiency test for Indonesian teachers of English. Hill comments: (1996: 323)

  the majority of English learners will use English to communicate with other non-native speakers within South-East Asia. For this reason it was decided the test should emphasize the ability to communicate effectively in English as it is used in the region, rather than relate proficiency to the norms of America, Britain or Australia . . . this approach also aims to recognize the Indonesian variety of English both as an appropriate model to be provided by teachers and as a valid target for learners.

  Brown and Lumley claim to have had several aims in view in their Indonesian test development, all of which were fulfilled. These were:

  • the judicious selection of tasks relevant to teachers of English in Indonesia; the selection of culturally appropriate content;
  • an emphasis on assessing test takers in relation to local norms;
  • the use of local raters, that is non-native speakers of English (whose proficiency was
  • nevertheless of a high standard). (Brown and Lumley 1998: 94)

THE BEGINNINGS

  We began our research in this area in 2002 when we set up a research project in Hong Kong, where we were both based at the time. We started from the hypothesis that interna- tional English tests are biased: by “bias” we meant that they systematically misrepresent

  

the “true scores” of candidates by requiring facility in a variety of English to which whole

groups of candidates have not been exposed. (The “true” score is a hypothetical score

  reflecting a candidate’s true ability, once an adjustment has been made for measurement – in this case sampling – error.) This definition should make clear that “bias” is not about difference as such but about unfair difference. The argument about bias on international English tests is that these tests represent the old colonial Standard English of the UK, USA, etc. − a kind of English that is not known or only partly known by many of those who have learnt English as an additional language, in particular those living in one of the so-called New English societies, such as Singapore, Malaysia, and India, which have adopted a local or locally emerging variety of English. We had intended to undertake an empirical study, but in the event this was not possible and in its place we held an invited seminar in Hong Kong, described in Davies et al. (2003), with representatives from Singapore, China, In- dia, and Malaysia. The purpose of the seminar was to compare local tests of English used in those four countries with international tests of English. We concluded that what is at issue in comparing international and local tests of English proficiency is which standard is under test. The question then becomes: does a WEs variety “own” (in the sense of accept) a standard of its own which it appeals to in a local test? If not, then the assumption is that in the testing situation speakers of this WE variety will be required to operate in the IE standard. That is precisely the point made strongly in the HK seminar by Lukmani (India). But are such WEs speakers being discriminated against in being required to do this? That is an empirical question, and formed part of the study reported here.

  The Hong Kong study concluded with three questions:

  1. How possible is it to distinguish between an error and a token of a new type?

30 Liz Hamp-Lyons and Alan Davies

  2. If we could establish bias, how much would it really matter?

  3. Does an international English test privilege those with a metropolitan Anglophone education? Following the Hong Kong study, we continued to debate these issues back and forth between us, and agreed that inherent to the whole debate are questions of beliefs and judgements; therefore, it would be appropriate to collect judgemental data that might illuminate bias if it exists, and perhaps indicate where bias might, or might not, lie.

THE EMPIRICAL STUDY

  With funding from a Spaan Fellowship (University of Michigan English Language In- stitute Testing and Certification Division) we conducted an empirical investigation exam- ining a range of scripts written by university student writers from six different language backgrounds, drawn from the database of the MELAB (University of Michigan English Language Battery). We obtained ratings of these scripts by (a) two native speakers of the writer’s first language; (b) two pairs of raters from NNS backgrounds other than the writer’s first language; (c) the original score data from the two MELAB raters, as the raw data for the study proposed. By examining writing tests, specifically the judgements made of a range of writers’ performances by different categories of raters, this work follows the lead of Hamp-Lyons (1986) and Hamp-Lyons and Zhang (2001), but looks at quantitative data rather than qualitative judgements about text characteristics.

  The structure of the data

  The data comprise 10 essays by writers from each of the following language backgrounds: Arabic, Bahasa Indonesia/Malaysia, Japanese, Mandarin Chinese, Tamil, and Yoruba. All essays were written on the same or a very similar topic from the MELAB pool: each essay received the same or closely similar scores from two official MELAB raters who were native speakers of Standard American English. We used the average of two raters’ scores as the dependent variable in this dataset, and within each language set we obtained a range of MELAB score levels, not including the very lowest levels. In the official MELAB scoring, raters use a 10-point scale with the following score labels: 53 (Low), 57, 63, 67, 73, 77, 83, 87, 93, 97 (High). When both readers give the same score, that is the score the essay gets. When the readers are one point away from each other (e.g. 57, 63), the essay is given the average of these two scores (i.e. 60). For that reason, the basic 10-point scale is expanded to 19 through use of these between-point average scores. In our dataset, all essays either received identical scores or were one point away from each other, so none was read more than twice. (The MELAB Scale – see Appendix – is a multi-band system. As in all scales, the numbers assigned to the bands are arbitrary and while those numbers may give a false sense of interval-level precision of measurement, such arbitrariness is unavoidable. The precision of a scale is conventionally determined and derives its validity from that shared convention.)

  The essay sets were rated by pairs of raters: each pair shared their language background with one of the six sets of candidates: thus there were two native-speaking Japanese raters, two native-speaking Bahasa Indonesia/Malaysia raters, etc. We hypothesized that if there

  2

  were bias, it would be reflected in a significant difference between our native speaking raters’ ratings and the official MELAB scores. We calculated correlations between the

  

The Englishes of English tests

  31

  MELAB ratings for the Japanese writers’ essays and the Japanese raters of their native groups’ essays, and repeated this analysis for the Bahasa set, etc. We were also interested in a further comparison: whether or not there are any measurable differences across raters from different (IE and WEs) backgrounds. We agreed that if we found any such patterns on statistical analyses, we would look closely at the essays to try to identify consistent reasons (such as background-related reasons or rater bias factors).

  Our reasoning for selecting these language backgrounds was more cultural and social than linguistic. We considered establishing a language distance scale (Davies and Elder, 1998), the assumption being that languages closer linguistically to English (e.g. German would be very close) were more likely to accept IE norms, while those closer culturally and socially because of a shared imperial and colonial past (e.g. India) were more likely to reject those norms. However, apart from the difficulty of establishing a language distance scale given the complexity of the variables involved and their interaction, we considered that language distance was less likely to be a critical factor than cultural proximity. By “cultural proximity”, we allude to a hypothesis that the more influenced culturally, socially, and politically by IE societies members of an English variety-using society are, the more influence we would see on that society’s members’ use of English. Even at the time of formulation it was evident to us that this is by no means a clear hypothesis. Nor is it by any means simple to operate, as we found when we began to discuss where we would place each of our groups on a “cultural proximity scale”. The decisions are very obviously based on our own judgements, our own experiences, knowledge of users of those language and “representatives” of those cultures − they are wholly subjective. Furthermore, the intervening variables are many and not easy to control. We introduce this point here both in order to show the difficulties we faced in attempting to establish a sufficiently clear research design for the use of quantitative analyses, and in the hope that readers will respond to us with fruitful suggestions for further research methodology. We used the design we had set up for ourselves, and present the results here, with all these caveats in mind.

  

Table 1. Hypothesized scale of language/cultural “distance”/proximity

English No clear basis − English +

Tamil Arabic Chinese

Yoruba Bahasa Japanese

  The tentative “cultural proximity” scale for +/− English as we used it in this study is shown in Table 1. Our reasoning for placing the groups as we have done was as follows. While neither Tamil nor Yoruba would be close to English on a language distance scale, Tamil speakers (India and Sri Lanka) and Yoruba speakers (Nigeria), particularly those from the educated classes to whom overseas education is accessible, are close to (British) English culturally and socially because of their long former colonial status. We have there- fore “scaled” Tamil and Yoruba as “+English” on our hypothetical cultural proximity continuum. On the opposite end of the continuum, “−English”, we have placed Japanese and Chinese. Japanese is not only far from English in language distance, it has never been colonized, and until the last 60 years there was very little external influence on its language or culture. While more recently many curious Japanese−English lexical blends have been created and there is much American influence on Japanese popular culture, considerable

32 Liz Hamp-Lyons and Alan Davies

  research in contrastive rhetoric (e.g. Hinds, 1983; 1987) leads us to suppose for the pur- poses of this study that Japanese remains socially and culturally, as well as linguistically, distant from English. Finally, China outside Hong Kong (in this study, mainland Chinese resident speakers of Putonghua) is the furthest away traditionally from English influence, not colonized and until recently not directly connected. In the middle ground we have tentatively put Bahasa and Arabic. As a language, Bahasa spans Malaysia and Indonesia: Malaysia is a former British colony while Indonesia has never been colonized by the British, and our raters were Indonesian. However, the nationalities (as opposed to the L1) of the Bahasa writers are unknown, and therefore there is some confound in our Bahasa data. As a language, Arabic is fairly distant from English, and Arabic users might generally be assumed to be culturally distant from English; but while the writers came from several of the more traditionally culturally distant countries (e.g. Saudi Arabia), our raters are users of Egyptian Arabic, and Egypt has a mixed history of accommodation with the Anglophone west. Therefore, there is some uncertainty in the Arabic data also.

  Ratings

  As explained above, MELAB scores are described on a scale labeled from 53 to 97 − an unfamiliar and unusually long scale, which when scores are averaged would be 19 points long. For simplicity, rather than attempting to train all these raters in different countries to use a new scale of an unfamiliar length, we asked them to use the Test of Written English rating scale − a 6-point scale which when averaged has 9 points between 2 and

  6. We therefore needed to make a judgemental adjustment between the scales in order to establish the cut-points for “equivalence”. In retrospect, we would have had our set of essays re-scored by MELAB raters but using the same scale as our other raters. This would have given us some semblance of data overlap, which would have improved our ability to interpret the data patterns.

  

FINDINGS

  We had four sets of hypotheses, the first two of which can be seen as an opposing pair:

  1. An IE hypothesis: that the mean scores for all subsets of essays would correlate with the MELAB raters’ mean scores at the same or very similar levels (i.e. there would be no significant differences between any of the pairs of raters and the MELAB scores).

  2. A WEs hypothesis: that the mean scores of some subsets of essays would be less closely correlated with MELAB scores than others (i.e. there would be significant differences between some pairs of raters and the MELAB scores).

  3. A very strong WEs hypothesis: that the scores of groups at the +E end of the scale would be least likely to agree with the MELAB scores. The ground for this hypothesis would be that those most dominated culturally and economically would be most likely to reject exonormative norms. It follows that, if this were true, the opposite would apply at the −E end of the scale.

  4. A relativist WEs hypothesis: that there would be greater differences between the scores of pairs of raters on their own language background scripts and any other pair of raters of the same essay than between the same rater-pair and the MELAB essays (i.e. that the cultural/linguistic distance is greater between two non-standard varieties than between any one of them and a standard variety).

  

The Englishes of English tests

  33 Table 2. Inter-rater reliability ∗∗ Bahasa

  .446 ∗∗∗ Chinese .733 ∗∗∗ Japanese .747 Tamil .270 ns ∗∗ Yoruba .498 ∗ ∗∗ ∗∗∗ Note on Arabic: one rater did not return scored essays and this group had to be discarded.

  = p<.05; = p<.01; = p<.005

  Before we could use these data in attempting to answer any of our questions, we needed to know how reliable the averaged scores from each pair of raters were (the inter-rater r of

  3 the MELAB averaged scores was predetermined by the Michigan programme as .81 ).

  It can be seen that the scores of our two Tamil raters were so divergent that there is no consistent relationship; and therefore any statements we might make based on them would be dubious at best. This leaves us with only one subgroup in the “+English” and “neutral” conditions: Japanese and Chinese are the most stable, and so we can be most confident about our findings on hypotheses 2 and 3 for these languages: however, with so few data points we proceed with caution. We should also note that none of the rater pairs showed itself to be as inter-reliable as two MELAB raters; this is not surprising given that not only are MELAB raters trained; they work closely together, and many of them have done so for many years. We also have very small datasets, which explains why such apparently low reliabilities are found to meet our probability conditions.

  Hypothesis 1 was an IE hypothesis: that there would be no significant difference between any of the pairs of raters rating their in-country scripts and those of the MELAB raters. What do we find when we compare in-country raters rating their own L1 candidates with the MELAB scores for the same candidates? Simple correlations are shown in Table 3. As we can see from this table, three of the four viable data sets correlate at p<.05 or better. This may be some evidence toward supporting the IE hypothesis; however, as Yoruba does not perform similarly, the strong IE hypothesis must be rejected.

  

Table 3. Matched student/rater in-country data vs MELAB ratings

  • E No clear basis − E ∗∗ ∗

    (Tamil) MELAB−Bahasa .771 MELAB–Chinese .682

    ∗∗ ∗ ∗∗

    MELAB−Yoruba .589 (Arabic) MELAB–Japanese .784

  p<.05; p<.01

  Hypothesis 2 was a WEs hypothesis: that the mean scores of some subsets of essays would be less closely correlated with MELAB scores than others, and that this difference would not be explainable by inter-rater unreliability. The MELAB−Tamil correlation was .554 (ns), which, even though the data are rejected for the other hypotheses due to inter- rater unreliability, is quite similar to the picture for MELAB−Yoruba. We see that the correlations of the data sets with MELAB scores are not all equivalent: there is variation. The weak WEs hypothesis would appear to be tentatively upheld, but we must remind ourselves again of the weakness of our data structure: we cannot be satisfied that these results spring from true patterns rather than from random error in the data.

34 Liz Hamp-Lyons and Alan Davies

  It may be that Hypothesis 3, the strong WEs hypothesis, can provide some insight: here we posited that the scores of groups at the +E end of the scale would be least likely to agree with the MELAB scores.

  We can immediately see that the Yoruba scores correlate least with their matched MELAB scores, and therefore the strong WEs hypothesis would appear to be upheld. However, only at the −E end of the scale do we have a complete and stable data set: here we see that both the Chinese and Japanese data sets correlate significantly with MELAB scores; however, the Japanese data set correlates more strongly than does the Chinese. The picture is by no means clear, because we can see that the Bahasa and Japanese score sets are noticeably more closely correlated than are any other score sets, which was not predicted. We are hard pressed to construct an explanation why this should be, on the basis of the cultural proximity hypothesis, whether we use an IE position or a WEs position. There are not

  s

  enough significant r to enable a conclusion to be reached – how could there be in so small a sample? To investigate these issues closely enough to arrive at any supportable explanations, we need larger data sets and much stronger hypotheses about cultural distance or some alternative theoretical position.

  In the final set of analyses, we explored Hypothesis 4, the relativist WEs hypothesis: that there would be greater differences between the scores of pairs of raters on their own language background scripts and any other pair of raters of the same essay than between the same rater-pair and the MELAB essay. Here, we compared each language set with the language it is closest to and with one other.

  Language-by-language discussion Bahasa (see Table 4). Bahasa raters (although from a different geopolitical region than

  the writers) agree very closely with the Yoruba raters (in fact this is the closest agreement in the entire study), and strongly with the MELAB raters, but not with the Tamil raters. Since we have already noted the unreliability of the Tamil ratings, we should disregard this result, and the other two results are too similar for the relativist WEs hypothesis to be upheld in this case.

  Chinese (see Table 5). All the interactions are significant for the Chinese essays: the relativist WEs hypothesis cannot be upheld in this case.

  

Table 4. Bahasa essays

MELAB raters Tamil raters Yoruba raters Bahasa raters of Bahasa essays 0.771 0.362 0.901 Significance P<.009 P<.304 P<.000

  

Table 5. Chinese essays

MELAB raters Japanese raters Yoruba raters Chinese raters of Chinese essays 0.682 0.692 0.643 Significance P<.030 P<.027 P<.045

  

The Englishes of English tests

  35 Japanese (see Table 6). Excluding the Tamil scores for their unreliability, we see that

  although the correlation between MELAB and the L1 Japanese raters was significant, that between Japanese and the “third language” (Chinese) pair was not. The relativist WEs hypothesis can be upheld in this case. This finding is particularly interesting because these two pairs of raters – Japanese and Chinese – were found to be the two most inter-rater reliable pairs.

  

Table 6. Japanese essays

MELAB raters Chinese raters Tamil raters Japanese raters of Japanese essays 0.784 0.180 0.130 Significance P<0.007 P<0.619 P<0.720

  Yoruba (see Table 7). There are no significant interactions in this set, although

  Yoruba−MELAB comes close; we would be stretching things too far to claim evidence here for the relativist WEs hypothesis.

  We are left with only one data point that appears to support the relativist WEs hypothesis; however, we found this where we also found the most reliable pairs of ratings (the MELAB ratings and the Japanese ratings), and we must consider whether we would have had clearer results if we had used trained rater-pairs.

  

Table 7. Yoruba raters

MELAB raters Japanese raters Tamil raters Yoruba raters of Yoruba essays 0.589 0.060 0.207 Significance P<0.073 P<0.870 P<0.567 The rater conundrum

  The problem with training raters, we had decided, was that by doing so we might under- mine the very distinctiveness of their culture-specific responses to the scripts. However, with hindsight we see several things we might have done to ensure we started with raters who would fit our needs as well as possible. To begin with, we should have been far more careful about the backgrounds of raters. Hamp-Lyons (1986) had noted the influence that raters’ teaching experiences had on the ways they viewed NNS essays. Since then several studies, notably Lumley (2005), have found that raters’ knowledge, experience, and beliefs do affect the way they rate. Our two Japanese raters were both attached to one of the pres- tigious Tokyo universities and may well have received part of their education in the USA. Our Arabic raters both came from Egypt, where the Arabic spoken is somewhat different from that spoken by most of these essay writers, and where again American influence is likely to occur. Our “Yoruba” raters were first language Ibo speakers whose Yoruba was near-native but not native. None of these rater sets, we concluded, was free of potentially confounding influence.

  A further conundrum, of course, is the fact that the MELAB raters are all NNSs of Standard American English, which is not the “parent” lect of all the student writers. If

  36 Liz Hamp-Lyons and Alan Davies

  raters’ lect of English had been matched to writers’ dominant English cultural and linguistic influences, results might have been different.

  

CONCLUSION

  In addition to the many rater variables and the inherent difficulties in reliably scoring performance data, as we pointed out in our introduction, our data set is very small and the intervening variables are many and incestuous: the uncertainty about candidates’ L1 (not all Yoruba were Yoruba); the shifting nature of cultural “membership” (has the writer − or rater − lived/studied in an English-speaking country? are Indonesia and Malaysia “the same” culturally as well as linguistically?); the lack of fit between raters and how far they shared the L1 of their “compatriots”; the lack of training of raters and the worry that, if they were trained, raters would become ciphers of the IE we want them to problematise; our failure to use just one rating scale. And so on.

  However, this pilot (or even pre-pilot) is, we suggest, worth extending. What we hope to do is to limit our sets of student texts to 4 L1s, selecting two thought to be culturally close and two thought to be culturally distant. Each set should have an N = 50+. Each set should have 2 NS raters and 4 NNS raters, making a total of 4×2 + 4×4 = 24 raters in total. All raters should use the same scale. Half in each set should be trained. We hope to link the continuation of this project with a study to be undertaken by Author 2, who has been awarded a Leverhulme Emeritus Fellowship for research in this area, entitled “Native speakers and native users”.

  Envoi

  Both issues we have confronted, WEs and bias, are fugitive. Nevertheless, their pursuit through analysis of test instruments does afford the possibility of coming nearer to our quarries. Bias on the basis of our study may be “not proven”, but it cannot be dismissed.

  As for the three questions we posed at the outset, we are no closer to any answers, but we are becoming clearer as to how further research can help us understand whether these are the right questions to ask and to select the right strategies with which to pursue them.

  

NOTES

1. We are grateful for helpful comments from two reviewers. What faults remain, we take full responsibility for.

  2. Our definition of a significant difference was pre-set at p<.05.

  

3. Pearson with Spearman-Brown prophecy formula: personal communication, Jeff Johnson, Michigan ELI Testing and

Certification Division. Reliability is shown in Table 2.

APPENDIX 1: MELAB WRITTEN COMPOSITION RATING SCALE

  97 Topic is richly and fully developed. Flexible use of a wide range of syntactic (sentence level) structures, accurate morphological (word forms) control. Organization is appropri- ate and effective, and there is excellent control of connection. There is a wide range of appropriately used vocabulary. Spelling and punctuation appear error- free.

  93 Topic is fully and complexly developed. Flexible use of a wide range of syntactic structures. Morphological control is nearly always accurate. Organization is well controlled

  

The Englishes of English tests

  37

  and appropriate to the material, and the writing is well connected. Vocabulary is broad and appropriately used. Spelling and punctuation errors are not distracting.

  87 Topic is well developed, with acknowledgement of its complexity. Varied syntactic struc- tures are used with some flexibility, and there is good morphological control. Organization is controlled and generally appropriate to the material, and there are few problems with connection. Vocabulary is broad and usually used appropriately. Spelling and punctuation errors are not distracting.

  83 Topic is generally clearly and completely developed, with at least some acknowledge- ment of its complexity. Both simple and complex syntactic structures are generally ad- equately used; there is adequate morphological control. Organization is controlled and shows some appropriacy to the material, and connection is usually adequate. Vocabulary use shows some flexibility, and is usually appropriate. Spelling and punctuation errors are sometimes distracting.

  77 Topic is developed clearly but not completely and without acknowledging its complex- ity. Both simple and complex syntactic structures are present; in some “77” essays these are cautiously and accurately used while in others there is more fluency and less accuracy. Morphological control is inconsistent. Organization is generally controlled, while connec- tion is sometimes absent or unsuccessful. Vocabulary is adequate, but may sometimes be inappropriately used. Spelling and punctuation errors are sometimes distracting.

  73 Topic development is present, although limited by incompleteness, lack of clarity, or lack of focus. The topic may be treated as though it has only one dimension, or only one point of view is possible. In some “73” essays both simple and complex syntactic structures are present, but with many errors; others have accurate syntax but are very restricted in the range of language attempted. Morphological control is inconsistent. Organization is partially controlled, while connection is often absent or unsuccessful. Vocabulary is some- times inadequate, and sometimes inappropriately used. Spelling and punctuation errors are sometimes distracting.

  67 Topic development is present but restricted, and often incomplete or unclear. Simple syntactic structures dominate, with many errors; complex syntactic structures, if present, are not controlled. Lacks morphological control. Organization, when apparent, is poorly controlled, and little or no connection is apparent. Narrow and simple vocabulary usually approximates meaning but is often inappropriately used. Spelling and punctuation errors are often distracting.

  63 Contains little sign of topic development. Simple syntactic structures are present, but with many errors; lacks morphological control. There is little or no organization, and no connection apparent. Narrow and simple vocabulary inhibits communication, and spelling and punctuation errors often cause serious interference.

  57 Often extremely short; contains only fragmentary communication about the topic. There is little syntactic or morphological control, and no organization or connection are apparent. Vocabulary is highly restricted and inaccurately used. Spelling is often indecipherable and punctuation is missing or appears random.

  53 Extremely short, usually about 40 words or less; communicates nothing, and is of- ten copied directly from the prompt. There is little sign of syntactic or morphological

38 Liz Hamp-Lyons and Alan Davies

  control, and no apparent organization or connection. Vocabulary is extremely restricted and repetitively used. Spelling is often indecipherable and punctuation is missing or ap- pears random. N.O.T. (Not On Topic) Indicates a composition written on a topic completely different from any of those assigned; it does not indicate that a writer has merely digressed from or misinterpreted a topic. N.O.T. compositions often appear prepared and memorized. They are not assigned scores or codes.

  (http://www.lsa.umich.edu/eli/composition)

  

REFERENCES

Bartsch, Renate (1988) Norms of Language: Theoretical and Practical Aspects. London: Longman.

  Brown, Annie (2000) Tongue slips and Singaporean English pronunciation. English Today, 16(3), 31–6.

Brown, Annie, and Lumley, Tom (1998) Linguistic and cultural norms in language testing: a case study. Melbourne

Papers in Language Testing, 7(1), 80–96.

  

Clapham, Caroline (1996) The Development of IELTS: A Study of the Effect of Background Knowledge on Reading

Comprehension. Studies in Language Testing No. 4. Cambridge: Cambridge ESOL and Cambridge University Press.

Criper, Clive, and Davies, Alan (1988) ELTS Validation Project Report. Research Report 1/1. London and Cambridge:

The British Council and Cambridge University Press.

Davidson, Fred (1993) Testing English across countries and cultures: summary and comments. World Englishes, 12(1),

113–25.

Davies, Alan (1999) An Introduction to Applied Linguistics: From Practice to Theory. Edinburgh: Edinburgh University

Press. Davies, Alan (2003) The Native Speaker: Myth and Reality. Clevedon, UK: Multilingual Matters.

Davies, Alan, and Elder, Catherine (1998) Performance on ESL examinations: is there a language distance effect? Language

and Education, 11, 1–17.

Davies, Alan, Hamp-Lyons, Liz, and Kemp, Charlotte (2003) Whose norms? International proficiency tests in English.

  World Englishes, 22(4), 571–84.

de Beaugrande, Robert (1999) Theory and practice in the discourse of language planning. World Englishes, 18(2), 107–21.

Graddol, David (1997) The Future of English. London: British Council.

Graddol, David (1999) The decline of the native speaker. In English in a Changing World. Edited by David Graddol and

Ulrike H. Meinhof. AILA: the AILA Review, 13, 57–68.

  

Hamp-Lyons, Liz (1986) Writing in a foreign language and rhetorical transfer: influences on evaluators’ ratings. In British

Series in Applied Linguistics 1: Selected Papers from the 1985 Annual Meeting. Edited by Paul Meara. London: CILT,

pp. 72–84.

  

Hamp-Lyons, Liz, and Zhang, Wen xia (2001) World Englishes: issues in and from academic writing assessment. In

English for Academic Purposes: Research Perspectives. Edited by John Flowerdew and Matthew Peacock. Cambridge:

Cambridge University Press, pp. 101–16.