20
themselves entirely to direct testing and will always include an indirect element in their tests. Of course, to obtain diagnostic information on underlying abilities,
such as control of particular grammatical structures, indirect testing is called for.
3.3 Validity
When we discuss about the validity we cannot escape from the contents of validity. It cannot be denied that a test is said to have content validity if its content
constitutes a representative sample of the language skills, structures, etc. with which it is meant to be concerned. It is obvious that a grammar test, for instance,
must be made up of items testing knowledge or control of grammar. But this in itself does not ensure content validity. The test would have content validity only if
it included a proper sample of the relevant structures. Just what are the relevant structures will depend, of course, upon the purpose of the test. We would not
expect an achievement test for intermediate learners to contain just the same set of structures as one for advanced learners. In order to judge whether or not a rest has
content validity, we need a specification of the skills or structures etc. that it is meant to cover. Such a specification should be made at a very early stage in test
construction. It isn’t to be expected that everything in the specification wt11 always appear in the test; there may simply be too many things for all of them to
appear in a single test. But t will provide the test constructor with the basis for making a principled selection of elements for inclusion in the test. A comparison
of test specification and test content is the basis for judgments as to content validity. Ideally these judgments should be made by people who are familiar with
21
language teaching and testing but who are not directly concerned with the production of the test in question.
The importance of content of the validity is,firstly, the greater a test’s content validity, the more likely it is to be an accurate measure of what it is
supposed to measure. A test in which major areas identified in the specification are under-represented—or not represented at all—is unlikely to be accurate.
Secondly, such a test is likely to have a harmful backwash effect. Areas which are not tested are likely to become areas ignored in teaching and learning. Too often
the content of tests is determined by what is easy to test rather than what is important to test. The best safeguard against this is to write full test specifications
and to ensure that the test content is a fair reflection of these. The criterion of the related validity.Another approach to test validity is
to see how far results on the test agree with those provided by some independent and highly dependable assessment of the candidate’s ability. This independent
assessment is thus the criterion measure against which the test is validated. There are essentially two kinds of criterion-related validity: concurrent
validity and predictive validity. Concurrent validity is established when the test and the criterion are administered at about the same time. To exemplify this kind
of validation in achievement testing, let us consider a situation where course objectives call for an oral component as part of the final achievement test. i
objectives may list a large number of ‘functions’ which students respected to perform orally, to test all of which might take 45 minutes for each student. This
could well be impractical. Perhaps it is felt that only ten minutes can be devoted to
22
each student for the oral component. The question then arises: can such a ten- minute session give a sufficiently accurate estimate of the student’s ability with
respect to the functions specified in the course objectives? Is it, in other words, a valid measure?
From the point of view of content validity, this will depend on how many of the functions are tested in the component, and how representative they are of
the complete set of functions included in the objectivesvery effort should be made when designing the oral component to give it content validity. Once this has been
done, however, we can go further. We can attempt to establish the concurrent validity of the component.
To do this, we should choose at random a sample of all the students taking the test. These students would then be subjected to the full 45 minute oral
component necessary for coverage of all the functions, using perhaps four scorers to ensure reliable scoring. This would be the criterion test against which the
shorter test would be judged. The students’ scores on the full test would be compared with the ones they obtained on the ten-minute session, which would
have been conducted and scored in the usual way, without knowledge of their performance on the longer version. If the comparison between the two sets of
scores reveals a high level of agreement, then the shorter version of the oral component may be considered valid, inasmuch as it gives results similar to those
obtained with the longer version. if, on the other hand, the two sets of scores show little agreement, the shorter version cannot be considered valid; it cannot be used
as a dependable measure of achievement with respect to the functions specified in
23
the objectives. Of course, if ten minutes really is all that can be spared for each student, then the oral component may be included for the contribution that it
makes to the assessment of students’ overall achievement and for its backwash effect. But it cannot be regarded as an accurate measure in itself.
A test is said to have face validity if it looks as if it measures what it is supposed to measure. For example, a test which pretended to measure
pronunciation ability but which did not require the candidate to speak and there have been some might be thought to lack face validity. This would be true even if
the test’s construct and criterion-related validity could be demonstrated. Face validity is hardly a scientific concept, yet it is very important. A test which does
not have face validity may not be accepted by candidates, teachers, education authorities or employers. It may simply not be used; and if it is used, the
candidates’ reaction to it may mean that they do not perform on it in a way that truly reflects their ability. Novel techniques, particularly those which provide
indirect measures, have to be introduced slowly, with care, and with convincing explanations.
What use is the reader to make of the notion of validity? First, every effort should be made in constructing tests to ensure content validity. Where
possible, the tests should be validated empirically against some criterion. Particularly where it is intended to use indirect testing, reference should be made
to the research literature to confirm that measurement of the relevant underlying constructs has been demonstrated using the testing techniques that are to be used
this may often result in disappointment—another reason for favoring direct
24
testing.Any published test should supply details of its validation, without it can be said that test the test is not valid.
3.4 Reliability