34 selecting among alternatives. Good essay items challenge students to organize,
analyze, integrate and synthesize information. Essay items can be classified according to their educational purpose or focus ex: evaluating content, style and
grammar, the complexity of the task presented ex: knowledge, comprehension, application, analysis, synthesis and evaluation, and how much structure they
provide restricted or extended response. Below is an example of essay item:
1. List the types of muscle tissue and state the function of each.
Figure 2.8. Example of essay item Reynolds, 2009: 228
Short-answer items
Short-answer items Reynolds, 2009: 237 require students to supply a words, phrase, number or symbol in response to a direct question. It can also be
written in an incomplete-sentence format instead of a direct-questions format which sometimes is referred to as a completion item. Here are the examples:
Direct-Question Format.
1. What is the membrane surrounding the nucleus called? _______________
Incomplete-Sentence Format
1. The membrane surrounding the nucleus is called the ________________
Figure 2.9. Example of short-answer item Reynolds, 2009: 237
4. The Meaning of Test Results
Test results or scores reflect the performance or ratings of the individuals in completing a test Reynolds, 2009: 62. Test scores are the keys to interpret and
understand the examinees’ performance. Therefore the meaning and interpretation
35 are extremely important topics and deserve careful attention. There are many
formats of scores available for our use and each format has its own unique characteristics. Reynolds 2009: 63 says that the simplest type of score is a raw
score. A raw score is simply the number of items scored or coded in a specific manner such as correctincorrect, truefalse, and so on. Yet, raw score tends to
offer very little useful information because in most situations it has little interpretative meaning. We need to transform or convert it into another format to
facilitate the interpretation and give it meanings. This transformed scores typically referred to as derived scores, standard scores. These transformed scores are
crucial to help us interpret the test results. Derived scores can be classified as either norm-referenced or criterion referenced Reynolds, 2009: 63.
Norm –referenced score interpretations
is when the examinee’s performance is compared to the performance of other people a reference group.
For example, scores on tests of intelligence. If student, after doing the test, has an IQ of 100, this indicates he or she scored higher that 50 of the people in the
standardization sample. Standardization samples should be representative of the types of individuals who are expected to take the tests.
Criterion-referenced score interpretations is when the examinee’s
performance is not compared to a specified level of performance Reynolds, 2009: 63. The emphasis is on what the examinees know or what they can do, not “their
standing relative” Reynolds, 2009: 63 to other test takers. For example: a classroom examination. If an examinee correctly answered 85 of the items on a
classroom test, the performance is not compared to that of other examinees but to a perfect performance on the test.
36 Norm-referenced interpretations are relative i.e. relative to the performance
of other examinees while criterion referenced interpretations are absolute i.e. compared to an absolute standard.
Along with transforming the test results, we usually should provide qualitative descriptions of the scores produced by their tests Reynolds, 2009: 85.
This helps professionals communicate results in written reports and other formats. For examples: The Stanford-Binet Intelligence Scales’ shown as follows:
IQ Classification
145 and above Very Gifted or Highly Advanced
130-144 Gifted or Very advanced
120-129 Superior
110-119 High Average
90-109 Average
80-89 Low Average
70-79 Borderline Impaired or Delayed
55-69 Mildly Impaired or Delayed
40-54 Moderately Impaired or Delayed
Figure 2.10. Stanford-Binet Intelligence Scales Reynolds, 2009: 85
Another example is the Behavior Assessment System for Children which provides qualitative descriptions of the clinical scales such as depression or
anxiety scales as follows:
T-Score Range Classification
70 and above Clinically significant
60-69 At- Risk
41-59 Average
31-40 Low
30 and below Very Low.
Figure 2.11. Behavior Assessment System for Children Reynolds, 2009: 85
37
5 Criteria
Designing a good testing is not easy. One should notice the specific criteria that determine the overall usefulness of the instrument such as: reliability,
validity, Practicality and Authenticity.
Reliability is the consistency of test scores for the same individuals. A test is
reliable if it states the same score for a given individual on two separate occasions. Genesee 2007: 245 considers four kinds of reliability: first, test-retest reliability-
the degree of consistency of scores for the same test given to the same individuals on different occasions. Second, alternate-forms reliability, that is the consistency
of scores for the same individuals on different occasions but comparable form of the test. Third, internal consistency, that is the degree of consistency of test scores
with regard to the content of a single test. Fourth, scorer reliability, that is the degree of consistency of scores from different scorers for the same individuals on
the same test or from the same scorer for the same individuals on the same test but on different occasions.
Validity is the appropriateness of a given test or any of its component parts
as a measure of what it is intended to measure. A valid test is “one that measures what it is supposed to measure –no more, no less” Geneese, 2007: 245. Validity
can be classified into three categories: first, construct validity which refers to the extent to which we can interpret a given test score as an indicator of the ability, or
construct we want to measure. Second, content validity which depends on a logical analysis of the test’s content to see whether the test contains a
representative sample of relevant language skills Anderson, Clapham, Wall, 1995: 171. And third, criterion referenced validity which refers to “studies
38 comparing students’ test scores with measures of their ability gleaned from
outside the test” skills Anderson, Clapham, Wall, 1995: 171.
Practicality refers to five aspects, such as fairness issue, that is the degree to
which a test treats every student the same or the degree to which it is impartial; the cost issue which is related to the time and funds that teachers’ need in
conducting objective test; ease of test construction which is related to the number of test questions; ease of test administration, that is the degree to which a test is
easy to administer and ease of test scoring, that is the degree to which a test is easy to score.
The last is authenticity. Authenticity is considered as an important feature
of language test, but frequently the notion is related only to the use of authentic material. Actually the concept of authenticity is far more comprehensive. Eder,
2010: 1 As it is defined by Charles Alderson 2000: 138 that the goal of all reading assessment “is typically to know how well readers read the real world”,
authenticity becomes an important aspect of testing since it describes the relationship between the test and the real world Eder, 2010: 1.
3. Written Communicative English Competence