The Meaning of Test Results

34 selecting among alternatives. Good essay items challenge students to organize, analyze, integrate and synthesize information. Essay items can be classified according to their educational purpose or focus ex: evaluating content, style and grammar, the complexity of the task presented ex: knowledge, comprehension, application, analysis, synthesis and evaluation, and how much structure they provide restricted or extended response. Below is an example of essay item: 1. List the types of muscle tissue and state the function of each. Figure 2.8. Example of essay item Reynolds, 2009: 228 Short-answer items Short-answer items Reynolds, 2009: 237 require students to supply a words, phrase, number or symbol in response to a direct question. It can also be written in an incomplete-sentence format instead of a direct-questions format which sometimes is referred to as a completion item. Here are the examples: Direct-Question Format. 1. What is the membrane surrounding the nucleus called? _______________ Incomplete-Sentence Format 1. The membrane surrounding the nucleus is called the ________________ Figure 2.9. Example of short-answer item Reynolds, 2009: 237

4. The Meaning of Test Results

Test results or scores reflect the performance or ratings of the individuals in completing a test Reynolds, 2009: 62. Test scores are the keys to interpret and understand the examinees’ performance. Therefore the meaning and interpretation 35 are extremely important topics and deserve careful attention. There are many formats of scores available for our use and each format has its own unique characteristics. Reynolds 2009: 63 says that the simplest type of score is a raw score. A raw score is simply the number of items scored or coded in a specific manner such as correctincorrect, truefalse, and so on. Yet, raw score tends to offer very little useful information because in most situations it has little interpretative meaning. We need to transform or convert it into another format to facilitate the interpretation and give it meanings. This transformed scores typically referred to as derived scores, standard scores. These transformed scores are crucial to help us interpret the test results. Derived scores can be classified as either norm-referenced or criterion referenced Reynolds, 2009: 63. Norm –referenced score interpretations is when the examinee’s performance is compared to the performance of other people a reference group. For example, scores on tests of intelligence. If student, after doing the test, has an IQ of 100, this indicates he or she scored higher that 50 of the people in the standardization sample. Standardization samples should be representative of the types of individuals who are expected to take the tests. Criterion-referenced score interpretations is when the examinee’s performance is not compared to a specified level of performance Reynolds, 2009: 63. The emphasis is on what the examinees know or what they can do, not “their standing relative” Reynolds, 2009: 63 to other test takers. For example: a classroom examination. If an examinee correctly answered 85 of the items on a classroom test, the performance is not compared to that of other examinees but to a perfect performance on the test. 36 Norm-referenced interpretations are relative i.e. relative to the performance of other examinees while criterion referenced interpretations are absolute i.e. compared to an absolute standard. Along with transforming the test results, we usually should provide qualitative descriptions of the scores produced by their tests Reynolds, 2009: 85. This helps professionals communicate results in written reports and other formats. For examples: The Stanford-Binet Intelligence Scales’ shown as follows: IQ Classification 145 and above Very Gifted or Highly Advanced 130-144 Gifted or Very advanced 120-129 Superior 110-119 High Average 90-109 Average 80-89 Low Average 70-79 Borderline Impaired or Delayed 55-69 Mildly Impaired or Delayed 40-54 Moderately Impaired or Delayed Figure 2.10. Stanford-Binet Intelligence Scales Reynolds, 2009: 85 Another example is the Behavior Assessment System for Children which provides qualitative descriptions of the clinical scales such as depression or anxiety scales as follows: T-Score Range Classification 70 and above Clinically significant 60-69 At- Risk 41-59 Average 31-40 Low 30 and below Very Low. Figure 2.11. Behavior Assessment System for Children Reynolds, 2009: 85 37 5 Criteria Designing a good testing is not easy. One should notice the specific criteria that determine the overall usefulness of the instrument such as: reliability, validity, Practicality and Authenticity. Reliability is the consistency of test scores for the same individuals. A test is reliable if it states the same score for a given individual on two separate occasions. Genesee 2007: 245 considers four kinds of reliability: first, test-retest reliability- the degree of consistency of scores for the same test given to the same individuals on different occasions. Second, alternate-forms reliability, that is the consistency of scores for the same individuals on different occasions but comparable form of the test. Third, internal consistency, that is the degree of consistency of test scores with regard to the content of a single test. Fourth, scorer reliability, that is the degree of consistency of scores from different scorers for the same individuals on the same test or from the same scorer for the same individuals on the same test but on different occasions. Validity is the appropriateness of a given test or any of its component parts as a measure of what it is intended to measure. A valid test is “one that measures what it is supposed to measure –no more, no less” Geneese, 2007: 245. Validity can be classified into three categories: first, construct validity which refers to the extent to which we can interpret a given test score as an indicator of the ability, or construct we want to measure. Second, content validity which depends on a logical analysis of the test’s content to see whether the test contains a representative sample of relevant language skills Anderson, Clapham, Wall, 1995: 171. And third, criterion referenced validity which refers to “studies 38 comparing students’ test scores with measures of their ability gleaned from outside the test” skills Anderson, Clapham, Wall, 1995: 171. Practicality refers to five aspects, such as fairness issue, that is the degree to which a test treats every student the same or the degree to which it is impartial; the cost issue which is related to the time and funds that teachers’ need in conducting objective test; ease of test construction which is related to the number of test questions; ease of test administration, that is the degree to which a test is easy to administer and ease of test scoring, that is the degree to which a test is easy to score. The last is authenticity. Authenticity is considered as an important feature of language test, but frequently the notion is related only to the use of authentic material. Actually the concept of authenticity is far more comprehensive. Eder, 2010: 1 As it is defined by Charles Alderson 2000: 138 that the goal of all reading assessment “is typically to know how well readers read the real world”, authenticity becomes an important aspect of testing since it describes the relationship between the test and the real world Eder, 2010: 1.

3. Written Communicative English Competence