Validity and Reliability of the Quantitative Data Test-Prototype Development Validity of Test Prototypes: Logical Expert Judgment for Test Revision I Reliability of Test Prototypes: Test Try-Outs Item analysis using Three Item Indices Test Revision II: Fi

applied here. Below is the explanation of the three types of triangulation that this research employs. a Time triangulation. It was employed by collecting the data over certain period of time. Using three data collection instruments, this study collected data in the planning, action, and observing stages of the research. b Investigator triangulation. It was occupied by having more than one observer involved in a study in order to avoid bias observation. In this study, the researcher did not observe the research conduct only on her own. The English teacher serving as a research collaborator was also engaged. c Theoretical triangulation. It was applied by having the data collected during the research analyzed by more than one theoretical perspectives.

2. Validity and Reliability of the Quantitative Data

Regarding the quantitative data, the only instrument employed to collect the data was tests. As with the qualitative data, this research also ensured validity and reliability of the quantitative data by ensuring those of the instrument employed to collect the data. Figure 7 below demonstrates the stages that the research followed to develop the tests.

1. Test-Prototype Development

2. Validity of Test Prototypes: Logical

Validity through Content Validity

3. Expert Judgment for

Content Validity

4. Test Revision I

7. Reliability of Test Prototypes:

Cronba ch’s Alpha 6. Data Analysis using ITEMAN 3.00

5. Test Try-Outs

8. Item analysis using Three Item Indices

9. Test Revision II: Final Draft of the Instrument

Figure 7: Stages of Test Development The validity of the quantitative data was obtained from logical validity through content validity. According to Brown 2004: 22, content validity is defined as “the extent to which the assessment requires students to perform tasks that were included in the previous classroom lessons and that directly represent the objectives of the unit on which the assessment is based”. Thus, the reading materials covered in the test prototypes were taken from the Standard of Competence SK and the Basic Competence KD of the School-Based Curriculum KTSP, which regulates English instruction at schools in Indonesia, for Grade VIII students in the first semester on the reading skill. After that, the test prototypes were consulted with expert judgment, which in this case they were consulted with th e researcher’s thesis supervisor and the English teacher with which the researcher conducted the research. Revisions after consultation with expert judgment included 1 consistency in the form of A, B, C, D written in the instruction and those written in the alternatives of the items in each test prototype; and 2 revisions related to some spelling errors as well as some errors in sentence structures and grammar made in the test prototypes. Then, the test prototypes were tried out to other students having the same characteristics as those of the students in the research subjects. The results of the test try-outs were analyzed in terms of item indices and reliability of the test prototypes using ITEMAN 3.00. According to Brown 2004, there are three item indices that should be taken into account before accepting, discarding or revising items, namely item facility, item discrimination and distractor efficiency. The three item indices are further explained as follows. a Item Facility IF. Information about IF of a test item in the analysis result using ITEMAN 3.00 is indicated in Prop. Correct of the Item Statistics. IF is the extent to which an item of a test is easy or difficult for the proposed group of test-takers reflected by the percentage of students answering the item correctly. According to Henning in Fulcher and Davidson 2007, an ideal facility value ranges from 0.3 to 0.7. b Item Discrimination ID. There are two types of ID, namely ID of a test item and ID of a test item’s alternatives. For a test item, information about ID in the analysis result using ITEMAN 3.00 is indicated in Point Biserial and Biserial of the Item Statistics. Likewise, ID of each item’s alternatives is given in Point Biserial and Biserial of the Alternative Statistics. However, Fulcher and Davidson 2007: 103 state that ‘The most commonly used method of calculating item discrimination is the point biserial correlation’. Therefore, this study referred to the point biserial correlation for information about ID of each test item. ID refers to the extent to which an item of a test differentiates between test-takers who do well and those who do not. The positive value indicates that the students with a higher score in the test answer the item correctly; meanwhile, the negative value suggests that it is the students with a lower score who answer the item correctly. However, a test item with good discriminating power garners correct responses from most of the high-ability group and thus the value is positive. According to Henning in Fulcher and Davidson 2007, items with an r pbi of ≥ 0.25 are considered acceptable, while those with a lower value would be rewritten or excluded from the test. Also, regarding the alternatives, the positive value is prefered for the key. Meanwhile, the negative value is prefered for all distractors. In addition, the positive ID value of the key must be higher than the positive ID value of any distractor. c Distractor Efficiency DE. “In multiple choice testing, the intended correct option is called the key and each incorrect option is called a distractor ” Fulcher and Davidson, 2007: 107. Information about DE in the analysis result using ITEMAN 3.00 is indicated in Prop. Endorsing of the Alternative Statistics. DE refers to the extent to which a the distractors of a test item lure a sufficient number of test-takers where more lower-ability test-takers answer the item incorrectly than the higher-ability ones do, and b those responses are somewhat evenly distributed across all distractors Brown, 2004. A distractor is considered good when it is chosen minimally by 5 of the total test-takers BNSP, 2010. Since the number of test-takers was 28, 25, and 26 each for the pre-test, post-test I and post-test II, it means that each distractor of those tests should minimally be chosen by 2 students. The analysis results using those above-mentioned three item indices for items in the pre-test, post-test I and post-test II are presented in Table 4 below. Table 4: Analysis Results using Three Item Indices through ITEMAN 3.00 for Items in the Pre-Test, Post-Test I and Post-Test II