ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

(1)

ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA

NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

By Bagus Alghani

The objectives of this research are to identify: the validity, the reliability, the level of difficulty, the discriminating power, the quality of the alternatives, of the final semester at the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year. The data from the final semester test consisting of 35 questions were analyzed by using ITEMAN seen from: reliability, level of difficulty, discriminating power, and quality of the options. Before being analyzed by using the software, the researcher used qualitative approach to identify the construct validity, content validity, and face validity.

The results of the research show that: 1) By relying on the traits of reading testing, KTSP (School-Based Curriculum), and Guidelines for Constructing Multiple Choice Test, it was found that the construct validity is valid, the content validity is valid, but the face validity is not valid. 2) The reliability of the final semesterat the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year is categorized as good, which the alpha is 0.448. 3) The level of difficulty of the final semesterat the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year can be classified into four categories: good items (30%), very difficult (10%), very easy (20%), and too difficult (40%). 4) The discriminating power of the final semesterat the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year are classified into five categories: high discriminating power (25.7%), average/without revising (5.7%), low/need revising (0%), very low/need dropping (51.5%), and negative discrimination (17.1%). 5) The quality of the alternatives of the final semestertest at the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year is classified into three classifications: need revising (60.5%), good enough (24.5%), and

very good (15%). It shows that the quality of the final semester testat the second year of SMA Negeri 1 Purbolinggo in 2013/2014 academic year is moderate.

(2)

SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

Bagus Alghani

A Script

Submitted in a Partial Fulfillment of The requirement for S-1 Degree

The Department of Language and Arts Faculty of Teacher Training and Education

ENGLISH EDUCATION STUDY PROGRAM LANGUAGE AND ARTS EDUCATION DEPARTMENT FACULTY OF TEACHER TRAINING AND EDUCATION

UNIVERSITY OF LAMPUNG 2015

(3)

(4)

(5)

(6)

CURRICULUM VITAE

Bagus Alghani was born in Raman Utara on March 13rd, 1994. He is the

oldest son of a lovely couple Drs. Bambang Udara and Dra. Dewi Asiah. He has one

sister and one brother, Linda Mayasari and Dimas Trio Saputra.

He was enrolled at TK Aisiyah Taman Fajar, Purbolinggo in 1997. Then he

was registered at elementary school in SDN 3 Tanjung Inten in 1999 and graduated in

2005. After that, he continued his study at SMPN 1 Purbolinggo and graduated in

2008. For the next step, he continued his study at SMAN 1 Purbolinggo and

graduated in 2011. In the same year, he was admitted as the student of S1 English

Department of the Teacher Training and Education Faculty (FKIP) in University of

Lampung.

On July – September 2014 he did his Field Practice Program (PPL) at SMP PGRI 1 Wonosobo, Tanggamus for three months.

(7)

vi MOTTO

I am excited when someone beats me.

(8)

vii

All praises are only to Alloh SWT, the Almighty God, for the abundant mercy and

blessing that enables the writer to finish his script. This script entitled “Analyzing the

Quality of the Final Semester Test Using ITEMAN Software Program at the Second Year of SMA Negeri 1 Purbolinggo in 2013/2014 Academic Year” is submitted as a compulsory fulfillment of the requirement for S-1 Degree at the Language and Arts Education Department of Teacher Training and Education Faculty of University of Lampung.

Gratitude and honor are addressed to all people who have helped the writer to complete this research. Since it is necessary to be known that this research will never have come into its existence without any supports, encouragements and assistances by several outstanding people and institutions, the writer would like to express his sincere gratitude and respect to:

1. Ujang Suparman, M.A., Ph.D, as the first supervisor who has contributed and given his invaluable evaluations, comments, and suggestions during the completion of this script.

2. Dr. Ari Nurweni, M.A., as the second advisor, for her assistance, ideas, guidance

and carefulness in correcting the writer’s script, and as the Chief of English

Education Study Program and all lecturers of English Education Study Program who have contributed their guidance during the completion process until accomplishing this script.

3. Dr. Flora, M.Pd., as the examiner and, for her support encouragement, ideas, suggestion and in supporting the writer.

4. Dr. Mulyanto Widodo, M.Pd., as the chairperson of Language and Art Education Department for his contribution and attention.

5. Drs. Sutrisno, M.Si, as the Headmaster of SMAN 1 Purboinggo for giving the writer permit to conduct the research.

6. The writer’s parents (Drs. Bambang Udara and Dra. Dewi Asiah), my sister (Linda Mayasari), and my brother (Dimas Trio Saputra) for their love, support, motivation, and prayer.

7. The writer’s foster parents (Trimo and Susi Angraini), my sisters (Naila Trisa

Sa’adah and Kaila Putri Triani) for their love, support, motivation, and prayer.

8. My friends, Elisabeth Gracia S., Luh Ayu M., Aria Nugraha, Chairul Ichwan, Ferdian Muhammad, Muhammad Haris, Devrian Mustafa, and all of the members of ED 2011 and all my friends that I cannot mention one by one.

Finally, the writer believes that this script might be still far from perfect. There may be weakness in this research. Thus, comments and suggestions are always welcome for better research. Somehow, the writer hopes that this research can give a positive

(9)

viii

Bandar Lampung, May 2015 The writer,

(10)

LIST OF CONTENTS

Page

ABSTRACT... ... i

CURRICULUM VITAE... ... iv

DEDICATION... ... v

MOTTO... ... vi

ACKNOWLEDGEMENTS... ... vii

LIST OF CONTENTS... x

LIST OF TABLES... xii

LIST OF FIGURES... xiii

LIST OF APPENDICES... ... xiv

I. INTRODUCTION... 1

1.1. Background of the Problems... 1

1.2. Identification of the Problems... 5

1.3. Limitation of the Problems... 6

1.4. The Formulation of the Research Questions... 6

1.5. The Objectives... 7

1.6. The Significance of the Research... 8

II. THEORITICAL BACKGROUND... 9

2.1. Review of Previous Research... 9

2.2. Review of Related Literature... 11

2.2.1. Quality of Test………. 12

2.2.2. Final Semester Test... 20

2.2.3. Multiple Choice Tests... 21

2.2.4. Guidelines for Constructing Multiple Choice Items………… 22

2.2.5. ITEMAN Software Program... 23

2.2.6. Assessing Multiple Choice Tests Using ITEMAN Program... 30

(11)

3.2. Research Design... 32

3.3. Population and Sample... 34

3.4. Data Collecting Technique... 34

3.5. Research Procedures... 35

3.6. Data Analysis... 36

3.7. Hypothesis Testing... 43

IV RESULTS AND DISCUSSION……... 44

4.1 The Results of Final Semester Test………. 44

4.1.1 Validity………. 45

4.1.2 Reliability………. 66

4.1.3 Level of Difficulty………... 66

4.1.4 Discriminating Power………... 68

4.1.5 The Quality of the Alternatives……… 70

4.2 Discussion………... 73

4.2.1 Validity……… 73

4.2.2 Reliability………. 79

4.2.3 Level of Difficulty………... 80

4.2.4 Discriminating Power………... 82

4.2.5 Quality of the Alternatives……… 87

V CONCLUSIONS AND SUGGESTIONS……….. 89

5.1 Conclusions………. 89

5.2 Suggestions………. 92

References... 94

(12)

xii

Page

1 Table 2.1. Criteria of Reliability (Alpha)……….. 20 2 Table 2.2. Criteria of Proportion Correct (p)………. 25 3 Table 2.3. Criteria of Discrimination (D)……….. 26 4 Table 4.1. The Classification of the Final Semester Test

in Reading Comprehension……….. 46 5 Table 4.2. The Classification of the Final Semester Test (Face

Validity)………... 48 6 Table 4.3. The Classification of Quality of the Alternatives in

(13)

xiv

Appendices Page

Appendix I The Final Semester Test at the Third Semester

in 2013/2014 Academic Year……… 97 Appendix II Table of Analysis of Content Validity of Final

Semester Test at the Second Year of SMAN 1

Purbolinggo in 2013/2014 Academic Year……… 101 Appendix III Program Semester Semester Ganjil Kelas XI…… 102 Appendix IV The Guidelines for Constructing Multiple

Choice Items……….. 108

Appendix V The Output Data of ITEMAN Analysis…………. 111 Appendix VI Permission Letter from the Dean of FKIP Unila

to Conduct the Research at SMAN 1 Purbolinggo 122 Appendix VII Confirmation Letter Stating the Accomplishment

of the Research from the Headmaster of SMAN1

Purbolinggo……… 123

(14)

xiii

Page

1 Figure 3.1 An example of Data File Using Notepad on Windows… 38 2 Figure 3.2 The Output Data from Final Examination of SMK

YADIKA NATAR 2013/2014………... 41 3 Figure 4.1 The Classification of the Level of Difficulty of the

Final Semester Test in 2013/2014 Academic Year……….. 68 4 Figure 4.2 The Classification of the Discriminating Power of

the Final Semester Test in 2013/2014 Academic Year………. 70 5 Figure 4.3 The Classification of the Quality of the Alternatives of

the Final Semester Test in 2013/2014 Academic Year………. 72 6 Figure 4.4 The Classification of the Level of Difficulty of the

Final Semester Test in 2013/2014 Academic Year……….. 82 7 Figure 4.5 The Classification of the Discriminating Power of

the Final Semester Test in 2013/2014 Academic Year………. 86 8 Figure 4.6 The Classification of the Quality of the Alternatives of

(15)

CHAPTER 1 INTROUCTION

This chapter concerns with several sub chapters, that is 1). The background of

the problems, 2). The identification of the problems, 3). The limitation of the

problems, 4). The formulation of the research questions, 5). The objectives, and 6).

The significance of the research as elaborated in the following sections.

1.1.Background of the Problems

Multiple choice testing is an efficient and effective way to measure students’ ability. The multiple choice tests are published for use in many different schools. The

consideration behind the statement is that it comes as the most standardized tests,

including school or national examination. Most profitable tests are mainly made up of

multiple choice items. Besides, multiple choice tests may give a more accurate

picture of how well students have met the standard.

Even though, multiple choice tests have drawbacks, such as, students can

guess the answers, the test does not measure deep thinking skills, writing successful

multiple choice questions is difficult, and the students cannot organize and express

their ideas, this kind of test is still popular because they are truly reliable and

objective. This standardized test is also practical. It means that the test is easy to

(16)

Based on the researcher’s pre-observation, it was found that several difficulties were encountered by the teachers in SMA Negeri 1 Purbolinggo. In

assessing multiple choice tests, there were many crucial things that the teachers must

master, but they faced some difficulties in determining the quality of multiple choice

tests.

Based on the pre-interview with some teachers in SMA Negeri 1 Purbolinggo,

they only believed in the test which was made by MGMP (English Teacher

Organization). Because of that, the teachers merely took and administered the tests to the students without prior analysis of their quality. The students were forced to

answer all the questions which the teachers did not know how the validity, the

reliability, the level of difficulty, the discriminating power, the quality of the

alternatives were, which in general, these characteristics were very important in

determining the quality of the test. Most of the teachers in SMA Negeri 1

Purbolinggo never assessed the multiple choice tests given by MGMP after or before

the tests. So, the researcher was interested in analyzing the multiple choice test items

created by MGMP in SMAN 1 Purbolinggo.

Most of the teachers in SMAN 1 Purbolinggo are the members of MGMP.

Sequentially, the teachers have their turns to construct multiple choice tests for the

final semester test, including the final semester test in 2013/2014 academic year.

Basically, multiple choice tests might be somewhat beneficial, if the purpose is to

check on the knowledge of the subject taught before. But, some of the teachers

reported that some multiple choice tests did not restrain the knowledge of the subject

(17)

students without using the syllabus in curriculum as guidance. They convey the

subject when they believe that it is from English books, with no prior analysis. Thus,

the content of the test is sometimes not correlated with the materials taught before.

As a matter of fact, the teachers sometimes added some additional

competencies to make the students easier to understand the subject. It means that the

multiple choice tests might have a bad effect on overall curriculum and instruction.

They stated that the multiple choice test was sometimes too easy and too difficult for

the students. Because of that, what the students know in a subject was cut down from

what the multiple choice test measured. Because, the distracters in the multiple choice

test might not be heterogeneous, which made the test weak. From the cases

mentioned above, it can be considered that the multiple choice test might not

discriminate the more knowledgeable students from the less knowledgeable students.

The teachers were in light of the multiple choice tests created by other

teachers that the tests might not be well written. Not only that, they also remarked

that the multiple choice tests were sometimes not related to the curriculum. It was

also uttered by the teachers that the multiple choice tests might not measure deep

thinking skills. Because multiple choice items did not allow for creative responses,

the students only had choice of responding (A or B or C or D or E). So, if there was

anything more they would like to add or show what they knew beyond what was

present, they could not come up their ideas. It is obviously true that the statements

above refer to the multiple choice test made by the members of MGMP is considered

(18)

The teachers also added that multiple choice tests operated on the assumption

that there was only one correct alternative, the students considered picking out correct

alternative. Occasionally, the test maker believed that the correct alternative was true,

but, the brightest students might indeed find something different about every

alternative, including the correct one which was considered wrong. So, the multiple

choice tests promoted confusion to students. In that case, it appeared as plausible

solutions to the problem for those students who had not achieved the learning

objective. In such manner, the multiple choice tests promoted guessing to them.

Considering the facts above, the researcher decided to help the teachers

determine the quality of multiple choice tests by using ITEMAN program. ITEMAN

is very important for the teachers taking charge of administering test in order to be

sure about the quality of the test they use. Consequently, understanding how to

interpret and use information based on student test scores is as important as knowing

how to construct a well-designed test.

ITEMAN is a software used to analyze test item and determine which test

item is good and which is not, based on the criteria of reliability, discriminating

power, level of difficulty, and the quality of the alternatives. The data are analyzed

automatically by the software; therefore the teachers are made easier and faster to do

the analysis. The teachers do not need to acquire complicated mathematic calculation,

since the steps are very simple to follow. Not only that, the software program could

be used to analyze almost unlimited number of testees in relatively very short time.

As ITEMAN is considered useful, the teachers are more expected to have an

(19)

order to utilize the program, the ITEMAN software program should be installed first.

Another fact that motivated the writer to conduct this research was his own

experience that proved his assessing multiple choice tests to have been easily

analyzed by using the program. Because of that, the writer here put an effort on how

to find some ways to utilize the program as a treatment to promote the assessment of

multiple choice tests. Thus, this research was regarded to as a facilitative way for the

teachers to analyze the final semester test.

The researcher used ITEMAN software program which helped the teachers

determine the quality of the final semester test and prove whether the test had

fulfilled the criteria of a good test or not. Therefore, this research utilized the tool

used to analyze the final semester test at SMAN 1 Purbolinggo in 2013/2014

academic year.

1.2. Identification of the Problems

According to the background of the problem, the researcher found several

problems that can be identified, that is:

a. The multiple choice tests might:

1. Not check on the knowledge of the subject taught before.

2. Have bad effect on overall curriculum and instruction.

3. Be too easy for the students.

4. Be too difficult for the students.

5. Not discriminate the more knowledgeable students from the less

(20)

6. Not be related to the curriculum.

7. Not measure deep thinking skills.

8. Promote confusion to the students.

9. Promote guessing to the students.

10.Not have heterogeneous distracters.

b. The teachers got difficulties in assessing the multiple choice tests and do not

know how to determine the quality of the tests (validity, reliability, the level

of difficulty, the discriminating power, and the quality of the alternatives).

1.3. Limitation of the Problems

Considering the identification of the problems, this research is limited that the

multiple choice tests might:

a. Not be related to the curriculum.

b. Be too easy for the students.

c. Be too difficult for the students.

d. Not discriminate the more knowledgeable students from the less

knowledgeable students.

e. Not have heterogeneous distracters.

1.4. The Formulation of the Research Questions

In line with the limitation of the problems, the following research questions

(21)

How is the quality of the final semester test at the second year of SMAN 1

Purbolinggo in 2013/2014 academic year? In relation to:

a. How is the validity of the final semester test at the second year of SMAN 1

Purbolinggo in 2013/2014 academic year? (Validity is not analyzed by

ITEMAN)

b. How is the reliability of the final semester test at the second year of SMAN 1

Purbolinggo in 2013/2014 academic year?

c. How is the level of difficulty of the final semester test at the second year of

SMAN 1 Purbolinggo in 2013/2014 academic year?

d. How is the discriminating power of the final semester test at the second year

of SMAN 1 Purbolinggo in 2013/2014 academic year?

e. How is the quality of the alternatives of the final semester test at the second

year of SMAN 1 Purbolinggo in 2013/2014 academic year?

1.5. The Objectives

In line with the formulation of the research questions, the objectives of this

research are:

To find out how the quality of the final semester test at the second year of SMAN 1

(22)

a. Find out how the validity of the final semester test at the second year of

SMAN 1 Purbolinggo in 2013/2014 academic year is (Validity is not analyzed

by ITEMAN).

b. Find out how the reliability of the final semester test at the second year of

SMAN 1 Purbolinggo in 2013/2014 academic year is.

c. Find out how the level of difficulty of the final semester test at the second

year of SMAN 1 Purbolinggo in 2013/2014 academic year is.

d. Find out how the discriminating power of the final semester test at the second

year of SMAN 1 Purbolinggo in 2013/2014 academic year is.

e. Find out how the quality of the alternatives of the final semester test at the

second year of SMAN 1 Purbolinggo in 2013/2014 academic year is.

1.6. The Significance of the Research

The findings of the research are expected to be beneficial both theoretically

and practically.

1. Theoretically, as a verification of the previous theories of the quality of

assessment.

2. Practically, this research may be used to help teachers assess the quality of

(23)

CHAPTER 2

THEORITICAL BACKGROUND

This chapter provides two major points. That is, review of the previous

research and review of related literature. They are elaborated in the following

sections.

2.1. Review of Previous Research

In relation to this research, there is some previous research which has been

conducted by some researchers, such as: Ariyana, 2011; Fitriana, 2013; and

Ratnaningsih, 2009.

Ariyana (2011) conducted research on students of SMP in Grobogan district,

Semarang. She investigated final semester test at the third semester in science class.

The purpose of the research was to find out the validity, level of difficulty,

discriminating power, the effectiveness of alternatives, and the reliability. The

method of collecting the data was recording. The quantitative approach was done by

using ITEMAN program. The result of the research showed that the multiple choice

test was 2% very difficult; 20% difficult; 70% average; 4% easy and 4% very easy.

The discriminating power of the test was that 26% was average, 62% was high, 10%

was low, and 2% was very low. The effectiveness of the alternatives was 82%. It

(24)

meant that the test had high reliability. Based on the result of the research, the validity

was reasonable but needed revising. Therefore, the multiple choice test had average

level of difficulty, high discriminating power, functional alternatives, and high

reliability.

Fitriana (2013) conducted research on students of MI Sultan Agung at grade

five, Sleman district, Yogyakarta. She investigated the quality of final semester test.

She used ITEMAN as the tool to determine the quality of the test. The result of the

research revealed that the multiple choice test made by official government

(Dikpora), Sleman district, had high validity. There were 27 questions or 67.5 % of

the test which were valid and 13 questions or 32.5% were not valid. There were 67

alternatives out of 120 alternatives were functional. Not only the validity, the alpha of

the test was 0.780 meaning that it had high reliability. From the level of difficulty, it

showed that there were 25 questions which were easy. The discriminating power

which was accepted was 22 questions, because only 37.5 % of the multiple choice

tests had good discriminating power.

Ratnaningsih (2009) conducted research on students of UT Pondok Cabe,

Pamulang, South Tangerang town. The paper aimed to analyze multiple choice items

of the End Semester Examination of UT using the program ITEMAN. The data used

were the answer sheets of students taking eight courses in the first and second

semester of 2009. The results showed that the test items used had a pretty good

quality. Average test item difficulties were fair. This was indicated by the mean value

of P which ranged from 0.328 to 0.461. Discrimination index for both semester tests

(25)

0.451 for the first semester of 2009 tests and 0.343 to 0.382 for the second semester

of 2009 tests. Meanwhile, the reliability of the test items could be considered good

whose value ranged from 0.771-0.520. The effectiveness of the alternatives was 62%

- 94%. It meant that the alternatives were functional.

There is a lot of research that has been conducted by using ITEMAN program.

From the related studies above, those studies mention that the researchers used

ITEMAN as a tool to analyze multiple choice tests in elementary school, junior high

school, and university as the population and sample of the research. As known that,

the validity in the ITEMAN is concluded by covering level of difficulty,

discriminating power, and proportion of the distracters. This research discerns the

validity seen from the content validity, construct validity, and face validity. Because

of that, the researcher analyzed on those sides and investigated the population which

had different knowledgeable students and multiple choice items, as the focus of this

research.

2.2. Review of Related Literature

For the specific explanation about the analysis of final semester test using

ITEMAN software program, the researcher explains some related literature about

quality of a test, final semester test, multiple choice tests, guidelines for constructing

multiple choice items, ITEMAN software program, and assessment of multiple

(26)

2.2.1. Quality of Test

One commonly used tool in assessment is a test. That is to assess the outcome

of the learning process. To determine the quality of the test, it is necessary to analyze

the test before the test is given to the participants of the test. According to Arikunto

(2006:205), item analysis is a systematic procedure, which will provide information

that is very specific to the test items arranged. Nunnally (1978:301) states that item

analysis is extremely useful. This furnishes a variety of statistical data regarding how

subjects responded to each item and how each item relates to overall performance.

From the two definitions above, it can be concluded that the analysis of the test is a

systematic activity that involves the collection and processing of data in the form of a

test that is done in order to obtain information to determine a conclusion about the

quality of the test.

There are two approaches that can be used to determine the quality of a test,

namely qualitative and quantitative approaches (Osterlind, 1998:84). A qualitative

approach is done by reviewing items and should be done before the test is tested. The

thing which is emphasized is the assessment from the aspects of material,

construction, and language. While the quantitative approach is a method of test item

review based on empirical data obtained through participant responses. Item

characteristics are a quantitative parameter. In determining the characteristics of the

item, there are generally three things which should be considered, namely: (1) level of

difficulty, (2) discriminating power, and (3) effectiveness of distracters. These three

characteristics of the item jointly determine the quality of the item. Linn & Gronlund

(27)

reliability, and usability. Validity means that the accuracy of the interpretation of the

results of the measurement. Reliability means the consistency of the result

measurement, and usability means the procedure is practical.

a. Validity

If the result of a test is not considered valid, then the test is meaningless. If it

does not measure what it is measured, the result cannot be used to answer the

research question, which is the main aim of the research. Validity is the extent to

which an instrument measures what it is supposed to measure (Carmines and Zeller,

1979:17). According to Lynne (2004:31), validity, reliability’s partner term, refers to the ability of the test to measure what it is required to measure, its suitability to the

task at hand. Besides, according to Wiggins & McTighe (2005:194), validity refers to

the meaning the raters can and cannot properly make of specific evidence, including

traditional test-related evidence. For the criteria of validity, in a very general sense, a

test is valid for anything with which it correlates (O’neil, 2009:23). Therefore, validity almost seems like an afterthought, in some ways drawing upon the overall

history of validity in which the test authors are the supreme authority about the

validity of their tests.

In ITEMAN software program, the measurement of validity is not covered

explicitly. To know the validity of a test using ITEMAN, the value covers the level of

difficulty, discriminating power, and proportion of the alternatives (Salirawati,

2011:28). Then, the conclusion from the three aspects gives a decision whether the

(28)

There are three types of validity used in this research: construct validity,

content validity, and face validity. This research uses these types of validity due to

that fact that in ITEMAN, the validity is not statistically computed. Consequently,

construct validity, content validity, and face validity help the researcher determine the

validity more accurately.

1) Construct Validity

The underlying theoretical construct in a test is concerned in this validity. The

term “construct validity” refers to the overall construct or trait being measured (O’Neill, 2009:26). If a test is supposed to be testing the construct of speaking, it should indeed be testing speaking, rather than listening, reading, writing, vocabulary,

and grammar. Therefore, the term construct validity has been used both for

correspondence at the element level and at the relation level (Brinberg & McGrath,

1985:115).

a. Traits of Listening

Listening is one of the most fundamental skills in learning language. Because

a communication will not be running well if this basic skill is not mastered,

especially for ESL. Listening is an activity of paying attention and trying to get

meaning from something through ears. In listening comprehension, the forms of

the test that are given to the testees are short utterances, dialogues, talks, and

lectures (Heaton, 1975:8). It indicates that the listener must digest the message of

(29)

comprehension, he defines that an effective way of developing the listening skill is

through provision of carefully selected practice material. Such material is in many

ways similar to that used for testing listening comprehension. He considers that it

is possible to develop listening ability if the practice material is not dependent on

spoken responses and written exercises.

Based on the statements above, listening is a manner conducted by the listener

in actively paying attention and understanding the meaning of the words the

speaker says.

b. Traits of Speaking

Speaking is an action of conveying ideas and thoughts. It takes the part of

pronunciation, vocabulary, grammar, fluency and comprehension altogether

(Haris, 1974:84). According to Heaton (1975:8), to test speaking ability, the test is

usually in the form of an interview, a picture description, role play, and a

problem-solving task involving pair work and group work. Therefore, speaking test can take

place if the speaker uses verbal symbol like word and non verbal symbol like

gesture and body language to convey the intention.

c. Traits of Reading

Reading deals with how the readers receive the meaning through the written

symbols and process them into their mind. Reading is one of the important skills

which are needed by students from elementary school to university. Heaton

(30)

sounds with responding graphic symbols. He defines reading comprehension as

the questions which are set to test the students’ ability to understand the gist of a text and to extract key information on specific points in the text. It indicates that

comprehending the reading text involves connecting information from the written

message to arrive at the meaning of the text.

Comprehension is very prominent in this case. Because of that, traits of

comprehending texts which are evaluated indirectly put a heavier burden on the

testing procedures which the tester decides to use and may have an effect on the

score of the test taker (Shohamy, 1985:103).

To find the construct validity of the reading test, the final semester test was

formulated by the concept of reading comprehension. According Davenport (2007:

61), common types of questions found in reading comprehension are included as

follows:

1. Identifying main idea, main point, author purpose or an alternate title for

the passage.

2. Recognizing the tone of the passage or identify the style.

3. Comprehending information directly stated in the passage (finding

supporting detail).

4. Answering relational questions about the author’s opinion or idea, even if not stated directly.

5. Recognizing the structural methodology employed to develop the passage,

(31)

6. Extending limited information given by the author to a logical conclusion

using inference (inference meaning).

This research is focused on main idea, supporting detail, inference meaning,

vocabulary, and reference.

d. Traits of Writing

Writing is a productive skill in the written form. Writing is one of the

language skills that are used for indirect communication such as, letter, note, short

message, and invitation. Through writing, students can express their understanding of

problems or ideas. Writing is considered the most difficult skill to master (Shohamy,

1985:188). Moreover, Heaton (1975:135) says that this skill needs not only

grammatical and rhetorical devices, but also conceptual and judgmental elements.

Writing is a productive skill in the written form. According to Heaton (1974:135),

five components that are necessary for testing the writing skills are:

1. Language use: the ability to write correct and appropriate sentences.

2. Mechanical skills: the ability to use correctly those conventions peculiar to the

written language – e.g. punctuation, spelling.

3. Treatment of content: the ability to think creatively and develop thoughts,

excluding all irrelevant information.

4. Stylish skills: the ability to manipulate sentences and paragraphs, and use

(32)

5. Judgment skills: the ability to write in an appropriate manner for a particular

purpose with a particular audience in mind, together with an ability to select,

organize and order relevant information.

e. Traits of Grammar

Grammar is one of the language components. In testing grammar, multiple

choice test is one of the most common types. To test awareness of the grammatical

features of the language using the objective test (multiple choice test), the test

evaluates the ability to recognize or produce correct forms of language rather than

the ability to use language to express meaning, attitude, emotion, etc (Heaton,

1975:34). It refers to pattern of form and arrangement by which the words are put together, because, according to DeCapua (2008:1), grammar is a set of rules. One

must also know how the words work together in English sentences, not only

knowing English words and their meanings (Allen, 1983:2). Therefore, someone

using language has to know the grammatical pattern of the language.

f. Traits of Vocabulary

If students cannot master vocabulary, they will fail to use the language both in

oral or written form. Therefore, in order to be able to master the language, the

students must learn vocabulary well. Not only a certain number of vocabularies,

but they also know all vocabularies in order to master the language and use the

words properly in vocabulary testing. Wallace (1986:1) states that vocabulary is

(33)

are designed that they test knowledge of words which, though frequently found in

many English textbooks, are rarely used in ordinary speech. Subsequently, a

careful selection, or sampling, of lexical items for inclusion in vocabulary test is

the most crucial task.

2) Content Validity

Content validity represents the correlation between the test and exact

materials, in terms of construction. As known that content validity is concerned with

identifying the relationship between test tasks and specific learned content, construct

validity attempts to make the connection between test tasks and theoretical constructs

of language proficiency regardless of learned materials (Azwar, 2000:45). In the case

of semester test, of course, there are no test specifications, and the teachers may

simply need to check the teaching syllabus or the course textbook to see whether each

item is appropriate for that examination.

3) Face Validity

Although this validity is considered as a weak measure, its importance cannot

be underestimated. Face validity is very important for holistic scores. Holistic tests

that measure writing look at actual pieces of writing to do so (Lynne, 2004:35).

According to O’Neill (2009:26), face validity is a test looked like it would measure the desired ability or trait. So, if the test lacks face validity, it may not work as it

(34)

b. Reliability

If the results of a test are replicated consistently, they are reliable. In

psychometrics, reliability is a technical measure of consistency (Lynne, 2004:31).

Reliability is the degree to which a test consistently measures whatever it measures

(Crocker & Algina, 1986:105). Therefore, any random influence which tends to make

measurements different from occasion to occasion or circumstance to circumstance is

a source of measurement error (Nunnally, 1978:248). In ITEMAN software program,

Alpha is the measurement of reliability of a test.

There are three indexes that can be followed to determine whether the

reliability of a test is very bad, sufficient, and very good, as follows:

Table 2.1 Criteria of Reliability (Alpha)

Criteria Index Clasification Decision

Reliability (Alpha)

0,000 - 0,400 Low Very bad 0,401 - 0,700 Average Sufficient 0,701 - 1,000 High Very good

Source: Ngadimun (2004:8)

c. Usability

A test is said to have a high usability when the test is practical. That is, the

test is easy to be implemented, easy to be assessed, easy to make administration, and

also fulfilled with clear and complete instructions that may be given by others.

2.2.2. Final Semester Test

Final semester test is an activity that is carried out by educators to measure

(35)

all indicators that represent all of the standard competence in the semester

(Permendiknas No. 20, 2007 on the Standard Assessment). Based on the article, it

asserts that the final semester test given by educators is under the coordination of the

educational unit. Because of that, the educators or teachers have to conduct an

assessment of their students under the coordination of the school as an educational

unit. The provisions indicate that the teachers have an important role to determine the

progress of the students through final semester test. This is relevant to the evaluation

of the characteristics of education where the most ideal in evaluating education is

teacher as an educator.

In fact, traditional assessment is still implemented and used in final semester

test. Multiple choice tests are the test which is still counted on by MGMP. This type

of assessment is not the only way or the best way to evaluate students, but is the most

common way used to measure the student learning process.

2.2.3. Multiple Choice Tests

This kind of test requires the students to pick out the correct answer from

several alternatives provided by the test maker. Over the last decade, large student

numbers, reduced resources and increasing use of new technologies have led to the

increased use of multiple choice questions as a method of assessment in higher

education courses (Nicol, 2007:53). According to Wiggins & McTighe (2005:338),

multiple choice tests are indirected measures of performance. A standard multiple

choice test item consists of two basic parts: a problem (stem) and a list of suggested

(36)

incomplete statement, and the list of alternatives contains one correct or best

alternative (answer) and a number of incorrect or inferior alternatives (distracters)

(Crocker & Algina, 1986:76). For those students who have not achieved the

objectives, the distracters appear as plausible solutions to them. On the contrary, only

the answer should appear plausible to these students and the distracters must emerge

as implausible solutions for those students who have achieved the objectives.

The alternatives may be complete sentences, sentence fragments, or even

single words. In fact, the multiple choice items can assume a variety of types,

including absolutely correct, best answer, and those with complex alternatives

(Osterlind, 1998:20).

2.2.4. Guidelines for Constructing Multiple Choice Items

When test writers refer to style, they usually mean the expression of ideas in a

smooth, orderly, pleasing manner. Each test writer develops an individual style of

expression that allows for a personal presentation of his or her own thoughts and

emotions. For analyst, however, style connotes something different. Editorial style

refers to the consistent use of a set of rules and guidelines. The rules and guidelines

prescribe a consistent use of punctuation, abbreviations, and citations, a uniform and

attractive format for tables, graphs, and charts, and a correct form for the many other

elements that constitute written communication (Osterlind, 1998:161).

There was one research by Haladyna and Downing (1989a, 1989b) in

Haladyna (2004:98) involving an analysis of 46 textbooks and other sources on how

(37)

Author consensus existed for many of these guidelines. But for other guidelines, a

lack of a consensus was evident. The next study by them involved an analysis of

more than 90 research studies on the validity of the item-writing guidelines. Only a

few guidelines received extensive study. Nearly half of the 43 guidelines received no

study at all. Since the appearance of these two studies and the 43 guidelines,

Haladyna repeated this study. They examined 27 new textbooks and more than 27

new studies of the guidelines. From this review, the guidelines were reduced to be 31

guidelines, which were used in this research. In such manner, he stated that there are

two categories of item whether the item correlates to the guidelines or not, that is,

flawed and non-flawed items. Because of that, these guidelines help the researcher

determine the validity of the final semester test, especially in terms of face validity.

This research has a set of multiple choice item-writing guidelines that apply to

all multiple choice formats taken from Haladyna’s item-writing guidelines. So, the researcher implements the guidelines judiciously but not rigidly in determining how

the face validity of the final semester test is.

2.2.5. ITEMAN Software Program

The use of ITEMAN stays widespread, but, some takes into account of an out

dated system. ITEMAN is an accurate software program with the beginning stamping

back to the 1960s (Nelson, 2012). For quite a few years, it was designed to be utilized

for traditional item and test analysis. As a complete and reliable workhorse, it has had

(38)

The ITEMAN software program is publicized as a Classical Item Analysis

program. Not only to estimate and note test scores, but also can examine multiple

choice questions. The model of the program is 3.50, at hand on the internet at

www.assess.com. There are four statistical measures offered in the program (ASC,

1989-2006:13): Proportion Correct, Discrimination Index, Biserial and Point Biserial

Correlation Coefficients.

Here are brief descriptions of the research’s commonly used terms, to allow for better understanding when they appear in the remainder of the paper. All these

formulas are not used in practice because ITEMAN analyzes them automatically

except validity.

Proportion Correct

Probably the most popular item-difficulty index for dichotomously scored test

or multipoint items is the p-value (Osterlind, 1998:266). It is simply the proportion

(or percentage) of students taking the test who answered the item correctly

(Haladyna, 2004:207). This value is generally reported as a proportion (rather than

percentage), ranging from 0.0 to 1.0. A value of 0.0 would indicate that no one

answered the item correctly. A value of 1.0 would indicate that everyone answered

the item correctly.

There are four indexes that can be followed to determine whether a test item is

(39)

Table 2.2 Criteria of Proportion Correct (p)

Criteria Index Clasification Decision

Proportion Correct ( p )

0,000 - 0,099 Very difficult Rejected/total revising

0,100 - 0,299 Difficult Revised

0,300 - 0,700 Average Good

0,701 - 0,900 Easy Revised

0,901 - 1,000 Very easy Rejected/total revising Source: Ngadimun (2004:8)

Discrimination Index

The size of the discrimination index is informative about the relation of the

item to the total domain of knowledge or ability, as represented by the total test score

(Haladyna, 2004:211). This is also known as Differentiation Index. This statistic is a

measure of each test question’s ability to differentiate between high scoring and low scoring students. This is computed as: the number of people with highest test scores

(top 27%) answering the item correctly minus the number of people with lowest

scores (bottom 27%) answering the item correctly, divided by the number of people

in the largest of the two groups.

Disc. Index = PHigh – Plow

Where PHigh is the proportion of examinees in the upper 27% of the score

distribution answering the item with the correct/keyed answer and PLow is the same

proportion in the lower 27% group.

The higher the number, the more the question is able to discriminate the

higher scoring people from the lower scoring people. Possible values range from -1.0

(40)

question correctly, and the upper 27% of the group all answered the question

incorrectly. A score of 1.0 indicates that the upper 27% of the group all answered the

question correctly and the lowest 27% of the group answered the question incorrectly.

Negative discrimination would signal a possible key error (Haladyna, 2004:228).

There are four indexes that can be followed to determine whether a test item is

rejected, revised, or accepted, as follows:

Table 2.3 Criteria of Discrimination (D)

Criteria Index Clasification Decision

Discrimination( D )

D  0,199 Very low Rejected/total revising

0,200 - 0,299 Low Revised

0,300 - 0,399 Average Accepted

D  0,400 High Accepted

Source: Ngadimun (2004:8)

Item-Total Correlation

This is recognized as correlation coefficients. These two coefficients are also

known as Discrimination Coefficients (ASC, 1989-2006:13).

1. Biserial Correlation Coefficient

It is closely related to the point biserial correlation, with an

important difference. The distinction between these two measures exists

in the assumptions. Whereas the point-biserial statistic presumes that one

of the two variables being correlated is a true dichotomy, the biserial

correlation coefficient assumes that both variables are inherently

(41)

for both variables is normal (Osterlind, 1998:282). For computational

purposes, however, one of the variables has been arbitrarily divided into

two groups, one low and the other high. In item analysis, the two groups

are examinees who responded correctly to a given item and those who did

not. In other words, it is a measurement of how getting a particular

question correct correlates to a high score (or passing grade) on the test.

Possible values range from -1.0 to 1.0. A score of -1.0 would indicate that

all those who answered the question correctly scored poorly on (or failed)

the test. A score of 1.0 would indicate that those who answered the

question correctly scored well on (or passed) the test.

2. Point Biserial Correlation Coefficient

One index of discrimination is the point-biserial correlation

coefficient. As a measure of correlation, the point-biserial coefficient estimates the degree of association between two variables: a single test

item and a total test score (Haladyna, 2004:211). This statistic is a

measure of the capacity of a test item (question) to discriminate between

high and low scores. In other words, it is how much predictive power an

item has on overall test performance. Possible values range from -1.0 to

1.0 (the maximum value can never reach 1.0, and the minimum can never

reach -1.0). A value of 0.6 would indicate the question has a good

predictive power, i.e., those who answered the item correctly received a

(42)

incorrectly. A value of -0.6 would indicate the question has a poor

predictive power, i.e., those who answered the item incorrectly received a

higher average grade compared to those who answered the item correctly.

The following statistics are provided by ITEMAN for each scale (subtest)

analyzed (ASC, 1989-2006:16-18):

1. N of Items. The number of items in the scale that are included in the analysis.

2. N of Examinees. The number of examinees that are included in the analysis

for the scale.

3. Mean. The average number of items on each scale that were answered

correctly.

4. Variance. The variance of the distribution of examinee scores on each scale.

5. Std. Dev. The standard deviation of the distribution of examinee scores for

each scale.

6. Skew. The skewness of the distribution of examinee scores for each scale.

The skewness gives an indication of the shape of the score distribution. A

negative skewness indicates that there is a relative abundance of scores at the

high end of the scale distribution. A positive skewness means that there is a

relative abundance of scores at the low end of the distribution. A skewness of

zero means that the scores are symmetrically distributed about the mean.

7. Kurtosis. The kurtosis of the distribution of examinee scores for each scale.

(43)

that of a normal distribution. A positive value indicates a more peaked

distribution; a negative value indicates a flatter distribution. The kurtosis of a

normal distribution is zero.

8. Minimum. The lowest score on each scale for any examinee.

9. Maximum. The highest score on each scale for any examinee.

10.Median. The examinee score at the fiftieth percentile for each scale. It is thus

the score that half of the examinees scored at or below.

There were 32 examinees in the data file.

Scale Statistics ---

Scale: 0 --- N of Items 30 N of Examinees 32 Mean 21.906 Variance 6.085 Std. Dev. 2.467 Skew -1.504 Kurtosis 3.420 Minimum 13.000 Maximum 25.000 Median 22.000 Alpha 0.476 SEM 1.786 Mean P 0.730 Mean Item-Tot. 0.294 Mean Biserial 0.445 Max Score (Low) 21 N (Low Group) 12 Min Score (High) 24 N (High Group) 10

11.Alpha. It is an index of the homogeneity of each scale. It can range in value

(44)

designed to measure a single trait. The alpha value is usually considered to be

a lower-bound estimate of the reliability of a scale.

12.SEM. The standard error of measurement for each scale. It is an estimate of

the standard deviation of the errors of measurement in the scale scores.

13.Mean P. The average proportion correct across all items on the scale for

scales composed of dichotomously scored items.

14.Mean Item-Tot. The average point-biserial correlation across all the items in

the scale.

15.Mean Biserial. The average biserial correlation across all of the items on the

scale.

Listed above, these statistical measurements are the most widely used terms to

assess multiple choice questions. The purpose of these reports is to help evaluate the

quality of test items, and tests as a whole, by examining their psychometric

characteristics.

2.2.6. Assessing Multiple Choice Tests Using ITEMAN Program

When the test analyzed by ITEMAN is composed of multiple scales, the items

are assigned to the scales using the inclusion codes. This means that statistics analysis

about the test is provided in the output data of ITEMAN. Particularly, the exemption

of file capability in ITEMAN gives an opportunity to the examinees to re-analyze

data of the multiple choice tests if students find that they want to take into account of

(45)

circumstances for giving credit to more than one alternative include poorly phrased

questions, conflicting source information, or an indication of additional problems

from a previous analysis. No single response is considered correct and the item has

no influence on the total score (ASC, 1989-2006:3). According to Surapranata

(2006), an alternative is considered functional if at least chosen by 5 % of the

examinees.

ITEMAN analyzes scales containing either dichotomously scored or

multipoint items. The program can work only with multiple choice items. It is

relatively easy to analyze test items using the ITEMAN program.

2.2.7. The Hypotheses

Based on the theories, the researcher formulated the hypotheses as follows:

H0 : The final semester test has not fulfilled the criteria of a good test, that is,

has bad validity, low reliability, very easy or difficult level of difficulty,

very low discriminating power, and non-functional alternatives.

H1 : The final semester test has fulfilled the criteria of a good test, that is, has

good validity, high reliability, average level of difficulty, high

(46)

CHAPTER 3 RESEARCH METHOD

In this chapter, the method of the research is discussed. The parts of

methodology such as: setting of the research, research design, population and sample,

data collecting techniques, research procedure, data analysis, and hypothesis testing

are explained further.

3.1. Setting of the Research

The researcher chose SMA Negeri 1 Purbolinggo as the research place

because this was one of the developing schools in East Lampung district that could be

reached easily by the researcher. He analyzed the final semester test which had been

conducted by the students at the second grade of the third semester. And for the time

of research, he prepared for the proposal, determined the object of the research,

determined the subject, approached the school and the teachers, sought for permission

from the headmaster and teachers of English to carry out the research in that school

for one specific time.

3.2. Research Design

This research used quantitative and qualitative approaches. It is obviously true

(47)

information from computer records about the frequency of ‘hits’ in the use of web -based course materials (Robinson, Spratt & Walker, 2004:6). Meanwhile, qualitative

research focuses on the process of research involves questions and procedures, data

typically collected in the participant’s setting, data analysis inductively building from particulars to general themes, and the researcher makes interpretations of the meaning

of the data (Creswell, 2009:70). The writer chose this research design because he

tried to investigate whether the final semester test had fulfilled the criteria of a good

test or not. So, every question in the multiple choice tests was evaluated

quantitatively and qualitatively.

From the observation that was done among the items, the use of ITEMAN

program for assessing the final semester test would be elicited. The design of this

research was descriptive and evaluative. This means that the researcher described the

result of an evaluation on an object which was based on the standard criteria.

From the explanations above, this study was planned and done as quantitative

and qualitative study that aimed at finding the use of ITEMAN software program and

the assessment of the final semester test.

The researcher used one class. The class comprised of students with different

ability in English. The students had already been given the test by the school. The

result of the test was only taken in the school which means that the researcher had got

the data. Subsequently, after the result of the test had been obtained, the researcher

(48)

3.3. Population and Sample

Population is all cases, situations or individuals who share one or more

characteristics (Nunan, 1992:231). The population of this research was the second

grade students of SMA Negeri 1 Purbolinggo. There were 252 students for the second

grade in SMA Negeri 1 Purbolinggo.

After determining the population, the researcher had to pick out the sample of

this research. The sample is XI IPA 2. There were 30 students in XI IPA 2. This class

was taken by using purposive sample. Purposive sampling, also known as judgmental, selective or subjective sampling, is a type of non-probability sampling

technique. Non-probability sampling focuses on sampling techniques where the units

that are investigated are based on the judgment of the researcher (Patton, 2002:230).

The researcher needed to get a group of students who had the lowest score among

others as the sample. The purpose was to determine the quality of the final semester

test more accurately. Since the final semester test made by MGMP has been used for

years by the school, it means that the test is considered good. The researcher needed

to know if the group of the lowest students had really bad scores due to the test, to

their ability, or to the learning process. Consequently, the researcher chose XI IPA 2

as the sample of this research.

3.4. Data Collecting Technique

The data were collected from the final semester test created by MGMP in

2013/2014 academic year. The researcher took the students’ answers and the test from the school. Then, ITEMAN software program got its turn to analyze the test.

(49)

3.5. Research Procedures

The researcher checked the quality of the final semester test after the students’ answers and question sheets had been obtained. The instrument was the final semester

test; each item had five options A., B., C., D., and E. Then, the researcher analyzed

the test.

There were several procedures to make the research run well. The procedure

of this research was as follows:

1. Determining the problems

The problems were formulated to be a foundation of this research.

2. Determining and selecting the population and the sample

The population of this research was the second grade of SMA Negeri 1

Purbolinggo. The researcher took one class that contained 30 students.

The sample of this research was XI Science 2 class at the third

semester.

3. Determining the test

The test was from the final examination of semester three made by

MGMP- SMA LAMPUNG TIMUR 2013/2014 academic year.

4. Assessing the test

Before the final semester test was examined quantitatively, the test

was analyzed by using qualitative approach to find out the construct

validity, content validity, and face validity. Then, this research

(50)

program. The test consisted of 35 multiple choice items, and the

students were given 60 minutes to answer.

3.6. Data Analysis

Data analysis is the process of data organization in order to achieve the

necessity of the research. The purpose of data analysis is to determine the quality of

the final semester test at the second year of SMAN 1 Purbolinggo in 2013/2014

academic year. The data of the research was examined by using quantitative and

qualitative approaches.

In order to know the quality of the final semester test at the second year of

SMAN 1 Purbolinggo in 2013/2014 academic year, the researcher analyzed the test

using traits of language skills and aspects of language, KTSP (School-Based

Curriculum), Guidelines for Constructing Multiple Choice Test, and ITEMAN

software program.

Before being analyzed by using ITEMAN software program, the researcher

evaluated the test by utilizing traits of language skills and aspects of language, KTSP,

and Guidelines for Constructing Multiple Choice Test.

Traits of language skills and aspects of language were applied to find out the

construct validity of the test. This concerned whether the tests were true reflection of

the theory of the trait, in this case, language which was being measured. For content

validity, KTSP got its turn to analyze the final semester test. So, the relationship

(51)

analyzing face validity, Guidelines for Constructing Multiple Choice Test helped the

researcher determine the validity which was very important for holistic scores.

After analyzing the final semester test using the three items, the researcher

conducted the analysis of the data by using the steps of ITEMAN program. The

following are the steps to enter the data using a new file (Suparman, 2011):

1. Click Start

2. Select Program

3. Select Accessories

4. Choose and click Notepad

5. Save/click File

6. Select and click Save as, then name the data file, for example: MIDTEST (make sure the file name must not exceed 8 letters/numbers)

7. Start data entry

8. The data appear like shown on the example of the data in the input (Figure 1).

ITEMAN requires that the input data file be formatted in ASCII (text-only)

files. Most data files produced by optical scanning devices are very close to the

format that ITEMAN requires, with the exception of the 5 lines that must be added at

the beginning. These lines (fig. 3.1) contain the control line, the key, number of

(52)

Figure 3.1. An example of Data file using notepad on Windows Source: Ngadimun (2004)

After all the data have been put in Notepad and saved, the data are analyzed

using ITEMAN program. The following are the steps of utilizing the program

(Suparman, 2011):

1. Open ITEMAN Program, by clicking Start

2. Select program/click ITEMAN and the program shows this appearance. Number of items Number of digits and empty

space before students’ answers

Key answer

Number of answer Number of testees

Students’ answers

Do not enter after writing the last letter

(53)

3. Type the name of your data file (input) on Enter the name of the input file. For example, F:\MIDTEST.txt, then Enter

4. Enter the name of the output file on Enter the name of the output file. For example, F:\MIDTEST.output, then click Enter

5. A question appears Do you want the scores written to a file? (Y/N), then type

Y and click Enter.

6. Enter the name of your score file on Enter the name of the score file: For example, F:\MIDTEST.scr, then click Enter, Finish.

Then, there are some steps to open the results of item analysis on MS

WORDS program (Suparman, 2011):

1. Click Start

2. Select Program/click Microsoft Word

3. Click File/click Open, then look for the results on, for example, Drive F

(depends on which one you choose)

(54)

ITEMAN produces an output file, score file (if desired) and statistics file (if

desired). The output file contains the statistical measures, and displays them not only

for each question, but for each alternative as well. Here is a sample from the output

file:

MicroCAT (tm) Testing System Page 2

Item analysis for data from file D:\ITEMAN\UAS.TXT

Date: 05-19-14 Time: 7:51 pm Item Statistics Alternative Statistics

--- --- Seq. Scale Prop. Disc. Point Prop. Endorsing Point No. -Item Correct Index Biser. Alt.Total Low High Biser. Key ---- --- --- --- --- --- --- ---- ---- --- --- 1 0-1 .40 .39 .28 A .10 .17 .09 -.15 B .40 .50 .18 -.24 C .40 .25 .64 .28 * D .10 .08 .09 .08 E .00 .00 .00 Other .00 .00 .00 2 0-2 .47 .39 .37 A .23 .33 .09 -.33 B .47 .25 .64 .37 * C .03 .00 .09 .13 D .03 .08 .00 -.13 E .23 .33 .18 -.11 Other .00 .00 .00

And so on

Scale Statistics ---

Scale: 0 --- N of Items 30

(55)

N of Examinees 32 Mean 21.906 Variance 6.085 Std. Dev. 2.467 Skew -1.504 Kurtosis 3.420 Minimum 13.000 Maximum 25.000 Median 22.000 Alpha 0.476 SEM 1.786 Mean P 0.730 Mean Item-Tot. 0.294 Mean Biserial 0.445 Max Score (Low) 21 N (Low Group) 12 Min Score (High) 24 N (High Group) 10

Figure 3.2. The Output Data from Final Examination of SMK YADIKA NATAR 2013/2014

This output lists the proportions of 1) the total number of students selecting,

2) the bottom 27% of the group selecting, and 3) the top 27% of the group selecting

for each alternative (ASC, 1989-2006). The output also lists the Biserial Coefficients for each alternative. The asterisk denotes which alternative is the correct answer. This

format allows the user to examine alternatives by comparing the high scoring students

(56)

that are attracting to many high scoring students, indicating the alternative may need

revision.

The statistic in output data of ITEMAN is used when the rater identifies a

mastery criterion group within the group of students being tested. The upper scoring

group is usually the group that passes the test; whereas the lower scoring group is

usually the group that fails the test. To find out the proportion correct, the statistic is

calculated by taking the number of master students answering the item correctly,

subtracting the number of non-master students answering the item correctly and then

dividing by the total number of students (Crocker & Algina, 1986). It is calculated by

the following formula (Backhoff, Larrazolo, & Rosas, 2000):

��= ��

��

where:

pi = Difficulty index of item i

Ai = Number of correct answers to item i

Ni = Number of correct answers plus number of incorrect answers to item i

The program can process up to a 750-item test with unlimited number of

students (ASC, 1989-2006). The user can also manually create a data file using the

edit menu in ITEMAN, which is similar to Windows Notepad program. ITEMAN’s controls are few in number and very simple to use. The program offers five pull down

(57)

the file and select the options desired for analysis. The user then selects the analyze

menu or button. The user can view or print the output file by clicking on the view

button or print button. These buttons appear after the analysis is complete.

3.7. Hypothesis Testing

The research was intended to find out whether the final semester test in

2013/2014 academic year had fulfilled the criteria of a good test or not. The

researcher used the final semester test created by MGMP because the test had been

distributed to the students in SMAN 1 Purbolinggo for years. It means that the test

was always relied on by the teachers to evaluate the students. So, the probability is

that the final semester test has fulfilled the criteria of a good test, that is, has good

validity, high reliability, average level of difficulty, high discriminating power, and

(58)

CHAPTER 5

CONCLUSIONS AND SUGGESTIONS

This chapter deals with the conclusions and the suggestions based on the

results and the discussions of this research.

5.1. Conclusions

The findings of the research specify that not all items in the final semester test

have good validity, in relation to construct validity, content validity, and face validity.

The construct validity and the content validity of the final semester test are valid,

except face validity.

For construct validity, the validity is valid. The final semester test was made

for testing listening and reading. But, due to the technical problem, the listening

comprehension was not conducted by the students. To find the construct validity of

the test, the test was analyzed by the concept of reading comprehension. Based on the

classification of the final semester test, all reading items show a link to the traits of

the reading test. This is the same as the content validity of the final semester test. The

content validity of the final semester test is valid because all items in the reading

comprehension are relevant to the syllabus in KTSP.

For face validity, it was evaluated by using the Guidelines for Constructing

(1)

and may have to be redesigned. The results show that most of the items are not good and need to be revised.

In the output data of the ITEMAN, the result shows that the reliability coefficient of alpha is 0.448. Based on the criteria of the reliability of the test items, it is categorized as average/sufficient, that is, the test items whose alpha ranges from 0.401 – 0.700. It means that the test items in general if they are tested frequently under the same condition, they might result in similar outcome.

The test items are good if they are not too easy or not too difficult, or in average level. So, if the test is in the average level of difficulty, the test is good for the students. Related to the result of the level of difficulty in the output data of ITEMAN, some of the items fulfill the quality of a good item, but some do not.

Regarding with the item analysis using ITEMAN, it was found that the level of difficulty can be classified into four categories, that is, good or directly usable, very difficult or needs revising, very easy or needs revising, and too difficult or needs dropping or total revision. The criteria of the items which have the level of difficulty ranging from 0.300-0700 are categorized as good or directly usable. This class consists of 11 items (30%). There are eleven items that are good, that is 17, 18, 19, 21, 24, 29, 35, 38, 40, 41, 47. These items are recommended to be directly used without any prior revision. For the criteria very difficult or needs revising, the items have the level of difficulty ranging from 0.100-0.299. This class consists of 4 items (10%). There are four items that are very difficult, that is, 16, 26, 31, 49. These items need to be revised. As to the category very easy or needs revising, the items have the level of difficulty ranging from 0.701-0.900. This class consists of 6 items (20%).

(2)

There are seven items that are very easy, that is, 20, 21, 23, 25, 36, 45. These items also need to be revised. With reference to the criteria of the items which have the level of difficulty ranging from 0.000-0.099, the items are categorized as too difficult or needs dropping or total revision. This class consists of 14 items (40%). There are fourteen items that are too difficult, that is, 27, 28, 30, 32, 33, 34, 37, 39, 42, 43, 44, 46, 48, 50, therefore, they need dropping.

There are 6 items (17.1%) in the final semester test which have negative discrimination value, that is, 17, 19, 30, 31, 33, 38. It means that these items should be checked whether the key answer is correct. Related to the item analysis using ITEMAN, it was found that the test items whose discriminating power ≥ 0.400 is classified as high. There are 9 items (25.7%) that are high, that is, 23, 24, 25, 29, 35, 40, 41, 47, 49. These test items are recommended to be used as they can discriminate between the more knowledgeable from the less knowledgeable students. The criteria average/without revising is the items whose discriminating power ranges from 0.300-0.399. There are 2 items (5.7%) that do not need revising, that is, 16, 21. Concerning with the criteria low/needs revising, it points out that the items whose discriminating power ranges from 0.200-0.299. It was found that there are no items (0%) which involve in low discriminating power or need to be revised. The test items whose discriminating power range from 0.000-0.199 are categorized as very low/needs dropping. There are 18 items (51.5%) that are too difficult, that is, 18, 20, 22, 26, 27, 28, 32, 34, 36, 37, 39, 42, 43, 44, 45, 46, 48, 50.

Based on the results of the data analysis using ITEMAN, it was found that the alternative of the 35 items consisting of A, B, C, D, and E with the total of the

(3)

alternatives is 175, can be classified into three categories, that is, very good, good enough or sufficient, and least/dropped, or needs revising. With respect to the criteria very good, the alternatives whose Prop. Endorsing (proportion of the answers) ranges from 0.051-1.000. This class consists of 26 options (15%). These alternatives are recommended to be used without any prior revision. The alternatives whose Prop. Endorsing (proportion of the answers) ranges from 0.011-0.050 is categorized as good enough or sufficient. This class consists of 43 options (24.5%). These alternatives are recommended to be directly used, because they are chosen by at least 5% of the testees. Related to the criteria least/dropped, or needs revising, it is the alternatives whose Prop. Endorsing (proportion of the answers) ranges from 0.00-0.010. This class consists of 46 options (60.5%). These items should be revised before being tested.

5.2. Suggestions

In line with the conclusions above, some suggestions are proposed as follows: 1. Suggestions to the teachers

a. According to the data gained, the teachers should be familiar with construct validity, content validity, and face validity in order that they can assess the quality of the test.

b. The teacher should be good at the assessment from the aspects of material, construction, and language in order to improve the quality of the test.

c. The teachers should be familiar with ITEMAN software program in order that they can assess the students’ ability faster.

(4)

d. The teachers should be trained to use ITEMAN software program in order to improve the quality of the test.

e. The teachers should be familiar with all the terms related to the quality of the test items, such as, validity, reliability, prop. Correct (level of difficulty), point biserial (discriminating power), prop. Endorsing (options), distracters, key answers, alpha, and standard deviation.

2. Suggestions to other researchers

a. It is suggested that the role of ITEMAN in determining the quality of multiple choice items is investigated further. It is also interesting to collect a larger or smaller data base for investigating whether there are more tendencies in determining the quality of items.

b. Other researchers should replicate the current study in analyzing the quality of other test items, such as, Mid Semester Test, Final School Test (UAS), and National Examination (UN).

(5)

REFERENCES

Allen, Virginia. F. (1983). Techniques in Teaching Vocabularies. New York: Oxford University Press.

Arikunto Suharsimi. (2006). Prosedur Penelitian Suatu Pendekatan Praktik. Jakarta: Rineka Cipta.

Ariyana, Lilis T. (2011). Analisis Butir Soal Ulangan Akhir Semester Gasal IPA Kelas IX SMP di Kabupaten Grobogan. (Skripsi). Universitas Negeri Semarang, Semarang.

Azwar, Saifuddin. (2000). Reliabilitas dan Validitas. Yogyakarta: Pustaka Pelajar. Assessment Systems Corporation. (1989-2006). User’s Manual for the ITEMAN™

Conventional Item Analysis Program. 2233 University Avenue, Suite 200. Backhoff, E., Larrazolo, N., & Rosas, M. (2000). The level of difficulty and

discrimination power of the Basic Knowledge and Skills Examination (EXHCOBA). Revista Electrónica de Investigación Educativa, 2.

Brinberg, David & McGrath, Joseph E. (1985). Validity and the Research Process. Beverly Hills: Sage Publications.

Carmines, Edward G., & Richard A. Zeller. (1979). Reliability and Validity Assessment. Beverly Hills, CA: Sage.

Creswell, John W. (2009). Research design: Qualitative, Quantitative, and Mixed Methods Approaches. Los Angeles: Sage.

Crocker, L., & Algina, J. (1986). Introduction to Classical and Modern Test Theory. New York: Holt, Rinehart and Winston.

Davenport, R. A. (2007). Mastering the SAT Critical Reading Test. Canada: Wiley Publishing, Inc.

DeCapua, Andrea. (2008). Grammar for Teachers: A Guide to American English for Native and Non-Native Speaker. New York, N.Y.: Springer

Fitriana, Novaria. (2013). Analisis Kualitas Butir Soal Ulangan Akhir Semester Gasal Mata Pelajaran IPA Kelas V Mi Sultan Agung Tahun Pelajaran 2012/2013. (Skripsi). Universitas Islam Negeri Sunan Kalijaga, Yogyakarta.

Haladyna, T. M. (2004). Developing and Validating Multiple-Choice Test Items-3rd ed. New Jersey: Lawrence Erlbaum Associates.

Haris, David. (1974). English as Second Language. New York: Mc. Graw-Hill Inc. Heaton, J. B. (1975). Writing English Language Test. London: Longman Group. Linn, R.L. & Gronlund, N.E. (1995). Measurement and Assessment in Teaching.

(Seventh Edition). Ohio: Prentice-Hall, Inc.

Lynne, Patricia. (2004). Coming to Terms: Theorizing Writing Assessment in Composition Studies. Logan: Utah State University Press.

(6)

Ngadimun, Hd. (2004). Pengantar Authentic Assessment (Penilaian Otentik). Makalah 2004\ anabut praktis. University of Lampung. Bandar Lampung. Nelson, Larry. (2012). ITEMAN 3 and Lertap 5. Curtin University of Technology. Nicol, David. (2007). E-assessment by Design: Using Multiple-Choice Tests to Good

Effect. Journal of Further and Higher Education, Vol. 31, No. 1, February 2007, 53–64.

Nunan, David. (1992). Research Methods in Language Learning. Cambridge University Press.

Nunnally, J. (1978). Psychometric Theory. New York: McGraw-Hill

O’Neill, Peggy. (2009). A Guide to College Writing Assessment. Logan: Utah State University Press.

Osterlind, Steven J. (1998). Constructing Test Items: Multiple-Choice, Constructed-Response, Performance and Other Formats.University of Missouri-Columbia. Patton, M. Q. (2002). Qualitative research and evaluation methods (3rd ed.).

Thousand Oaks, CA: Sage.

Peraturan Pendidikan Pendidikan Nasional 20 Tahun 2007 tentang Standar Penilaian Pendidikan.

Ratnaningsih, Dewi J. (2009). Analisis Butir Soal Pilihan Ganda Ujian Akhir Semester Mahasiswa Di Universitas Terbuka Dengan Pendekatan Teori Tes Klasik. (Skripsi). Universitas Terbuka, Tangerang.

Robinson, B., Spratt, C. & Walker, R. (2004). Practitioner Research and Evaluation Skills Training (PREST) in Open and Distance Learning: handbook A5: Mixed research methods. Common wealth of Learning.

Salirawati, Das. (2011). Analisis Butir Soal Dengan Program Iteman. staff.uny.ac.id. Last time retrieved : Desember 22nd, 2014

Shohamy, Elana. (1985). A Practical Handbook in Language Testing for the Second Language Teacher. Tel-Aviv University.

Surapranata, Sumarna. (2006). Analisis Validitas, Reabilitas dan Interpretasi Hasil Tes Implementasi Kurikulum 2004. Bandung: PT Remaja Rosdakarya.

Suparman, U. (2011). The Implementation of Iteman to Improve the Quality of English Test Items As A Foreign Language: An Assessment Analysis. AKSARA- Jurnal Bahasa, Seni, dan Pengajarannya, Vol-XII, No. 1, pp 86-96. Wallace, M.J. (1998). Teaching Vocabulary. New York: Haineman Educational

book.

Wiggins, G. & McTighe, J. (2005). Understanding by Design (Expanded 2nd Ed. USA). Alexandria, Va.: Association for Supervision and Curriculum Development.

ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

LIST OF CONTENTS

REFERENCES

Parts

Dokumen yang terkait

THE EFFECT OF PICTURES ON THE SECOND YEAR STUDENTS' STRUCTURE ACHIEVEMENT AT SLTP NEGERI 3 JEMBER IN THE 2001/2002 ACADEMIC YEAR

THE EFFECT OF USING EUROTALK FLASHCARDS SOFTWARE ON THE SEVENTH GRADE STUDENTS’ VOCABULARY ACHIEVEMENT AT SMP NEGERI 2 AMBULU IN THE 2013/2014 ACADEMIC YEAR

THE EFFECT OF USING PICTURE GAMES ON STRUCTURE ACHIEVEMENT OF THE SECOND YEAR STUDENTS OF SLTPN 11 JEMBER IN THE 2000/2001 ACADEMIC YEAR

THE EFFECT OF USING PICTURES ON STRUCTURE ACHIEVEMENT AT THE SECOND YEAR STUDENTS OF SL TPN 6 JEMBER IN THE 200312004 ACADEMIC YEAR

THE EFFECT OF USING RECIPROCAL TEACHING STRATEGY ON THE ELEVENTH YEAR STUDENTS’ READING COMPREHENSION ACHIEVEMENT AT SMAN 1 BONDOWOSO IN THE 2013/2014 ACADEMIC YEAR

THE EFFECT OF USING ROUNDTABLE TECHNIQUE IN COOPERATIVE LANGUAGE LEARNING ON TENSE ACHIEVEMENT OF THE EIGHTH YEAR STUDENTS AT SMPN 1 JENGGAWAH IN THE 2012/2013 ACADEMIC YEAR YEAR STUDENTS AT SMPN 1 JENGGAWAH IN THE 2012/2013 ACADEMIC YEAR YEAR STUDENTS AT

THE EFFECT OF USING SPIDERGRAM ON THE EIGHTH GRADE STUDENTS’ VOCABULARY ACHIEVEMENT AT SMP NEGERI 8 JEMBER IN THE 2013/2014 ACADEMIC YEAR

IMPROVING LISTENING COMPREHENSION OF THE SECOND YEAR STUDENT OF SMU NEGERI I YOSOWILANGUN LUMAJANG BY USING TAPE RECORDER IN THE ACADEMIC YEAR 1999/2000

THE EFFECT OF ATTITUDE TO THE STUDENTS’ SPEAKING ABILITY AT THE SECOND YEAR OF SMA NEGERI 1 KALIREJO

ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

Dukungan

Links

ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

LIST OF CONTENTS

REFERENCES

Parts

Dokumen yang terkait

THE EFFECT OF PICTURES ON THE SECOND YEAR STUDENTS' STRUCTURE ACHIEVEMENT AT SLTP NEGERI 3 JEMBER IN THE 2001/2002 ACADEMIC YEAR

THE EFFECT OF USING EUROTALK FLASHCARDS SOFTWARE ON THE SEVENTH GRADE STUDENTS’ VOCABULARY ACHIEVEMENT AT SMP NEGERI 2 AMBULU IN THE 2013/2014 ACADEMIC YEAR

THE EFFECT OF USING PICTURE GAMES ON STRUCTURE ACHIEVEMENT OF THE SECOND YEAR STUDENTS OF SLTPN 11 JEMBER IN THE 2000/2001 ACADEMIC YEAR

THE EFFECT OF USING PICTURES ON STRUCTURE ACHIEVEMENT AT THE SECOND YEAR STUDENTS OF SL TPN 6 JEMBER IN THE 200312004 ACADEMIC YEAR

THE EFFECT OF USING RECIPROCAL TEACHING STRATEGY ON THE ELEVENTH YEAR STUDENTS’ READING COMPREHENSION ACHIEVEMENT AT SMAN 1 BONDOWOSO IN THE 2013/2014 ACADEMIC YEAR

THE EFFECT OF USING ROUNDTABLE TECHNIQUE IN COOPERATIVE LANGUAGE LEARNING ON TENSE ACHIEVEMENT OF THE EIGHTH YEAR STUDENTS AT SMPN 1 JENGGAWAH IN THE 2012/2013 ACADEMIC YEAR YEAR STUDENTS AT SMPN 1 JENGGAWAH IN THE 2012/2013 ACADEMIC YEAR YEAR STUDENTS AT

THE EFFECT OF USING SPIDERGRAM ON THE EIGHTH GRADE STUDENTS’ VOCABULARY ACHIEVEMENT AT SMP NEGERI 8 JEMBER IN THE 2013/2014 ACADEMIC YEAR

IMPROVING LISTENING COMPREHENSION OF THE SECOND YEAR STUDENT OF SMU NEGERI I YOSOWILANGUN LUMAJANG BY USING TAPE RECORDER IN THE ACADEMIC YEAR 1999/2000

THE EFFECT OF ATTITUDE TO THE STUDENTS’ SPEAKING ABILITY AT THE SECOND YEAR OF SMA NEGERI 1 KALIREJO

ANALYZING THE QUALITY OF THE FINAL SEMESTER TEST USING ITEMAN SOFTWARE PROGRAM AT THE SECOND YEAR OF SMA NEGERI 1 PURBOLINGGO IN 2013/2014 ACADEMIC YEAR

Dokumen yang Anda mencari sudah siap untuk unduhkan