Research Question 1 DATA ACQUISITION
                                                                                68 up the tasks, while the other acts as an assessor and does not join the conversation.
The  overall  test  format  was  organized  into  three  parts.  The  first  two  parts  were developed from  the responsive speaking task suggested by  Brown  2004.  In the
first task, each student interacted with the interlocutor. The interlocutor asked the students  some  guided  questions.  The  other  student  not  answering  question  was
asked to pay attention to what the performing student was speaking. Then, for the second  task,  each  student  was  asked  to  paraphrase  the  information  given  by  the
other  student  who  had  performed  in  advance.  Meanwhile,  the  third  task  was developed  using  the  interactive  speaking  task  Brown,  2004.  In  the  third  task,
students were to interact with each other. The interlocutor set up the activity using a standardized rubric developed based on particular topics taken from the course
material.  During  and  at  the  end  of  the  test,  each  of  the  examiners  gave  marks. Finally, the scores between the two examiners were compared and averaged.
The  reliability  of  this  test  was  very  important  as  according  to  Gall,  et  al. 2003,  p.196,  the  reliability  of  a  test  refers  to  the  measurement  error  present  in
the scores yielded from the test. Since there were two examiners giving marks for every  student  in  each  test,  it  was  necessary  to  make  sure  that  the  testing
instruments  and  procedures  were  valid  and  reliable.  There  was  a  possibility  that the  two  examiners  gave  inconsistent  marks  or  measurement  error  even  for  the
same student. In this case, a reliable test should yield stable and consistent scores whenever it is administered Creswell, 2011 and whoever the test examiners are.
In order to show the test’s reliability, therefore, an inter-rater reliability test was conducted to negate any bias that any individual bring to the scoring. This inter-
69 rater reliability test should be conducted by “having several testers administer the
test to a sample of individuals and then correlating their obtained scores with each ot
her” Gall, et al., 2003, p.198 using SPSS. The  inter-rater  reliability  test  was  conducted  through  examining  the
intraclass correlation coefficient between teacher 1’s scores and teacher 2’s scores given to participants during the pre-test. The results of this statistical examination
is presented below.
Table 3.2 Intraclass correlation coefficient
Intraclass Correlation
b
95 Confidence Interval F Test with True Value 0
Lower Bound  Upper Bound Value
df1 df2
Sig Single Measures
,761
a
,088 ,945
12,537 8
8 ,001
Average Measures ,864
c
,162 ,972
12,537 8
8 ,001
Based  on  the  table,  it  is  indicated  that  the  scores  given  by  teacher  1  and teacher  2  were  correlated  as  much  as  0.86,  which  was  greater  than  0.7.  This
signified that they were  highly  correlated.  In addition,  the  significance level  was 0.001 p  0.05, which indicated that the inter-reliability was high. Based on this
fact,  the  test  scales  were  proven  highly  reliable,  so  that  it  could  be  utilized throughout the research.