The inter-rater reliability of two alternative analytic grading scales for the evaluation of oral interviews at Anadolu University School of Foreign Languages
Karslı, Ece Selva
Klinghammer, Sarah J.
Item Usage Stats
Of all language exams, the accurate testing of speaking is regarded as the most challenging to prepare, administer and score because it takes considerable time and effort to obtain reliable results (Madsen, 1983; O’Malley & Pierce, 1996). Since subjective types of tests (e.g. interview ratings) require the judgment of the raters, inconsistency in judgments, which may affect the rater reliability adversely, may occur. This research study investigated the inter-rater reliability of two alternative speaking assessment criteria designed for Anadolu University, School of Foreign Languages. The perspectives of the participants on the scales were also analyzed with the help of the interview records. Two types of data were used in this study: raters’ scores using both of the scales and raters’ opinions of the rating scales. The participants in the study were five English instructors currently employed at Anadolu University School of Foreign Languages. The teachers attended the training and norming sessions for the four-band scale and then graded 36 elementary level students’ oral performance using the scale. Then the teachers were interviewed as a group. They were asked to express their opinions about the scale. Six weeks later, same procedure was followed for the fiveband scale. The training and norming sessions for both of the scales were held by the researcher. Then inter-class correlation for both of the scales was calculated using the scores assigned to 36 elementary level students. The result of the statistical analysis revealed that the four-band scale is more reliable than the five-band scale. The results of the interviews indicated that the raters have common problems in assigning the scores to students’ oral performances while using both of the scales. The problem that the raters faced in the scoring procedure while they were using the five-band scale is that two terms used in the descriptors are not clear. The common problems faced by the raters while they were using the four-band scale are as follows: 1) one term used in the descriptors is not clear, 2) students’ performance may not fit into the bands, 3) the number of bands in each category is not enough, and the highest band in vocabulary needs to be more detailed 4) the lowest band is unnecessary, 5) there is a big difference among the bands in terms of the value assigned to each band. After an analysis of the two speaking assessment scales, the four-band scale is recommended to assess oral performances of elementary level students’ at Anadolu University School of Foreign Languages. Since nearly all participants stated problems concerning the descriptors in both of the scales, the descriptors need to be reconsidered and paid more attention to during training and norming sessions. In addition, the scale is open to revision in terms of weighing because the participants had problems with it. Finally, it is recommended that teachers who are going to take part in the assessment of learners’ oral performances need to attend training and norming sessions before they take part in the actual scoring procedure.