|dc.description.abstract||Of all language exams, the accurate testing of speaking is regarded as the
most challenging to prepare, administer and score because it takes considerable time
and effort to obtain reliable results (Madsen, 1983; O’Malley & Pierce, 1996).
Since subjective types of tests (e.g. interview ratings) require the judgment of the
raters, inconsistency in judgments, which may affect the rater reliability adversely,
This research study investigated the inter-rater reliability of two alternative
speaking assessment criteria designed for Anadolu University, School of Foreign
Languages. The perspectives of the participants on the scales were also analyzed
with the help of the interview records.
Two types of data were used in this study: raters’ scores using both of the
scales and raters’ opinions of the rating scales. The participants in the study were five
English instructors currently employed at Anadolu University School of Foreign
Languages. The teachers attended the training and norming sessions for the four-band
scale and then graded 36 elementary level students’ oral performance using the scale.
Then the teachers were interviewed as a group. They were asked to express their
opinions about the scale. Six weeks later, same procedure was followed for the fiveband
scale. The training and norming sessions for both of the scales were held by the
Then inter-class correlation for both of the scales was calculated using the
scores assigned to 36 elementary level students. The result of the statistical analysis
revealed that the four-band scale is more reliable than the five-band scale.
The results of the interviews indicated that the raters have common problems
in assigning the scores to students’ oral performances while using both of the scales.
The problem that the raters faced in the scoring procedure while they were using the
five-band scale is that two terms used in the descriptors are not clear. The common
problems faced by the raters while they were using the four-band scale are as
follows: 1) one term used in the descriptors is not clear, 2) students’ performance
may not fit into the bands, 3) the number of bands in each category is not enough,
and the highest band in vocabulary needs to be more detailed 4) the lowest band is
unnecessary, 5) there is a big difference among the bands in terms of the value
assigned to each band.
After an analysis of the two speaking assessment scales, the four-band scale is
recommended to assess oral performances of elementary level students’ at Anadolu
University School of Foreign Languages. Since nearly all participants stated
problems concerning the descriptors in both of the scales, the descriptors need to be
reconsidered and paid more attention to during training and norming sessions. In addition, the scale is open to revision in terms of weighing because the participants
had problems with it. Finally, it is recommended that teachers who are going to take
part in the assessment of learners’ oral performances need to attend training and
norming sessions before they take part in the actual scoring procedure.||en_US