Interrater Reliability and Agreement | 18 | The Reviewer's Guide to Qu

ABSTRACT

When researchers make use of observer ratings (where observer may refer to acquaintances such as peers, family members, and teachers, or to trained observers not previously acquainted with the research participants), they provide evidence of dependability or replicability of ratings by reporting coefficients of reliability (for continuous scores) or agreement (for categorical ratings). Many methods have been recommended for quantifying dependability of ratings, and investigators (for whom this task is often only a peripheral issue) might not be aware of well-documented limitations of some of these approaches. Interrater reliability (for continuous rating scales) is best quantified as an intraclass correlation coefficient (ICC). Shrout and Fleiss (1979) provided a primer on the different types of ICCs and how to choose among them. For interrater agreement (for nominal scales) Cohen’s (1960) kappa coefficient is recommended when there are exactly two raters, or Fleiss’s (1971) extension for three or more raters. Tinsley and Weiss (1975) offered a helpful introduction to reliability and agreement, including critiques of inferior approaches to estimation. Hoyt and Melby (1999; see also Lakes & Hoyt, 2009) noted that multiple sources of error (e.g., instability of scores over time, internal inconsistency of rating scales, as well as rater variance) contribute to unreliability of ratings, and researchers may find it useful to report generalizability coefficients (Brennan, 2001; Shavelson & Webb, 1991) as a means of quantifying dependability with respect to multiple sources of error simultaneously. Schmidt and Hunter (1996; see also Schmidt, Le, & Ilies, 2003) offered a helpful discussion of the impact of measurement error on study findings, and the importance of reporting coefficients that reflect the relevant sources of error in scores. Hoyt (2000; see also Hoyt & Kerns, 1999) discussed issues for interpretation of findings in the presence of rater errors. Feldt and Brennan (1989) provided a technical treatment of the relation between reliability and generalizability coefficients.