What is an Intraclass Correlation Coefficient (ICC)?
In biostatistics, the Intraclass Correlation Coefficient (ICC) measures the reliability or consistency of measurements made by different observers or instruments for the same subject. It is commonly used in contexts like medical studies, psychology, and quality control to assess the agreement between raters or repeated measurements.
One of the more commonly accepted classifications for deciding the type of ICC to perform is Shrout and Fleiss’s Intraclass Correlation Coefficient (ICC) classification. It is a widely-used framework introduced in their 1979 paper, providing a detailed system for choosing the appropriate type of ICC based on the study design.
Shrout and Fleiss’s Classification for ICC
Shrout and Fleiss proposed three models of ICC, each suited for different study designs. These models are based on whether raters or subjects are treated as random or fixed effects and whether you are interested in measuring absolute agreement or consistency between raters.
- ICC Model 1: One-way Random Effects Model (ICC(1))
- Use Case: This model is appropriate when a random sample of raters rates different subjects, and each subject is rated by a different set of raters. This design is often found in cases where raters are inconsistent across all subjects, meaning the ratings for each subject are not fully crossed.
- Reliability Assessed: Measures the reliability of ratings from random raters where the raters themselves are considered random effects.
- Example: In a medical trial, different sets of doctors rate different patients, and there is no overlap in which doctors rate the same patients.
- ICC Model 2: Two-way Random Effects Model (ICC(2))
- Use Case: This model is applicable when the same raters rate each subject, and these raters are considered a random sample from a larger population. This model estimates how well the ratings generalize to other raters from the same population.
- Reliability Assessed: This model measures absolute agreement (ICC(2,1)) or consistency (ICC(2,k)) between raters. It evaluates both random variances from the raters and subjects, allowing for the generalization of the results to other raters.
- Example: A group of patients is rated by a team of doctors, and these doctors are randomly selected from a larger pool.
Variants:
- ICC(2,1): Assesses the absolute agreement between individual raters.
- ICC(2,k): Assesses the absolute agreement between the average of k raters.
- ICC Model 3: Two-way Mixed Effects Model (ICC(3))
- Use Case: This model is used when the same raters rate each subject, but the raters are considered fixed (i.e., they are the only raters of interest, and generalization to other raters is unnecessary). This model is appropriate when you are only interested in the reliability of these specific raters.
- Reliability Assessed: Measures consistency (ICC(3,1)) or consistency of average ratings (ICC(3,k)) between raters. Unlike ICC(2), it does not account for random variance from raters, as raters are treated as fixed.
- Example: A study in which a specific panel of doctors rates all patients, and the results are only relevant to these doctors, not a broader population of raters.
Variants:
- ICC(3,1): Assesses consistency of ratings from individual raters.
- ICC(3,k): Assesses consistency of the average of k raters.
Choosing the Right Model
- One-way random effects (ICC(1)): Use when a different set of raters rates each subject, and those raters are a random sample from a larger pool.
- Two-way random effects (ICC(2)): Use when the same raters rate all subjects, and raters are drawn randomly from a larger population. This model allows for generalization of the results to other raters.
- Two-way mixed effects (ICC(3)): Use when the same raters rate all subjects, and you are only interested in the reliability of those specific raters (raters are fixed and not generalizable).
Summary
Shrout and Fleiss’s classification of ICC provides a structured approach for assessing the reliability of ratings or measurements. It includes three models:
- ICC(1): One-way random effects for independent raters,
- ICC(2): Two-way random effects for the same raters, with results generalizable to other raters,
- ICC(3): Two-way mixed effects for the same fixed raters without generalization.
The choice of model and whether to assess absolute agreement or consistency depends on the study design and whether raters are random or fixed. This aspect of the models will be discussed in more detail in my next few posts.
- Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: uses in assessing rater reliability. Psychological Bulletin, 86(2), 420-428.
- Koo, T. K., & Li, M. Y. (2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155-163.
- McGraw, K. O., & Wong, S. P. (1996). Forming inferences about some intraclass correlation coefficients. Psychological Methods, 1(1), 30-46.