Welcome to the Kappa Zoo
1 Welcome
The goal of this project is to unify and extend research on classification, in particular the statistics used to assess rater agreement or classification accuracy. For example a panel of wine judges independently rates a series of vintages. This type of data is widespread in consumer data (product ratings), medicine (diagnostics, trials), education (evaluation), social science (surveys), and machine learning context (text annotation). The commonality between these is a set of raters (evaluators, annotators) who independently assign categories to subjects, usually from a small number of choices.
The most basic question about such data sets is “is it just random numbers?” This can be answered by comparing the variation of responses within each subject to the variation between subjects. There are multiple research methods that can be applied, such as ANOVA, Item Response Theory, and other continuous-scale approaches, but the focus here is methods that directly or indirectly use exact rater agreement as the basis for analysis. There are at least two distinct research traditions in rater agreement. One is from social sciences and medicine that has proliferated a number of rater agreement statistics, each with different assumptions. These “kappas” prompted the title of the website, since there is by now a “zoo” of these statistics to choose from. On a parallel track more associated nowadays with Machine Learning, there has been development of model-based classification statistics (e.g. MACE) that embody a tree-like set of conditional probabilities used to generate (often) individual parameters for the subjects being classified and the raters who assign the classes to the subjects.
The body of work here includes statistical derivations, an R library for implementing them, and examples of use. It’s intended to be particularly useful for those who are still scratching their heads about which kappa statistic to use. The answer is: don’t. There’s a better way to approach the problem.
1.1 Getting Started
If you are new to t-a-p models, I recommend starting with the Chapter 1, which introduces rater agreement as a classification problem and lays out the philosophical and statistical assumptions. Chapter 2 provides the basic statistical derivations needed to work with t-a-p models. Chapter 3 derives the relationship to some existing rater agreement statistics, for example showing how the Fleiss kappa is a special case of a t-a-p model. Chapter 4expands the number of parameters and shows how this relates to the “kappa paradox.” Chapter 5 develops a multi-parameter model that allows each rater and subject to have parameters for accuracy and truth, respectively. This model is equivalent to a binary version of MACE, from the machine learning literature (Hovy et al., 2013). Chapter 6 shows how we can assess model fit and the independence assumption. Chapter 7 introduces the tapModel R package, which allows for easy estimation of the model parameters and lots of utilities. Chapter 8 introduces the t-a-p app, an interactive way to use real or simulated data to estimate the model parameters without coding. It is currently under construction, since I’ve made major changes to the tapModel package. An appendix is included to provide more detailed statistical derivations and proofs for some results. There is a growing list of examples that use data sets to illustrate the concepts.
1.2 Table of Contents
Examples
2 Overview
2.1 Unification
It turns out that the kappa-type statistics can be derived as special cases of latent-variable generative models for binary classification. We can see this common link by starting with a simple probability tree for binary classification with three parameters: \(t\) is the prevalence of Class 1 (as opposed to Class 0), \(a\) is the rater accuracy, and \(p\) is a probability for inaccurate assignments. This generative approach is inspired by the Justified True Belief model of knowledge, and it naturally incorporates cases where raters arrive at correct classifications for incorrect reasons, analogous to Gettier problems in epistemology.
Applying this t-a-p model to the Fleiss Kappa, one of the most well-known of the rater agreement statistics, shows that the kappa is a special case with the assumption that \(t = p\), meaning that when raters make mistakes, they at least retain the distribution of the correct class proportions. I call this the unbiased case, meaning that when raters assign randomly, those classifications still preserve the marginal class proportions.
An issue with sensitivity/specificity parameterizations of discrete classification is a symmetry in which prevalence and classification tendencies can flip together while producing equivalent observed predictions. This creates a label-switching symmetry and can lead to non-identifiable parameterizations. The t-a-p parameterization makes these symmetries more explicit and easier to constrain, leading to identifiable solutions under natural assumptions.
The models can be estimated with direct optimization, Expectation-Maximization (EM) algorithms or Bayesian MCMC methods. EM methods search for maximum-likelihood estimates, while Bayesian methods characterize posterior uncertainty through sampling. Log likelihood has a natural interpretation as bits of entropy per rating, which is a nice index for assessing model fit.
2.2 Surprises
The first aha moment was finding that if the t-a-p model’s \(t=p\) condition is met, the Fleiss kappa is just the square of accuracy. Another unexpected result was that when rater accuracy approaches zero, expected estimates of rater accuracy increase. I called this the Dunning-Kruger effect, as a tongue-in-cheek reference to the social science research on meta-ignorance. At first I thought this was probably just estimation error, since lower accuracy means noisier data. However, I was able to find an exact closed formula for accuracy when \(t=p\), and show that it’s a real effect: when accuracy is low enough, the chances of an overestimate increase.
The PTSD data set made me wonder how much worse does classification become when we downgrade from 10 raters (impractical in real situations) to two raters with a third tie-breaker when needed. I thereby rediscovered Condorcet’s Jury Theorem, which deserves to be more widely-known.
It was surprising to me that not only can the three parameter t-a-p model be estimated with a simple Expectation-Maximization algorithm, but that the version with individual rater and subject parameters can be as well. It was also surprising that the ad hoc procedure I put together for doing this was, in fact, a provable E-M algorithm.
A recurrent theme in the literature was the suggestion that we needed more coefficients to handle the possibilities in a 2x2 confusion matrix (assigned class versus true class). It sounds odd to say that raters are more accurate when classifying Class 1 cases than Class 0 cases, but it makes sense when we take into account the pattern of false positives and false negatives. This approach–separate accuracies for classes–gives a satisfying explanation of the so-called Kappa Paradox concerning unbalanced data.
2.3 Todo
This is an ongoing project. I came to it from the social science angle, from the kappa statistics, and have backed into the Bayesian methods. That means I’m not an expert, and some of my terminology is probably weird or wrong. I’ll be working to fix the semantics, along with associated substantive issues, and I welcome input from those with more expertise in these areas.
I’ve substantially revised the goodness-of-fit chapter, but it’s not quite ready to be posted. I find it conceptually difficult to work through the information-theory aspects in conjunction with calibration curves and MCMC posterior distributions. That means the examples all need to be updated with the new metrics.
There are more references to add, including new-to-me articles that extend the MACE-type methods. Semantic information theory seems relevant. Bernard Williams’ Truth and Truthfulness seems to include a description of a t-a-p like logic for knowledge.
The k-class algorithm breaks with hard data sets. I think it may be an easy fix, but there are a lot of things that can go wrong. Extending the categorical case to ordinal data case is hard, because ordinal models really want to have a continuous latent variable. That’s an awkward fit with the generative classification models studied here. Signal detection theory is a variation I came across recently, but like Item Response Theory it doesn’t seem to include the possibility of correct classifications for incorrect reasons, which is a key feature of the t-a-p model.
I’d like to add examples of embedding a logistic-type model for the parameters, so that we can see how accuracy changes as a function of subject characteristics, or how subjects vary in their difficulty to rate. See Bob Carpenter’s paper here that does this.
A future chapter will be devoted to the connection between these rater statistics and causality measures, which is how this project actually started (Eubanks, 2014). The larger issue is to understand the strengths and weaknesses of MCMC estimation for the multi-parameter models, e.g. how pooling should work. As I go along, I’ll work through the collection of data samples I’ve gathered over the years to create a catalog of cases. If you’d like to contribute to any of this, email me.