Welcome to the Kappa Zoo

Author

David Eubanks

1 Welcome

The goal of this project is to unify and extend research on classification, in particular the statistics used to assess rater agreement or classification accuracy. For example a panel of wine judges independently rates a series of vintages. This type of data is widespread in consumer data (product ratings), medicine (diagnostics, trials), education (evaluation), social science (surveys), and machine learning context (text annotation). The commonality between these is a set of raters (evaluators, annotators) who independently assign categories to subjects, usually from a small number of choices.

The most basic question about such data sets is “is it just random numbers?” This can be answered by comparing the variation of responses within each subject to the variation between subjects. There are multiple research methods that can be applied, such as ANOVA, Item Response Theory, and other continuous-scale approaches, but the focus here is methods that directly or indirectly use exact rater agreement as the basis for analysis. There are two distinct research traditions in rater agreement. One is from social sciences and medicine that has proliferated a number of rater agreement statistics, each with different assumptions. These “kappas” prompted the title of the website, since there is by now a “zoo” of these statistics to choose from. On a parallel track more associated nowadays with Machine Learning, there has been development of model-based classification statistics (e.g. MACE) that embody a tree-like set of conditional probabilities used to generate (often) hierarchical parameters for the subjects being classified and the raters who assign the classes to the subjects.

The body of work here includes statistical derivations, an R library for implementing them, and examples of use. It’s intended to be particularly useful for those who are still scratching their heads about which kappa statistic to use. The answer is: don’t. There’s a better way to approach the problem.

1.1 Getting Started

If you are new to t-a-p models, I recommend starting with the Chapter 1, which introduces rater agreement as a classification problem and lays out the philosophical and statistical assumptions. Chapter 2 provides the basic statistical derivations needed to work with t-a-p models. Chapter 3 derives the relationship to some existing rater agreement statistics, for example showing how the Fleiss kappa is a special case of a t-a-p model. Chapter 4expands the number of parameters and shows how this relates to the “kappa paradox.” Chapter 5 develops a hierarchical model that allows each rater and subject to have individual parameters for accuracy and truth. This model is equivalent to a binary version of MACE, from the machine learning literature (Hovy et al., 2013). Chapter 6 shows how we can assess model fit and the independence assumption. Chapter 7 introduces the tapModel R package, which allows for easy estimation of the model parameters and lots of utilities. Chapter 8 introduces the t-a-p app, an interactive way to use real or simulated data to estimate the model parameters without coding. It is currently under construction, since I’ve made major changes to the tapModel package. An appendix is included to provide more detailed statistical derivations and proofs for some results. There is a growing list of examples that use data sets to illustrate the concepts.

1.2 Table of Contents

2 Overview

2.1 Unification

It turns out that the legacy kappa statistics are special cases of the probability-tree statistics. We can see this common link by starting with a simple probability tree for binary classification with three parameters: \(t\) is the prevalence of Class 1 (as opposed to Class 0), \(a\) is the rater accuracy, and \(p\) is a probability for inaccurate assignments. This tree approach imitates the Justified True Belief model of knowledge, and naturally incorporates Gettier problems (getting the right answer for the wrong reason).

Applying this t-a-p model to the Fleiss Kappa, one of the most well-known of the rater agreement statistics, shows that the kappa is a special case with the assumption that \(t = p\), meaning that when raters make mistakes, they at least retain the distribution of the correct class proportions. I call this the unbiased case.

2.2 Surprises

The first aha moment was finding that if the t-a-p model’s \(t=p\) condition is met, the Fleiss kappa is just the square of accuracy. Another unexpected result was that when rater accuracy approaches zero, expected estimates of rater accuracy increase. I called this the Dunning-Kruger effect, as a tongue-in-cheek reference to the social science research on meta-ignorance. At first I thought this was probably just estimation error, since lower accuracy means noisier data. However, I was able to find an exact closed formula for accuracy when \(t=p\), and show that it’s a real effect: when accuracy is low enough, the chances of an overestimate increase. Weird, right?

It was surprising to me that not only can the three parameter t-a-p model be estimated with a simple Expectation-Maximization algorithm, but that the hierarchical version can be as well. It was also surprising that the ad hoc procedure I put together for doing this was, in fact, a provable E-M algorithm.

A recurrent theme in the literature was the suggestion that we needed more coefficients to handle the possibilities in a 2x2 confusion matrix (assigned class versus true class). It sounds odd to say that raters are more accurate when classifying Class 1 cases than Class 0 cases, but it makes sense when we take into account the pattern of false positives and false negatives. This approach–separate accuracies for classes–gives a satisfying explanation of the so-called Kappa Paradox concerning unbalanced data.

The models can be fit with E-M algorithms or Bayesian MCMC methods, both of which maximize likelihood. It turns out that log likelihood has a natural interpretation as bits of entropy per rating, which is a nice index for assessing model fit.

2.3 Todo

This is an ongoing project. A future chapter will be devoted to the connection between these rater statistics and causality measures, which is how this project actually started (Eubanks, 2014). The larger issue is to understand the strengths and weaknesses of MCMC estimation for the hierarchical models, e.g. how pooling should work. When all that’s done, I’ll start working through the collection of data samples I’ve gathered over the years to create a catalog of cases. If you’d like to contribute to any of this, email me.

References

Eubanks, D. A. (2014). Causal interfaces. Arxiv.org Preprint. http://arxiv.org/abs/1404.4884v1

Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., & Hovy, E. (2013). Learning whom to trust with MACE. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1120–1130.