This chapter examines the assumptions and implications of the t-a-p model in detail to build a theoretical foundation for estimation and inference. This work also paves the way to connect the three parameter model to widely-used rater agreement statistics (kappas) and machine learning algorithms.
1 Introduction
Suppose that we have \(N\) subjects to be classified by at least two raters each, and each subject belongs (in truth) to one of two categories Class 1 or Class 0, abbreviated to \(C_1, C_0\). For example, if faculty members review portfolios of student work, the two classes could be “pass” and “fail.” We often use ordered classifications like “A, B, C, D, F” or “Excellent, Good, Fair, Poor” but these can be converted to binary classifications by grouping the classes. For example “A, B, C” could be compared to “D, F” or “Excellent, Good” could be compared to “Fair, Poor.” There are also extensions of the t-a-p model that work directly on ordinal or non-binary categorical scales. The exposition is easiest to understand for the binary case, so we will start there.
We assume that each subject is independently assigned to one of the two classes by each of \(R\) observers (raters). For now, think of \(R\) as fixed, so that there are \(RN\) total ratings, but that condition is relaxed later on, so that the number of ratings per subject can vary, which is common in real data.
Rater-assigned categories are distinguished from true classes in notation by the use of hats to suggest an estimated value. Ratings of Class 1 are denoted \(\widehat{C_1}\), but some of them may be \(C_0\) in truth. This distinction between rated (estimated) values and true values leads to the idea of true/false positives/negatives and the confusion matrix described in the introductory chapter, reproduced here.
Table 1: The t-a-p model’s correspondence to the confusion matrix. Terms in parentheses are inaccurate ratings.
True C1
True C0
Classified C1
\(ta + (t\bar{a}p)\)
\((\bar{t}\bar{a}p)\)
Classified C0
\((t\bar{a}\bar{p})\)
\(\bar{t}a + (\bar{t}\bar{a}\bar{p})\)
The true positive rate of classification in Table 1 can be separated into ratings that are true for a good reason and those that are accidentally true. This chapter expands on that idea to reveal other properties of the t-a-p model and how some existing rater agreement statistics are special cases of it.
We will assume that an assigned rating corresponds to the belief of a rater. This rules out raters who are intentionally being deceptive, for example. Then we will say that a rater makes an accurate assignment of \(\widehat{C_1}\) or \(\widehat{C_0}\) for a subject if
the rater-assigned class is the true class, and
the rater has justification for the choice.
These requirements operationalize the Justified True Belief (JTB) definition of knowledge used by philosophers who study epistemology. Inaccurate ratings are those where one of the two listed conditions fails. If the rater’s belief is false and chooses the wrong category or if they choose the correct category but because of an invalid justification. The latter case corresponds loosely to Gettier-type problems, where the chain of reasoning reaches a correct conclusion, but because of flaws in perception the logic isn’t sound. A clear case of failing the justification requirement is if raters flip coins to choose categories. Coin flipping is entirely random, but even good raters have some randomness inherent to the classifications. That randomness is the usual starting point for chance-corrected agreement statistics.
The rating process just described lends itself to a tree diagram that illustrates three variables as conditional probabilities: (1) the true Class 1 rate \(t\), (2) rater accuracy \(a\), and (3) the probability \(p\) of choosing \(\widehat{C_1}\) when rating inaccurately. On the diagram, it’s convenient to use a bar over a variable to denote its complementary probability, e.g. \(\bar{a} = 1 - a\).
Each rating is conditional on a subject’s true classification (Class 1 or Class 0), which will often be unknown, so that we can only observe the rater-assigned categories \(\widehat{C_1}\) and \(\widehat{C_0}\). The rating data will be denoted by \(c_{ij}\), where \(i = 1 \dots N\) are the subjects and \(j = 1 \dots R\) are the raters, who independently classify subjects as 1 or 0 according to the \(\widehat{C_1}\) or \(\widehat{C_0}\)assignment, respectively. This is convenient, since \(k_i := \sum_j c_{ij}\) is the number of \(\widehat{C_1}\) ratings for subject \(i\), and the average of the classifications made by raters is \(\text{E}[\widehat{C_1}] = \sum{c_{ij}}/(NR)\).
Example: A wine judge independently and honestly assesses a vintage for excellence. The two categories are \(C_1\) = “excellent” and \(C_0\) = “not excellent.” After judging four wines, the situation might be that in the table below.
Sample wine ratings showing normally unknown truth values.
Wine
Truth
Accurate?
Classification
1
Excellent
Yes
Excellent
2
Excellent
No
Not Excellent
3
Not Excellent
No
Not Excellent
4
Not Excellent
Yes
Not Excellent
5
Not Excellent
Yes
Not Excellent
If the judge makes an accurate assessment, the classification recorded matches the true value. But for the third wine, the judge got the correct answer even though the process was flawed and somewhat random (an inaccurate rating). For the second wine, the inaccuracy resulted in the wrong classification being assigned. Those two inaccurate cases are illustrated in the process diagram in the middle, marked Random.
2 Conditional Probabilities
The tree diagram in figure Figure 1 models the assignments of ratings \(c_{ij}\) over subject \(i\) and rater \(j\), and can be read by multiplying the conditional probabilities on the edges from the top down to find the probability of a given classification being \(c_{ij} = \widehat{C_1}\).
If a subject is not classified accurately, the classification for that rater is assumed to be made at random, with probability \(p\) of choosing \(\widehat{C_1}\) regardless of the true class. So the conditional probability of a \(\widehat{C_1}\) classification when the subject really is \(C_1\) is \(\text{Pr}(\widehat{C_1} | C_1) = a + \bar{a}p\). Similarly, \(\text{Pr}(\widehat{C_1} | C_0) = \bar{a}p\). This model assumes that guess rates for the two classes are the same independent of the true classification. More complex models are introduced later.
The binary (Bernoulli) probability that for a given subject with a given true classification, a rater will assign a \(\widehat{C_1}\) rating is shown in the table below. This comes from reading the tree diagram from the top down, multiplying the branches.
Conditional probabilities of a single rater assigning a \(\widehat{C_1}\) rating to a subject.
True C1
True C0
Classified C1
\(a + \bar{a}p\)
\(\bar{a}p\)
We can use these probabilities to find the expected number of ratings of Class 1.
3 Binomial Mixtures
The rating process posed in the t-a-p model is illustrated with the tree diagram and table of example ratings above. But both of those entail the use of the hidden true classification value for each subject. There are cases where that can be known, but in general it is not. What we usually have to work with is a table of ratings, from which we must infer the hidden variables like rater accuracy. The collection the ratings under the t-a-p assumptions fall into a well-studied probability distribution called a binomial mixture.
Code
# set the parametersN_r =5# number of raters for each subject N_s =100# number of subjects (not used here)t = .3# fraction of subjects that are in fact class 1a = .7# probability of a rater rating a subject accuratelyp = .2# probability of a rater rating a subject as class 1 when rating inaccurately# find the conditional probabilities for each class# probabilities of a class 1 rating for a class 0 subject c0_probs =dbinom(0:N_r, N_r, prob = (1-a)*p)# probabilities of a class 1 rating for a class 1 subject c1_probs =dbinom(0:N_r, N_r, prob = a + (1-a)*p)# create the mixture with t as the mixing parametermixture = c0_probs * (1-t) + c1_probs * t# Plot the conditional probabilitiesplot(0:N_r, c0_probs, type="b", col="blue", pch=16, ylim=c(0, max(c(c0_probs, c1_probs, mixture))),ylab="Probability", xlab="Number of class 1 ratings for a given subject", main="Binomial Mixture Plot")# Add the second component (c1_probs)lines(0:N_r, c1_probs, type="b", col="red", pch=16)# Add the mixture as a black dashed linelines(0:N_r, mixture, type="b", col="black", lty=2, pch=16)# Add a legendlegend("topright", legend=c("Class 0 Probabilities", "Class 1 Probabilities", "Mixture"),col=c("blue", "red", "black"), lty=c(1, 1, 2), pch=16)
Figure 2 shows an example of how probabilities combine to create the mixture. Given the three t-a-p parameters plus the number of raters per subject, we apply the binomial probability density function using the probabilities found at the end of the previous section. The mixture is created by weighting these two distributions by their frequency in the sample space. In this case, the parameters are \(t = .3\), \(a = .7\), and \(p = .2\).
For true Class 1 cases, the probability of a rater assigning a \(\widehat{C_1}\) rating is \(ta + t\bar{a}p\), and for true Class 0 cases it is \(\bar{t}\bar{a}p\) (see the table at the end of the last section). Those are probabilities for a single rating. If we have \(R\) raters, then anywhere between zero and \(R\) of them could assign a \(\widehat{C_1}\) rating to a given subject. The binomial distribution gives the probability for each of those possible count outcomes. For true Class 1 subjects the probability of \(k\) raters assigning a \(\widehat{C_1}\) rating is
\[
\begin{aligned}
Pr(k | C_1) &= \binom{R}{k} (a + \bar{a}p)^k (1 - a - \bar{a}p)^{R - k} \\
&= \binom{R}{k} (a + \bar{a}p)^k (\bar{a} - \bar{a}p)^{R - k} \\
&= \binom{R}{k} (a + \bar{a}p)^k (\bar{a} \bar{p})^{R - k}
\end{aligned}
\] That distribution is represented by the red line in Figure 2. Notice that the most outcome is that four of the five raters assign a \(\widehat{C_1}\) rating. The reason that’s not five is that the parameter \(p = .2\) means that for inaccurate ratings, the “guess” is much more likely to assign \(\widehat{C_0}\) than \(\widehat{C_1}\). The effect is to deflate the number of \(\widehat{C_1}\) ratings for true Class 1 subjects.
For true Class 0 cases (the blue line in the plot), it is
These are the two probability distributions are shown in Figure 2, with the given parameters applied. The code is included so that you can try variations on your own.
But there are not necessarily the same number of true Class 1 and Class 0 cases. The fraction of Class 1 cases is assumed to be \(t\). The mixture of the two is
This is the mixture distribution that is assumed to represent the count data of Class 1 ratings per subject. To proceed with an investigation of the t-a-p model, we first count up the number of \(\widehat{C_1}\) ratings for each subject. If there are the same number of raters for each subject, these counts will form a histogram that corresponds to the black dashed line in the figure, for some set of t-a-p parameters. The job then is to find out what those parameters are.
For a real data set, the red and blue plots in Figure 2 are assumed to exist, but are not directly accessible to us. Instead we can see something like the mixture (black dashed line). But even that isn’t exact, because it is subject to sampling error: the histogram of counts won’t exactly correspond to the ideal probabilities. The larger the number of subjects rated, the closer the empirical proportions will, in theory, converge to the ideal mixture distribution.
For general information on binomial mixtures of discrete data see Agresti (2003) chapter 14.
4 Simulation
The interactive app provided with the tapModel R package allows you to specify t-a-p parameters and generate a data set from them. Once the package is installed, you can run the app with tapModel::launchApp(). Navigate to the Simulate Data tab.
Figure 3 shows two count histograms with identical parameters except for the number of subjects (20 versus 1000). The histograms are the result of applying the t-a-p binomial mixture distribution with five raters on each subject and parameter set \((t = .3, a = .7, p = .2)\). Notice that the smaller sample size on the left (plot A) doesn’t match the distribution line as well as the one on the right. This is the effect of random sampling. The smaller the sample, the more likely it is that the empirical counts don’t look like the distribution.
This leaves us with two problems: how do we estimate the parameters, and how much should we trust the results? These are classical problems from statistics.
The accuracy rate \(a\) will affect the subject distributions. If \(a = 0\) the ratings will be distributed as \(\text{Bernoulli}(p)\), independently of the subjects being rated. If \(a=1\), then all raters agree on the true value for each subject. Therefore the way we can reconstruct \(a\) from data is through the distribution of the within-subject ratings. The method used here can be seen as a latent class model with binomial mixture distributions. For a nice discussion of these ideas in practice see Grilli et al. (2015), which helpfully notes that binomial mixtures are statistically identifiable if the number of cases exceeds a low threshold McLachlan & Peel (2000).
We would like to know the true proportion \(t\) of the subjects belonging to \(C_1\) regardless of how they were rated, rater accuracy \(a\), and the proportion \(p\) of inaccurate assignments that are \(\widehat{C_1}\). That goal describes the general model illustrated in the following section.
5 Fitting the Model
The first question about the model illustrated in figure Figure 1 is whether it is computationally useful. Using known parameter values for \(p, a, t\) to generate simulated ratings, can we then recover the parameters from the data? The answer is yes, with some provisos. Given a data set \(c_{ij}\), we can fit a general model to estimate the three parameters \(t\), \(a\), and \(p\) using maximum likelihood to fit a binomial mixture model. The log likelihood function for the binomial mixture described by figure Figure 1 with \(N\) subjects and \(R_i\) raters for subject \(i\) is
where \(k_i=\sum_{j}C_{ij}\) is the number of Class 1 ratings for subject \(i\). The sum over the logs is justified by the assumption that ratings are independent (i.e. multiplying probabilities). The \(t\) and \(\bar{t}\) terms at the top of Equation 2 are the mixing proportions for the two classes.
The second equation in Equation 2 just rewrites the log-likelihood function more compactly to see the binomial mixture structure. The last equation is a more efficient way to calculate the likelihood, based on the observation that there are a limited number of pairs \((R_i,k_i)\) that can occur in the data. For example, if there are three raters for each subject, then the possible pairs are \((4,0), (4,1), \dots, (4,4)\), five unique combinations regardless of the number of subjects \(N\). The likelihood is a sum over the unique pairs, multiplying each log-likelihood for that pair by the number of times it occurs in the data, denoted here by \(n_u\). In other words, the rating combination \((R_u, k_u)\) occurs \(n_u\) times. Using that formulation speeds up estimation algorithms because it reduces the number of calculations needed to fit the model.
It is straightforward to implement the function in the Bayesian programming language Stan Carpenter et al. (2017), using uniform \((0,1)\) priors for the three parameters (see the discussion section at the end of that paper to access the source code).
To test the computational feasibility of this method, ratings were simulated using a range of values of \(t\), \(a\), and \(p\). The 729 trials each simulated 300 subjects with five raters each, using all combinations of values ranging from .1 to .9 in increments of .1 for each of \(t\), \(a\), and \(p\). The Stan engine uses a Markov chain Monte Carlo (MCMC) algorithm to gather representative samples from the joint probability density of the three parameters. Each run used 1,000 iterations (800 after warm-up) with four chains each.
The accuracy measure \(a\) and the Class 1 “guess rate” \(p\) are stable across scenarios in figure Figure 4, but the estimated true fraction of Class 1 cases \(t\) is sensitive to values of \(a\) near zero. To see this, substitute \(a = 0\) into the likelihood function to get
Since the two terms at the top are the same, and we sum \(t + \bar{t} = 1\), the \(t\) drops out of the formula. So \(t\) is under-determined when \(a = 0\), and we should expect poor behavior as \(a\) nears zero. This is intuitive: if the raters are only guessing, they should give us no information about the true Class 1 rate. If the data in figure Figure 4 are filtered to \(a > .2\) the estimates of \(t\) greatly improve. Aside from extreme values of \(a\) affecting the estimation of \(t\), a visual inspection of the scatter plots of the parameter estimates shows no correlations.
The estimates in Figure 4 are taken from averages of the posterior parameter distributions, which is convenient, but sometimes hides the uncertainty in the estimates because of the limited range of possible values on [0,1]. When making inferences from a real data set, it’s useful to look at the full posterior distributions of the parameters to see if they are multi-modal or have other complications.
6 Indentifiability
Binomial mixtures are not always identifiable, meaning that the data don’t contain enough information to uniquely determine the parameters. If we fit two binomial distributions to the data with complete flexibility, each will have a “success rate” for the underlying Bernoulli trial, which determines the distribution’s mean and variance. There is also a weight parameter that defines the mixture; that determines the heights of the respective histograms of the two binomials. A simple data set would comprise the counts of Class 1 ratings for each subject. If the raters have reasonably high accuracy, a histogram of ratings will have a peak on the right, where a large fraction of raters of true Class 1 ratings agree. The average might be that 75 of 100 raters, on average agree that it’s Class 1. For true Class 0 cases, the raters might have an average of 25 of 100 ratings of Class 1 (erroneous ratings). The chapter on the interactive application shows how to simulate such data. It looks like this:
If we ask a generic optimization algorithm for the statistics to describe the two binomials, we can get two valid answers, e.g. distribution means of \(\mu_0 = 25\) and \(\mu_1= 75\) or \(\mu_0 = 75\) and \(\mu_1 = 25\). This is called “label-swapping,” because the solution doesn’t care which of the classes is which. One ad hoc solution is to include a restriction that the success rate for one of the distributions is greater than the other, but this is computationally awkward. The t-a-p model avoids this problem because the accuracy \(a\) is the non-negative difference between the means of the two binomials. In the example above, \(a = .5 = (75-25)/100\). Since \(a\) is bounded by 0 and 1, the model is generally identifiable.
Non-identifiable cases do occur when one or more of the t-a-p parameters is zero or one. If \(a = 1\), it doesn’t matter what \(p\) is, for example. This can also cause problems when a parameter is close to one or zero. For an example of that, see Chapter 4: The Kappa Paradox. These issues can be investigated using Bayesian MCMC estimation, which provides a full posterior distribution for the parameters. The distribution may be multi-modal if there are two competing solutions. This gives us a way to detect degenerate solutions and look for a different model if desired. These degenerate cases aside, the tap parameterization of the binomial mixture has an advantage over generic optimization.
We can add more parameters to the basic t-a-p model. For example, we might split accuracy into Class 1 accuracy and Class 0 accuracy, as is illustrated in Chapter 4. These additions can lead to multiple solutions, which can often be detected from the MCMC distributions of parameters. This topic needs more development.
7 Overdispersion
The basic t-a-p model assumes fixed averages of the three parameters over raters and subjects. The most sensible of these assumptions is that there is a single value for \(t\) that represents the fraction of Class 1 cases. That leaves two parameters that are certainly oversimplified in the tap model, so that counts of Class 1 ratings per subject are likely to have more variance than a binomial model would. This is due to anticipated variance in rater ability and the difficulty in rating subjects, resulting in overdispersion. A general approach to this problem is to allow each rater to have a different accuracy rate \(a_j\) and each subject to have a different guessing rate \(p_i\). This is a hierarchical model, which is described in Chapter 5: Hierarchical Models. Other approaches include using a beta-binomial prior for parameters, Williams (1975). Also see Ascari & Migliorati (2021). See Chapter 5 for more on these topics, particularly the discussion under the heading “Truth Parameters.”
8 Other Properties
As is shown in Appendix A, rater accuracy \(a\) is proportional to the correlation between the true and assigned classifications. If \(C\) is the true classification and \(T\) is the assigned classification, then
\[ \text{Cor}(T, C) = a\frac{\sigma_T}{\sigma_C} \]
Related properties can be found in the appendix.
References
Agresti, A. (2003). Categorical data analysis (Vol. 482). John Wiley & Sons.
Ascari, R., & Migliorati, S. (2021). A new regression model for overdispersed binomial data accounting for outliers and an excess of zeros. Statistics in Medicine, 40(17), 3895–3914.
Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1).
Grilli, L., Rampichini, C., & Varriale, R. (2015). Binomial mixture modeling of university credits. Communications in Statistics - Theory and Methods, 44(22), 4866–4879. https://doi.org/10.1080/03610926.2013.804565
McLachlan, G., & Peel, D. (2000). Wiley series in probability and statistics. Finite Mixture Models, 420–427.
Williams, D. (1975). 394: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics, 949–952.