Chapter 9: Multinomial Kappa
1 Introduction
So far I’ve considered binary ratings, with the exception of the ML algorithms like MACE, which are designed for \(K\) categories. I downgraded those to binary during that discussion for simplicity. But let’s revisit this question, starting with the Fleiss kappa. You’ll recall that with binary ratings, for unbiased raters (\(t=p\)), we have \(\kappa = a^2\). The Fleiss kappa is defined for multinomial ratings, so might it be the case that the same relation holds in that case?
To get there, we have to reimagine the t-a-p model in the case with \(K\) categories and \(k>2\). Instead of a binary latent truth \(T_i\), we can allow each subject take on values \(1 \dots K\) to indicate the true class. The corresponding probability \(t\) now becomes a probability distribution over the \(k\) values, so I’ll switch to boldface to indicate a vector. Accuracy need not be vectorized in the simplest case, but the random assignment rate \(p\) also needs to become a distribution.
With vectorized \(\boldsymbol{t}\) and \(\boldsymbol{p}\) the convention \(\bar{p} = 1-p\) no longer makes sense. Instead, I’ll use matrix multiplication, assuming column vectors. So the probability of two random classifications matching is \(\boldsymbol{p}^t\boldsymbol{p} = \sum_k{p_k^2}\) where \(p_k\) is the \(k\)th element of the vector \(\boldsymbol{p}\), and the \(t\) superscript means transpose from column to row vector to make a dot product.
Recall that kappas are based on match rates using a formula that’s agnostic to \(K\). For two ratings to match, two raters \(j_1,j_2\) of the same subject \(i\) must agree in their assignment of either Class 1 or Class 0 classifications. In other words, the binary random variables must agree: \(C_{ij_1} = C_{ij_2}\), which now may be an integer 1 to \(K\). A generic formula that includes the most common kappas is
\[ \kappa = \frac{m_o - m_c}{1 - m_c}, \] where \(m_o\) is the observed proportion of agreements and \(m_c\) is the expected proportion of agreements under chance. The assumption about \(m_c\) is a defining feature of the various kappa statistics.
Consider two raters classifying an observation into \(K\) categories. In the t-a-p model we can express the expected value of observed matches \(m_o\) as the sum of three kinds of agreement: (1) \(m_a\) is when both raters are accurate (and hence agree), (2) \(m_i\) when both raters are inaccurate (guessing) and agree, and (3) \(m_x\) is the mixed case when one rater is accurate and the other is inaccurate but they agree. The second two of these have expressions that include the guessing rate \(m_c\). Following that thinking we have the following expectations for rates:
\[ \begin{aligned} m_a &= a^2 & \text{(both accurate)}\\ m_r &= \boldsymbol{p}^t\boldsymbol{p} & \text{(random ratings)}\\ m_i &= \bar{a}^2m_r = a^2m_r - 2am_r + m_r &\text{(both inaccurate)}\\ m_x &= 2a\bar{a}\boldsymbol{t}^t\boldsymbol{p} &\text{(mixed accurate and inaccurate)}\\ m_o &= m_a + m_i + m_x &\text{(observed match rate)}\\ &= a^2+a^2m_r + m_r - 2am_r + 2a\bar{a}\boldsymbol{t}^t\boldsymbol{p}\\ \end{aligned} \tag{1}\]
For \(m_a\), both ratings must be accurate, in which case they automatically agree. For \(m_i\), both must be inaccurate (probability \(\bar{a}^2\)) and then match randomly (probability \(m_r\)). For \(m_x\), one rater must be accurate and the other inaccurate, in which case they agree if the accurate rater chooses the category that the inaccurate rater guesses.
2 Unbiased Raters: the Fleiss Kappa
Recall that in terms of the t-a-p model, the Fleiss kappa assumes that the by-chance match rate in the kappa formula comes from the observed rate of each category. Random matches are assumed to have the same distribution as the overall ratings distribution. In the vectorized t-a-p model, this amounts to finding \(\boldsymbol{c} = \boldsymbol{t}a + \bar{a}\boldsymbol{p}\), and the by-chance match rate in the kappa formula is \(m_c = \boldsymbol{c}^t\boldsymbol{c}\). Under the unbiased rater assumption, where \(\boldsymbol{p} = \boldsymbol{t}\) after a bit of algebra we get \(m_c = \boldsymbol{p}^t \boldsymbol{p} = m_r\).
With that result we can tackle the kappa,
\[ \begin{aligned}\kappa_f &= \frac{m_o - m_c}{1 - m_c} \\ &= \frac{a^2+a^2m_r + m_r - 2am_r + 2a\bar{a}m_r - m_r}{1 - m_r} \\ &= a^2. \end{aligned} \]
With vectorized truth and randomness, the math looks just like the binary case, and we get the same result: when the \(\boldsymbol{p} = \boldsymbol{t}\) condition is met, kappa is the square of accuracy. This is a nice result, because it gives us a direct way to check multinomial solvers that use unbiased simulated ratings with multiple categories.
3 Likelihood
The likelihood function for the multinomial case is similar to the binary case. The difference is in the sum, which is over each class. We have
\[ \begin{aligned} \text{Pr(ratings; parameters)} &= \prod_{i=1}^{N_s} \sum_{k_T = 1}^Kt_{k_T}^{(i)}\prod_{j}\pi_{k_Tk_{ij}}^{(j)} \, , \end{aligned} \]
where the leftmost product is over subjects, and the sum is over the \(K\) classes, where we pull subject \(i\)’s truth estimates \(\boldsymbol{t}\) and extract the elements one by one. The innermost product is over all the ratings, indexed by the rater \(j\), the assigned rating \(k_{ij}\) and each possible true Class \(k_T\). The rater likelihood is
\[ \begin{aligned} \pi_{k_Tk_{ij}}^{(j)} &= \text{Pr(assigned rating = }k_{ij}\text{; true class = }k_T) \\ &= \bar{a}_jp_{k_T}^{(j)} + a_jI(k_j = k_T). \end{aligned} \] Here, \(I\) is an indicator function that is one when the assigned rating equals the assumed true rating, and zero otherwise.
4 E-M Algorithm
The E-step of the multinomial case looks like the binary case, just with more classes. The likelihood contribution for a single subject \(i\) is
\[ \begin{aligned} \text{Pr(ratings; parameters, subject }i) &= \sum_{k_T = 1}^Kt_{k_T}^{(i)}\prod_{j}\pi_{k_Tk_{ij}}^{(j)} \, , \end{aligned} \] from which we extract new estimates of class probabilities via
$$ \[\begin{aligned} t_{k_T}^* &\approx Pr(\text{True class } = k_T ;\text{ parameters, ratings)} \\ &=\frac{t_{k_T}\prod_{j}\pi_{{k_T}k_{ij}}^{(j)}}{ \sum_{k_T = 1}^Kt_{k_T}\prod_{j}\pi_{{k_T}k_{ij}}^{(j)} } , \end{aligned}\]$$ where it is understood that we’re referring to subject \(i\), so I have removed those annotations. The \(t_{k_t}^*\) result on the left is the updated class probability that subject \(i\) is class \(k_T \, \epsilon \{1 \dots K\}\).
To derive the M-step of the algorithm we need to expand the 2x2 confusion matrix to KxK. The matrix consists of probabilities, where the \(i,j\) entry contains \(Pr(\text{Rating = }j | \text{True class }= i)\). Confusion matricies are sometimes row-normalized, but this one is not; it’s the intersection of true class values (rows) with rated classes (columns). The empirical matrix comprises the proportions of ratings that fall within each of these groups, which enables us to compare a model with a data set. I’ll illustrate with the \(K=3\) case, where the model’s expected value of these proportions is
\[ E[C] = \begin{bmatrix} t_1(a + \bar{a}p_1) & t_1\bar{a}p_2 & t_1\bar{a}p_3 \\ t_2\bar{a}p_1 & t_2(a + \bar{a}p_2) & t_2\bar{a}p_3 \\ t_3\bar{a}p_1 & t_3\bar{a}p_2 & t_3(a + \bar{a}p_3) \end{bmatrix} . \] Here, \((t_1, t_2, t_3)\) are the class probabilities, which we can more compactly express as a vector \(\boldsymbol{t}\), and similarly with the \(p\) parameters. The empirical \(\hat{C}\) we get from the ratings combined with the class probabilities \(\boldsymbol{\hat{t}}\) obtained in the E-step.
To solve for the parameters, we can start with the sum of the diagonals (the trace of the matrix), to get
\[ \begin{aligned} \text{trace}(E[C]) &= t_1(a + \bar{a}p_1) + t_2(a + \bar{a}p_2) + t_3(a + \bar{a}p_3) \\ &= a(t_1 + t_2 + t_3) + \bar{a}(t_1p_1+ t_2p_2+t_3p_3) \\ &= a + \bar{a}\boldsymbol{t}^t\boldsymbol{p} \\ & \approx \text{trace}(\hat{C}) , \end{aligned} \] which gives us one equation linking the model to the data. We can sum the off-diagonal columns to get similar equations relating to the components of \(\boldsymbol{p}\). For the first column, we have
\[ \begin{aligned} E[c_{21} + c_{31}] &= t_2\bar{a}p_1 + t_3\bar{a}p_1 \\ &= \bar{t}_1\bar{a}p_1 \\ & \approx \hat{c}_{21} + \hat{c}_{31} \\ &:= \hat{C}_{o1}, \end{aligned} \] where in the definition in the last step the o means “off-diagonal”, and the 1 means sum over column 1. Recall that we already have estimates for \(\boldsymbol{t}\) at this point, and wish to solve for \(a\) and \(\boldsymbol{p}\). If we sum the three off-diagonals we obtain \(\bar{a}-\bar{a}\boldsymbol{t}^t\boldsymbol{p} = 1 -\text{trace}(E[C])\), since all the elements of \(C\) sum to 1.
Solving each of the off-diagonal sums for \(p_i\) gives us a way to combine with the trace equation and solve for \(a\). Start with \(p_1 \approx \hat{C}_{o1}/(\bar{t}_1\bar{a})\) and substitute to obtain
\[ \begin{aligned} \text{trace}(\hat{C}) & \approx a + \bar{a}(t_1p_1+ t_2p_2+t_3p_3) \\ &= a + \bar{a} \left(\frac{t_1\hat{C}_{o1}}{\bar{t}_1\bar{a}} \right) + \bar{a} \left(\frac{t_2\hat{C}_{o2}}{\bar{t}_2\bar{a}} \right) + \bar{a} \left(\frac{t_3\hat{C}_{o3}}{\bar{t}_3\bar{a}} \right) \\ &= a + \frac{t_1}{\bar{t}_1}\hat{C}_{o1} + \frac{t_2}{\bar{t}_2}\hat{C}_{o2} + \frac{t_3}{\bar{t}_3}\hat{C}_{o3} \\ \end{aligned} \]
This gives us a simple calculation for the accuracy coefficient, from which we can derive an approximation for \(\boldsymbol{p}\) as well:
\[ \begin{aligned} a & \approx \text{trace}(\hat{C}) - \left(\frac{t_1}{\bar{t}_1}\hat{C}_{o1} + \frac{t_2}{\bar{t}_2}\hat{C}_{o2} + \frac{t_3}{\bar{t}_3}\hat{C}_{o3} \right) \\ p_i & \approx \frac{\hat{C}_{oi}}{\bar{t}_i\bar{a}}. \end{aligned} \] The formula for \(a\) gives us some insight into the t-a-p model. When \(a=1\) the ratings will match true values, so the matrix will be diagonal with values \(\boldsymbol{t}\). When \(a = 0\) we get the outer product \(C = \boldsymbol{t}\boldsymbol{p}^t\).