References

Author

David Eubanks

References

Agresti, A. (2003). Categorical data analysis (Vol. 482). John Wiley & Sons.

Aickin, M. (1990). Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to cohen’s kappa. Biometrics, 293–302.

Ascari, R., & Migliorati, S. (2021). A new regression model for overdispersed binomial data accounting for outliers and an excess of zeros. Statistics in Medicine, 40(17), 3895–3914.

Bassett, R., & Deride, J. (2019). Maximum a posteriori estimators as a limit of bayes estimators. Mathematical Programming, 174, 129–144.

Bennett, E. M., Alpert, R., & Goldstein, A. (1954). Communications through limited-response questioning. Public Opinion Quarterly, 18(3), 303–308.

Bonett, D. G. (2022). Statistical inference for g-indices of agreement. Journal of Educational and Behavioral Statistics, 47(4), 438–458.

Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinder by the author). Statistical Science, 16(3), 199–231. https://doi.org/10.1214/ss/1009213726

Brennan, R. L., Measurement in Education, N. C. on, et al. (2006). Educational measurement. Praeger Publishers,.

Button, C. M., Snook, B., & Grant, M. J. (2020). Inter-rater agreement, data reliability, and the crisis of confidence in psychological research. Quant Methods Psychol, 16(5), 467–471.

Byrt, T., Bishop, J., & Carlin, J. B. (1993). Bias, prevalence and kappa. Journal of Clinical Epidemiology, 46(5), 423–429.

Carpenter, B. (2008). Multilevel bayesian models of categorical data annotation. Unpublished Manuscript, 17(122), 45–50.

Carpenter, B., Gelman, A., Hoffman, M. D., Lee, D., Goodrich, B., Betancourt, M., Brubaker, M., Guo, J., Li, P., & Riddell, A. (2017). Stan: A probabilistic programming language. Journal of Statistical Software, 76(1).

Chaturvedi, S., & Shweta, R. (2015). Evaluation of inter-rater agreement and inter-rater reliability for observational data: An overview of concepts and methods. Journal of the Indian Academy of Applied Psychology, 41(3), 20–27.

Cicchetti, D. V., & Feinstein, A. R. (1990). High agreement but low kappa: II. Resolving the paradoxes. Journal of Clinical Epidemiology, 43(6), 551–558.

Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

Cronbach, L. J., Rajaratnam, N., & Gleser, G. C. (1963). Theory of generalizability: A liberalization of reliability theory. British Journal of Statistical Psychology, 16(2), 137–163.

Davani, A. M., Dı́az, M., & Prabhakaran, V. (2022). Dealing with disagreements: Looking beyond the majority vote in subjective annotations. Transactions of the Association for Computational Linguistics, 10, 92–110.

Dawid, A. P., & Skene, A. M. (1979). Maximum likelihood estimation of observer error-rates using the EM algorithm. Journal of the Royal Statistical Society: Series C (Applied Statistics), 28(1), 20–28.

Delgado, R., & Tibau, X.-A. (2019). Why cohen’s kappa should be avoided as performance measure in classification. PloS One, 14(9), e0222916.

Engelhard, G. (2012). Examining rating quality in writing assessment: Rater agreement, error, and accuracy. Journal of Applied Measurement, 13, 321–335.

Engelhard Jr, G. (1996). Evaluating rater accuracy in performance assessments. Journal of Educational Measurement, 33(1), 56–70.

Engelhard Jr, G., & Wind, S. (2017). Invariant measurement with raters and rating scales: Rasch models for rater-mediated assessments. Routledge.

Eubanks, D. (2017). (Re)visualizing rater agreement:beyond single-parameter measures. Journal of Writing Analytics, 1.

Eubanks, D. (2022). Grades and learning. Journal of Assessment and Institutional Effectiveness, 12(1-2), 1–24.

Eubanks, D. A. (2014). Causal interfaces. Arxiv.org Preprint. http://arxiv.org/abs/1404.4884v1

Fleiss, J. L. (1971). Measuring nominal scale agreement among many raters. Psychological Bulletin, 76(5), 378.

Fleiss, J. L., Levin, B., & Paik, M. C. (2013). Statistical methods for rates and proportions. john wiley & sons.

Gelman, A., & Carlin, J. (2014). Beyond power calculations: Assessing type s (sign) and type m (magnitude) errors. Perspectives on Psychological Science, 9(6), 641–651.

Gelman, A., & Hill, J. (2006). Data analysis using regression and multilevel/hierarchical models. Cambridge university press.

Gelman, A., Vehtari, A., Simpson, D., Margossian, C. C., Carpenter, B., Yao, Y., Kennedy, L., Gabry, J., Bürkner, P.-C., & Modrák, M. (2020). Bayesian workflow. arXiv Preprint arXiv:2011.01808.

Gettier, E. L. (1963). Is justified true belief knowledge? Analysis, 23(6), 121–123.

Grilli, L., Rampichini, C., & Varriale, R. (2015). Binomial mixture modeling of university credits. Communications in Statistics - Theory and Methods, 44(22), 4866–4879. https://doi.org/10.1080/03610926.2013.804565

Gwet, K. L. (2008). Computing inter-rater reliability and its variance in the presence of high agreement. British Journal of Mathematical and Statistical Psychology, 61(1), 29–48.

Hinde, K. (2023). Inter-rater reliability in determining criterion a for PTSD and complex PTSD in parents of autistic children using the life events checklist for DSM-5.

Hodgson, R. T. (2008). An examination of judge reliability at a major US wine competition. Journal of Wine Economics, 3(2), 105–113.

Holley, J. W., & Guilford, J. P. (1964). A note on the g index of agreement. Educational and Psychological Measurement, 24(4), 749–753.

Hovy, D., Berg-Kirkpatrick, T., Vaswani, A., & Hovy, E. (2013). Learning whom to trust with MACE. Proceedings of the 2013 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1120–1130.

Krippendorff, K. (2013). Commentary: A dissenting view on so-called paradoxes of reliability coefficients. Annals of the International Communication Association, 36(1), 481–499.

Krippendorff, K. (2018). Content analysis: An introduction to its methodology. Sage publications.

Krippendorff, K., & Fleiss, J. L. (1978). Reliability of binary attribute data. Biometrics, 34(1), 142–144.

Kruger, J., & Dunning, D. (1999). Unskilled and unaware of it: How difficulties in recognizing one’s own incompetence lead to inflated self-assessments. Journal of Personality and Social Psychology, 77(6), 1121.

Kumar, S., Hooi, B., Makhija, D., Kumar, M., Faloutsos, C., & Subrahmanian, V. (2018). Rev2: Fraudulent user prediction in rating platforms. Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining, 333–341.

Landis, J. R., & Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 159–174.

McElreath, R. (2020b). Statistical rethinking: A bayesian course with examples in r and stan (2nd ed.). Chapman; Hall/CRC.

McElreath, R. (2020a). Statistical rethinking: A bayesian course with examples in r and stan. Chapman; Hall/CRC.

McLachlan, G., & Peel, D. (2000). Wiley series in probability and statistics. Finite Mixture Models, 420–427.

Millet, I. (2010). Improving grading consistency through grade lift reporting. Practical Assessment, Research, and Evaluation, 15(1).

Passonneau, R. J., & Carpenter, B. (2014). The benefits of a model of annotation. Transactions of the Association for Computational Linguistics, 2, 311–326.

Paun, S., Carpenter, B., Chamberlain, J., Hovy, D., Kruschwitz, U., & Poesio, M. (2018). Comparing bayesian models of annotation. Transactions of the Association for Computational Linguistics, 6, 571–585.

Rasch, G. (1977). On specific objectivity. An attempt at formalizing the request for generality and validity of scientific statements in symposium on scientific objectivity, vedbaek, mau 14-16, 1976. Danish Year-Book of Philosophy Kobenhavn, 14, 58–94.

Rasch, G. (1993). Probabilistic models for some intelligence and attainment tests. MESA Press.

Ross, V., & LeGrand, R. (2017). Assessing writing constructs: Toward an expanded view of inter-reader reliability. Journal of Writing Analytics, 1.

Scott, W. A. (1955). Reliability of content analysis: The case of nominal scale coding. Public Opinion Quarterly, 321–325.

Shabankhani, B., Charati, J. Y., Shabankhani, K., & Cherati, S. K. (2020). Survey of agreement between raters for nominal data using krippendorff’s alpha. Arch Pharma Pract, 10(S1), 160–164.

Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

Shrout, P. E., & Fleiss, J. L. (1979). Intraclass correlations: Uses in assessing rater reliability. Psychological Bulletin, 86(2), 420.

Team, S. D. (2022). Stan user’s guide 2.34. Stan Development Team.

Vach, W., & Gerke, O. (2023). Gwet’s AC1 is not a substitute for cohen’s kappa–a comparison of basic properties. MethodsX, 102212.

Williams, D. (1975). 394: The analysis of binary responses from toxicological experiments involving reproduction and teratogenicity. Biometrics, 949–952.