Classification performance metrics and indices
Luciana Nieto & Adrian Correndo
2024-06-30
Source:vignettes/available_metrics_classification.Rmd
available_metrics_classification.Rmd
Description
The metrica package compiles +80 functions to assess
regression (continuous) and classification (categorical) prediction
performance from multiple perspectives.
For classification (binomial and multinomial) tasks, it includes a
function to visualize the confusion matrix using ggplot2, and 27
functions of prediction scores including: accuracy, error rate,
precision, recall, specificity, balanced accuracy (balacc), F-score
(fscore), adjusted F-score (agf), G-mean (gmean), Bookmaker Informedness
(bmi, a.k.a. Youden’s J-index), Markedness (deltaP), Matthews
Correlation Coefficient (mcc), Cohen’s Kappa (khat), negative predictive
value (npv), positive and negative likelihood ratios (posLr, negLr),
diagnostic odds ratio (dor), prevalence (preval), prevalence threshold
(preval_t), critical success index (csi, a.k.a. threat score), false
positive rate (FPR), false negative rate (FNR), false detection rate
(FDR), false omission rate (FOR), area under the ROC curve (AUC_roc),
and the P4-metric (p4).
For supervised models, always keep in mind the concept of
“cross-validation” since predicted values should ideally come from
out-of-bag samples (unseen by training sets) to avoid overestimation of
the prediction performance.
Using the functions
There are two basic arguments common to all metrica
functions: (i) obs
(Oi; observed, a.k.a. actual, measured,
truth, target, label), and (ii) pred
(Pi; predicted, a.k.a.
simulated, fitted, modeled, estimate) values.
Optional arguments include data
that allows to call an
existing data frame containing both observed and predicted vectors, and
tidy
, which controls the type of output as a list (tidy =
FALSE) or as a data.frame (tidy = TRUE).
For binary classification (two classes), functions also require to
check the pos_level
arg., which indicates the alphanumeric
order of the “positive level”. Normally, the most common binary
denominations are c(0,1), c(“Negative”, “Positive”), c(“FALSE”, “TRUE”),
so the default pos_level = 2 (1, “Positive”, “TRUE”). However, other
cases are also possible, such as c(“Crop”, “NoCrop”) for which the user
needs to specify pos_level = 1.
For multiclass classification tasks, some functions present the
atom
arg. (logical TRUE / FALSE), which controls the output
to be an overall average estimate across all classes, or a class-wise
estimate. For example, user might be interested in obtaining estimates
of precision and recall for each possible class of the prediction.
List of classification metrics* (categorical variables)
Note: All classification functions automatically recognize the
number of classes and adjust estimations for binary or multiclass cases.
However, for binary classification tasks, the user would need to check
the alphanumeric order of the level considered as positive. By default
“pos_level = 2” based on the most common denominations being c(0,1),
c(“Negative”,“Positive”), c(“TRUE”, “FALSE”).
# | Metric | Definition | Details | Formula |
---|---|---|---|---|
1 | accuracy |
Accuracy | It is the most commonly used metric to evaluate classification quality. It represents the number of corrected classified cases with respect to all cases. However, be aware that this metric does not cover all aspects about classification quality. When classes are uneven in number, it may not be a reliable metric. | \(accuracy = \frac{TP+TN}{TP+FP+TN+FN}\) |
2 | error_rate |
Error Rate | It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst | \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\) |
3 |
precision , ppv
|
Precision | Also known as positive predictive value (ppv), it represents the proportion of well classified cases with respect to the total of cases predicted with a given class (multinomial) or the true class (binomial) | \(precision = \frac{TP}{TP + FP}\) |
4 |
recall , sensitivity ,
TPR , hitrate
|
Recall | Also known as sensitivity, hit rate, or true positive rate (TPR) for binary cases. It represents the proportion of well predicted cases with respect to the total number of observed cases for a given class (multinomial) or the positive class (binomial) | \(recall = \frac{TP}{P} = 1 - FNR\) |
5 |
specificity , selectivity ,
TNR
|
Specificity | Also known as selectivity or true negative rate (TNR). It represents the proportion of well classified negative values with respect to the total number of actual negatives | \(specificity = \frac{TN}{N} = 1 - FPR\) |
6 | balacc |
Balanced Accuracy | This metric is especially useful when the number of observations across classes is imbalanced | \(b.accuracy = \frac{recall + specificity}{2}\) |
7 | fscore |
F-score | F1-score, F-measure | \(fscore = \frac{(1 + B ^ 2) * precision * recall}{(B ^ 2 * precision) + recall)}\) |
8 | agf |
Adjusted F-score | The agf adjusts the fscore for datasets with imbalanced classes | \(agf = \sqrt{F_2 * invF_{0.5}}\), where \(F_2 = 5 * \frac{precision~*~recall}{(4*precision)~+~recall}\), and \(invF_{0.5} = (\frac{5}{4}) * \frac{npv~*~specificity}{(0.5^2 ~*~ npv)~+~specificity}\) |
9 | gmean |
G-mean | The Geometric Mean (gmean) is a measure that considers a balance between the performance of both majority and minority classes. The higher the value the lower the risk of over-fitting of negative and under-fitting of positive classes | \(gmean = \sqrt{recall~*~specificity}\) |
10 | khat |
K-hat or Cohen’s Kappa Coefficient | The khat is considered a more robust metric than the
classic accuracy . It normalizes the accuracy by the
possibility of agreement by chance. It is positively bounded to 1, but
it is not negatively bounded. The closer to 1, the better the
classification quality |
\(khat = \frac{2 * (TP * TN - FN * FP)}{(TP+FP) * (FP+TN) + (TP+FN) * (FN + TN)}\) |
11 |
mcc , phi_coef
|
Matthews Correlation Coefficient | Also known as phi-coefficient. It is particularly useful when the number of observations belonging to each class is uneven. It varies between 0-1, being 0 the worst and 1 the best. Currently, the mcc estimation is only available for binary cases (two classes) | \(mcc = \frac{TP * TN - FP * FN}{\sqrt{(TP+FP) * (TP+FN) * (TN+FP) * (TN+FN)}}\) |
12 | fmi |
Fowlkes-Mallows Index | The fmi is a metric that measures the similarity between two clusters (predicted and observed). It is equivalent to the square root of the product between precision (PPV) and recall (TPR). It varies between 0-1, being 0 the worst and 1 the best. | \(fmi = \sqrt{precision * recall} = \sqrt{PPV * TPR}\) |
13 |
bmi , jindex
|
Informedness | Also known as the Bookmaker Informedness, or as the Youden’s J-index. It is a suitable metric when the number of cases for each class is uneven. It varies between | \(bmi = recall + specificity -1 = TPR + TNR - 1 = \frac{FP+FN}{TP+FP+TN+FN}\) |
14 | posLr |
Positive Likelihood Ratio | The posLr, also known as LR(+) represents the odds of obtaining a positive prediction for actual positives. | \(posLr = \frac{recall}{1+specificity} = \frac{TPR}{FPR}\) |
15 | negLr |
Negative Likelihood Ratio | The negLr, also known as LR(-) indicates the odds of obtaining a negative prediction for actual positives (or non-negatives in multiclass) relative to the probability of actual negatives of obtaining a negative prediction | \(negLr = \frac{1-recall}{specificity} = \frac{FNR}{TNR}\) |
16 | dor |
Diagnostic Odds Ratio | The dor is a metric summarizing the effectiveness of classification. It represents the odds of a positive case obtaining a positive prediction result with respect to the odds of actual negatives obtaining a positive result | \(dor = \frac{posLr}{negLr}\) |
17 | npv |
Negative predictive Value | It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst | \(npv = \frac{TP}{PP} = \frac{TP}{TP + FP}\) |
18 | FPR |
False Positive Rate | It represents the complement of
specificity . It could vary between 0 and 1. The lower the
better. |
\(FPR = 1 - specificity = 1 - TNR = \frac{FP}{N}\) |
19 | FNR |
False Negative Rate | It represents the complement of recall . It
could vary between 0 and 1. The lower the better. |
\(FNR = 1 - recall = 1 - TPR = \frac{FN}{P}\) |
20 | FDR |
False Detection Rate | It represents the complement of precision
(or positive predictive value -ppv -). It could vary between
0 and 1, being 0 the best and 1 the worst |
\(FDR = 1 - precision = \frac{FP}{PP} = \frac{FP}{TP + FP}\) |
21 | FOR |
False Omission Rate | It represents the complement of the npv .
It could vary between 0 and 1, being 0 the best and 1 the worst |
\(FOR = 1 - npv = \frac{FN}{PN} = \frac{FN}{TN + FN}\) |
22 | preval |
Error Rate | It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst | \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\) |
23 | preval_t |
Error Rate | It represents the complement of accuracy. It could vary between 0 and 1. Being 0 the best and 1 the worst | \(error~rate = \frac{FP+FN}{TP+FP+TN+FN}\) |
24 |
csi , jaccardindex
|
Critical Success Index | The csi is also known as the threat score
(TS) or Jaccard’s Index. It could vary between 0 and 1, being 0 the
worst and 1 the best |
\(csi = \frac{TP}{TP+FP+TN}\) |
25 |
deltap , mk
|
Markedness or deltap | The deltap (a.k.a. Markedness
-mk -) is a metric that quantifies the probability that a
condition is marked by the predictor with respect to a random
chance |
\(deltap = precision+npv-1 = PPV + NPV -1\) |
26 | AUC_roc |
Area Under the Curve | The AUC_roc estimates the area under the
receiving operator characteristic curve following the trapezoid
approach. It is bounded between 0 and 1. The closet to 1 the better.
AUC_roc = 0.5 means the models predictions are the same than a random
classifier. |
\(AUC_{roc} = precision+npv-1 = PPV + NPV -1\) |
27 | p4 |
P4-metric | The p4 estimates the P-4 following Sitarz
(2023) as an extension of the F-score . It is bounded
between 0 and 1. The closet to 1 the better. It integrates four metrics:
precision , recall , specificity ,
and npv . |
\(p4 = \frac{4} {\frac{1}{precision} + \frac{1}{recall} + \frac{1}{specificity} + \frac{1}{npv} }\) |
List of additional abbreviations:
P = positive (true + false)
N = negative (true + false)
TP = true positive
TN = true negative
FP = false positive
FN = false negative
TPR = true positive rate
TNR = true negative rate
FPR = false positive rate
FNR = false negative rate
ppv = positive predictive value
npv = negative predictive value
B = coefficient B (a.k.a. beta) indicating the weight to be applied
to the estimation of fscore
(as \(B^2\)).
References:
Ting K.M. (2017). Confusion Matrix. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Accuracy. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining . Springer, Boston, MA.
García, V., Mollineda, R.A., Sánchez, J.S. (2009). Index of Balanced Accuracy: A Performance Measure for Skewed Class Distributions. In: Araujo, H., Mendonça, A.M., Pinho, A.J., Torres, M.I. (eds) Pattern Recognition and Image Analysis. IbPRIA 2009. Lecture Notes in Computer Science, vol 5524. Springer-Verlag Berlin Heidelberg.
Ting K.M. (2017). Precision and Recall. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Sensitivity. (2017). In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Ting K.M. (2017). Sensitivity and Specificity. In: Sammut C., Webb G.I. (eds) Encyclopedia of Machine Learning and Data Mining. Springer, Boston, MA.
Trevethan, R. (2017). Sensitivity, Specificity, and Predictive Values: Foundations, Pliabilities, and Pitfalls in Research and Practice. Front. Public Health 5:307
Goutte, C., Gaussier, E. (2005). A Probabilistic Interpretation of Precision, Recall and F-Score, with Implication for Evaluation. In: D.E. Losada and J.M. Fernandez-Luna (Eds.): ECIR 2005. Advances in Information Retrieval LNCS 3408, pp. 345–359, 2. Springer-Verlag Berlin Heidelberg.
Maratea, A., Petrosino, A., Manzo, M. (2014). Adjusted-F measure and kernel scaling for imbalanced data learning. Inf. Sci. 257: 331-341.
De Diego, I.M., Redondo, A.R., Fernández, R.R., Navarro, J., Moguerza, J.M. (2022). General Performance Score for classification problems. Appl. Intell. (2022).
Fowlkes, Edward B; Mallows, Colin L (1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association. 78 (383): 553–569.
Chicco, D., Jurman, G. (2020). The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics 21, 6 (2020).
Youden, W.J. (1950). Index for rating diagnostic tests. Cancer 3: 32-35.
Powers, D.M.W. (2011). Evaluation: From Precision, Recall and F-Measure to ROC, Informedness, Markedness & Correlation. Journal of Machine Learning Technologies 2(1): 37–63.
Chicco, D., Tötsch, N., Jurman, G. (2021). The Matthews correlation coefficient (MCC) is more reliable than balanced accuracy, bookmaker informedness, and markedness in two-class confusion matrix evaluation. BioData Min 14(1): 13.
GlasaJeroen, A.S., Lijmer, G., Prins, M.H., Bonsel, G.J., Bossuyta, P.M.M. (2009). The diagnostic odds ratio: a single indicator of test performance. Journal of Clinical Epidemiology 56(11): 1129-1135.
Wang H., Zheng H. (2013). Negative Predictive Value. In: Dubitzky W., Wolkenhauer O., Cho KH., Yokota H. (eds) Encyclopedia of Systems Biology. Springer, New York, NY.
Freeman, E.A., Moisen, G.G. (2008). A comparison of the performance of threshold criteria for binary classification in terms of predicted prevalence and kappa. Ecol. Modell. 217(1-2): 45-58.
Balayla, J. (2020). Prevalence threshold (φe) and the geometry of screening curves. Plos one, 15(10):e0240215.
Schaefer, J.T. (1990). The critical success index as an indicator of warning skill. Weather and Forecasting 5(4): 570-575.
Hanley, J.A., McNeil, J.A. (2017). The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology 143(1): 29-36
Hand, D.J., Till, R.J. (2001). A simple generalisation of the area under the ROC curve for multiple class classification problems. Machine Learning 45: 171-186
Mandrekar, J.N. (2010). Receiver operating characteristic curve in diagnostic test assessment. J. Thoracic Oncology 5(9): 1315-1316
Sitarz, M. (2023). Extending F1 metric, probabilistic approach. Adv. Artif. Intell. Mach. Learn., 3 (2):1025-1038.