The precision and the recall are the indicators used to measure the performance of binary classification model, for instance is the patient sick or not ?
Usually the output is noted 1 or 0.
Let's imagine we have a population of 1 and 0. We want then create from observed data a model able to predict if someone is a 1 or a 0. The model output
have 4 possibilities:
- It predict 1 and it's really a 1. The prediction is a True Positive (TP)
- It predict 1 and it was supposed to be 0. The prediction is a False Positive (FP)
- It predict 0 and it's really a 0. The prediction is a True Negative (TN)
- It predict 0 and it was supposed to be a 1. The prediction is a False Negative (FN)
Usually it is summarise in a table as follow:
# | Prediction |
1 | 0 |
Reality | 1 | TP = ? | FN = ? |
0 | FP = ? | TN = ? |
Furthermore, let's note
P (resp.
N) the total a 1 (resp. 0) in the original dataset, P̂ the total of predicted 1 (resp. 0).
So P = TP+FN and N = FP+FN.
The precision is the number of true positive divided by the number of predicted 1 : precision = TP/P̂. Thus the precision is, amoung all the time I said it's a 1,
how many times I was right. So P̂ = TP+FP and likewise (to complete the table) N̂ = FN+TN.
The recall is the number of true positive divided by the number of 1 in the original dataset : precision = TP/P. Thus the recall is, amoung, all the time it was supposed
to be a 1, how many time I found it.
The
concentration of 1 is P/(P+N)
Is it better to have a high precision or a high recall? Many false positive or many false negative ?
It depends on the objective. If the objective is to give a particular treatment to the people being 1, and if I give the treatment to the wrong person this person will risk to die or have serious damage,
you'd better have a low recall (miss most of 1) but have a high precision (rarely give false positive). But if you have to put in quarantine every person being a 1 for a month or a
mortal disease will spread all over the world, you'd better have a high recall (found everybody) but have a low precision (put some people in quarantine that are not supposed to be,
it's ok they will go out in one month !)
The performance of the model has to be compared to the concentration. Indeed, if the concentration is 0.1%, a precision of 50% is fabulous. But if
the concentration is 48%, a precision of 50% is not so good.
Let's consider the aside dataset. 25% of the population is a 1. So now we have to train a model that will give us how likely an entry has to be a 1.
The model gives for output a score between 0 and 1. By placing a threshold at different levels between 0 and 1, we can modify the precision and the recall in order
to be more or less conservative.
By lowering the threshold, the orange area increase since we accept more and more people to be at 1.
Thus the recall increase since we find more 1, but the precision decrease because we have more false positive.
The better the model, the smaller the precision decrease while the recall increase. Visually, it means than when we decrease the threshold, the orange area above the blue area increase less
than above the green area.
By putting the threshold at 0, the orange area is the whole space, then the recall is 100% and the precision is equal to the concentration.
By putting the threshold at 1, the orange area is of size 1 (only 1 point), if this point is a true positive the recall is 1/P and the precision is 100%, if
this point is a false positive the recall is 0% and the precision 0%.
Let's fix a threshold and visualize the precision and the recall.
The precision is the red area over the total orange area (visible on last graph).
The recall is the red area over the green area (visible on the first graph).
We can then plot a precision recall curve where each point is the value of the precision and the recall for a specific threshold.
Let's now consider a random choice. We have n papers in a bag written 1 or 0. We take at random m papers and our prediction is that they are all ones. In average,
the precision will be the concentration, since the probability you pick a 1 is equal to the concentration. For instance, the oxygen concentration in the
air is around 20%, if you fill a balloon with ambient air, the concentration of oxygen in the balloon will still be 20%. What's change is the recall, if
you take a small sample of n, you will get little good 1. If you pick half the bag (playing head or tails) you will pick half of 1, so get a recall of 50%,
if you pick the entire bag, the recall is 100%. Eventually, the precision-recall curve of the random strategy will be the blue horizontal line equal to the
concentration on last plot.
We see that given the quite low concentration, a model that would have such a red curve would be an excellent model. For some model, the precision can start
from 0%, then quickly go above the concentration line and decrease to met the concentration line for a recall of 100%.