4
Min Read
When training and deploying machine learning (ML) models, it’s often impossible to measure model performance without implementing model monitoring as part of your MLOps lifecycle. Understanding model evaluation metrics, performance metrics for classification and regression, and the F1 score in machine learning is critical for anyone deploying ML models.
There are several metrics and model monitoring tools used to evaluate the performance of an ML model, and choosing the right ones is crucial. After all, evaluating your ML algorithm or model is an essential part of any successful project. There are at least 7 different categories and types of metrics used to measure machine learning performance, including:
1. Classification Metrics
2. Regression Metrics
3. Ranking Metrics
4. Statistical Metrics
5. Computer Vision Metrics
6. Natural Language Processing Metrics
7. Deep Learning Related Metrics
Clearly, there are lots of different metrics and lots of variables that you could measure or assess regarding machine learning models. We could try to tell you which ones are better, or dive into all the technical jargon of each one. Instead, we are going to look at a couple of the more popular ones a little closer, starting with the F1-score.
In technical terms, the F1 score is defined as the harmonic mean between precision and recall. While certain applications might require more of an emphasis on either precision or recall, if you want to get a good sense of both metrics combined into one, then the F1 score is exactly what you are looking for.
The F1 score formula is expressed in the following image, where TP denotes true positives, FP denotes false positives, and FN denotes false negatives.
F1 is calculated as follows:
$$F_1=2*\frac{precision * recall}{precision + recall}$$
where:
$$precision=\frac{TP}{TP + FP}$$
$$recall=\frac{TP}{TP + FP}$$
It is worth noting that there are often tradeoffs between precision and recall of ML models, and as one variable gets too high, the other begins to get significantly lower. With that in mind, let’s talk about what constitutes a good F1 Score.
The range for F1 scores is between 0 and 1, with one being the absolute best score possible. Naturally then, the higher the F1 score, the better, with a poor score denoting both low precision and low recall. As your recall and precision scores increase, your F1 score will also increase. That means if you find your F1 score is low, a good place to start looking for solutions is within your precision or recall metrics.
Another simple yet important metric for ML models is classification accuracy, often referred to simply as accuracy. Accuracy refers to the ratio of correct predictions to the total number of samples. In other words it looks something like:
$$Accuracy=\frac{\text{Number of correct predictions}}{\text{Total number of predictions}}$$
This might have you wondering, what is good accuracy for machine learning? While the goal is always to be as close to 1 (if you stick with the ratio) or 100 %, the reality is that perfect accuracy is hard to achieve. It is generally agreed upon that anywhere from 90% (or .9) is a good accuracy rate. However, that can change depending on the specific industry or model you are using. For example, if the model is being used to diagnose a deadly disease, a good accuracy rate might be closer to 95% or even 99%. Conversely, something with lower stakes like identifying whether pictures contain a dog in them might be candidates for a good accuracy score at 90%.
In many regards, artificial intelligence and machine learning embody the best of promise and progress for society. But, just like all humans have innate and implicit biases and blind spots, AI is also imperfect. The problem with machine learning is that most of what it decides, how it decides, and why it decides in a certain way, happens in a black box. That makes it difficult to detect model bias or flaws in the process. That can be a huge issue.
At Fiddler, we help MLOps and Data Science teams develop responsible AI by providing explainable AI. Once you understand why your ML models are making certain decisions, you can improve their overall performance.
Try Fiddler to get started on your path to building trust into AI.