Bleckwen

Challenges in machine learning models 2/2

article



May 6, 2022



by Isabelle Robin

‍

Continuing our previous article on the specifics of building a machine learning model for fraud detection, we will discuss the creation, metrics, and explainability of artificial intelligence models.

Stable models from development to production

Regarding supervised learning, the stability between results on the validation datasets, the development phase, and effects on live data from the production phase relies on a solid hypothesis: data from both phases show similar distributions in their features. Suppose the patterns of genuine or fraudulent cases are very different.

In that case, the model won’t be as performant as during the development phase. It occurs when the practices change, for example, during an economic crisis or when data is not taken from the same sources. However, this phenomenon can be kept under control using a rigorous training methodology which can help to rightly estimate the model performance.

As fraudulent patterns can vary across time, classical splitting between training and validation datasets can be insufficient. Validating performances on models trained only on records ensures the time component of frauds is considered. Our models are also validated over different periods to ensure our training pipelines produce stable models.

The right metrics

Using the right metrics during training and evaluation steps is also key to building an efficient model. Whereas some use cases relying on ML are driven by the accuracy of predictions, detecting fraud implies different success metrics. Indeed, as fraudulent cases are sporadic, the metrics should focus on the fraud detection rate (also called recall) and relevancy.

This leads us to the tricky threshold question. Usually, a threshold is computed from the validation dataset to optimize desired metrics. Most of the time, it will be based on the alert handling capacity of the fraud analysts' team. Still, if this capacity is flexible, it can also be computed to optimize the relevancy or reach a minimum detection rate. Indeed, the model gives a score for every record, but we need to decide the threshold above which it will be considered an alert.

‍
Explainable AI

Our fraud detection score is meant to be used by people, so technical performance is not the only criterion. If the generated alerts are not human-understandable, they are likely to be misinterpreted or ignored. Explainability is a crucial driver of ML applications' performance in production. It can be achieved with the help of interpretability algorithms, e.g., SHAP.

It can give each record the weight and direction of each feature: in other words, how much each component is « pushing » towards fraud or genuine. However, this weight can be hard to interpret for the users. We add a layer of « human-interpretability » over it to convert those weights to a sentence, in which values and features are decrypted.

‍

Birth and death of a model

Now, the model is in production, and its scores are used by analysts.Does the journey end here? Not! As data and fraudulent patterns change, models’ performance can drift over time. As for retraining, a model can belong. We need to monitor both the model score and all its feature distribution. Thus, retraining can be launched only when there is a change in those.

Monitoring dashboard: scoring draft surveillance

‍

Detecting fraud with Machine Learning is an endless lifecycle rather than a one-shot project! Simple retraining won't be enough when the data or patterns change too much. A complete study must be performed again to find the right features and parameters.

‍

Monitoring dashboard: features surveillance

With fraud on the rise, credit actors know they can no longer rely solely on past methods. They are increasingly investing in AI and ML to minimize risk and maximize ROI with minimal effort. The future of fraud prevention lies in AI and ML expert solutions equipped with deep knowledge of banking and fraudsters' strategies. These solutions are designed to complement and enhance existing expertise in the industry. The difficulty lies mainly in production, e.g., explicability for analysts, consistency of data and training of the historical model, and the model's output, monitoring, and life cycle.

‍

“Depending on the reasons given, the score instruction can quickly drop from 30 minutes, in case of additional checks, to 5 minutes. The ROI is impressive, and we are delighted, especially since fraud is now revealed much earlier in our process and any additional checks are done much more quickly"

Karim Tinouiline - Fraud Manager, Carrefour Banque & Assurance

‍

Did you know 87% of machine learning projects never go into production?

Building artificial intelligence from scratch is a considerable risk;
It is complex and requires a lot of work and operational costs;
You must invest in solutions that fall between ownership, long-term strategy, immediate ROI, and efficiency.

‍