Thinking Beyond OSS Tools for Model Monitoring

As more Machine Learning (ML) models are deployed each day, ML teams increasingly must monitor model performance, with a variety of tools at their disposal. Operationalized models work on data they’ve never seen before, their performance decays over time, and they must be retrained to maintain model effectiveness or avoid model issues, such as data drift or model bias.

Typical ML applications are run either in real-time or batch modes, and, in either case, monitoring model predictions is key to closing the iterative feedback loop of model development. But how can ML teams accomplish all this?

Why OSS DevOps tools may not be sufficient

Visualization and querying OSS tools like Kibana (ELK stack) and Grafana (friend of Prometheus) provide charting tools to build and edit dashboards along with flexible querying for DevOps monitoring. Grafana’s roots were in charting time-series plots (counters, gauges, histograms) in mind, which typically capture metrics like disk usage, CPU usage, and the number of requests. Prometheus is typically used in conjunction with Grafana and it’s capable of scraping and storing time-series data. Kibana tightly integrates with Elastic which is capable of ingesting, indexing, and querying logs data, and was primarily built for log monitoring.

Before your team decides to build their own model monitoring solution on top of any of these existing solutions, here is how Fiddler provides off-the-shelf visibility into your models’ performance.

Model centric APIs

ML revolves around models, and different stakeholders on different teams must collaborate to successfully develop and deploy models to derive business value. Model monitoring requires models to be given a first-order treatment. The ability to compare the performance of multiple models, challenger and champion models, and various model versions, while supporting complex model stacks, requires very careful engineering to provide a solid, durable enterprise-grade offering. 

Fiddler achieves this by building model-aware dashboards and visualizations that can help propel your MLOps teams into a production-ready state right from deployment. Fiddler also organizes the model metrics at different levels of the information hierarchy, allowing users to go from high-level cockpit views to the lower-level model and dataset views, all the way into individual sample views.

Data security and access control

The feature inputs and outputs for ML models may contain sensitive data which is core to the business application, requiring fine-grained access protection. Fiddler’s model-centric approach allows you to protect data and views at the right level of abstraction. OSS DevOps tools fall significantly short in offering these types of protections, which enterprise buyers really care about.

Large-scale computing 

Due to the high volume and complexity of data for ML, aggregation is essential for immediate access to monitoring metrics. Fiddler’s Aggregation Service consumes each incoming event and updates all metrics instantly. It keeps track of “running” aggregates, which are backed by a datastore. The solution is massively scalable, since it distributes the precomputation across multiple containers. When users want to see feature drift over an entire quarter, for example, there is no need to fetch all the events to compute the drift; instead, aggregates serve the call providing very low latency. 

Figure 1 shows a high-level view of the Fiddler Aggregation System (FAS) which forms the foundation of Fiddler’s monitoring solution. All inference/prediction events received get buffered in the RabbitMQ messaging queue. These inferences get distributed between aggregation containers running on top of Kubernetes. 

Flexible metric computation and visualization

ML metric computation requires sophisticated techniques to be applied at scale for areas like drift computations, data integrity checks, outlier checks, and anomaly detection, further extended to different slices of the data. Layering these on traditional DevOps tools would result in fragile point solutions which end up significantly under servicing the needs of ML engineers and Data Scientists.

A few different metrics include:

  • Model Performance - AUC, Precision, Recall, F1 for regression models, 
  • Feature and Prediction Drift - Metrics computed based on KL divergence, JS divergence, Feature Importances, etc.
  • Model Fairness - Disparate Impact, Demographic Parity, Equal Opportunity

Fiddler’s aggregation service is capable of computing such metrics and other custom metrics in a highly extensible manner, which makes it a strong monitoring platform for all types of model monitoring use cases.

Complex data schemas 

DevOps tools like Grafana + Prometheus work on metrics like counters, gauges, and histograms, which are collected at source. Additionally, these tools provide storage, querying and time-series views over these data types. In other words, the inputs for these tools are well-formatted at the source and require minimal transformation besides rolling up on various different dimensions (time and labels), to produce varying granular views into the data.

Model monitoring on the other hand requires you to work with complex data types which include nested vectors, tensors, word embeddings, and one-hot and multi-hot encoded feature vectors to support a wide range of ML applications. Fiddler offers a robust platform to support such specialized computations at scale without having to write a lot of bespoke code. 

Time-delayed labels

Current DevOps monitoring stacks typically do not need to worry about months-old events still being relevant later on, but in contrast this happens frequently with ML models. For many industries, ground truth labels might not be available until days, weeks, or even months after the model’s predictions are made. Performance and accuracy calculations require combining the ground truth labels with the model outputs and aggregating them across many events to get a complete picture of the model’s performance.

Unified view – monitoring, explainability, fairness

Fiddler ties model monitoring to other pillars of ML observability, e.g. explainability and model fairness. The unified approach offers a comprehensive toolkit for ML engineers and Data Scientists to embed into their ML workflows, and provides a single pane of glass across model validation and model monitoring use cases. Building this visibility would require highly interactive visualizations, which are beyond the reach of tools like Grafana and Kibana.

Integrations

Enterprise products are incomplete without solid integrations with the other components in the ecosystem. Fiddler today integrates with several different tools for different purposes.

  • ML serving systems - AWS Sagemaker, Databricks, Spell, Seldon 
  • Data Warehouses - Snowflake, Bigquery, Redshift, SingleStore
  • Storage systems - S3, GCS
  • Workflow engines/Pipelining tools - Airflow, Flyte, etc 
  • Notification services - Email, Pagerduty, Slack

How to get started

Fiddler offers a comprehensive suite of machine learning model monitoring capabilities that typically requires an assembly of several OSS solutions. Fiddler not only saves time to start model monitoring but also provides a safer and faster way for an organization to scale their ML models in production while keeping visibility. Contact us to learn more.