This post covers the complete Enterprise Monitoring landscape including the newest category of artificial intelligence and machine learning monitoring.
With the advent and adoption of web products and services over the past two decades, an entire category of systems dedicated to managing the related infrastructure has developed. Software monitoring, one of the core operational needs, has itself grown into a multi-billion dollar market accelerated in recent years by the migration of enterprises to the cloud and adoption of new technology standards including microservice architectures and containerization. The 2017 acquisition of AppDynamics and the 2019 IPOs of DataDog and PagerDuty are evidence that there is still room for growth.
This post will examine the types of companies in the monitoring landscape and then focus on a new entry in this space - AI/ML Monitoring - monitoring artificial intelligence and machine learning in production.
Current enterprise monitoring landscape
Enterprise software monitoring can historically be grouped into two primary categories:
(1) Business monitoring with product analytics
(2) Infrastructure and application performance monitoring
Business Monitoring enables enterprises to monitor product usage to understand leading indicators of business health and opportunities for growth, including how products are used, what drives user behavior or churn, or how workflows or channels convert. To do this, products in this category offer easy ways to compute and visualize simple metrics like pageviews as well as complex insights like workflow, funnel, and cohort analysis. Data is generally assimilated across multiple sources and databases which can result in delayed metrics. Typical users are business owners like Product Managers. Business Intelligence is a related but distinct category focused on analyzing collected data e.g. PowerBI, Tableau.
Popular product analytics solutions are incumbents like Adobe Analytics and Google Analytics and new entrants like Heap, Pendo, Amplitude, and MixPanel who are innovating with rich intuitive interfaces that enable quick insights to help democratize product analytics across the company.
The second category, infrastructure and application monitoring, enables enterprises to monitor the activity in the underlying application, components, and infrastructure to understand the health of the deployed software. To do this, products in this category offer easy ways to integrate into any running services or systems via software agents or plugins that continuously gather real-time metrics. Metrics can be as simple as CPU usage, latency, uptime, etc. The shift to the cloud in the 2000s created opportunities for new cloud-first monitoring solutions. Many products now offer easy ways to connect metrics across multiple services or even the entire application stack for complete operational visibility. Open source solutions like Grafana and Prometheus are particularly popular due to their simple setup for gathering basic standalone metrics quickly. New entrants like DataDog and AppDynamics have grown with easy-to-use and powerful managed observability solutions. Typical users are DevOps, ITOps, and Engineers.
Introducing ML Monitoring
AI’s increased adoption has created a new entrant into the monitoring landscape - ML Monitoring. Businesses are expected to double their spending in AI systems from a projected $35.8 billion in 2019 to $79.2 billion by 2022. But ML is not the easiest technology to deploy: a considerable majority of ML models never make it to production.
ML models are unique software entities, as compared to traditional code, that are trained for high task performance on specific tasks using historical data. Their performance can hence gradually and subtly degrade due to changes in the data input into the model after deployment. Successful AI deployments thus require continuous ML Monitoring to keep an eye on their business impact on an ongoing basis.
A complete ML monitoring solution should address all the operational issues faced by ML models - drift detection, outliers identification, real-time performance, data integrity alerts, and bias checks. Because these nuanced issues are specific to machine learning, traditional business and infrastructure monitoring products were not designed to take on this challenge.
The black-box nature of machine learning models makes them especially difficult to understand and debug for data scientists and other ML practitioners. Explainable AI, a recent research advancement, extends traditional monitoring to provide deep model insights with actionable steps. With AI explainability, users can understand the problem drivers, root cause issues, and analyze the model to prevent a repeat, and saving time over more manual investigation methods.
ML monitoring can either be a standalone offering or be integrated into an ML platform provider. Companies with basic ML monitoring needs, like viewing model input and output over time, can repurpose open source tools like Grafana as an alerting base to build and maintain their desired operational monitoring capabilities. In addition to building on top of open-source, teams looking to maintain high performance in ML deployments have the option of selecting an end to end ML platform for ease of procurement or integrate individual standalone ‘best of breed’ solutions into their own systems to gain a competitive advantage. Since ML monitoring is a new product area, many companies are still building out their solutions. Azure and Dataiku for example have only one capability, drift detection (available in preview) while Sagemaker offers data integrity and drift checks and IBM OpenScale’s solution offers bias detection. Explainability, a growing need, is available in select products like Fiddler, IBM OpenScale and others. These solutions include additional tools to isolate problem drivers, with prediction attribution based methods and other model analytics, for fast resolution.
Following the pattern of traditional monitoring solutions with the explosion of cloud, we expect the adoption of AI across businesses to create a major need and market for ML monitoring solutions.