As enterprises around the world deploy machine learning and AI in actual production, it’s becoming increasingly critical that AI can be trusted to produce not just accurate, but also fair and ethical results. An interesting market opportunity has opened up to equip enterprises with the tools to address those issues.
At our most recent Data Driven NYC, we had a great chat with Krishna Gade, co-founder and CEO of Fiddler, a platform to “monitor, observe, analyze and explain your machine learning models in production with an overall mission to make AI trustworthy for all enterprises”. Fiddler has aised $45 million in venture capital to date, most recently a $32 million Series B just last year in 2021.
We got a chance to cover some great topics, including:
- What does “explainability” mean, in the context of ML/AI? What is “bias detection”?
- What are some examples of business impact of “models gone bad”?
- A dive into the Fiddler product and how it addresses the above?
- Where are we in the cycle of actually deploying ML/AI in the enterprise? What’s the actual state of the market?
Below is the video and full transcript. As always, please subscribe to our YouTube channel to be notified when new videos are released, and give your favorite videos a “like”!
(Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen, Karissa Domondon and Diego Guttierez)
TRANSCRIPT [edited for clarity and brevity]:
[Matt Turck] You’ve had a very impressive career as a data engineering leader. You worked at Microsoft and Twitter, then Pinterest and Facebook. And you could have tackled pretty much any problem in this broad data space which keeps exploding and getting more interesting. Why did you choose that specific problem of building trust in AI?
[Krishna Gade] I spent15 years of my career focusing on infrastructure projects, whether that’s search infrastructure or data infra or machine learning infrastructure at Facebook. When I was working at Facebook, we got into this very interesting problem around we had a lot of machine learning models powering core products, like newsfeed, ads. And they became very complex over time.
And simple questions like, “Hey, why am I seeing this story in my newsfeed?” were very difficult to answer. The answer used to be, “I don’t know. It’s just the model,” right? And those answers were no longer acceptable by internal executives, product managers, developers. In those days, “explainability” was not even a coined term. It was just plain, simple debugging. So we were debugging how the models work and understanding which model versions were running for which experiments, what features were actually playing a prominent role and whether there was an issue with the model or the feature data that was being supplied to the models.
It helped us address feed quality issues. It helped us answer questions that we’d get across the company. And eventually, that effort that started with one developer then became a full-fledged team where we had essentially established a feed quality program and built out this tool called Why Am I Seeing This, which was embedded into the Facebook app and showed these explanations to employees and eventually end users.
That experience really triggered this idea that now I’ve been working on machine learning for a long time. And I’ve spent some time working on search quality at Bing. And in those days, I’m talking mid 2000s, we were actually productizing neural networks for search ranking, two-layer networks. The problem was that I saw that this machine learning thing was actually going beyond just FAANG companies or companies that were trying to just sell advertisements. This was actually entering the enterprise in a cool way. Then we have seen the emergence of tools by the time, SageMaker was launched and there was already DataRobot.
A lot of those tools were focusing on helping developers build models faster in an automated fashion and whatnot. But I felt like without actually having visibility into how the model is working and understanding how the model was built, it’s going to be very difficult to make sure that you’re deploying the AI in the right way. And part of my experience being at Facebook also helped me understand that part and how important it is to do it right.
We saw this space where eventually the hypothesis was that the machine learning workflow will become the software developer lifecycle where the developers will choose the best-in-class tools to put together their ML workflow. We saw an opportunity to build a monitoring, analysis and explainability tool in that workflow that can connect all of your models and give you these insights continuously. That was the hypothesis. This was a new category that we wanted to create. Fortunately, here we are three and a half years later. This category is now thriving and there’s a lot of interest from a lot of customers and active deployments as well today.
Let’s go through a quick round of definitions just to help anchor the conversation. What does “explainability” mean, in the context of machine learning?
There are essentially two things that are very unique about a machine learning model.
At the end of the day, a machine learning model is a software artifact, right? It is trained using a historical dataset. So it’s essentially recognizing patterns in a dataset and encoding in some sort of a structure. It could be a decision tree. It could be a neural network or whatever structure that is.
And it then can be applied to infer new predictions on new data, right? That’s basically what machine learning is at the end of the day.
Now, the structures that the machine learning models train are not human interpretable in the sense that if you want to understand how a neural network or a deep neural network is operating and detecting a particular image to be a cat versus a dog. Or a model could be classifying a transaction to be a fraudulent transaction or a non-fraudulent transaction. Or if a model is being used to set credit limits for a customer in a credit card company, if you want to know why it is doing that, that’s the black box.
It’s not a traditional software where if I had written a traditional piece of software where I’ve encoded all these instructions in the form of code, I can actually look into the code line by line. And a developer could actually understand how it works and debug it. For a machine learning model, it’s not possible to do it. So that’s number one.
Number two is these models are not static entities. Unlike traditional software, the quality of the model is highly dependent on the data it was trained with. And so if that data changes over time or shifts over time, then your model quality can deteriorate over time.
For example, let’s say I have trained a loan credit risk model on a certain population. Now suddenly, say, a pandemic happened. People lost jobs. Businesses foreclosed. And a whole lot of societal disturbances happened. Now the kind of applicants that are coming to me to apply for loans are very different from the type of applicants that I used to train the model.
This is called data drift in the ML world. And so this is the second biggest problem that essentially you have a model that you built. And you might be flying blind without knowing when it actually is making the right predictions, when it is actually making inaccurate predictions. Those are the two problems where you need transparency or explainability or visibility into how the model is working.
What is “bias detection”?
It’s part of the same problem. Now, for example, let’s say I trained a face recognition model. We’ve all been aware of all the problems of face recognition AI systems, right? Based on the population that you’ve trained the AI system, it can be very good at recognizing certain kinds of people. So let’s say maybe it’s not trained on Asian people or African-Americans. It may not be able to do well. And we have seen several incidents like this, right?
The most popular one in our recent history was the Apple Card gender bias issue where when Apple rolled out their credit card, a lot of customers complained that, “Hey, I’m getting very different credit limits between myself and my spouse even though we seem to have the same salary and similar FICO score and whatnot.” And almost 10 times the difference in credit limits, right? And how is it happening? It could be possible that when you build these models, you may not have the training data in a balanced manner. You may not have all the populations represented across positive and negative labels.
You may have proxy bias entering into your model. For example, let’s say if you use zip code as a feature in your model to determine credit risk. We all know zip code has a high proxy, a high correlation with race and ethnicity of people. So now you can actually introduce a proxy bias into the model by using features like that. And so this is another reason why you need to know how the model is working so that you can actually make sure that you’re not producing bias in decisions using machine learning models for your customers.
What’s another example of “models gone bad” in terms of how it impacts the bottom line?
We hear this from our customers all the time. For example, in fact, there was a recent LinkedIn post by an ML engineer, I think, from a fintech company. It’s a very interesting example. So this person trained a machine learning model. One of the features was an amount. I think it was income or loan amount or whatnot. It was basically being supplied by an external entity, like a credit bureau. So the input was coming in the form of JSON. It was coming basically like “20,00.” So it was basically $20 versus $2,000, right?
So the data engineers knew this business logic. And they actually would divide 2,000 by 100. Then they would store it into the data warehouse. But the ML engineer did not know about it. So when they trained the model, he was actually training the model the right way, so using the $20. But when he was actually sending the production data to the model, it was actually sending 2,000. So now you have a massive difference in terms of the input of values, right?
So as a result, they were denying pretty much every loan request that they were getting for 24 hours. They had an angry business manager coming and talking to them. And they had to go and troubleshoot this thing and fix it. These are similar issues that we see amongst our customers. One of our customers mentioned that when they deployed a pretty important business critical model for their application, that started drifting over the weekend. And they lost up to about half a million dollars in terms of potential revenue, right?
The most recent one that we all have been aware of and which we don’t really know the complete details of is the Zillow incident where they’re supposed to have used machine learning to do price prediction. We don’t know what went wrong there. But we all know the outcome, what has happened. And the business lost a lot of money. So this is why it’s very important not just for the reputation, trust reasons from a branding perspective that you want to make sure that you’re making responsible and fair decisions for your customers, which is also important. But just for your core business if you’re using machine learning, you need to know how it’s working.
What’s your sense of the level of awareness of those problems?
There are obviously two types of companies in the world, companies who have invested a lot of energy and money and people and data and the mature data infrastructure and are now leveraging the benefits of both machine learning and AI, right? We work with a lot of companies in that side of the world where they are basically trying to productize machine learning models. And they’re looking for this monitoring.
Most of these customers, when we spoke to them, were using or trying to retrofit existing DevOps monitoring tools. Say one of the customers was using Splunk with SageMaker. They would train their models, deploy their models. And they would try to retrofit Splunk, which is a great tool for DevOps monitoring but retrofitted for model monitoring. Same thing with a lot of customers would use Tableau or Datadog or homegrown, open source tools, like RAVENNA.
They had to do a whole bunch of work up front; creating custom pipelines that calculate drift, custom pipelines that calculate accuracy and whatnot and explainability algorithms and whatnot. So the effort that they’re putting after a point was not something that was not giving them any business ROI. So Fiddler provides automated packaging, all of this functionality, so that you can point your log data coming out of your models. And you can quickly get those insights.
So in the sense, we discovered, we uncovered this category, so we are working with customers that were already doing it because there was nothing else at the time. When we started working with them, we uncovered that the post-production model monitoring is something completely unaddressed. And so we’ve started working on building the product.
Let’s get into the product. Do you have different modules for explainability, for drift, for model management? How is the product structured?
It’s like a layered cake. So essentially, the base layer are customers. A lot of our customers use Fiddler for model monitoring. But we have a lot of other customers, especially in regulated industries, that use it for model validation, pre-production model validation, and post-production model monitoring. Model validation is quite important in a fintech or a bank setting because you have to understand how your models are working and actually get buy-in from other stakeholders in your company, it could be compliance stakeholders, it could be business stakeholders, before you push the model to production, unlike, say, a consumer internet company. You can’t really afford to do online experiments with freshly created models, right? So model validation is a big use case for us.
And then we are now seeing that model audits where a lot of companies, especially again in regulated or semi-regulated sectors, they’re spending a lot of people and money and time to create reports around how their models work for third-party auditing companies. This is where we are finding an opportunity to help them. This is where they’re trying to figure out, “Is my model fair? How is my model working across these different segments and whatnot?” And so that’s the third use case that’s actually emerging for us.
Great. Let’s jump into a demo.
Yeah. Absolutely. I can show the product demo now. So here is a simple model. It’s a random forest model. It’s predicting the probability of churn. So I’m going to start with how… this is basically the details of the model. It’s a binary classification model.
What happened before that, you imported the model in this?
Yeah. Essentially, the expectation is the customer has already trained the model. And they’ve integrated the model artifacts. And they’ve also integrated their training datasets and what was grand data that they have trained with in Fiddler.
Do you support any kind of model?
Right. Fiddler is a pluggable service. So we spend a lot of time making sure it works right across a variety of formats. Today we support scikit-learn, XGBoost, TensorFlow, Onyx, MLflow, most of the popular model formats, Spark, and that people use today in production.
So in this case, this is actually a random forest. It’s a sklearn model. It’s a very simple model. And these are the very simple nine features that were used to train with. Most of them are just discreet features, continuous features.
And now you can see when I’m monitoring it. So we provide a client SDK where the customer can send continuous data when they’re monitoring the models. So essentially, we have integrations with Airflow, Kafka and a few other data infrastructure tools that can pipe the prediction logs to Fiddler in a continuous manner.
So in this case, you can see that I’m tracking two things here for this probability of churn. One is just the average value of predictions over time just to see how my predictions are doing. But the blue line is the more interesting part which is essentially tracking the drift. This is basically one line that tells you, “Is my model drifting or not?”
And so for a long time, this model drift is quite low. It’s close to zero on this axis. So that’s good because drift being at zero means that the model is more or less behaving the same way that it was trained. But then after a point, it starts drifting quite a bit. And this is where an alert could fire if you configure an alert. And then what Fiddler provides is it provides these diagnostics that really help you figure out what’s going on.
So an alert can fire. An ML engineer or a data scientist can come to Fiddler and see, “Okay. The model started drifting. Why? What’s going on? Why is that happening?” And so this drift analytics table really helps them pinpoint which features are actually having the highest impact on the drift. So in this case, the feature called number of products seems to be having the most impact, 68% impact. And you can see, drill down further. And you can see why that is happening.
You can see that when the model was trained, the baseline data, the training dataset had a feature distribution where most customers were using one or two products when the model was trained. But when the model was in production on this day, you can see that the distribution has shifted. You’ve seen customers using three products or four products now coming into your system.
And you can actually go and verify this. You can go and go back in time and see that those bars align here, like a few days ago. Whereas, when the model started drifting, you see that there is a discrepancy. Now, this is a point where you start debugging even further. And this is one of the use cases of Fiddler is this is where we combine explainability with monitoring to give you a wide, very deep level of insights. So this is essentially our model analytics suite which is the first of its kind. It uses SQL to help you slice and dice your model prediction data and analyze the model in conjunction with the data.
So, for example, here, what I can do is I can actually look at a whole bunch of different statistics on how the model is doing, including, for example, how is the model performance on that given day? What’s the precision recall accuracy of the model, confusion matrices, precision recall curves, ROC curves, calibration plots and all of that? And you can do that with different time segments. You can go and adjust these queries.
So, for example, let’s say if we want to look at all the possible columns, I can just go and simply run my SQL query here. And now you’re essentially getting into this world where I’m slicing the query on one side and then explaining how the model is doing on the other side. So this paradigm is very inspired from MapReduce. So we call it slice and explain. So you’re slicing on one side.
So now what I can do is I can actually look at the feature importance. Is the feature importance shifting? Because this is one of the most important things data scientists care about, right? When the model was trained, what was the relationship between the feature and the target? And now is that relationship changing as the model went into production? Because if it is changing, then it can be a cause of concern. You may have to retrain the model, right?
So in this case, there’s some slight change happening, especially if you can see that the feature importance of the number of products seems to have changed. And now you can dig into this further. Let’s say if I wanted to look at the correlation between number of products and, let’s say, geography. And you can understand how… let’s see. I think I have to put this the other way around. So if I look at the number of products and geography, I can quickly see that across all the states Hawaii seems to have a weird wonkiness here. You can see that it’s the number of products in Hawaii seems to be much on the higher side than the other states. So I can go and quickly debug into that.
So I can go and set up, say, another filter. So let’s say I want to look at the Hawaiian state. I can run that query. And I can go back to the feature impact to see the feature importance. You can see that the wonkiness actually is much more clear. The number of products seems to be much more wonkier here. I can confirm it by looking at the slice evaluation.
I can see that the accuracy of the Hawaiian slice is much lower. Just for the comparison, I can go and look at the non-Hawaiian slices. You see that the non-Hawaiian slices’ accuracy is much higher. So now we have found a problematic segment. It seems to be the Hawaiian query. And you can see that the feature importance in the non-Hawaii is actually much stable. It’s much more resembling the training data.
So now we have found a slice in your data which is coming from this geography of Hawaii where the feature distribution of this particular feature, which is essentially the number of products feature, is different. You can see it’s much more skewed towards people using three or four products. I can now confirm it. This is a data pipeline issue. Or is it actually a real business change with my business team? If it’s indeed a business change, now I know that I have to retrain my model so that it can accommodate for this particular distribution shift. Any questions here?
Where do you fit in the broad MLOps category? It sounds like you were carving out a category as part of that called model performance management. In general, you guys have you guys have some very good category names. There was X… what was it, XAI? Explainable AI.
Yeah. We started with Explainable AI, which is obviously the model explainability stuff we started. And then we expanded it to model performance management that covers model monitoring and bias detection. It’s inspired from this application performance management, which has been really successful in the DevOps world. And we are trying to bring that into the ML Ops world versus MPM. We want MPM to be the category which represents the set of tools that you need to continuously monitor and observe your machine learning models at scale.
Great. So in that ML Ops life cycle, what part do you cover? What part do you not cover? And what else should people be thinking about to have a full ML ops solution?
Essentially, we come into picture when you’re deploying models to production. Essentially, we work with teams with data scientists even with a handful of models, right? So today a lot of teams start with five, six models running in production. And they quickly see that, “Hey, by having Fiddler, I can increase model velocity. I can go from 5 to 50 very quickly because I have standardized model monitoring for my team.”
Everyone knows what needs to be checked and how models are working. And there’s alerting. And I have basically made sure that we are de-risking a lot of our models. So that’s one of the biggest values that we provide for customers that we can increase their model velocity. And at the same time, we help C-level execs make sure they have peace of mind that models are being monitored, that people on the ground are actually receiving alerts. They can actually go get shared reports and dashboards on how the models are working and go in and can ask questions.
As I said, there are two value props that we provide essentially; pre-production model validation where before you deploy the model, how is the model working? And post-production model monitoring. So in some ways, we fit nicely with the ML ecosystem working with an ML platform, say a SageMaker or H2O or any of these ML platforms out there that are helping customers train models or have an open source model framework.
So we can be a really nice plugin into those services. And you can actually use, say, a Fiddler plus SageMaker or a Fiddler plus Databricks. A lot of our customers use that combination to train and deploy models in SageMaker and then monitor and analyze them in Fiddler.
Who’s a good customer for you? Which type of companies? Which industries? Any names or case studies you can briefly talk about.
We have a lot of customers that are on our website in terms of logos. And we have worked with a lot of financial services companies that are deploying machine learning models. The reasons they’re interesting to us are, first, there is a lot of appetite to move from quantitative models to machine learning models. They’re seeing a huge ROI. They have been building models for a long, long time.
If you look at banks, hedge funds, fintechs, investment companies, they see they’re having access to these unstructured data and these ML frameworks. And so they’re able to move from quant models to machine learning models with high ROIs. But they’re also in a regulated environment, right? So they have to make sure that they have explainability around models, monitoring around models.
And so this is a sweet spot for us as we work with companies. But Fiddler is available for customers in agtech, eCommerce, SaaS companies trying to build models for AI-based products for their enterprise customers. But, yeah. Financial services is basically our major customer segment today.
Based on your experience on the ground, where are in the overall cycle of actually deploying AI in the enterprise? One hears from time to time is that the more advanced companies have deployed ML and AI, but basically, when you dig, it’s really just one model in actual production. It’s not like 20. Is that what you’re seeing as well?
It’s still in the first innings. A lot of our customers that talk to us have less than 10 models or maybe tens of models. But the growth that they’re projecting is to hundreds of models or, if a large company, thousands of model. One of the things that you’re seeing is a lot of data scientists are being mentored by grad schools and a lot of new programs.
In fact, I was talking to a cousin of mine who is applying for undergrad courses. The top program for undergrads is not bachelor’s in computer science anymore. It’s actually a bachelor’s in data science. So you see the shift is actually… There’s a lot more ML engineers and data scientists coming out, people rescaling themselves, new people coming out of schools. So we see a secular trend where all these people would go into these companies and they would build models. But in terms of AI’s evolution life cycle, it’s still in the first innings of a game. But we see the growth happening much, much faster.
Great. Well, that bodes incredibly well for the future of Fiddler. So it sounds like you are at a perfect timing on the market. So thanks for coming by, showing us a product, telling us about Fiddler. Hopefully, people have learnt a bunch. I’ve certainly enjoyed the conversation.