So you think your Enterprise AI model is working amazingly well!

‍Navneet Singh

‍Aug 5, 2021 · 5 min read
‍

‍The challenge of AI model evaluation

The other day I got a call from a friend who had built a predictive maintenance model for drilling machines in her factory. The hope was that the model would accurately predict which machines were about to fail. When they had tested the model on past data, the model was extremely accurate in predicting upcoming failures. But now, with the model in production, she was baffled why accuracy had apparently dropped so much. And she wanted to know, what happened?

‍

What happens in the lab stays in the lab

Here are two of the most glaring problems that illustrate the challenges of evaluating AI models as we move them from the lab to the real world:
‍

The real world has moved on. Your AI model found a certain relationship between the input data and results (e.g. between hours worked and employee attrition) and then scored well on a blind evaluation. But that was all historical data. Do those relationships continue to be as strong in the real world of today?
The difficulty of measuring averted outcomes. Any model that predicts an undesirable outcome — like whether a manufacturing machine will fail — may appear inaccurate if people act to avoid the predicted outcome. Is it fair to blame the model that no machines actually failed following preventive maintenance as directed by the model?

Suppose you’ve built a predictive model for the churn and renewal risk of your SaaS customers (we, at StepFunction, have). How do you know it’s working and working as well as it should be? It’s one thing to evaluate the model on historical data (see here for a discussion of various efficacy metrics, e.g. Precision and Recall). But once you put your model in action, you can no longer assess your model based on whether “acted-upon” customers churn or not.

‍

A New Evaluation Mindset

So what’s the right way to measure the impact of your model? Start by thinking hard about your objective for the project — Is it to minimize churn rate, or maximize NRR (Net Revenue Retained), or keep your Customer Success (CSM) costs to a minimum?

‍

Monitoring trends is a starting point

Once you’ve decided on the business metrics you want to improve, you can monitor them over time. You’ll still have the problem of confounding variables — for example, suppose you deploy the churn-reducing AI model in Jan 2020, and six months later the churn rate is clearly down. Should you be overjoyed and declare the project a victory? No, because you cannot say for sure what caused the dip — in fact, many businesses saw reduced churn in the first months of covid quarantine as customers avoided drastic action.

Nonetheless, you should certainly track the most important business metrics, e.g. Net Revenue Retained, over time. A significant uptick in NRR, sufficiently long after your AI model goes into production, everything else being equal, can and should give you a good feeling about the project. And you won’t fall into the trap of false excitement if you’re aware of most, if not all, factors that could be changing your key metrics.

‍

Clearest way to evaluate impact — If you can afford it

The gold standard for testing the impact of a new business practice is to do an A/B experiment. If the churn AI model identifies 1000 at-risk customers in a certain cycle, randomly put X% in set A (the set to be acted upon and given to the CSM team) and the remaining in set B (don’t give them to the CSM team). The obvious problem here is that most business execs won’t be willing to stay silent on an identified set of customers and risk losing them to churn. Research in medical experiments gives us a ray of hope here, and in subsequent posts, we’ll explore techniques of conducting such experiments without going silent on identified at-risk customers.

‍

Relative impact — Comparison with old methods

Now suppose your CSM team had some kind of an existing methodology in place for the identification of customers at risk of churn (a lot of SaaS businesses do). The team used to use triggers, e.g. if weekly product usage drops by more than 50%, to identify at-risk customers before — call it the Trigger model or Model T. Now, starting in Jan 2021, they’ve put the new AI-based predictive model in place — call it Model P. Your objective now might refine slightly — You might want to know whether Model P is doing a better job than Model T of helping you reduce your churn rate. Or even if it’s not, is it additive to your existing model? Doing a fair comparison between Model T and Model P is still challenging, e.g. if there’s a big overlap between the sets of customers they identify or if the size of the sets is very different. But the comparison is doable, and having a baseline model to compare to is much better than not having one.

‍

Conclusion

It’s easy to get excited when you see good results over a short period of time that you’ve invested in something magical, but clear proof isn’t easy. Here are 3 takeaways that can help your ROI efforts:
‍

Don’t rely on traditional measures (used to test AI models on historical data) to assess your AI model’s impact in the real world. If you think that your model is giving you a lot of false positives, i.e. identifying many at-risk customers who end up staying, pause and think “Could it be that the model is correctly identifying at-risk customers, and my CSM team is saving them?”
Think hard about your business objectives from the project and carefully think through what might be impacting them — is it your AI model, or some other change in the business environment?
Compare the AI project and its results to the absence of this project. We discussed a couple of ways of doing such a comparison either between identified customers who were acted upon vs those who weren’t, or between your traditional project’s recommendations and those of the AI project.

‍