At our most recent Data Driven NYC, we had the great pleasure of hosting Bindu Reddy, CEO and co-founder Abacus AI, and formerly GM & creator of AI verticals at AWS, and an ex-Googler. Bindu also has a very witty and entertaining Twitter account (@bindureddy), where she talks about all things machine learning and AI.
This was a very educational and approachable conversation, where we covered:
- some key definitions: neural networks, weights and biases, supervised vs unsupervised learning, feature store
- Applying neural networks to structured, tabular data
- Abacus’ vision around “autonomous AI”
- How companies wait too long to start experimenting in ML/AI
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
[Matt Turck] Bindu welcome to Data Driven. You are the CEO of Abacus.ai, formerly known as RealityEngines, a startup that you founded in 2019, and you’re also a wonderful person to follow on Twitter at @BinduReddy, super entertaining and interesting.
Tell us about your background – you’ve done some really incredible things. I’d love for you to share the story.
I grew up in India. As you probably know, India is a very, I would say, academic culture, especially in the upper middle class. I had a great education, and I ended up going to an engineering school in Bombay. Then from there on, I came to graduate school at Dartmouth. After I graduated, I was really attracted to California because of the warm weather. India is a very warm place and Dartmouth is a very cold place. I was about to leave. I was like, this is not working for me. Luckily, a friend of mine invited me to Caltech. I hung out in California, and I was like, this is where I want to move.
I ended up moving to California and started working for a biotech firm as a computational biologist. I was always excited about neural nets and genetic algorithms and so on, but I was excited about it in the context of biology, something which is actually super-hot right now. But I was a little early for my time I think. There wasn’t enough data in the bio regime for us to really develop really good ML models or simulations and stuff like that. I ended up getting excited about the Internet, started seeing the Internet take off in the late ’90s, early 2000s, and I jumped from bio into technologies.
I worked in a startup called Upwork, then went to Google. Learned a lot there. Ended up being Head of Product for the G Suite, kind of like product managed Docs, Spreadsheets, and Slides. Also was actually for some time at Google I started becoming a skeptic when it came to machine learning because there were so many black-boxy models which were magical, and then went and did my own startup called Post Intelligence. Turned around 180 degrees, I fell in love with machine learning again just seeing its power when it comes to advertising, and then finally ended up at Amazon Web Services, which is most related to my current situation.
I used to be a B2C gal working in all sorts of consumer apps and then I became a B2B person when I started seeing the power of the cloud just enabling millions of businesses was super exciting for me.
Then from Amazon I quit a couple of years ago and I started Abacus. I’m super-excited to be in this space. I feel like after 20 years of exploration, I found the right job for myself. It took me 20 years to grow up.
As a level set, can we start with an approachable definition of neural networks and what those actually do, what they’re good at and not so good at?
Okay. At its very, very basic level, think of neural networks as literally a bunch of different nodes. Think of a graph and in the graph there’s going to be inputs, so inputs into the graph and then there’s outputs, and you’ll imagine that there is a bunch of data.
Take a very simple use case. The best example use case that I like giving people is the use case of cat versus no cat. Imagine if you had 1,000 images and in those 1,000 images, there were 500 which had cat and 500 which didn’t have a cat. What you’re going to be doing is training a neural network to detect if there’s a cat or not. The way you would do that is you would input the image and from that input you want to be able to get an output and the output is simple output, which is cat versus no cat.
The input can be the image in a mathematical form. What you’re doing is at all times you’re taking this mathematical vector and you’re transforming it, and you’re transforming it in different ways.
There’s two ways you can transform it using something called a weight or something called a bias. Just think of that as a multiplicative factor of that image or an additive factor, and using those transformations, you get to an output which is cat or no cat.
You have this graph network and you have these weights and biases. Think of it as parameters in a function, and all you’re doing is fitting that function. That’s a neural net. Think of it as a very generic universal function that you’re fitting with the weights and biases of each of those nodes in the graph. Once you fit that, you now have a model which given a new image, the 1,001st image, will take that vector of that image and try to predict whether it has a cat or not. What it’s doing essentially is it’s learning the mathematical representation of that image and mathematical representation of the cat specifically, if that makes sense at a very abstract level.
Could you talk about the different types of learning, supervised versus unsupervised?
Neural nets have become really popular over the last few years, and the reason why they’ve become really popular is they are very effective at basically solving pretty much all problems. If you’re a place like Pinterest (NOTE: Dave Burgess, Head of Data Engineering at Pinterest, spoke right before Bindu) looking at 80 different machine-learning problems that Pinterest is solving, right, and a lot of those problems have to do with images.
Before 2012, these problems were very hard to solve. What happened in 2012 is a bunch of people came up with a neural net called AlexNet, which really was good at processing these images and solving for different types of image classification problems, determining whether they’re cats or different objects in the image.
Basically, once they figured that out, they started realizing that this could be a whole class of problems which neural nets could solve, and at a very global level think of it as three types: supervised and unsupervised and, let’s just say, semi-supervised learning.
Supervised learning is what you see every day. This is what you see used in Pinterest, is what you’ll see used in Google, and basically it’s nothing more than getting labeled data and using that label data to train a neural net. You’re telling the neural net, this is an image which has cats.
You could also tell the neural net, this is a video which has a person moving. That’s what happens in an AV company. They actually train with loads and loads of data which they get by recording as they drive and then actually labeling each and every single piece of that video with pedestrians and stuff like that, right. That’s labeled data and then you’re training that data and that’s supervised learning.
The other extreme of this, and the magical extreme of this is what we call unsupervised learning where you take a bunch of data, just give it to a model, and that model tries to learn patterns in that data. At its very simplest, think of taking a bunch of data and clustering data points which are together, and that’s unsupervised learning because you’re not teaching it anything.
And everything in between, where you have some labeled data and which you can learn from and then use that semi labeled data to actually learn from data which is not labeled is what you would call semi-supervised learning. If you think about those two extremes, where there are tons of models in production are in supervised learning.
Unsupervised learning is kind of coming of age, things anomaly detection is unsupervised learning if that makes sense.
You mentioned examples with videos, which qualifies as well as documents and sounds that qualifies as unstructured data. But deep learning can be used on tabular data as well, right.
What are some examples and some use cases for tabular data?
Yeah. That’s my favorite topic. If you guys look at Abacus, you will see that our first focus was all on tabular data. The reason we focused on tabular data and why tabular data is actually pretty effective with deep learning is, one, because every company has multiple different examples or machine-learning models that they want to develop using tabular data. At its most basic level, we have something called predictive modeling. Predictive modeling is very similar to finding a variable which is what we call a dependent variable based on other independent variables.
Let me give you a concrete use case. If you go and look for a home on Zillow or Redfin you have something called the house property price, right, estimate, the Zestimate, I’m sure everyone’s seen this. The Zestimate comes because they have a machine-learning model, which is a predictive model which is using all kinds of properties like, say, bedrooms, bathrooms, square footage, neighborhood, location, and whatnot, and trying to determine the price of the house based on recent homes which have sold with those attributes. It’s learning constantly based on recent sales and then trying to predict based on the attributes of your home what the price would be, right. That’s a predictive model.
These predictive models there are very many methods you can use to find these predictor models. Conventionally people use things like gradient boosted trees or tree-based algorithms, classical algorithms. Of late, there is a big rise in using neural nets. If you look at a company like Google or Facebook, almost all of their models, 95% of their models are neural net based. Why? Because neural nets actually do really well with high dimensional tabular data. In the case of these houses, they do really well if you have 50-plus columns, and the more columns you have, the better they do.
Second aspect of this is neural nets will actually look at the unstructured pieces of the data. Let’s take the example of the houses. What if I can put in the description of the house to also enhance the price. Clearly, I could say, hey, I have a huge pool house in my house, right. That may be just in the description, and the neural net can pick up that every time there’s a pool house, there is a maybe a 50k extra that you price the house at. A classical machine-learning model won’t do that for you. Here the neural net is looking at the language data, understanding it, and helping you parse that data.
Take one more example. If you have a sequence of events. If a particular event is followed by another event take, for example, if you’re doing a lead scoring model. You’re scoring all your incoming marketing leads and you want to be able to score the leads which are going to become customers as high and leads which are not going to be customers as low. The neural nets will do a much better job than a classical model because the neural nets will look at the sequence of what the customer did. For example, if they’ve just taken a webinar or they came to a webinar, they’re much more likely to convert than if they’ve taken a webinar five months ago and then they maybe clicked on an ad and then they dropped off, right. Those are the two big reasons why neural nets are super useful in tabular data as well.
Let’s talk about Abacus. The big part of the vision and the idea as I understand it is really to democratize access to what you just described. Tell us about the platform and what the company does.
Yeah. We believe we’re unique in the sense that we want to do what is called autonomous AI and what do I mean by that. What I mean by that is we want to ideally convert your data as is. Let’s assume you’re using a Redshift or a Snowflake or what have you. There’s a bunch of data sitting there. It’s coming in on a regular basis. We want to ideally take your data and be able to convert it into intelligence at enterprise scale. What that means is we want to be able to take a use case, say lead scoring, assume that your data is in four or five different places, we want to be able to connect to these data sources, set up your data pipelines, do automated data transformations and data wrangling and data processing. Once your data is ready for consumption by a model, we want to be able to find the right neural net based on your use case in your data set and create the best neural net that is possible with your data and then help you deploy that neural net into production.
Then once it’s in production, you want to ideally be able to retrain that model, keep that model running, and keep that in production over a long time, and that’s kind of what Abacus does, that end to end of reading your data, processing your data, managing those data transformations, creating that model, and then putting that model in production, and continuously retraining it and learning, and learning in a way so that your model becomes more and more smarter over time and that’s what we are calling untrained autonomous AI and that’s what Abacus does.
This is not to be confused with AutoML, which has been around, I think Dave actually mentioned this too, it’s been a hot topic for a long time, right, how do I build machine-learning systems easily with less code, no code, and put it into production. AutoML is one piece of this which is finding the right model based on your data. We actually automate the whole end-to-end, and most of the chunk of the work we do is in the data processing, the data processing wrangling management, which if you guys are a data scientist, at least in the audience, you know that that’s a pain and that’s the most problematic thing. It’s called Data Driven NYC I guess for a reason.
How does it work behind the scenes? Presumably you’re using different open-source frameworks and you build the glue around them to have them all work together. Is that the right way to think of it?
Yeah. That’s absolutely right. When I saw Dave explain all of those pieces in the Pinterest ML system, we literally have equivalent pieces in our service, right, for all of those pieces, all the way from workflow orchestration to data transformations and so on. Yeah, we are similar. We use some version of Kafka internally, Kubernetes, Spark, Redis, pretty much everything that I’m talking about here is an open-source wonder. Yeah, it is a bunch of these different tools pulled together. TensorFlow, of course.
We also have written a bunch of stuff ourselves. Most of our engineering team … actually one of our co-founders is the founder of BigQuery. There is a very sophisticated data engine that has been written which can process data at scale. Things like merging streaming and batch data sets together, transforming that data, making it available in real-time, and these are some hard problems which we solve.
What are you building next? What is the roadmap?
When I say what I say, a lot of people are incredulous, to be honest, because they’re like, is this really possible to do? I would say in terms of where we are in our current product evolution on a zero to 10 scale, we’re around 7, 7.5, meaning we’re still having to add a lot of stuff because we learn as we go, as we get new customers, and there’s a lot of demand for our service right now. As we get new customers, we want to make sure that the customer is actually getting what we are talking about, which is this end-to-end autonomous AI. A large part of our work is focused on making that happen, things like making sure that data transformations are robust and so on.
Having said that, the key other focus we have for this year is releasing a whole bunch of use cases beyond structured data and unsupervised learning which we just released late last year, which was around anomaly detection. This next quarter we’re going to be releasing a whole bunch of language use cases, support for language, and Q3, Q4 is support for vision, support for image use cases.
Who’s an ideal customer for the platform? Is that a Fortune 1000 company that doesn’t have engineering resources? Is that a startup? Who’s the ideal customer?
We have a bunch of, I would say, startups and medium-sized companies. We have, I would say, two classes of ideal customers: one is the startup of the medium-size company who basically wants to go, go, go, right. They have seven different ML use cases. They may have anywhere between two to five data scientists, sometimes zero. Let’s say zero to five data scientists, and they want to get things in production. What we’ve seen all the time is I’ll give you a classic example of this.
This is a medium-sized company, not really a startup. It’s a company called USIC LLC. I’m picking this one because they’re the 811 company. You call them every time you want to dig somewhere, because they have to come and mark exactly where a particular utility is, so that you don’t dig it and harm the utility. They’re a pretty large company. I think they’re about 500-plus employees. They have a ton of data. They want to be able to build models which are probability of a particular site getting damaged, time to go and make that inspection. They have multiple different machine-learning models and they have a lot of scale, and they would rather use Abacus than not. That would be our, I would say, a sweet spot customer and a small- to mid-sized company.
Then we look at the Fortune 500s and a few of them are actually our customers today. A lot of them will go with either a use case which they haven’t really worked on yet and they want to do this end to end using Abacus. Then we also have something called Abacus AI Deconstructed. For the people who have that 500-person machine-learning team, like take Pinterest. Pinterest, of course, only uses AWS, as Dave mentioned. We haven’t really pitched to them. But if you have a 500-person data science team, what we can do is we can give you a module of our service as a standalone component.
Let’s say, you’ve had problems with data wrangling and data cleaning. We have something called a machine-learning feature store that you can use and that will keep your data easy to manage while you consider-
What does that mean? Feature store is one of the hot topics in machine learning? Do you want to define what that means?
Of course. When you’re building machine-learning models, you have to build what are called features typically. Meaning the inputs that you send into your machine-learning model are features and they’re usually coming from your raw data, right. Let’s say you have a bunch of data, which is just data rackets. You want to be able to take that data and you would have to transform it in a way that you can feed it to your machine-learning model.
The usual problem is, let’s say, you’re someone like Pinterest or even Netflix. They have millions of queries in a second, right. Now we’re talking about petabytes of data and imagine transforming each petabyte of data even to compute, say, the average across three different points. Let’s just assume that’s a feature. A very simple feature could be the average length of a user session over the last three days. Let’s say, a match used Netflix and we want to know how long he spends on Netflix over the last three days, average length, that’s it, do that for billions of users that they have. That’s a feature.
Now if you want to build this features, if you want to go build this model, a data scientist would have to go compute this feature on all of Netflix’s customers almost in real time and it is super-hard to do. There is the feature store which helps you solve this problem. How do you do this? You basically specify your feature in the feature store. The feature store will then go to your data sources that you’re streaming and start computing this for you and make it available for you for training or for real-time predictions, right. That makes your model super simple.
Now you can go literally build your model using a laptop if you want, and then you can go to the feature store, use an API, and get the data transformed. Now you don’t have to deal with the data wrangling, the data cleaning, the data processing. You just use this feature store and you can get going. The other big advantage is, let’s say, I am a data scientist and I built this feature using this feature store’s interface, right. Now, let’s say, Matt comes along and says, I want to use that same feature, the average session time of a Netflix user in the last three days in my model.
Now he doesn’t have to go build that feature all over again. He can just tag that feature, share it with me, and we’re off to the races, right. You’re not wasting time data wrangling and you’re not copying, making tons of copies of data.
I was going to ask you some questions about some of your best tweets. Here we are. You had one that I thought was very interesting that said, startups often wait too long to train and deploy their first machine-learning model. “Your data is your leverage and your best moat against competition. Don’t wait to hire a data science team. Use a low-code ML platform and get started ASAP.” You’ve seen people do this successfully or how does one get started I guess? What are some examples of low-code ML platforms and how do people go about doing that?
Well, it’s a self-promotional tweet. An example of a low-code ML platform would be Abacus.
I’m sure there are others as well, but I’ll give you an example. One of our customers, a startup called DailyLook. They’re kind of like Stitch Fix, except their business model is slightly different in the sense that I think you get to choose certain things that you want in your box, but presumably every month you get a box of clothing that you might like. You keep some, you return some.
Even when they were pretty small, they had a few thousand customers and they didn’t have a machine-learning team, they started using us. They’ve been using us for a year. They’re one of our oldest customers, and now they’re up to model three and it’s worked really well for them because they knew what the problem was. They wanted to be able to optimize the number of things that you kept every month clearly, right. You have to pick the right style items to send them on a monthly basis. Clearly, that’s a machine-learning problem. You don’t have to wait until you have millions of data points, even hundreds of thousands is too much.
If you have 10,000 data points, that’s my rule of thumb, if you have 10,000 data points, you’re ready to build a model. The minute you build a model, you have, what I call, a virtuous feedback loop. The minute now they have customers, they’re using this machine-learning model, and the machine-learning model decides which items to send and which items not to send on a monthly basis based on that customer behavior. Now they spend less money or they retain more of their customers. Their number of active customers or the monthly actives keep going up, because the machine-learning model is helping them find the right set of stuff to send, so that churn is very low, right.
That’s what startups don’t do. They start with saying, hey, I don’t have enough data for machine-learning. I’m going to do a heuristic model, but a heuristic model may not be perfect. It may not learn as the startup evolves. But if you start with machine-learning from day one, you start banking and becoming more and more competition resistant to some extent.
Among your many other good tweets, another one I liked that was very encouraging and inspiring I guess. “Becoming an expert at deep learning is largely about experimentation. Play with open source machine-learning libraries. Start with some simple CNNs on image data. Try transformers and develop an intuition for them. You don’t need to be a genius, just a tinkerer.” And then you had a separate tweet that said, “machine-learning is like sex, the more you experiment with it, the better it gets.“
Yes. Yes. Well, yeah.
Are you saying, basically, does one need to go to school and be trained, or is that something … basically, what you’re saying is you can train yourself to become a great machine-learning engineer just by experimenting?
The truth of the matter is, if you think about it, when did deep learning start being taught in school, right. Less than some 10 years ago, very honestly. I don’t even know if it was 10, maybe less than six years ago. At least I graduated before 10 years ago, and I think a lot of the people in the audience have. Some of the best deep learning engineers actually are people who have graduated over 10, 15 years ago, and the reason for that is they have experimented and they have taught themselves. The idea is machine-learning actually rewards the person who really is curious.
The curiosity, data exploration, and experimentation are the three – from curiosity you want to explore and from exploration you want to experiment. If you have that mindset, you succeed in deep learning especially. In fact, you can even write a paper I think and I’ve seen examples of this. You can actually be first or second author in a deep learning paper and get that paper accepted at NeurIPS if you have that mindset. There is also intuition that you build, the more you experiment. There are so many courses and there are so many resources right now that it’s actually not that difficult to get in and start thinking about it if you’re interested.
All right. I’d love to switch to some questions from the group. We have some good ones. Let’s see, one from Gadalia who I think mailed us the question before this as well. What is a state-of-the-art around labeling data? We really struggle with obtaining enough labeled training data.
Oh, it’s a very good question. Labeled data is really difficult. There’s a whole company called Scale AI, and I think there’s a bunch of companies around Scale AI which help you with labeling data. The good news is there are more and more models, and if you wait for two years, there’ll be even more. But right now even there are lots and lots of new models which are coming up, which don’t require too much labeling. Take the best most talked about one in the world, GPT-3. The whole reason GPT-3 is the most popular language model that is in the world today or talked about most in the world is because you need very little labeled data. As an example, you can give GPT-3 five examples and then it will start learning very quickly. I think the state-of-the-art is moving very rapidly so that you can teach models with less and less data, and this will be the case for both images and language very soon.
Question from Mark. What are some of the highest-quality DNN models customers use, for example BERT I guess?
Yeah. BERT is a good one. We specialize in structured data models. What we like a lot are our deep learning based personalization models. They’re super effective. They’re also money making. Customers love making money, obviously. They also like increasing their revenue. If you look at personalization that’s the single biggest revenue generator there is. Basically, by personalizing the experience of your customers, you can actually increase lift and engagement and conversion. These models which are … they could be simple RNN/LSTM models, but off late we’ve had a lot of success with transformers for personalization as well.
I took a question that is going to enable you to dish out on competitors, what are some of your competitive advantages or even drawbacks compared to H2O.ai and DataRobot?
Yeah. Somebody actually knows our space, but H2O and DataRobot are really we … maybe they consider themselves our competitors. We like to think of them as non-competitors. The reason for that is their real focus, as you probably may know, is on the AutoML or the automated machine-learning part of it. Our focus really is to do that end to end. We’ve spent a ton of time in the data wrangling, data management, and doing that across multiple sources. Literally, you can connect your data from Snowflake, from Redshift and from your data lake and Salesforce together and bring that all together. I think our biggest competitive advantage is that data wrangling and the real-time feature store which is part of our service.
There was a question about how do you take outliers into consideration in DNN?
Yeah. That’s a good question. You treat outliers differently based on the use case. Since we’re a use case where if you go to our site you will see that we have these solutions personalization versus forecasting. With personalization, we just throw out the outliers. With forecasting, we keep the outliers and fill in missing values. We do different data-cleaning techniques almost based on what the use case is, and that’s why we need to know what your use case is at some level.
Maybe just to close some quick rapid-fire bonus questions. What’s your favorite data tool that you’ve been using over the years, like the thing you cannot live without?
I sound so self-promotional right now. Lately I have really started liking … okay, actually, the thing I can’t live without is Excel. It’s a data tool, so I’m going to go with Excel. But then beyond that, I really … yesterday for a fact I trained nine models, and I was using Abacus and I liked it from the model training perspective. The other thing I think from a data science perspective the thing which has really, really borne the test of time has been Spark.
What’s your favorite data book, newsletter, or podcast outside of Abacus and the content that you guys create?
Ah, podcast, newsletter. I love the Freakonomics podcast. This is more broader. I don’t know if it’s data driven as much, but I think it’s something which I really enjoy, especially when it comes to data trends that they talk about.
Last question. Outside of the immediate world of Abacus, what new data trend or product in the overall data ecosystem are you most excited about or most intrigued about? Any new interesting thing that’s come up on your radar in the last year?
I’ve heard of a company called Rockset, actually maybe Matt you know more than this. But the thing that I’m most excited for is a replacement for Spark which is easier, because Spark is very hard for live queries. We’ve looked at different things but there’s a tool called Backstory you can look at, there’s a tool called Rockset you can look at. But anything and everything which helps you do data transformations on large amounts of data in real-time I think is going to be the big thing, and sooner or later we’re going to figure something out there.
Wonderful. Well, that’s a great place to end this conversation. Thank you so much.
Great. Thank you. Thanks for having me.