At our most recent Data Driven, we had the great pleasure of hosting Chip Huyen, a writer and computer scientist who also teaches machine learning design at Stanford, for a fascinating and fun conversation.
We covered a range of topics, including:
- What is machine learning design?
- The MLOps landscape, and how it’s both overdeveloped and under-developed
- What is online machine learning?
- The divergence between East and West for machine learning and data infrastructure
- A couple of book recommendations
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
VIDEO
TRANSCRIPT
[Matt Turck] Chip, welcome, you are a writer and computer scientist, and you teach machine learning design at Stanford. We appreciate you being here today. You have so many interesting things, your background and your story, your resume. Can you start with the story of your life?
[Chip Huyen] Thank you so much for those kind words. I actually have this inferiority complex going right at Jack. He’s the VP of Data at Reddit. I don’t even have enough commerce to create my own subreddit. Thank you for squeezing me with amazing speakers.
My background is a little bit nonlinear. I come from a very non-tech background. I grew up in Vietnam, chasing grasshoppers, if you want to be romantic about it. Then, I decided not to go to college and travel the world for a few years. Then, I wrote a couple of books about it. Then, I just didn’t know what to do with my life after that, so I went back to college. I never thought I would become an engineer because, at that time I thought computer science was something really boring. Who would want to stay in the basement laboring away, typing on the keyboard, and I don’t have friends going outside? But then, I took a CS course and it was really fun. Before I knew it, I spent the rest of my college in one basement at Stanford doing a lot of very miserable CS assignments. Now, I’m an engineer.
As part of your many talents, you also published four books, right?
Yeah, I have published two books before college and two books when I was in college. They are not technical books and they are all in Vietnamese. I’m very lucky I don’t think any audience here has read my book. I feel very grateful for that.
You also have a fantastic blog. There are plenty of really interesting things for anyone, as a lot of people in the audience loves data, loves machine learning, loves new learning about that space, it’s definitely highly, highly recommended. You also have a Discord group, right?
Yeah, I’m teaching machine learning system design at Stanford. It’s mostly about machine learning in production. One thing I have noticed is that a lot of materials about machine learning online is a tutorials approach, which is great. You see like, “Here’s a tool. You just follow this step, follow this notebook, and you can get up and running,” which I think is great for people to get into machine learning. I benefited a lot from it when I started out.
However, as I started reading more and more systems at a scale, I wanted to be more reliable, more maintainable, I realized we need more of a systematic approach to it. We can’t just take a bunch of tutorials and put them together and hope that things work. My course is about having a more systematic approach to machine learning in productions. You don’t just get a bunch of tools, but you don’t know anything about the machine learning system. What do you want to use this for? How do you want your future to be? What are some of the design challenges you have to face? What happens when you just follow a design decision?
What does that actually mean, machine learning design?
It’s an interesting question. I think design is, quite honestly, a fancy word. I think it’s a course on how to bring machine learning to productions. Design is anything of more architectural choice. First of all, you want to know whether a system is going to do predictions online or in batch, or, whether you want to use request driven, a response API, or do you want to stream event-driven API? A lot of stuff like that.
I should talk about the engineering, different data formats. I was really enjoying the discussion with Jack on the engineering at Reddit. We did have a couple of lectures on data engineering. Just to help students understand different data formats, trade-offs, for example, column-based and row-based format, text-based binary, et cetera.
Is the overarching idea of design to be very opinionated about a certain way of doing things?
So the main idea is you go backward from the problems. So I think a lot of people approach machine learning as you’re stuck with the solution, and they charge you for five problems where machine learning can be applied. So it’s so tempting to be like, “Oh, hey, here’s this fancy model coming out,” and then they try to run it and see whether it helps us. And I think it’s very interesting in R&D, but I think it’s the wrong approach when you try to solve actual problems. So I was trying to encourage my students to look at the problems. What problem is this? What is the easiest, simplest solution? And it doesn’t have to be machine learning. It can be non-machine learning. Or if it’s machine learning, it can be very simple models, and then from there you can… So at least you have some baselines and you can touch on more complex system models for the solutions. It’s a lot about best practices.
Another part of your work has been to survey the landscape of machine learning tools, and you’ve written a lot about that. How many frameworks and tools have you found so far?
It’s actually not the core of my work. It was just for natural interest. I like playing with new tools. So whenever something is coming out, I just want to look at it in GitHub, try to clone it and see what I can do. I guess that’s my approach, just look at solutions and try to find a problem for it. I just want to see whether it would help solve my problem. And I was at Snorkel right before… I just left Snorkel very recently, which is very sad. I’m laughing. I’m actually not happy about it.
So as a startup, we just need you to keep an eye out for what is out there. Snorkel is a great tool, and I think it’s one of the better ML setups out there. But when you’re doing a startup, you can’t just close your eyes. You always have to keep an eye out for what is out there. So I have been just keeping track of different tools that I found, and I think the last version has about 284. It was supposed to be almost 300 and then a few of them died between December two years ago and December last year.
What have you learned about the evolution of the landscape over time?
It’s very interesting. So I was looking at it, and I think I was trying to look at the year actually when the tool made the first comment if it was open source. Or if it was a company, it was looking to see when it was incorporated. So I was trying to divide the tools into different categories based on the problem it was trying to tackle. And I think there was a pre deep learning phase, before AlexNet in 2012. So it was a lot of traditional older frameworks with a lot of cool things, like decision trees and stuff like that. And then after deep learning, there were these phases, explosions of deep learning frameworks and then from 2016, I think Google has… Okay, my phone is just like, “Wake up,” because when I say “Google” my phone is like, “Hey”.
So in 2016, I think Google has this article about how they use deep learning for Google Translate. And it was the first use case of deep learning in production. Anything from then, there was an explosion of companies trying to use deep learning in their products. And we have seen a lot more tools around bringing machining to production, so less framework. So I think by 2017, 2018, pretty much the competition is between PyTorch and TensorFlow. Before then it was just so, so many more. So from 2016, a lot about serving, a lot about model evaluations, on the surrounding of machine learning and not just model building,
You were also saying somewhere in your writing that you felt that the landscape was still underdeveloped.
I think it’s very interesting. It is both overdeveloped and underdeveloped. It’s very crowded, but it’s still underdeveloped. My theory is that… and it can be totally wrong and I’d love to hear your thoughts on it. So my theory is that a lot of people try to pluck low-hanging fruit right now. So there are a lot of low-hanging fruit tools, and they are very similar. But there are a lot of big challenges.
What’s an example of a low hanging fruit?
Data labeling. How many data labeling tools out there? There’s a lot. It’s not a bad thing. It’s just a lot of low-hanging fruit, so there’s so many different tools. The thing about machine learning production is that data is still the biggest problem in machine learning right now. And data is… how should I let it in, how should I let it out – the in and out is a big problem and it can cause a lot of latency and people care about it.
Is that generating the data? Is that moving the data?
So I mean about data management. So if you have a large amount of data coming in, how you can process it very quickly, know where to send the data, and then what to do with data once it’s there. I think Jack was mentioning the scale of which Reddit got their data and it was a lot.
So that part is there’s a lot of low hanging fruit. Where do you see the underdeveloped part of the landscape?
So there’s a lot of low hanging fruit, and I think that a lot of low-hanging fruit of people come from the actual frustrations of what they have seen when they are trying to build machine learning in their companies, which is great. But the problem is that so many people are so focused on what is the problem right now, and they don’t see what machining can be. I’m not sure if it makes any sense.
So I think what could have been better is that not people are trying to tackle the problems of making machine learning to be a lot more useful. And I see a lot more potential in machine learning compared to how machine learning is being used right now.
I think one example you could feel very excited about is… I wrote another post of how machine learning is going real time. So I am really excited about online machine learning.
There are two aspects. One is online predictions. It’s a machining system that can make predictions in real time, like when users enter a query, it can just generate predictions, and online learning are systems that can learn in real time, like TikTok. If you open TikTok, in just a few minutes it will know exactly what you want, and it can give you suggestions about what I want to watch next. TikTok is incredibly good at it, and it’s so addictive. So it’s that online learning.
So I think a lot of people are not using machine learning for that. A lot of people are still using offline machine learning. So you train a model and then you generate a lot of predictions, like Netflix. You open Netflix and you go to the account and you’ll see some recommendations for what I might watch. So all these recommendations are actually generated offline. So you go into Netflix and you play around, and you’ve been watching a lot of horror movies. You might see a lot of recommendations for horror movies, but today you want to choose a comedy. So you might look for comedy movies, you might go into the comedy category. So you feel like Netflix should be able to figure out that, “Hey, you want comedy today, so I’m going to give you more comedy recommendations.” But they can’t because they need to wait until the next time to generate a batch of recommendations for you to get it. So I just see there’s so much of limitation to existing machining learning systems.
That’s fascinating. And is there a reason why TikTok is able to do that while Netflix is not able to? Obviously there’s the frequency, what you just mentioned, one of them being batch and the other one being almost real-time. But is that also a different algorithm, like a different stack? Is it a completely different set of tools to be able to do that, or does some of it come from the very nature of the media property?
So this is a good question. So the main questions of, “Why are companies not doing that?” So I talk to a lot of people and ask people, “Why don’t you do it?” And they were like, “Why should we make predictions online because there’s just no point in doing it.” They’re doing batch predictions and it’s totally fine. So here’s one way of doing things and here’s another way of doing things. And the other of doing things might change the way you set up the infrastructure. And have never tried before, so you don’t know what performance boost you can get because you have nothing to compare with. So you might be very tempted to stay the other way. So you’re talking about, what makes it hard? There are several reasons that make online predictions hard, and I’m just talking about online predictions right now and not even online learning yet. So online prediction is still the easy part. Online learning is really hard. So there’s several reasons.
Could you define again what online learning is for the audience?
So online learning is… So currently when you take a machining course, you do offline learning. You’re doing learning in batch. So you collect a lot of data and then you train a model and then you deploy the model. And the model is not updated until the next time you train the model offline. And online learning is when you have incoming data and you keep on updating the data and you make it adaptive to the incoming data sample.
Okay, great. Thank you.
Yeah, so I was talking about online predictions. So how is this hard? So to do online predictions, you actually need two components. So the first component is you want a model to make predictions very fast, like fast inference, like low-latency. So first of all, you don’t want to open Netflix and the webpage load for a minute and to show you recommendations. So maybe for a certain system, it can be very slow and companies don’t want to make users wait. So they generate predictions offline and whenever users have a query, they fetch a query because the time it takes to fetch a query is much, much faster than the time it takes for the models to generate the recommended and predictions right on the spot. I’m not sure it makes sense.
It’s hard to explain it without a whiteboard or whatever. So currently machine learning models are getting bigger and bigger. In general…. I’m not saying in all cases, but in general, when the models get bigger, they get more parameters. The time it takes for the model to make predictions takes longer, and people are getting very impatient. So a lot of technology has been trying to develop to allow you to make predictions, to allow models to do inference a lot faster. So within model compressions, distillations, you can have inference optimizations like TensorRT, you can still be much more powerful software. You can have either bigger, better servers, like some AWS you get bigger GPU’s and stuff, or we can have more powerful edge devices or chips that can be optimized for certain machine learning architectures. So that’s one thing. And another is that you want other infrastructures to be able to handle data coming in real-time. So you might have to change from the way you process data in batch into the way you process data in streams.
Great. And we have Materialize right after this for a minute. So we’ll talk about streaming and all those things.
I’m really excited for Arjun’s talk after this.
And Jack Hanlon in the comments, says, “Chip is so on point here. Online learning is much harder. Can confirm as we’re getting into more and more of it at Reddit.”
Online learning is interesting. Online learning is hard in theoretical guarantee because online learning is so different from what you think of as offline learning. I’m not sure have you taken any machine learning courses, the way they teach you how to do machine learning?
I have, but I’d be embarrassed to say…
So when we can take a machine learning course, it could be like, to change a model and to train a model multiple epochs and do your conversions. So that’s what you learn about machine learning, how you train a model. But the thing about when you do online learning… So conversions means that you assume that there’s some stationary distributions that the models converge to. But in the point of learning is you have the models adapt to the changing data, the changing distribution. So this is nothing stationary or static to converge to. So the concept of conversion is very weird in online learning. And second, we don’t have multiple epochs. You use a model like at most once.
So it is weird. But I do believe that if you set up the infrastructure right, online learning and offline are primarily to tune. So in offline learning, you can say, “Here, train the model each step of a thousand, ten thousand data samples.” In online learning, you can say, “Here, train the model on a sample or a hundred samples.” So it’s just a matter of the hyperparameter that you choose to set. But we don’t have tools for it yet, and I see very little tools focusing on it because most people are focusing on low-hanging fruit.
Shifting gears a little bit, I’d love to talk about something else you’ve covered that I’ve found really interesting, which is how trends diverge between the East and the West, and specifically China, when it comes to machine learning and data infrastructure.
This is a really interesting question, and this something that I have been trying to understand more recently. One thing is it’s hard because I don’t speak Mandarin or Cantonese. So I think it just makes it a lot harder. So I have noticed something. It’s that when I was looking into online learning, and I realized all the examples I found were by Chinese companies. And I think I’ve heard some American companies doing that, but they are doing it at much smaller scale, a lot less complex models than say companies like Alibaba or ByteDance are doing. I’m not quite sure why. I was talking to a few more engineers working for Chinese companies, and it seems they also have a different set of open source than what we’re used to in the US.
And do you where that open source lives? Presumably not on GitHub?
I think it is on GitHub. First of all there’s actually a different version in China. My brain is not.. It’s a bit clogged right now. I can share with you when I look it up. So I think one thing that happened is because of the language barrier. So for our open source culture people use English. So when you use an open source, they want you to contribute back. You had to submit the PR in English and you might get feedback and review and then you try to read again and you merge. And I think a lot of Chinese engineers find that is a big barrier because it’s just like, “I’m not familiar with that.”
And another reason I see the diversions is because of the maturity of adoptions and also legacy systems. So a lot of American internet companies are a lot older than the average new Chinese internet company. So it means American internet companies have a legacy system from 20 years ago, and they just have to build the system on top of it. The Chinese companies are very young and can just build the system from scratch, so they’re more willing to just jump straight to online learning instead of having to build on top of the batch learning system whatsoever that American internet companies have.
Fascinating. So as we get close to the end of allocated time here, a couple of questions from the group. Rachel asks, “Which is the better path towards online learning? Adapting infrastructure to train the DNN in an online way? Or different traditional algorithms like bandits?”
I think it’s a really interesting question. I think it really depends on the use case and on data. I think a lot of companies that I see… When they do online learning, they try to start with a simple use case, but I’ve seen people who try to start with a similar use case and still run into the same problem as people who would try to start with a more complex use case. I think I’m way out of my depth here because I haven’t been able to build an online learning system myself.
Sounds good. One more. What are some applications that are leveraging online learning today, e.g., recommendation systems for e-commerce, and how is this implemented on device or cloud?
I think it’s a great question, what applications leverage online learning. So I think that the easiest one to see is definitely recommendations. So TikTok is one of the biggest examples of this. People’s attention online and what we’re interested in online changes second to second. Firstly you went online thinking you could watch a lecture on machine learning and then you read some news about octopus punching a fish and like, “Okay, now I want to watch a video of octopus punching a fish.” So our attention online changes very quickly. It happens to me all the time. So this is one of the use cases when you want to learn and adapt to user preference very quickly and then make predictions. This is just what they want right now, so recommendation system content or e-commerce.
Another use case that I think is heavily underexplored and I don’t see a lot of companies doing it yet except for a few big companies. It’s customer service support. So right now people are trying to make customer support more effective. So when user support, a customer ticket, you want to classify what this is about and route it to the correct person and to solve the problem for them. But how often do you send a customer support message?
So I could say for me, I can use a lot of apps and websites and I get frustrated, but I don’t send a message ever, maybe one in 20 times. So I want to be able to follow my users, a customer on the app, and just predict what they’re trying to do and help them before they get so frustrated that they either leave the app or send a message. I think that’s a use case that I don’t see many people doing it yet, but it could be very interesting for online learning.
All right. So to end, one rapid fire question from me. What’s your favorite data book, newsletter or podcast that you would recommend to the audience?
So for book, I like this book by Martin, I think Kleppman, Designing Data-Intensive Applications. I think it’s really awesome. Another book I think is very interesting is very similar to the book that Jack recommended. It’s called A Weapon of Math Destruction. It’s about how algorithms used at large scale, can bias against people at large scale. It’s very interesting. It was written by the mathematicians. And so I like the name. It’s a really great pun.
Cathy O’Neil.
Yeah. I don’t listen to any podcasts because I don’t drive and I feel like you have to have a car and drive to listen to podcasts. But I’m hoping to get into it one day.
Chip, thank you so much. This was fascinating and I learned a lot. So thank you very much for dropping by. We really appreciate it.
Yeah. Thank you for having me. I think I see some questions there and feel free to reach out to me. I’m on Twitter. I’m on email as well, my website, so if you have any other questions, feel free to reach out to me anytime. Thank you so much for having me.