Hosting Alok Gupta at our most recent Data Driven NYC was special for a couple of reasons.
First, because Alok is the very talented head of data science and machine learning in a company that has all sorts of really interesting use cases for AI and just had a phenomenal IPO, valuing it at $60B at the time of writing.
Second, because it was a homecoming of sorts for Alok, whose journey in the field of data science was inspired in part by Data Driven NYC – as he puts it:
This also feels like it nicely completes my journey starting 8 years ago when I was working on Wall Street in 2013 and started coming to your monthly evening talks at the Bloomberg building to learn more about ‘Data Science’. That was really a launching point for me to switch from trading to DS, and I’m grateful to be able to give back in a small way :).
One of those stories that brings joy to the heart of the organizers of this community!
Here are the video, as well as a full transcript for easy perusal:
VIDEO
TRANSCRIPT
[Matt Turck] By way of quick introduction, Alok is Head of Data Science and Machine Learning at DoorDash. Previously, he was Senior Director of Data science at Lyft, and before that a Director of Data Science at Airbnb and an affiliated researcher at Stanford. (So, congratulations on your great choice of companies between Airbnb, Lyft, and DoorDash!) Before that in a past life, he was a research fellow in mathematics at Oxford, and then a high-frequency trader on Wall Street.
First of all, congratulations on the remarkable success of the DoorDash IPO, now a $63 billion market cap company. And maybe that would be a great place to start. So, I think most people are familiar with DoorDash as a service getting food delivery, or actually more than food, but in particular food, but from a business standpoint, how should one think about it?
[Alok Gupta] I think DoorDash wants to essentially help empower local businesses is the core to their mission. The company began with restaurants because there were numerous and they began with food delivery because that was one of the most pressing problems for them. But the mission really is to do whatever we can to empower local businesses, provide services. Last mile logistics is something we’re becoming proficient in. So, that’s an area where we’re really making our mark.
Is it effectively a three-sided marketplace?
The food delivery is three-sided, you’re right. There’s the merchant, the consumer, and the Dasher. As we start to branch out into grocery delivery, convenience delivery, it becomes four-sided, where we introduce a picker as well. Yeah, that’s right.
What does machine learning and data scientists do in a three-sided or even four-sided marketplace? What are some of the problems that you guys tackle?
The really fun thing is that we’re able to work on problems, both traditional consumer app problems. If you think about recommendation, ranking, search, pricing, all of those traditional consumer side problems, as well as then some of the newer logistics type problems. When you think about companies like Lyft who are matching a driver with a rider, we have those problems as well, matching the Dasher with the order of the pickup. So we have all the pickup logistics, plus the consumer app machine learning problems. And then again, all your traditional e-commerce problems, whether it’s fraud, detection, support, optimization, et cetera.
That’s actually a really interesting way of thinking about the business. One thinks of DoorDash as a company that physically moves people and product to different places, but ultimately, would it be fair to say it’s a big software brain, a delivery system that, ultimately, is all software and allocation of resources against time and distance? Is that a good way to think about it?
I think we would like to think of ourselves as an infrastructure company, that’s right, where we build products and services to help us in the things we’re trying to do, but ultimately we get them to a stage of maturity and robustness through experimenting on ourselves where we can then offer them as white label services to other companies in the same way Amazon built AWS, for example, that’s definitely a nice way of thinking about what we’re trying to build.
I’d love to dive into the machine learning infrastructure at DoorDash: the tools, the software, ML frameworks, ML model life cycle, features, etc.
Sure. I think right at the start of a project is probably the hardest and most impactful part of the life cycle of the model. I think I prefer to call it data-driven software, because it could ultimately be something other than a model, is actually understanding, identifying the right problem to work on and framing the solution to actually solve the problem. And so a lot of work is done upfront to really understand, are we trying to identify these particular categories or are we trying to optimize metric A or B or metric C with constraint metric D and why.
And so, a lot of the time that work has to be done up front and then as well as that, what data do we have available? And maybe what data can we get? And what is the predictive power or optimization power of inserting this data into the workflow we have today to do that advanced measurement or optimization? So that’s right at the top of the stack, and oftentimes data scientists will work with their product managing partners or their operations partners to identify the right problem and solution. Then once we get on to the tech stack, yeah, it’ll be a lot of-
I read that your team considered a lot of ML models and then converged towards a smaller subset of models: XGBoost, LightGBM boost, and TensorFlow and PyTorch on the neural network side. Is that the core of it, or are you always experimenting, trying new things?
Yeah. The motivation at the start was to, how do we get everybody at the company onto the same centralized machine learning platform stack? So when I joined a year and a half ago, we’d already had models in search, fraud, logistics, et cetera, and they’re all on their own stack. And so the first thing was how do we bring everything onto one stack? So the first question was, “Okay, how do we build something quickly, which solves most people’s use cases?”
We landed on using a framework that enables tree-based models and we picked LightGBM for that after trying a few different packages and also deep learning and for that, we then used PyTorch. We started with those two core libraries and we basically said to everyone, “If you want to use the platform, you have to build to this and over time we will add capabilities for other things.” it wasn’t so much a constraint or a restriction, it was more of a prioritization. How do we enable the most people to migrate to a platform as quickly as possible?
Great. I interrupted you where you are about to go into the infrastructure aspect of this.
Yeah. Once we picked a problem and solution, it’s really then about grabbing out data. We use a Snowflake Data Lake. We use Python in Databricks and Spark to pull data in. Build models, A lot of it will be off the shelf, Python, ML libraries, like scikit-learn or some of the Spark libraries. I think one nice thing we do is, instead of building a model and pickling it, or whatever, a conversion package you want to use, we commit the actual training scripts to our model library. Each time the platform wants to run or create a new model. It runs the training script itself, which has configuration for bringing in data, creating features, training the model, and then it builds its model and it can push it to production to start making predictions. That’s really good because it enables us to commit to code rather than a model and by committing code, we at that point can introduce the code review workflow in Git, or whatever repository we’re using.
I also saw that in your overall architecture, there is a feature store. Can you talk about what that is? And maybe starting with what a feature is in machine learning for folks that may not know what it is.
I suppose in traditional machine learning a feature is a polished or aggregated combination of some input raw data. For example, if I’m making a prediction on someone’s health, I may take in their weight and height, but I might also transform the two things into some body mass index, like a BMI, which is a ratio of height to weight and so this iterated input data is called a feature. And so that feature, we then take a lot of these features and we put them into a model and they’re combined in some way to produce predictions or output. For us at DoorDash, we need a place to store these features not only to train models on so we can make future predictions, but also in the moment of needing to make a prediction, we can call from this feature store. It serves two purposes for model training and for model prediction.
Having a feature store is important to how this is able to call these features very quickly, especially if they’re for a real time application, like search, for example, we need to have those features available very quickly. It’s also important so that people can share useful features across different models and different use cases. That’s why we have a feature store. It can get difficult when we’re creating features that are aggregated on historical data. For example, what’s my number of purchases in the past week versus a real-time feature such as how many times have I searched in the last 10 seconds, for example, and that creates different technology requirements.
Maybe you compare and contrast this overall infrastructure against what you saw at Airbnb and Lyft. Are you seeing a convergence of the architectures being more or less the same or do you see significant differences in terms of like how people build those machine learning pipelines?
The themes are definitely similar, I would say at all three places, there was a desire for a centralized stack so that iterative improvements helped all models. For the feature store for example, there’s a real tension between the data we use to train models offline and the actual data hitting the models when a prediction needs to be served and there can be a divergence between those two things. So you could imagine if I am putting my data from a warehouse to create a feature that I train a model, and then in real time, I’m trying to compute that feature. They can actually diverge based on internet timestamps, rounding errors, precision, formats. A way to create a feature store whereby we can consolidate the training data with the prediction serving time data is a common sort of holy grail for all the companies I’ve been at in their pursuit.
Another thing is being able to house models that can make batch prediction. I can send a set of, for example, if I want to make a prediction for someone’s life expectancy, I can send a hundred sets of features for a hundred people and make a batch prediction in one go, so make a hundred predictions versus in sequence, sending a person one at a time and making a prediction in serial. There’s a real, again holy grail is how do you create an ML platform that can do both simultaneously without having to duplicate code or duplicate the model itself. There’s some things like that which repeat.
Interesting. Let’s switch to the team, the machine learning and data science team at DoorDash. How big is it? How many people and what kind of people do you have on the team in terms of function?
A year and a half ago when I joined, we had six, five or six people on the team. We’re now almost 30 people a year and a half later. This year, we want to double that to get to 50 plus people. These data scientists, I wrote a blog post recently that went live, I think last Monday, we’re looking for, of course, technical brilliance. We have people who typically have a master’s or PhD in quantitative subjects, plus some years of industry experience. That’s a given and we have a very rigorous challenge before you come onsite and have one-on-one interviews with us and then test that, but what we really look for in the interviews as well, is an ability to, or a real motivation to solve business problems rather than a desire to build cool models.
I think that’s well, and it’s okay if you want to build interesting models and write interesting algorithms, that’s probably best saved for academia or some research labs. For us we’re really interested in moving business metrics. That’s what we look for. We look for people who will do anything to move those business metrics and over the course of the year, most of your time will be spent building interesting models to serve that purpose. We have found a lot of people that are very technically gifted, but don’t immediately gravitate to what is useful for the business. That’s something we tried to tease out. It’s probably unique to our interview process compared to other places I’ve worked, where we really test people’s ability to navigate business problems.
How do you do that? How do you test the business part?
Really, we just borrowed interviews from other parts of the business. We’ve borrowed interviews from the analytics org and the operations org. Typical consulting, analytics type interviews, where we give a nebulous business problem and say, this is something we’re seeing. What do you think? What would you want to look at? What metrics would you care about? How would you frame a problem? What are some hypotheses? We go further back in the data science life cycle of a problem to the real beginning, almost like PM like questions, analytics type consulting questions. They’re very high signal. They’ve worked really well for us so far.
How is a team organized? Are they organized by project, by role, by function, are they centralized, they all over the company, what does it look like?
The data science team that I lead is, they report centrally up to through me. I have some managers up through them to me, but we distribute people to be embedded in cross-functional pods. A cross functional pod at a tech company typically looks like there will be a product manager who sort of sets the vision and strategy and then there will be half a dozen engineers, an analytics person, a data scientist, maybe a marketer, maybe a designer, and together they co design the roadmap for a given quarter, given a goal or metric set from above.
You had this interesting theme around the democratization of machine learning. I understand you created a machine learning council, I think you guys called it. Can you talk to that?
Yes. Some of the tensions I’d seen at previous places, and I’d also heard from others is who should work on ML? Should it be data scientists or ML engineers or the AI lab. I found a lot of these discussions, or I thought I could make these discussions and these conflicts unnecessary by essentially creating a set of principles for machine learning at DoorDash, whereby we say machine learning is a tool and we encourage everyone to be able to use it. We don’t try and put restrictions on what people can do. What we do is put hard lines, hard boundaries on what people are accountable for. It is perfectly acceptable for anyone at DoorDash to build a machine learning model. But if they want to put it into production, they need approval or review from someone like a data scientist or machine learning engineer.
We try to divorce what someone can work on from what they’re accountable for. The machine learning council was a way to bring different flavors of machine learning together to co build and co strategize. And again, to diffuse some of the territorialness that can develop at larger companies. We have people from engineering, from the ML platform, from data science, and we’ll add other people in future. It’s a place for people to discuss ideas, discuss strategy, discuss hiring, discuss team composition, discuss roles and responsibilities. It gives everyone a say at the table and we meet every two weeks. We review technologies and we review priorities.
One more question from me, and then I want to start opening up to some of the questions in the chat. When COVID hit, there was a lot of chatter around how it messed up machine learning models, which are ultimately all about prediction, but prediction based on a certain state of the world and in a world that was dramatically different because of COVID. What was the impact on your models? And how did you go about solving that problem?
We saw some of those challenges as well. The first thing we did was make sure we were well set up for any of the outlier problems for consumers, merchants, Dashers and that we put any temporary fixes or gaps or floors or multipliers on predictions that needed to for the sort of sudden step change in the environment and then what we did as a second stop gap was, we trained some of the look back for the training data. Whereas in the past for an assignment algorithm, we might’ve looked back, four weeks we changed it to one week and we then slowly extended that back to two, three, four weeks, but we shortened the timescales. There was a lot of manual work to make sure predictions still made enough sense so customer experience wasn’t to deteriorate terribly. We saw lots of feature drift, a lot of prediction drift, and it took a while to settle to a new equilibrium.
Great. All right. Let me as promised, start to look at some of those questions. Maybe switching back to the infrastructure side of things and in no particular order. Question from Glenn, is your feature store fully homegrown or using a vendor product?
It’s mostly homegrown. We’re using some technologies like Flint from one of some of our real time feature aggregation, data aggregation, but mostly it’s homegrown. We try to avoid building things in house and leverage third party software when we can, but given some of the needs and some of the scaling requirements, oftentimes we do end up building in house.
A question from Gedallia. Am I understanding correctly that instead of running Databricks on something like S3, you are running on Snowflake interested to hear more if that is so.
Apologies, sorry. We run Databricks on S3, but we pull data into Databricks from Snowflake. We have a connection between the two.
Interesting. And a question from me, do you have two parallel stacks? I mean, there’s presumably a whole stack around like BI and which is more centered around Snowflake, presumably.
Yeah. Snowflake is our data warehouse and then many teams pull from that. BI will pull from that to create data pipelines and dashboards at the end of that. Then we can also pull from those same tables in that data warehouse to build features and train models.
So you have the data warehouse on one side and then in parallel you have S3 and Databricks and those feed into machine learning. Is that correct or no?
We don’t pull much data from S3. We pull the data from Snowflake and Databricks runs on AWS. Most of the data comes from Snowflake.
Okay. Interesting. Question from Sana. How do you ensure that your model can make the best predictions for new customers and used cases? Are real-time features given higher priority for training the model, in this case, since you have limited historical data on new users?
New users can be difficult for us. If there’s a strong enough use case for a different cohort or segment like new users, typically, it’ll make more sense just to train a new model rather than trying to shoehorn it into a general model. That’s typically what we found. And again, the first thing will be to see, is there anything worth building a model around if its a new user, we don’t have much information, probably isn’t worth it for most use cases and we should just apply some sensible rules.
All right. Question from Xiun Yang, very interesting to commit code, to manage models. How would these scale for cases where the training time is lengthy and compute intensive i.e deep learning?
Then we probably change the frequency of retraining. So rather than every day, maybe we’d retrain once a week or once every two weeks. So it can still take however long multiple hours to train. Then once it is trained and ready, it automatically gets pushed to production where it can serve and make predictions.
Let’s see. A lot of good questions. Question from Wilson. Do you have entity resolution problems and how are you solving them, internally or via third party solution?
Entity resolution problems. By this I assume we’re talking about a sort of identity.
Presumably yes.
Identity. This comes up mostly so far on the fraud side and we plug into a lot of typical vendors for getting signals on identity and credit cards addresses, other attributes of identity. We do use many of those.
Okay. Maybe a couple more questions this time more on the team and skills side. Could you speak about the skills you needed to adopt during your transfer from academia and sciences to presumably a practitioner?
I can speak to DoorDash and some of the previous roles. I think based on my experience in academia, my motivation was to find something interesting so that I could publish, I remember being my motivation. I think when joining your company, we’re signing a contract with the company to align our goals with the company’s goal, which is typically its mission and do whatever is necessary in the service of that mission.
And so, the first skill I would say is to create that mindset shift to what we’re pointing our skills towards, which is advancing the mission and the metrics of the company. Other skills I’d say the soft skills are a very steep learning curve. I found it very steep in switching from finance to tech about how I interact with my fellow peers in product and engineering and operations and marketing. How to do simple things like assume positive intent, be transparent, communicate frequently, clearly, articulately. Be open-minded, be curious, be encouraging. A lot of these social skills are probably the ones people coming from academia need.
Great. One last question. What are the hot areas in the tech stack or in domain knowledge that a new grad should try to get exposure to, to land a job at DoorDash?
I would advise people to have proficiency and awareness of a lot of the model types. Linear models, tree based models, deep learning and then to pick one area to go deep in whether it’s tree based or NLP or vision or whatever it is, but BA have a broad awareness and then a real specialty in something that’ll serve you well in most companies you apply to.
Great, wonderful well on this note, thank you so much Alok this was terrific. Really appreciate your dropping by Data Driven NYC and sharing a bit about your work, your infrastructure with us. Super interesting. So thanks again. Really appreciate it.
Thanks Matt. Thanks for having me.