If you follow the various talks at Data Driven NYC, and the data ecosystem on general, it’s plenty apparent that the overall tooling for data, data science and machine learning is still in its infancy, particularly compared to the software stack.
While this may feel ironic (yes, I really do think) given the billions in venture capital money that have been poured in the space, it’s worth remembering that the data stack (at least in its “big data” phase) is relatively recent (10-15 years), while the software stack has had several decades of evolution.
In many organizations, the data science and machine learning stack looks a collection of various tools, some open source, some proprietary, glued together with one-off scripts. Teams started experimenting with one tool, then another, then created ad hoc pathways to make it all work together over time, and before you knew it, you ended up with complex environments that are painful to manage.
In response to this situation, various machine learning frameworks have emerged to make abstract away the complexity. Several of those frameworks were developed internally at large tech comapanies to solve their own problems, and then open sourced.
Kedro is one such example. It was developed and maintained by QuantumBlack, an analytics consultancy acquired by McKinsey in 2015. It’s McKinsey’s first open-source product.
Kedro is somewhat hard to categorize. If it had its own category, it might be considered a Machine Learning Engineering Framework. What React did for front-end engineering code is what Kedro does for machine learning code. It allows you to build “design systems” of reusable machine learning code.
At our most recent Data Driven NYC, we had the great pleasure of hosting Yetunde Dada, a Principal Product Manager at QuantumBlack, who has been the key driving force behind Kedro.
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
[Matt Turck] Very excited to welcome you today Yetunde – you are a principal product manager at QuantumBlack, which is a company that was acquired by McKinsey. When was it five years ago, six years ago?
[Yetunde Dada] Six years ago.
You are going to talk about Kedro, which is a very interesting workflow orchestrator tool that was open sourced by a McKinsey. You are doing this from London – I promise this is not meant to be a form of torture for our speakers to make them speak so late at night, but really, really appreciate it.
I’m really glad I’m able to reach everyone across the world. This time is fine for me. So briefly, Matt did introduce QuantumBlack. Essentially, we were acquired by McKinsey. You can think of us, I like to joke, as the black ops team that goes forth and delivers advanced analytics projects with McKinsey. When we talk about issues related to how do you scale analytics across an organization? You’ll see that we’ve grown so much over the last 4 or 5 years now.
We’ve had our issues. We’ve got our scar tissue and essentially we’ve been embedding those learnings in product. Kedro is one of many products within QuantumBlack that essentially helps us deliver analytics. And you need to understand this background as to why actually Kedro exists. So what I’m going to do is straight jump into a story. I’m going to jump into setting the scene as to why Kedro exists, go into what Kedro is, and then we’ll go through a demo of how it’s used in a very basic example.
The first place we’re going to start is the question of what you’re trying to build. I’m a product manager, so I really care deeply around the intent behind any data science, data engineering, or machine learning engineering code that you might create. And sometimes the intent is really to just test that a concept works in practice. The hope is that this code will never be used again after you’ve produced maybe an insights report that is a static document that goes to inform business decisions.
But sometimes people will come back to you and say, “You know what? That POC that you created is really cool. Can we actually put that thing in production? What is it going to take?” So what you try to do is you start to stack on or clap on little parts and components of code to actually help structure it in a better way.
So we see the clever bedroom option storage that’s lifted here just indicates that there was poor planning. The fact that this clock is behind this pole, but you’ve made a plan to try and make it work still. And you eventually end up with even things like this energizing electrical hookup.
But really what you should have been thinking about was what we call machine learning product, which is data science and data engineering code that needs to be rerun and maintained in future because you’ve actually got a system that’s built on it. There’s different building styles required.
When you’re focusing on POC code, quick and dirty is what you’d be aiming for. And when you’re thinking about a machine learning product, you’re really thinking about how do we scale this code? And when you want your organizations to really look and represent those different types of code assets, how do you do that?
So this is actually where we run into the many challenges that people talk about. Data scientists have to learn so many tools to create high quality code. Everyone works in different ways. People have different exposure levels to software engineering best practice which would help you create that code. And do you know your code’s not going to run on another person’s machine? And we also hear things like ‘analytics just doesn’t scale in my organization’ or rewriting entire code bases because you couldn’t use them.
And this is where we believe that machine learning is not really the hard part, but building and maintaining the data pipeline is, and this actually introduces what Kedro is. So Kedro, essentially, if we think of that building picture, was the scaffolding for the skyscrapers to be built. It’s a tool that helps you work in a way that you can experiment quickly but still end up with production code at the end if you so choose to actually scale and deploy your code. We open sourced it in June of 2019. One of the reasons was client need; some sort of way for clients to be able to access upgrades and support beyond our engagements with them because they were having the problems that Kedro was solving for them.
So, really think of it as setting the foundation for repeatable machine learning code. And it does this by borrowing software engineering concepts and applying them to machine learning code. So it’s definitely a cornerstone in the MLOps space, where those principles are applied. And some of the ones that we do heavily tie on are modularity, separation of concerns, and versioning. We continue to use it because it helps solve a lot of issues around glue code, and maybe the shortcomings of using Jupyter Notebooks in production. Sorry, I know Netflix is up next. Don’t kill me.
We also get efficiency as well because different team members can work on the same code base at the same time. And then, everyone just gets leveled up on software engineering best practice for creating really great code. Kedro, since it’s been launched, is used at startups, is used at major enterprises, not only because of our work as a consultancy, but also because other enterprises are just picking it up and writing about how they use it and deploy it. We’re also getting mentioned in all sorts of awesome things, so that’s also helped too.
In the demo, we’re going to be covering certain concepts that are in Kedro. We’re first going to talk about the project template, which is a series of files and folders that are generated whenever you create a Kedro project. We’ll talk about configuration, which is your ability to remove hard-coded variables from machine learning code so you can transition your code locally, in cloud or in production without changes to your code, removing need for you to specify file paths, for loading and saving data, and also removing your data science parameters. We’ll talk about data catalog, which are extensible collection of data, model, or image connectors that you can think of that allow you to connect to any data source to bring that into your workflow. And we’ll talk about how we create pipelines.
Kedro also supports flexible deployment modes because remember, the focus for Kedro is really how do we offer deployable code in the first place. But obviously, having the flexibility to deploy any way you so choose means that we do support quite a few tools for you to choose, because this is just how our teams work.
If you look at really where we sit in the ecosystem. So, not quite an orchestrator. We think of ourselves as a scaffolding, while you do all the different data science activities, that surround ingesting raw data, cleaning and joining the data, engineering the features, training and validating the model. And in terms of what time will this pipeline run? How will I know if it failed? We leave those tools to Dagster, Airflow, Prefect, Luigi, and many others to actually handle for you because they do that really well. Our focus really is on standardized, modular, and maintainable code.
So that’s what we do. Kedro is actively maintained by QuantumBlack. This is my job. This is what I do. So you’ll find us lurking around on different platforms. Head over to our GitHub repository to check out what we do, and our Docs are quite extensive so you can have a look at that.
So let’s actually get off into a demo, and I’ll start that now. So what I’m going to be taking you through, in terms of the demo, is a scenario. So you’re data scientists in the year 2160, for whatever reason, and what you’re trying to do is predict the price of a flight to the moon and back. There are hundreds of companies flying people. We’ve got review data, we’ve got shuttles data, like what shuttles each company owns. And we’re going to be doing some sort of data processing to clean up some of the data sources. And then we’re going to be doing some price prediction to also see what’s going on.
You’re going to be exposed to a few Kedro concepts, the project template, configuration, the catalog, and then how we construct pipelines using nodes, our concept called nodes.
The first one is project template. So what I’m actually going to take you through here, and let’s just actually close this terminal session and start a new one, is how do I create a project template that looks like this? So I’m just going to run Kedro new. This is, in terms of context, what is already set up. We’ve got a Python virtual environment set up. We’ve run pip installed Kedro, and now we’re actually ready to set up our space flights, our project template. Project template you can think of as a WordPress blog template, the same way. It’s basically your starting point for whatever you need to do. And we do allow our users to create their own ones based on whatever configuration exists in their organizations. So Kedro new, starter space flights. I’m just going to call this one space flights for the sake of this example. Press enter to accept the default naming.
And we should find it created now. So if I just open it up, you’ll see it here, another magic project template. We’re using space flights specifically because it’s set up with the pipelines that we’re using for this example. We’re actually just going to use the Data Driven NYC one that is already created. So I’m going to just change to Data Driven NYC and I’m in the folder. Cool. So we’re in this guy.
In terms of the project template, what you can think of it is a series of files and folders that are derived from Cookiecutter Data Science. They essentially guarantee workflow consistency because every project looks the same. It means that it’s easy for people to find what they’re needing in different projects. And as we mentioned, it’s modifiable and you can create your own project template based on the conventions that exist in your organization or the way that you need to work.
You’ll see that we have place for configuration, which I’ll be talking through now. So hang on for the next segment. We’ve got a place for data. This folder is normally not used in any enterprise data applications because your data is stored in the cloud or wherever you’re working from. But you see that the way that it’s structured with what we call our data engineering convention, which states how to go about processing data at different stages, raw data being immutable, and you can choose to go all the way to reporting if you so feel.
We’ve got a place for docs. So we do have support for states in Kedro. So it means that if you’re writing well-documented code, you can automatically generate the documentation that matches the code. We use standard Python logging for logs. We’ve got places for notebooks. Our relationship with notebooks is that we believe use notebooks for what they’re good for. Exploratory data analysis, initial pipeline development, and reporting. But in terms of actually building your full pipeline in Notebook, that’s where things get a little bit complicated because notebooks are a little bit challenging for reproducible and what we call production ready code.
And then we’ve got the source folder over here, the most important folder, because this is actually your pipe in code. Places for your tests, and really here is where we actually see our pipelines and how we construct them over here. So we’ve got 2 pipelines. I mentioned a data engineering one and a data science one, which I’ll be taking you through now.
But let’s actually start a configuration. So we kind of think of configuration as the settings for your machine learning code. It’s a way for you to define requirements for data logging and parameters in different environments. Here’s the logging one, here’s your parameters one.
I’ll specifically call note to what we call our data catalog over here. It manages the loading and saving of your data. It’s available as both a code or YAML API. There’s built-in versioning for the data sets or data things as well. And really just think of it as a series of data model image connectors to wherever your data is stored. In this case, only in this case, is the data on my local computer because of this example that I’m running. In most cases, your data catalog could look something like this, where your data is in S3 or Azure Blob storage or Google Cloud Storage or Hadoop File system somewhere. And you want to actually be able to call it from there.
So just for your purposes, you see that we’re using a pandas CSV dataset. The data catalog supports multiple integrations of pandas, Spark, Dask, SQLAlchemy, NGRX, Matplotlib and many more. It’s very extensible. I think Pillow won’t, but there’s many more.
So you’ll see, essentially just specify the type of the file that we want to interact with. They also indicated a layer here which is raw. I’ll show you what that looks like on Kedro-Viz when we get there.
We also do support parameters. So your experiment parameters for your data science workflow should also be outside of your code base because we really believe it helps you make code that is generalizable and reusable. By removing like the code for loading and saving your data and removing things like your parameters so they’re all in one place, allowing you to easily tweak your workflow.
The next part that I’d like to talk about is the source folder, and specifically looking at the pipelines. So this pipeline that we’re going to be talking about first is a data engineering one, which takes those 3 input datasets, does some cleaning to the companies table and the shuttles table, then combines them. So let’s actually have a look here, actually, let me show you Kedro-Viz first, because you might have a better sense of what it will look like in code once you’ve seen the visualization.
So it’s opened up, let me just put this here and I’ll put this here. So what we see here is, let’s actually have specifically, you look at the data engineering pipeline. There’s 3 input datasets: companies, reviews, and shuttles. Companies undergo some pre-processing, and so does the shuttles table. They come out as pre-processed companies and pre-processed shuttles. So we’ve reproduced the outputs.
We create a master table, so this function accepts the pre-processed companies table, the pre-processed shuttles table, and the reviews table, which didn’t need any processing. And then we create our master table here at the bottom. So, if that makes sense, let’s go back here. I’ll do a Kedro run in the meanwhile, this is essentially a pipeline run. You get to witness it. Let’s actually open up the data engineering file folder now.
So a node in Kedro world is just a Python function. These ones are utility functions, so you can ignore them for the sake of this example. But here there’s 3 functions I was talking about. Pre-processed companies, pre-processed shuttles, and create master table. You see that we’ve included docstrings for the sake of creating well-documented code. It just returns the companies table or the pre-processed shuttles table. And then we just create the master table here. How we construct the pipeline and Kedro world is this.
This is our pipeline obstruction. We put the function that we need. The input table called from configuration, remember this guy over here? Catalog, here. So we call this interimpipeline.py. We output pre-processed companies, and you’ll also see that pre-processed companies is actually specified here too. Here it is. Here’s the output. By putting it in the data catalog, we’re actually telling Kedro where to save it for us. But Kedro could actually run this pipeline without us telling it where it should be saved because Kedro also allows for in-memory processing.
We call it pre-processed companies, which is what you would have seen on Kedro-Viz. We do the same with pre-processed shuttles, the input table is shuttles. The output is pre-processed shuttles, and create master table. The final function accepts pre-processed shuttles, this guy over here, pre-processed companies, this guy over here, and reviews. The reason you write your pipeline like this is because Kedro will resolve running order for your tasks for you. And you don’t need to remember task running order because it does that for you.
So yeah, I did a Kedro run while we were waiting, and this is essentially the Kedro pipeline running. These logs are saved to the logs folder, so you have them there. And that is it essentially, so I guess we can handle the questions. That’s essentially Kedro in a nutshell.
I did see one question that was asked around how does this compare to AWS SageMaker? The AWS team actually did feature Kedro on the blog. We think of SageMaker as a deployment target for Kedro. So, still that whole process of creating well documented, well structured, modular data science code, but still having the freedom to deploy it on SageMaker, is how we see those roles interacting. SageMaker doesn’t quite help you with those 2 aspects. So I’ll stop share here.
Great. Thank you very much. And what’s on the roadmap?
Sure. What we’re going to be looking into, a few things this year, it’s quite exciting. We’re going to be working on experiment tracking in Kedro. We have seen one of the recognized patterns of how Kedro is used is with MLflow, because while we do support data versioning, we don’t necessarily support the collection of other things that you might need to recreate your entire experiment if you wanted to do that. So when you’re looking into the experiment tracking workflow, another issue that always gets raised as well by open source users is change data capture. So really if I summarize that as a user problem, it’s, “Hey, Kedro team. I only changed one data set or one function in my entire pipeline that has these thousands of nodes. Why can’t Kedro work out where to restart my pipeline for me and only run the nodes downstream of that.” So we’ll be working on those problems.
Very good. Who’s behind the Kedro team. Is that like a handful of folks at QuantumBlack? Is that a broader group? What’s the status?
I can actually show you maybe the contributor group as well. We are supported by an amazing team. We’re currently 12 people on the Kedro team that represent different skillsets. Software engineering, front-end engineering for Kedro-Viz, as well as technical writer skills, and we also have a visual designer as well that makes Kedro-Viz look amazing. Our contributor group is obviously growing because we do support core team Lorena Balan, Andrii, Lim, myself, you’re scrolling down, Stichbury, she’s our amazing technical writer, Merel, Ivan Danov is our Kedro tech lead, Richard, front-end engineering. And I think Liam, because he’s also in front-end might not appear, oh, here he is as bru5. We’re the team that actually just maintains Kedro and runs it. But you’ll see lots of contributions from open source community. We call them Kedroids, joining in to help us make the product great.
I’m also curious how that works at QuantumBlack. McKinsey traditionally has been management consulting, very strategic focus. The reason for the acquisition of QuantumBlack is to bring tech and data expertise. But is the idea that you guys work with McKinsey customers as well, or like advise them, or how does that work?
So the QuantumBlack business model, when we were acquired, is such that we obviously brought the advanced analytics skillset to our engagements, and McKinsey really came with that whole thing of how do we do proper analytics transformation and change management. Right now, that distinction is less so. QuantumBlack and McKinsey teams both roll out to clients and help them with their workflow.
How this tech focus and open source focus fits into the whole thing is that one of the QuantumBlack values is client independence. Our clients should be able to maintain the data pipeline when we leave. That for us, looks at our model of success. So, really being able to provide open source for us became a channel or way, or rather, to provide support and upgrades to our clients first engagement, so they can continue to use Kedro. It really does continue to fit in our business model because we’ve built a lot of internal products on top of Kedro as well. So that really just helps with how we work on engagements.
Very good. All right. So we’ve got a couple more questions in the chat. You only answered the first one, right? I’m not crazy.
Yeah, no you’re right.
So second question, do you integrate with any commercial open source catalog products? Alation and others.
No, currently. What I would recommend though, is I did mention how extensible the data catalog is. So it’s quite easy to actually extend it to whatever is based on that. We typically will accept ad hoc requests, based on whatever our users are using, and also contributions back into the data catalog as well, to support those different catalog products. So, yeah.
Okay, great. And then another question, Kedro is a very interesting tool, but it seems a lot like an ETL tool, like Talend, et cetera. How is Kedro using ML or deep learning logic?
Not Kedro itself. So, if you think of Kedro as more as the scaffolding for whatever pipeline you want to assemble, whether it is an ETL pipeline, or it is a data science or machine learning pipeline. Think of it as a scaffolding for those activities. So, when we talk about being able to use different machine learning applications in it, the functionality essentially supports that. So it doesn’t just do ETL. There’s all sorts of functionality that is built on top of Kedro.
Great. And then maybe one last question from me. Are there other products at QuantumBlack that the team is working on that maybe will become open source projects, or maybe there are existing other open source projects?
So I can talk about maybe 2 of them. There are quite a few, you can have a look on our QuantumBlack website. Bbut Performance AI, which was one of the tools that we used for experiment tracking – experiment tracking workflow, has been folded into Kedro. So when I talk about us, it’s on our roadmap, essentially you’re saying we’re actually going to fully realize what that integration looks like. You should be able to see it from Kedro-Viz, what your experiment workflow is looking like.
And then, we did get to work with the Great Expectations Team as well, because it’s this whole thing of, “I have this modular, maintainable data science code, but how do I make sure my data is validated while running through this pipeline?” And Great Expectations solved a lot of issues around that. So you might see some projects with the Great Expectations Team, Kedro Great Expectations eventually released as well.
Right. Great expectations being an open source data quality project, and a company behind it. Very cool.
I see one more question from Rodrigo. It says, “Can we say Kedro’s equal to Cookie Cutter Pro’s best practices plus workflow engine?” Yes you could.
If Cookie Cutter had built… if the data catalog had built thought about pipeline abstraction and pipeline visualization, and maybe additional things like data versioning and stuff like that. It might look a lot like Kedro. So, yeah.
All right. Very good. If we don’t have any more questions, I think we can call it a wrap. Thank you very much and thank you for doing this late at night in London with so much energy and this was terrific. Really enjoyed it. Thank you so much.
No worries. Thank you so much for having me.