For anyone in the data analysis community, Wes McKinney a very well known figure. In addition to literally writing the book on the topic (“Python for Data Analysis”), he’s played a leading role in several key open source projects: he created Python Pandas, he’s a PMC member for Apache Parquet, and he’s also the co-creator of Apache Arrow, his current development focus.
He’s also a serial entrepreneur, having co-founded DataPad (acquired by Cloudera) and now Ursa.
So it was a real pleasure hosting Wes for a chat at our most recent Data Driven NYC. As always, we tried to position the conversation to be approachable by everyone (with high level definitions) while being interesting for technical folks and industry experts.
Watch the video below (or read the transcript copied below the video) to learn:
- What are pandas? What is a dataframe?
- What is Arrow? What is its history and why is it a big deal?
- What is Ursa Computing?
[Matt Turck] By way of quick intro, you are an open source software developer focusing on data analysis tools. For anybody that follows his community. You’re very well known in particular for creating the Python pandas project and for being the co-creator of Apache Arrow, which we’re going to talk about a bunch today. You’ve also authored Python for Data Analysis, which is one of the sort of Bible kind of books in this space that every technical person has read or should read. You are a member of the Apache Software Foundation, and you’ve worked at Two Sigma, Cloudera, AQR Capital
[Wes McKinney] I’ve been busy. He sounds busy, this Wes guy.
No slacking there. I’d love to start with pandas actually, which was sort of like your original baby. I think it’d be super interesting for the folks that are not familiar with pandas to hear about it. And so I guess, what is it?
Yeah, so pandas sort started as a Swiss army knife for data access, data manipulation, analytics, and data visualization in Python. I often tell people it’s one of the primary on-ramps for data into the Python ecosystems. So, whether you’re reading data out of CSV files or other kinds of data files or reading data out of databases, it’s hard to estimate, but I’d say more than 90% of the data that’s coming into structured data processing in the Python ecosystem, tabular data processing, is passing through pandas at some point in its lifetime. And so it’s become after NumPy for Ray computing and Python, the most widely used Python package for data.
Yeah I was reading recently, you have literally millions of users, but it was like a 500 contributors or more than 500, I’m sorry. Thousands of contributors, right for pandas?
Yeah. I know there’ve been well over 2000 contributors at this point. I haven’t been actively involved in the development of the project for almost eight years now, but the project is almost 13 years old. I worked really actively on it for about five years, a couple of years in closed source. And then a few years in open source, while also writing my book, which is largely about pandas, but it’s really flourished and bit a model for a community governed open source project without pandas and pandas never really had a significant corporate sponsor who was putting in the majority of contributions. Like it really was a community project almost from the get-go.
Wonderful and to make this even more approachable to folks that may not spend that much time in the industry. A lot of energies are spent talking about machine learning and AI and all the things but ultimately what a lot of data analysis and data scientist people do is they spend a tremendous amount of that time getting the data ready for machine learning and AI. This is where pandas plays such an important role in accelerating and facilitating that first step. Is that correct?
Yeah. So, people use it for not only loading the data into Python, but also the data cleaning, we call data preparation, they use it to do feature engineering so, extracting features from datasets and kind of all of that munching and data work that’s needed before you can feed the data into a machine learning model, TensorFlow, scikit-learn, what have you. Data is largely, can be pretty messy and so you have to do, people report spending 80% or more of their time doing data cleaning and they spend a comparatively less time actually doing the science part of data science.
Still in an effort to make these educational, do you want to talk about data structures in the context of pandas and in particular there’s a term that comes back a lot, which is DataFrame. Do you want to explain what it is?
Yeah. DataFrame is a term that arose from originally in the R programming language, which was based on the S and S-PLUS programming languages. But a DataFrame basically means a table. And I guess we could call it a table and it would mean the same thing. But the term DataFrame has, at least in popular understanding, has become this idea of a style of programming interface for working with tabular data sets that’s imperative. So you can write four loops. You can step outside the bounds of what you could do in the SQL query language. So it tends to be interactive, very flexible, enables you to work with data sets from a general purpose programming language like Python.
And DataFrame from the context of pandas, how does pandas-
Well, DataFrame is the name of the main table object in pandas. So it’s a collection of columns. Most of the time people are working either with a collection of DataFrames or column from DataFrames which are called series. So when you read a CSV file, it comes into Python as a pandas DataFrame.
And all of this is in the context of the Python world especially when you started a little this, wasn’t it super clear that Python was going to be the dominant language for data analysis and data science, but it seems to be very much the case now? Why do you think that happened? And do you think that’s a permanent state or could other languages emerge?
Yeah, it certainly wasn’t clear, it wasn’t predetermined or inevitable that Python was going to become the number one data language that it is today. I think a part of that was a relatively small group of very passionate open source developers building the essential projects which enabled the people who are now called data scientists to do complete workflows in Python. And so we needed the array computing that you have in NumPy. We needed the tabular, data manipulation, data cleaning data loading that you have in pandas. We needed to be able to do machine learning that came from scikit-learn, which was developed around the same time. We needed a nice user interface, a programming environment that came from IPython and the Jupyter Project. We needed to be able to visualize data. So it was kind of this perfect storm of these different tools coming together and enabling productive workflow.
I think another thing that was a catalyst for Python was the fact that so many companies started to get value out of all of the new data that was being generated by smartphones and by mobile browsing. And so the time to market and the speed at which you could develop systems that created actionable insights on data and production systems that would deliver model predictions, different things, you need to be able to build those systems and get them to market really, really quickly. And so the fact that you could use Python as the language, not only of your research and your exploratory computing, but then you could also on the other hand build production systems and then put them out there in the cloud, that was very compelling. And you could hand somebody a book about Python and they could become productive and start writing useful code within hours or days.
So it was really about that rapid prototyping and the research to production life cycle that enabled Python to become what it is today. But certainly without that perfect storm of open source projects, it just wouldn’t have happened. I think without scikit-learn, without pandas, without NumPy, it just wouldn’t have… I think people would have had to go elsewhere. I don’t know what the world would look like, but it would be different.
Your focus for the last few years has been Arrow. When and how did that start?
I would describe it as this collective realization in the 2015 timeframe that the entire data ecosystem was facing this grand data interchange and data interoperability crisis. And that’s partly because there were all of these different independent groups of people that built different computing engines, data processing systems. There were a lot of different file formats. There was the cloud so people were starting to build the first data lakes in the cloud. And so there was all this angst about how do we move data and transport data between different systems? How do we get access to all these different data formats efficiently? And all that being said, the hardware underneath our feet, disc drives, networking, everything was getting a lot faster. And so we found ourselves really limited and held back by the speed of getting access to data and the costs associated with moving data around.
So I was at Cloudera at the time. So, I was interested in this problem and I have experienced it from the perspective of pandas and the Python ecosystem, wanting to build bridges from Python into all of these other systems. And so we started folks from Cloudera. So people from the Impala team, from the Kudu team, they’d worked with Julien Le Dem at Twitter on Parquet. Julien was now a Dremio. So, we kind of very quickly got all these companies together, Cloudera, Hortonworks, Dremio, Twitter folks, and a bunch of others. And eventually 20 or 25 of us in the room saying, “Hey, this is a really important problem. We need to create a cohesive solution to data interoperability and data interchange that we can all use and that will meet all of these diverse needs and make some of the pain going away.” It was definitely serendipitous but it also was a lot of hard labor to be able to get so many people to agree with each other to build one thing as opposed to 14 different incompatible quote, unquote standards.
And to double click on the problem. So is it fair to say on one side you have the world of databases and data warehouses and that’s the world of SQL, and then on the other side you have the world of the machine learning and data analysis tools, which is like Spark and NumPy and so on and so forth, and that’s Python, and those two worlds did not communicate? That’s right?
Yeah. I mean, that was one of the things that was the most painful for me, the fact that you had all this really sophisticated, scalable data processing technology that was being created in the big data world or in the analytic database world. And I had come from the data science ecosystem from Python and so we were off on our own kind of completely working by ourselves, building everything from scratch. So there was very little cross-pollination between the database, scalable data processing world, and the data science world. And so for me, one of the primary motivators was to create a technology which could proverbially tie the room together and enable that cross pollination between database developers and data science developers that had just never existed because there was nothing productive to collaborate on until that point.
Why is it so hard to extract data from a database? Can’t you just use an ODBC JDBC connector, you get the data out, and you’re done?
Well, we could spend 20 minutes or an hour just talking about this, but there’s many problems. So part of the problem is that the way that people consume the results of database queries is so diverse. And so not everything is a DataFrame library or an analytical tool. So, protocols like interfaces, ODBC, and JDBC were never designed or intended for bulk data transfer on the order of gigabytes, for example. Our best use for relatively small amounts of data that are the results of database queries, so all the work happens in the database, the results are relatively small, and they come into your application through something like ODBC or JDBC. There are also row oriented interfaces which makes them an awkward fit when you need to convert to a column oriented tool like pandas.
And so there’s this impedance or this conversion penalty going from things like ODBC into pandas. As a user of these tools, you kind of have this feeling of drinking a thick milkshake through a straw that’s too small. So you just try really hard to extract the data out of the database so that you can do work on it but you end up bottleneck on just the transfer because databases aren’t designed for that. They want you to do all of the processing within their walls.
And so how does Arrow work and how does it solve the problem?
Well, we designed a language independent standardized data format that is calm oriented that can be used for bulk data transfer, but that is also an efficient data format for doing analytics. So if you get a bunch of Arrow data out of a processing system, out of the database into memory, you can immediately go to work doing further data processing on it without a need to convert it into some other data format. So traditionally, data comes into your Python interpreter, into your Arrow interpreter, your process, and you immediately have to convert it into some other format. And so Arrow removes the need for that conversion. So aside from transporting bytes over the network, the data is essentially ready for analysis as soon as it gets into your hands.
And again, to reiterate for folks, I mean, this is a very big deal because, and correct me if I’m wrong, Wes, I’m going on a little bit of a limb here, but the world of data keeps talking about Snowflake and the data warehouses and it’s like a reasonable amount of tweets and articles and all the things, and then also machine learning, but effectively doing machine learning on top of a Snowflake is incredibly hard, not because the machine learning is complicated or not because Snowflake doesn’t work, but because the core problem is the movement of data and it’s incredibly complicated and expensive and time consuming to do this. So Arrow has a very strategic place in the ecosystem in that it precisely enables all those pieces to work together. So precisely to this point about data warehouses, what’s the status of the integration with a Snowflake or BigQuery or others?
Yeah, I mean, that’s been one of the most successful use cases for the project is a standard medium or data format for bulk interchange with data warehouse systems. So Snowflake supports exporting query results to Arrow format so does BigQuery. Microsoft has a number of internal projects in their cloud infrastructure for direct to Arrow export. And part of what motivated the data warehouses to support Arrow is because we build a really efficient bridge between Arrow and pandas. And so if their goal is to get gigabytes of data out of their data warehouses into the hands of data scientists, rather than building a custom connector that they have to design themselves and build and maintain and optimize, they can go through Arrow. So they have one thing to think about. So they get access to all of this ecosystem of tools that now support Arrow after five years of development work.
That makes things much simpler for the data warehouse vendors. They have one thing to think about and on the Python side, we can deal with like, okay, “I know how to optimize getting data into pandas.” So we can maintain that for them and their developers don’t have to solve that problem themselves. So I think that in the course of the next few years, pretty much every database system, every data warehouse is going to support Arrow based import and export in some format and that’s exactly what we want. I mean, that’s precisely the purpose of the project to have this common, standardized, efficient, interchange format. So it’s really just a big major development that is very difficult to achieve.
Great. Let’s talk about Ursa a little bit. So you first started this as a nonprofit entity which was an interesting model. Again as mentioned at the beginning you either switched or added a commercial entity to this. Do you want to talk about those?
Yeah, the backstory is 2016 Arrow started. I moved from Cloudera to Two Sigma, Two Sigma recognized the value of Arrow pretty much immediately as they have petabytes and petabytes of data and said, “Make Python work better with all of this big data, we’ll let you work on it here.” But then pretty quickly, all around the world companies began to realize how much of a big deal Arrow was and they all wanted to fund the development. And so I had all these companies coming to me and say, “Hey Wes, can you come work here?” And “You can work on Arrow, but we want you to solve our problems.” So initially Ursa Labs was created as a effectively an industry consortium to enable many corporations to fund Arrow development work and have a seat at the table to have regular syncs with me and my team.
So we can make sure that we’re hearing their concerns and that we’re prioritizing the important parts of the development roadmap, but we weren’t beholden to any single company or any single company’s needs. So RStudio was very gracious to provide a home for Ursa Labs and a lot of the majority or plurality of the funding for the work. So after a little more than two years of nonprofit Ursa Labs work, it became clear to me and to many people that to have more of a commercial engine behind Arrow and the Arrow ecosystem was important for enabling the ecosystem to continue to grow for us, to be able to pour a lot more resources into the open source project.
So as of the summer, we were six full-time developers. And so even with six full-time people working on the project, our hands were completely full with bug reports and trying to build features and keeping the community growing and healthy. So the writing was on the wall that we needed to think about product development for Arrow and for the data science ecosystem as it relates to Arrow and to be able to invest more in the maintenance and growth of the Apache Arrow project itself.
And Ursa Computing is going to be the commercial version of this?
Yeah. So we spun the Ursa Labs team out of RStudio and folded the Labs team into a new company, Ursa Computing. So we continued to have a Labs team which focuses a hundred percent on the open source development and we’re hiring and expanding that lab’s team. And the new company, Ursa Computing raised a venture around led by GV and we’ll be developing some commercial offerings to accelerate data science use cases. They’re all powered by the core computing technology that we’re creating in the Arrow project.
Okay. Very good. All right. Just a couple of questions from the audience and then we’ll switch to Julien and Datakin. Let’s see. A question from Ragu. How is Arrow compared with Databricks product stack Delta Format, Delta Engine?
It’s neither a competitor or a replacement, so it’s strictly a complimentary technology. Spark supports Arrow as an interchange format and it’s used heavily in the interface with Python and R for example. I’m not up-to-date on what else Databricks has done to inter-operate with Arrow, but I know that they’re an active user of Apache Arrow in the Delta Lake Python interface in core Spark itself, and probably in other parts of Databricks that I’m not aware of. So Arrow essentially enhances interoperability with file formats like Parquet which are an essential part of the Databricks platform.
And one last question from Joshua Bloom, if we’re talking about the same Joshua, former speaker at these events, more as a broad sort of industry question. After HPC/Hadoop/Spark/Ray, what’s the longterm future of parallel compute for data intensive workflows? What are the key innovations needed, is it the hardware and data center architecture, software platforms, algorithmic? What’s your, I guess, overall take on the future of parallel compute for data intensive workflows?
Yeah. I mean I think that hardware acceleration and leveraging advances in modern hardwares is absolutely essential. So the folks from NVIDIA have a large team building the RAPIDS project, which is CUDA based computing against Arrow data. And so they’ve been able to basically smash large scale big data benchmarks on big appliances which are full of NVIDIA GPU, so providing faster performance and lower TCO over the alternative solutions that don’t leverage that kind of hardware acceleration. I think it’s important that we develop computing engines that very effectively scale up at the single node scale and that we build really flexible and high quality distributed schedulers which enable us to take those efficient single node computing systems and distribute and scale them on large data sets.
And so you’re seeing that trend with projects like Ray and Dask which provide general purpose distributed schedulers for single node computing systems. But I think one of the things that you find with things like Spark is that they really failed to shrink down and effectively do computing at the single node scale. So you can use Spark at the single node scale as an alternative to pandas through the Qualys interface, but you’ll find that for many workloads it’s simply slower than pandas which is not super impressive. So we need to innovate on distributed computing, hardware acceleration, single node computing. Well, we have a lot of work to do so I think the next decade is going to be really exciting as these pieces begin to come together.
Wonderful. Perfect way to end it. Really appreciate it. Wes, this was wonderful. Congratulations on all the success with all of this and really appreciate your stopping by at Data Driven NYC.
Thanks, Matt. Great to see you as always.