Real-time data streaming is an increasingly crucial part of the data ecosystem. While financial services (trading) initially represented the bulk of the demand for streaming, the emergence of more mature technology in the space has unlocked more use cases, which in turn created more demand for better technology.
At a recent Data Driven NYC, we had a very interesting conversation with Arjun Narayan, CEO of Materialize, “the only true SQL streaming database for building internal tools, interactive dashboards, and customer-facing experiences”. Materialize is headquartered in New York and has raised $40M in venture capital money (with a new round rumored to be announced soon, at the time of writing).
This was a very educational discussion, where we covered the following topics:
- What is streaming? What is Kakfa?
- Why is there a need for a streaming database for analytics?
- Why is SQL underrated?
- What is Materialize?
- Partnering with DBT to make streaming ubiquitous
- Materializes’s roadmap
Below is the video and below that, the transcript.
(As always, Data Driven NYC is a team effort – many thanks to my FirstMark colleagues Jack Cohen and Katie Chiou for co-organizing, Diego Guttierez for the video work and to Karissa Domondon for the transcript!)
TRANSCRIPT (lightly edited for clarity)
[Matt Turck] What is streaming, and why does it matter?
[Arjun Narayan] I think the best way to wrap your head around streaming is to contrast it with batch. Data moves in either batch or streaming in the sense that you either move data a few times a day, having batched up the data, or you send data points across your various systems as events happen. So one could imagine, for instance, when doing quarterly reporting, you just collect up all the things that happened that quarter at the end of the quarter, then you do the processing.
The most early adopters of streaming technologies have always been financial use cases where, a new order comes in and you want to fulfill it immediately or do something at very low latencies. What we’ve been seeing is that streaming has started to become, over several decades and particularly in the past few years, much more broadly applicable beyond those small niche use cases as more and more businesses and users benefit from lower latency answers to a variety of questions.
People have been talking about streaming for a while, but that seems to be accelerating now. What are the drivers? Why is that accelerating so much as a key industry trend?
I definitely think there’s a confluence of several factors. One is that the tools for moving the data around are starting to become more ubiquitous as Apache Kafka being the most widely adopted technology for moving data from point A to point B at very low latencies.
For the less technical folks here, what’s Kafka, why is it a big deal?
Kafka is an open source technology which is a message broker. So if you have various systems that are producing data, you have various other systems that are consuming data, you can put Kafka in the middle and you can have the producers of data publish various feeds, and you can have consumers of data in a fairly decentralized fashion choose which feeds of data, or which topics as they call them in Kafka, to subscribe to.
What this means is that you can sort of create this real time clearing house infrastructure in your data, in your company, to allow various teams that may not even be coordinating with otherwise, in the absence of a system like Kafka have to essentially coordinate and talk to each other and figure out a way to move that data and closely integrate their systems.
They don’t actually have to do that coordination and they can just sort of subscribe to those data feeds that are available and potentially come up with use cases that make use of that data in real time. You can imagine how this would happen in batch, right? So everybody at the end of the day would put all their data in a data lake and then tomorrow anybody can pick up that data and do something useful with it. Kafka basically moves this into real time and allows you to within milliseconds get access to data that has just been created by upstream services in your company.
It has been, in my opinion, a key enabler of microservices, I think it’s pretty difficult to build and operate a decentralized set of microservices without first adopting something like Kafka in your organization to just move the data between all of these various microservices.
Why is there a need for a streaming database?
The biggest reason I would say is that today’s analytics databases as well very sophisticated and very powerful are fundamentally designed around the batch paradigm. So the best way to contrast this is, a lot of batch workloads are very pull oriented. So you show up, you say, I’ve got this database and I wish to recompute or compute fresh a bunch of queries. And these questions can be very complicated, they can join, merge, many, many datasets, very, very large data sets and things like that.
This paradigm doesn’t really work in the streaming world where you may be getting in aggregate still very large data, very big data to use the phrase, but every few milliseconds, you get another row of data, right? So stopping everything and recomputing from scratch is not really a framework that scales to these lower and lower latencies. Which is why fundamentally a lot of analytics databases today, including some of the more famous ones, they would prefer it if you would just batch up all your data over a few minutes and then rerun the computation, right?
So they introduce latency because the paradigm in which they perform their computation is not suited for incremental recomputation. Another way we thought about describing Materialize was we’re an incremental database, right? So we are suited for updating arbitrary queries, arbitrary answers, when the underlying data changes even if that change is small or that change is happening at very, very high frequency. So every few milliseconds and in aggregate hundreds of thousands or millions of updates per second.
And as you built Materialize you chose to go the SQL route. Can you walk us through the thinking that led to this?
Yeah. So I think SQL is incredibly underrated as a standard for describing computations and describing the queries that one executes over all these datasets. It’s incredibly long lived. It’s three, four decades old and if you think of many, many mature organizations, the corpus of SQL queries and analytics workflows that they’ve defined over decades is very, very rich. I’m generally very skeptical of any pitch where you tell folks, you can have all these great new benefits of low latency or whatever it is, but you’ve got to start all over from scratch. You have to throw everything out there and you got to rebuild everything in some new language. I think those efforts are largely doomed.
SQL is also a great standard for interconnectivity. So you have all sorts of different tools. Any company that has any scale, or even no scale at all, usually it’s using a dozen or hundreds of tools. These tools all talk to each other using standards like SQL. So a good example of this as a BI tool like Looker is pulling or issuing queries to a data warehouse like Snowflake, which is pulling data from a source of truth database, like CockroachDB, right? So you take these three, they’re all able to speak to each other because they speak that common language of SQL. So the user who’s moving the cursor and clicking some things in Looker, they may not even be writing SQL query or they may not even know SQL, but every click that they do on Looker is issuing under the hood various SQL queries that are being parsed and executed by the data warehouse like Snowflake.
And so I think one of the most important things is to recognize that when selling or adopting new technology, you are working within a context of an ecosystem that is orders of magnitude larger than your tool. And SQL, I think is the best example of that in the data space.
When do you think SQL came back? Right? There was this whole evolution of the last few years from traditional SQL databases and then it was all NoSQL and then SQL came back… what was the journey then?
It’s a great case study for those of us veterans who’ve been around for a while. I mean, a couple of decades ago, I mean, SQL was done, right? There was this big movement of NoSQL and Hadoop. The fundamental message there was that the data sets were growing so large and your SQL databases were never going to scale… Like if you wanted to process these ever-growing data sets, you had to throw everything out and you had to start over by using horizontally scalable technologies, like NoSQL on the transactional side and Hadoop on the analytical side, right? This obviously had some uptake, this had some adoption, but I would classify that as adoption among folks who could build and maintain and we’re starting in a fairly clean slate manner, so the tech companies, the first sort of, I guess, the first and second waves of web companies would adopt Hadoop, but by and large didn’t really get to a lot of mainstream adoption.
It was only when you had this new SQL, of vendors and technologies that came about that took those same ultimately correct ideas of horizontal scalability, cloud native sort of architectures, but then packaged it up such that the interface was still this very same SQL that people were used to that you really saw mainstream adoption across not just Silicon Valley companies, but all across the world.
I think ultimately Snowflake is a fantastic example. Under the hood they have a very modern microservice architecture, but it’s all sort of very neatly wrapped up with a bow on top such that it looks like a SQL database, and that’s much, much more attractive to folks who already have decades of SQL queries and analytics queries that fundamentally needs to be ported over and run in this cloud native fashion, but were not throwing it all out.
Let’s dive into Materialize a bit more. It’s based on an open source project called Timely Dataflow. What is that? What is the history there?
Materialize, despite being a relatively young company, we’re a little over two years old, is based on close to a decade of stream processing research primarily driven by my co-founder and Materialize’s Chief Scientist, Frank McSherry. Frank was an academic, he worked at Microsoft research for a while where he made several contributions to various parts of information theory, data privacy and also big data comp computing, but he led a project to build a next generation stream processor, which he then developed as an open source project called Timely Dataflow that he had written in Rust, which back then was still a programming language that was under development in a sort of pre 1.0 state.
I was a close follower of this technology because I was a PhD student at Penn in distributed systems, distributed computing. And it was sort of the first, what I would describe as the very first stream processor that could do everything that batch processors could do. Before in this sort of pre-Timely Dataflow world, there existed streaming processors like Apache Storm, but they fundamentally posed a trade off, they said, “You can do some things in real time, you can do some things incrementally, but you can’t do everything that you can do in batch.” So you have to choose, do you want to do complex computation, or do you want low latency? And Timely Dataflow was the very first stream processing technology that said, everything you can do in batch you can do in streaming and you can do it in low latency. So it removed this trade-off aspect and made it sort of purely dominant as a technology.
Timely Dataflow is the core technical base layer of Materialize. The way I would phrase it, it’s a little bit like it’s the engine and Materialize is the car. And SQL is the same gear shift and accelerator and steering wheel that you could use. You can’t fundamentally go to market holding a very heavy, much, much more powerful engine. People still ultimately at the end of the day, they’re trying to buy a car, they’re not trying to sort of hot rod it. And there’s a few people who have adopted a timely dataflow and are running in production the hot rod enthusiast community in this analogy, but Materialize sort of wraps that up that extremely scalable, extremely high performance streaming computation engine in a way that looks and feels very much like the SQL databases that people have been used to for 30, 40 years.
What are the pieces that you built on top then to help achieve that?
The Timely Dataflow project also includes this other part that called Differential Dataflow which builds reusable components that can be assembled into these dataflow graphs that are executed. Materialize builds, all of the SQL planning, the optimization, the integrations to tools like Kafka, the integrations to pull batch data from S3. Oftentimes, even when you do have streaming data, you often need to mix streaming and batch. That ends up being another challenge, because, again, not everybody is throwing out all their batch data or their batch tools and technologies overnight and switching to streaming, oftentimes there’s a long phase where you dealing with both batch and streaming data in parallel. And the integrations, the SQL execution, and then finally the tools that make it easy to operate and scale and deploy the system in the cloud.
You have a concept of materialized views, like update trigger and update granularity, do you want to dive into those and explain a little bit?
The reason we called the company Materialize is that what we really see happening is us delivering on a feature that hasn’t really a long-standing database feature, the materialized view, that hasn’t really lived up to its potential or its promise. And the way to think about a materialized view is, let’s imagine you’re writing a query and you’re rewriting it over and over again. And the data may change, the data may not change, right? So sometimes you’re refreshing to see if that package is still in Memphis, right? That’s a great example of a query that we’re all being rerunning all over, all the time. A Materialized view basically is registered to the database. I’m interested in the results of this query, please recompute it for me so that it’s fresh, right?
So when I come and ask for it, I want it to be top of mind for you. And if any of the underlying data has changed, please recompute the query at that point. So basically triggering the computation, the recomputation, when an actually relevant update happens, rather than when I ask for it. So if the database starts doing work when you ask the query, if it’s a very large data set or it’s a very complicated query, you may have to wait minutes or hours before you get the answer. But if it’s already recomputed that answer when the data has changed by the time you come and ask for it it’s ready for you in milliseconds.
Materialized views have been present in databases as old as Oracle or SQL server, but they’ve been very limited in the kinds of queries you could write and the kinds of materialized views we could write. And we really think Materialize, the product is the first database that gives you, I don’t want to say literally all the functionality, but sort of morally speaking 99.9% of the functionality that you can write in a batch database.
And trigger and granularity, what does that mean?
Trigger means you are recomputing the result when the data changes So the data may not have changed you don’t want to actually waste a lot of computation rerunning just to get the same answer, right? So the update triggers is important for both performance, as well as efficiency. The update granularity is, you actually want the capability to recompute efficiently when a single data point changes rather than being incentivized to sort of accumulate a bunch of updates and batch them up together because it’s too prohibitively expensive to update your answer on just a single row changing.
You just announced a partnership with DBT, and we had Tristan, the CEO of Fishtown Analytics behind DBT speak at this event two, three months ago now. What’s the idea behind the partnership?
DBT is this amazing tool that is gaining a lot of adoption. DBT is a modeling layer. The way I like to think about it, and Tristan may not like this, but I think of it as like GitHub for all of your SQL. It’s like there is workflows, there’s continuous integration testing, there’s processes that you want to have to stay sane when your company has thousands or tens of thousands of lines of SQL to sort of spread across all. And with all of these interdependencies between this SQL query depends on this dataset, which is computed by the SQL query which depend on this dataset, etc. That is where in my view, most of the world is moving to in terms of writing and defining that SQL. And the bulk of that today is of course done on batch systems.
And I think the thing that will truly unlock the potential of streaming is when a company can go from batch to streaming by taking that existing SQL that’s already being written and defined in these DVC models and then sort of flip a switch and say, we’re now moving all of this recomputation into a streaming pipeline without really having to rewrite that SQL because DBT does a wonderful job sort of abstracting that away from the underlying execution engine in a way that the semantics are perfectly preserved. Until streaming gets to the same level of tooling as existing batch systems (and a large part of that tooling is DBT), it will still remain a fairly niche technology.
Exciting. So what’s next? What’s on the roadmap for 2021?
Next month, we’re launching our cloud product, our fully hosted cloud product. I think this is definitely one of the top requested ways to use Materialize today. It’s a source available downloadable product, but particularly with database infrastructure, I think more and more users and customers wish to use a fully hosted cloud product rather than having to sort of own and operate infrastructure and hold a pager. In terms of the core features, we are working on tiered storage this year that we will launch. A tiered storage is basically the ability to store large historical data sets on very cheap storage like Amazon S3, such that you don’t have to really think about separating out your historical data sets. That just sort of happens behind the scenes. So live data is often stored on very expensive storage and then you want to minimize and compact away, let’s just keep one day’s worth of data on this expensive storage and let’s take the history of years and years of data and put them on tiered storage.
That ends up creating a lot of organizational complexity, operational complexity, and building that in natively to the product means that users will just define SQL and data that’s relevant will stay in local disc and the data that is historical will just magically disappear to tiered storage.
Next year, we’re building replication, this is for the more business critical use cases where you want very, very high availability, very, very low downtime. You will want to run replicated instances of your database so that if one goes down the other one can seamlessly take over.
Wonderful. So we pretty much out of time, so let’s call it a wrap. Thank you so much. Really appreciate it. That was fantastic and very, very educational. So thanks very much for coming by.
Thank you so much for having me.