There’s an inherent tension at the heart of modern data infrastructure. On the one hand, it’s becoming more mission-critical every day, as companies around the world rely on it to run their business. On the other hand, it’s more complex, and potentially brittle, than ever, an “assembly chain” involving multiple tools and repositories.
This tension has led to the emergence of DataOps as a distinct and very active segment. One particularly important area is known as “data lineage“. The concept is basically to monitor data pipelines and understand the journey of data through its various transformations and usages. This makes it possible to fix any issues that happen along the way, and go to the root of data quality, and potentially fairness, issues.
Because data lineage involves many different tools, platforms and companies, it makes sense for those different parts of the ecosystem to collaborate around standard definitions. This is the concept behind OpenLineage, a cross-industry effort involving creators and contributors from key data projects (DBT, Spark, Pandas, etc.), gathered together at the initiative of the founders of Datakin, an SF startup beyond the open source data lineage project Marquez (originally started at WeWork).
At our most recent Data Driven NYC, we had the pleasure of hosting Julien Le Dem, CTO of Datakin. His talk (video below) is very approachable and educational.Continue reading “Data Observability and Pipelines: OpenLineage and Marquez”