Big Data: V is for Variety

My career in Big Data started out ten plus years ago. I learned about the Big Vs: Volume, Velocity, and Variety. I just had a gig with Scala, and it seemed to be winning the Data Infrastructure. So, my career basically created itself in Data Infrastructure and ML. Not complaining!

Volume

Our industry was in a state of MapReduce, RDDs, and SQL on top of flat files. Ten years on. We clearly have good tools for Volume and the volume problem might not be as prevalent any longer.

Hardware power and network bandwidth lead to the software stack basically being overengineered for most queries, e.g. Amazon Redshift, see Why TPC Is Not Enough.

Along the way, our industry has crowned DataFrame APIs (mostly Python), and SQL (mostly flavored) as winning interfaces. If you build a data processing solution, you must provide one, or even both interfaces to succeed.

Velocity

Storage and compute were separated for tackling the volume problem. This increased the challenge of velocity. Velocity can be measured by throughput and latency.

Throughput measures amount of data over time. Latency measures the time per byte. These measures are both important, but not equally and in most companies and cases throughput wins over latency.

One clear winner in data infrastructure regarding velocity is Apache Kafka. Especially, Apache Kafka’s Consumer and Producer API have become the de-facto standard for exchanging and analyzing data flows in companies.

Velocity is still a challenge, depending on the use case. But our data centers and networks have become blazingly fast.

Variety

We believe that variety is the major challenge of the coming years in Data Processing. Variety describes the heterogeneity of data. From strict or non-strict typed tabular data, to images, text, and videos. It always has been a challenge in analytics. Today, we will focus on tabular data and what we observe in the industry.

Tabular data comes in many flavors: fully structured, semi-structured, additive, or even with values containing queryable structures like JSON. Distribution of values of a given field and its rules are as complex as the world. Because they basically try to capture a small portion of the world from the perspective of an individual or company.

So, what are the up-and-coming data infrastructure technologies in working towards tackling the variety of data shape and quality?

Storage & Cataloging

Storing various kinds of data in binary formats for different processing purposes created a plethora of tools for ingesting, managing, and discovering datasets. A shortlist of tools is below. These are undeniably the current state of the art or upcoming winners.

Cataloging became a clear necessity when dealing with the various kinds of storage and file formats, and various kinds of tabular data stored.

Apache Iceberg on Amazon S3

With the introduction of Amazon S3 Tables the Amazon S3 team made it clear that they are competing in the space of becoming a data lake for tackling variety in tabular data.

Apache Iceberg

Apache Iceberg emerged as the winner of the Catalog Wars. Focusing on performance in lookups and schema management only. Providing the most common and always winning interface: SQL. Apache Iceberg is becoming the de-facto interface for processing semi-structured data.

Delta Lake

Based on the Lakehouse paper Delta Lake by Databricks and now the Linux Foundation is gradually becoming from competitor to integrator for Iceberg. It offers more integrations with various processing engines. Nowadays, Delta Lake integrates with a whole variety of catalogs. Additionally, offering lineage on data sets and transaction.

Processing

DuckDB

The idea of having an embeddable OLAP database turned out to be much more, than just embeddable OLAP. By the embeddable nature DuckDB has allowed users to pick and choose where, when, and how to analyze their data. One dimension of variety are the various platforms data resides in. Having the same analytical database available everywhere. Allows for users to analyze any kind of tabular data anywhere and increases the amount of use cases possible.

With its extension ecosystem thriving ever more. One, can also analyze a plethora of more well-known data formats and shapes. Like Google sheets, spatial data, and the release of the C API for extensions will drive this even further.

Naturally, the team is working on a Lakehouse integration.

Apache DataFusion

A subproject of the Apache Arrow, Apache DataFusion, has a slightly different focus. Apache DataFusion is extensible query engine build on Arrow. Basically, it enables you to build your database like processing application. You desire SQL and DataFrame interfaces for your upcoming data infrastructure project? No Problem, DataFusion got your back!

We are seeing some merging databases build on top or creating their processing capabilities on top of Apache DataFusion, e.g. InfluxDB, GreptimeDB.

Conclusion

Variety has become the de-facto focus of Big Data industry. Be it processing, storage, or cataloging. We are moving step-by-step to have more and more tools to tackle the various data shapes and forms flooding our systems.

We are entering an area of single node or single domain optimized workloads in analytics. There are many more tools supporting this cause. We predict, we will see more domain specific analytics and even transactional databases to tackle the variety problem.