Innovation at Scale: September 2024

When working with large datasets, tools like Spark, Iceberg, and Trino have become popular in many data engineering workflows. They are robust, scalable, and capable of handling complex operations across distributed systems. However, there are cases where these heavyweight solutions might not be the best fit, especially when working in smaller environments or standalone systems. Enter pola.rs, a powerful alternative that offers speed and efficiency, particularly in certain scenarios.

Traditional architectures and the Bottlenecks
Let’s consider an architecture like Spark working in standalone mode only on one machine. Suppose we have data in a data lake (perhaps a CSV or Parquet file from another system) We load it into Spark to clean and process the data before analysis. Spark, though extremely powerful, comes with some trade-offs, especially in a standalone setup.
In that mode, Spark is limited to a single machine, no matter how powerful that machine is. This limits its scalability and overall efficiency when working with large datasets. Furthermore, the initial setup and resource overhead of Spark can be overkill for simple data transformations or cleaning tasks. That’s where pola.rs shines.

Why pola.rs is a Better Fit for this scenario

pola.rs, the Rust-based DataFrame library, is designed for speed and ease of use. When compared to Spark, especially in standalone environments, pola.rs can outperform Spark significantly, particularly for tasks that don't require distributed processing. Imagine you're loading a large CSV or parquet file with millions of rows to a data lake.

Instead of starting a Spark cluster, which consumes a considerable amount of memory and time just to get going, you could use Polars to process the data much faster on a single machine.
Polars uses Rust’s zero-cost abstractions (code that is both expressive and efficient) and efficient memory management, allowing it to process data faster with fewer resources. For example, loading and cleaning data from a CSV or Parquet file can be done in a fraction of the time that Spark would take, especially for mid-sized datasets.

Benchmark performed against TPC-H, open source [img source: polars.rs]

Streaming with Polars
Another key area where Polars shows its abilities is working with streaming data. If your workflow involves importing data into a system like Microsoft SQL Server, pola.rs offers efficient ways to handle this in a streaming fashion. By leveraging Polars' streaming APIs, you can import large datasets in chunks, avoiding the need to load everything into memory at once (of course additional tool like efficient bcp utility will be additionally used, but combining it with Polars is possible!).

In addition, Polars can also be used for ETL workflows, efficiently processing data and transforming it as it's being streamed from the source to the destination. However we have to have in mind, that it’s not streaming-first library like Kafka or Flink.

When to use pola.rs?

When should you consider using pola.rs over heavier solutions like Spark?

Standalone Environments: If you’re working with Spark Standalone, you might find that Polars is faster and more efficient for cleaning and preprocessing your data. In scenarios where the scale doesn’t justify the complexity of a Spark cluster, Polars is a great fit.
Mid-sized Datasets: Polars excels when working with datasets that fit comfortably on a single machine but are too large for in-memory tools like Pandas.
Streaming Workflows: If your data ingestion involves streaming into systems like Microsoft SQL or handling real-time data, Polars can be an optimal choice for managing this efficiently.

Downsides to Consider
While Polars is fantastic for many use cases, it’s not without some limitations:

Lack of Distributed Processing: Unlike Spark, which can scale across many machines in a cluster, Polars is designed for single-machine workloads. If your data is truly massive and requires distributed processing, Spark or a similar framework may still be the better choice.
Ecosystem: Polars, while powerful, does not yet have the rich ecosystem of plugins and integrations that Spark has. If your project relies on many external libraries or data sources, you may end up with writing more custom code with Polars.
Big data table formats: no native support of formats like Apache Iceberg, Paimon etc. Need of conversion first. But, since it can fit between parquet file and writing into an iceberg table, in those scenarios this will be no downside (we still need to convert either way, can use streaming/batching approach and use Spark with code for Iceberg, or PyIceberg).

For many data engineering tasks, especially those involving standalone machines Polars offers a faster, more lightweight alternative to Spark and other traditional big data frameworks. It reduces overhead, offers excellent performance, and is ideal for working with moderately sized datasets. If you haven't yet explored Polars, now might be the time to give it a try and see if it fits your next project better than the heavier options in your current stack.

Upgrade to Spark 3.3 finished already couple of months ago. Everyone was happy because of new features and performance improvements. But as time goes, new vulnerabilities are raised through the scanner. Oh, wait a minute, right... those are Spark dependencies. Time to upgrade Spark. Soon.

There are couple of options. There is Spark 3.4 in stable version with features that some teams are waiting for. There was also Spark 3.5 in alpha version (for the time being). Easy choice. Going with version 3.4.

Is there a spark-iceberg-runtime ready yet? Uff, yes, there is let's use it. Couple adjustments here and there, upgrade, couple other components, resolve versioning conflict and "BAM!", we have it. At least I think we have. Let's run those e2e tests.

Oh no... couple errors. Not big a deal. TimestampType without timezones is now TimestampNTZType, some fixtures changed, need to adjust. More or less. Couple different other adjustments, lets run again, green this time.

Yes we have it. Ready to deploy, ready to celebrate. But did someone test it on prod yet? Ok, maybe just a main feature. All green...

...wait a sec... no it's not, there is some OOM, but how can it be, with 3.3 it was working?

Since upgrade start, a new Spark version came out - 3.5.1. That should be stable enough... Trying that out. Maybe it will bring also some interesting new features. Again, adjusting the code, building, same problem. Only that one point in code - maybe we can optimize that somehow so it doesn't throw OOM? Probably we can...

Many companies, who implement some sort pipelines with data processing by themselves have to deal with that problem. It's a trade-off. New versions are bringing new features, but - can be more fragile to another constellation of parameters, which were working correctly in previous versions.

Another point is the proper use of Spark. Avoid unnecessary checkpoint(s), cache, persist here and there. Be aware of limits of the query plan length, do thing in another way, if possible.

Upgrading Spark is always a journey, full of both challenges and rewards. We've seen firsthand how new versions can bring exciting features and performance boosts, but also the occasional hurdle, like the unexpected OOM errors. However, with each upgrade, Spark continues to evolve, improving stability and expanding its capabilities.

Let’s approach this upgrade with the care it deserves—starting with thorough testing on smaller pipelines to ensure a smooth transition.

On Aug 10, 2024 a new version of Spark was released, 3.5.2. Improvements in Parquet timestamp handling, preventing OOM in some cases, as well as many other improvements. Making it a worthy consideration. Spark is constantly improving.

By staying in the same Spark version we loose the edge, we do not evolve.

By staying current, we’re not only solving today’s problems but also positioning ourselves for success as Spark continues to grow. Spark 4 is just around the corner, and being ready for it means embracing these updates now, leveraging the collective experience of the community, and preparing our systems for the future.

Let's move forward with confidence, knowing that each step brings us closer to a more robust, efficient, and future-proof platform.

Innovation at Scale

Tuesday, September 24, 2024

Consider using Polars whenever you can

Wednesday, September 4, 2024

How (NOT) to upgrade Spark