Wednesday, September 4, 2024

How (NOT) to upgrade Spark

Upgrade to Spark 3.3 finished already couple of months ago. Everyone was happy because of new features and performance improvements. But as time goes, new vulnerabilities are raised through the scanner. Oh, wait a minute, right... those are Spark dependencies. Time to upgrade Spark. Soon.



 There are couple of options. There is Spark 3.4 in stable version with features that some teams are waiting for. There was also Spark 3.5 in alpha version (for the time being). Easy choice. Going with version 3.4.

 Is there a spark-iceberg-runtime ready yet? Uff, yes, there is let's use it. Couple adjustments here and there, upgrade, couple other components, resolve versioning conflict and "BAM!", we have it. At least I think we have. Let's run those e2e tests.



Oh no... couple errors. Not big a deal. TimestampType without timezones is now TimestampNTZType, some fixtures changed, need to adjust. More or less. Couple different other adjustments, lets run again, green this time.

 Yes we have it. Ready to deploy, ready to celebrate. But did someone test it on prod yet? Ok, maybe just a main feature. All green...

...wait a sec... no it's not, there is some OOM, but how can it be, with 3.3 it was working?

Since upgrade start, a new Spark version came out - 3.5.1. That should be stable enough... Trying that out. Maybe it will bring also some interesting new features. Again, adjusting the code, building, same problem. Only that one point in code - maybe we can optimize that somehow so it doesn't throw OOM? Probably we can...


Many companies, who implement some sort pipelines with data processing by themselves have to deal with that problem. It's a trade-off. New versions are bringing new features, but - can be more fragile to another constellation of parameters, which were working correctly in previous versions. 

Another point is the proper use of Spark. Avoid unnecessary checkpoint(s), cache, persist here and there. Be aware of limits of the query plan length, do thing in another way, if possible.


Upgrading Spark is always a journey, full of both challenges and rewards. We've seen firsthand how new versions can bring exciting features and performance boosts, but also the occasional hurdle, like the unexpected OOM errors. However, with each upgrade, Spark continues to evolve, improving stability and expanding its capabilities.

Let’s approach this upgrade with the care it deserves—starting with thorough testing on smaller pipelines to ensure a smooth transition.

On Aug 10, 2024 a new version of Spark was released, 3.5.2. Improvements in Parquet timestamp handling, preventing OOM in some cases, as well as many other improvements. Making it a worthy consideration. Spark is constantly improving.

By staying in the same Spark version we loose the edge, we do not evolve.

By staying current, we’re not only solving today’s problems but also positioning ourselves for success as Spark continues to grow. Spark 4 is just around the corner, and being ready for it means embracing these updates now, leveraging the collective experience of the community, and preparing our systems for the future.

Let's move forward with confidence, knowing that each step brings us closer to a more robust, efficient, and future-proof platform.



No comments:

Post a Comment