Monday, May 4, 2026

Do You Really Need Horizontal Scaling?

In times of hype around distributed systems, it’s tempting to scale right away. Using Spark, Flink, or Dask as a processing engine is often straightforward. 

But should we design for clusters from the beginning, or start simpler and evolve the system over time?

To effectively scale, we need to understand the data volume. Because horizontal scaling can often be an overkill.

Scaling is not free. The more distributed a system becomes, the higher its complexity. More components means more failure points, operational cost rises.

Vertical scaling is underrated in my opinion.

Newest machines have dozens or even hundreds of cores. Same with RAM.

The “just add more nodes” approach is often the default assumption. But what does adding more nodes really mean? Network overhead, serialization, harder debugging.

Image: Spark Data Scaling Horizontal and Vertical 

 

Example of ETL batch processing:


Spark runs in two main execution modes: 

  • local mode (driver and execution on a single machine)
  • cluster mode (distributed executors across nodes)

 

Reality Check on Horizontal Scaling

In distributed systems, moving data between partitions introduces a cost known as shuffling.

Shuffling is required to achieve proper data redistribution and balanced workload across partitions. It also occurs in single-node systems (Spark still operates with a logical distributed execution model, even when running locally). However, in cluster environments, shuffle becomes significantly more expensive due to network communication and coordination overhead.

Technical cost

  • network latency, data shuffling
  • In many cases, Spark performance is dominated more by shuffle behavior and data layout than by raw compute.
  • serialization/deserialization (can also happen on single node, for example local disk spillover)

Operational cost

  • Kubernetes, cluster management
  • monitoring, observability
  • deployments

Cognitive cost

  • debugging distributed systems
  • tracing
  • consistency issues

However, a single large machine is not a universal solution. Now we come to trade-offs.

With vertical scaling we have simplicity, lower latency, easier debugging.

However, vertical scaling is ultimately constrained by hardware limits.

At a certain point, the data size or workload characteristics simply exceed what a single machine can handle. The question is: are you actually at that point yet?

As Microsoft research point out: 

"that the majority of analytics
jobs do not process huge data sets. For example, at least
two analytics production clusters (at Microsoft and Ya-
hoo) have median job input sizes under 14 GB"

or 

that the majority of real-
world analytic jobs process less than 100 GB of input,
but popular infrastructures such as Hadoop/MapReduce
were originally designed for petascale processing."

from: Scale-up vs Scale-out for Hadoop: Time to rethink?

This research paper is from 2013 and now we are dealing with a lot more data, but not always and not everywhere - I still experienced under 100GB workloads in current times.


Now let's move to Horizontal scaling: we get scalability, fault tolerance (new Executor can re-start a task)

For some systems it's the only possibility, and we have to count the added complexity and operational overhead in. 

Making decision:

Are we CPU-bound?  -> scale vertically -> tune jobs -> then scale horizontally if not helping

Tuning jobs: data organization (e.g., Iceberg partitioning - when using Iceberg) often has higher impact than scaling compute. 

Are we really in Petabyte scale? -> scale horizontally, as single machine will probably not be effective

SLA (Service Level Agreement) Low Latency or High Availability requirements -> scale horizontally.

Does the operational cost align with business expectations? Maybe longer running job over night will also be tolerable. Cost per Job?

I would rather not scale in early stage system or up to couple TB of data. 
I would rather scale for massive datasets, HA requirements, or when dealing with streaming or real-time.

Even if we need to think about scaling from the beginning, it is still better to start simple and scale later. But always measure first.
It is useful to understand tools like autoscalers and Kubernetes taints & tolerations (in environments such as EC2 nodes on EKS), but it is equally important to know when not to use them.

As always, it is not only CPU and memory that matter, but also I/O bottleneck. A larger VM does not always translate into linear performance gains. And scaling is a decision, not a default.


Tuesday, October 29, 2024

Evaluation of role playing in LLM Systems based on chatGPT


Many of us already have had or will have an interaction with some form of Large Language Model (LLM), regardless of whether we want it or not. 

Direct or indirect. 

It could be a chatbot, a mail that we just got from someone, a recipe, detailed guidance on technical problems related to programming or general life advice.

When we ask a question, ChatGPT appears to us as a "wise machine" which can, in theory, help with everything. Of course help (advise, summary, explanations, guides, suggestions...) comes not in physical form, but rather text, image, video in some cases. which does not make it any more or less important.

 

AI System that generate dialogues, a chatbot. Is that what it is? Well, I think there is more.

 

 It is not only the system prepared for giving us the answers, it was instructed to behave in this on another way, to provide only high quality answers. 

 

Have you noticed chatGPT almost never answers in a rude way?

 

Even if I have a bad day and don't write "please" at the end :) And I've experienced already something else with different, not so well refined models. That would mean, for sure chatGPT was instructed, per default, to give us the nice, quality answers. To make it a better product. To act as an omniscient being, that likes to share the knowledge.

But that means it's already playing a role, because it's in that just described role. That would mean, we can ask ChatGPT to alter the role a little bit, or even preserve the "defaults" AND be at the same time, a debate opponent, negotiator, character in a story, investigator, or a diplomat. 

So now, ChatGPT having all the skills from before, can have also specific stance or personality.

Examples? How will I use it? Coming right up. 

Inspired by various posts about what I can use ChatGPT for, I came up with an idea. Let’s play roles: ChatGPT will be the interviewer for my current position, and I need to pass the interview. I’ll formulate it like this:

“Act as an interviewer for a Senior Engineer role and ask me questions. Then tell me if I got the job or not.”

Tuesday, September 24, 2024

Consider using Polars whenever you can

When working with large datasets, tools like Spark, Iceberg, and Trino have become popular in many data engineering workflows. They are robust, scalable, and capable of handling complex operations across distributed systems. However, there are cases where these heavyweight solutions might not be the best fit, especially when working in smaller environments or standalone systems. Enter pola.rs, a powerful alternative that offers speed and efficiency, particularly in certain scenarios.



Traditional architectures and the Bottlenecks
Let’s consider an architecture like Spark working in standalone mode only on one machine. Suppose we have data in a data lake (perhaps a CSV or Parquet file from another system) We load it into Spark to clean and process the data before analysis. Spark, though extremely powerful, comes with some trade-offs, especially in a standalone setup.
In that mode, Spark is limited to a single machine, no matter how powerful that machine is. This limits its scalability and overall efficiency when working with large datasets. Furthermore, the initial setup and resource overhead of Spark can be overkill for simple data transformations or cleaning tasks. That’s where pola.rs shines.

Why pola.rs is a Better Fit for this scenario
pola.rs, the Rust-based DataFrame library, is designed for speed and ease of use. When compared to Spark, especially in standalone environments, pola.rs can outperform Spark significantly, particularly for tasks that don't require distributed processing. Imagine you're loading a large CSV or parquet file with millions of rows to a data lake. 


Instead of starting a Spark cluster, which consumes a considerable amount of memory and time just to get going, you could use Polars to process the data much faster on a single machine.
Polars uses Rust’s zero-cost abstractions (code that is both expressive and efficient) and efficient memory management, allowing it to process data faster with fewer resources. For example, loading and cleaning data from a CSV or Parquet file can be done in a fraction of the time that Spark would take, especially for mid-sized datasets.

 

Benchmark performed against TPC-H, open source [img source: polars.rs]

Streaming with Polars
Another key area where Polars shows its abilities is working with streaming data. If your workflow involves importing data into a system like Microsoft SQL Server, pola.rs offers efficient ways to handle this in a streaming fashion. By leveraging Polars' streaming APIs, you can import large datasets in chunks, avoiding the need to load everything into memory at once (of course additional tool like efficient bcp utility will be additionally used, but combining it with Polars is possible!).

In addition, Polars can also be used for ETL workflows, efficiently processing data and transforming it as it's being streamed from the source to the destination. However we have to have in mind, that it’s not streaming-first library like Kafka or Flink.


When to use pola.rs?

When should you consider using pola.rs over heavier solutions like Spark?

  1. Standalone Environments: If you’re working with Spark Standalone, you might find that Polars is faster and more efficient for cleaning and preprocessing your data. In scenarios where the scale doesn’t justify the complexity of a Spark cluster, Polars is a great fit.
  2.  Mid-sized Datasets: Polars excels when working with datasets that fit comfortably on a single machine but are too large for in-memory tools like Pandas.
  3.  Streaming Workflows: If your data ingestion involves streaming into systems like Microsoft SQL or handling real-time data, Polars can be an optimal choice for managing this efficiently.


Downsides to Consider
While Polars is fantastic for many use cases, it’s not without some limitations:

  1. Lack of Distributed Processing: Unlike Spark, which can scale across many machines in a cluster, Polars is designed for single-machine workloads. If your data is truly massive and requires distributed processing, Spark or a similar framework may still be the better choice.
  2.  Ecosystem: Polars, while powerful, does not yet have the rich ecosystem of plugins and integrations that Spark has. If your project relies on many external libraries or data sources, you may end up with writing more custom code with Polars.
  3.  Big data table formats: no native support of formats like Apache Iceberg, Paimon etc. Need of conversion first. But, since it can fit between parquet file and writing into an iceberg table, in those scenarios this will be no downside (we still need to convert either way, can use streaming/batching approach and use Spark with code for Iceberg, or PyIceberg).


For many data engineering tasks, especially those involving standalone machines Polars offers a faster, more lightweight alternative to Spark and other traditional big data frameworks. It reduces overhead, offers excellent performance, and is ideal for working with moderately sized datasets. If you haven't yet explored Polars, now might be the time to give it a try and see if it fits your next project better than the heavier options in your current stack.