Innovation at Scale

Sunday, May 24, 2026

From Farms to Blockchains: Understanding Ledgers, Daml and Canton

I often see terms like blockchain, ledger, Daml, smart contracts, and imagine something highly abstract and technical. Now I want to explain using my favorite analogy - farming.

Imagine several farms:

Your farm
John's farm
Peter's farm
A dairy cooperative
A bank

They all do business with each other: selling cows, leasing land, financing equipment, making agreements.

Everyone needs to keep track of who owns what.
A traditional database is like a barn full of paper records.

Inside you can write things like:

John owns 5 cows
Peter sold 2 pigs
Mark leased a field

The barn stores information well. But the barn itself doesn't enforce rules. It doesn't know:

Can John sell the same cow twice?
Can Peter see Mark's private agreement?
What happens if two people try to sell the same cow at the same time?
Who is allowed to change records?
What happened first?

A database stores information, but business rules usually need to be enforced separately by applications and processes.

Daml

This is where Daml enters.

Daml lets you describe the rules of the business:

To sell a cow:

the seller must own it
both parties must approve
after the sale, the previous owner no longer owns it

Daml defines the business rules and workflows that participants must follow.

Ledger

Think of the ledger as a farm manager. A manager who not only checks the rules but also maintains the official state of who owns what.

The manager checks:

Are the rules followed?
Is the cow really owned by the seller?
Has it already been sold?
Who can see this transaction?

So the system becomes:

Farmer - submits agreement
Daml - defines rules
Ledger - enforces rules, manager
Database - stores records

Blockchain

Now imagine there is no single manager. Instead, every farm keeps its own notebook.

When a cow is sold:

You write: "I sold a cow to John."
Other farms in the network verify:
- Was the cow yours?
- Does it exist?
- Was it already sold?
If everyone agrees, everyone updates their notebook.

That is essentially a blockchain.

Example: public blockchain (Bitcoin, Ethereum)

Anyone can join, think: "Any farmer in the world can keep a copy of the notebook." Thousands of participants maintain and validate the same system. Everyone sees everything.

Enterprise ledger (Daml + Canton)

Think of a private agricultural cooperative, only trusted members participate:

Bank A
Bank B
Exchange
Insurance company

Each participant can run their own machine, but random people on the internet cannot simply join.

And unlike public blockchains, not everyone sees everything. John only sees agreements involving John. Banks only see transactions they are allowed to see. Privacy is built into the model.

So far we have:

Blockchain - thousands of farmers sharing copies of the same notebook
Ledger - a system enforcing rules and maintaining records
Daml - the language defining the rules
Database - where the papers are stored

Every blockchain is a ledger.
Not every ledger is a blockchain.

Canton

Canton Network is a distributed ledger network that shares some characteristics with blockchains, but works differently from Bitcoin or Ethereum. Instead of showing every transaction to everyone, it is built for privacy. Think of it as a "network of networks."

Only the people involved in a transaction can see its details. Other participants cannot. This is called a "need-to-know" model — you only see the information you actually need, not everything happening on the network.

Smart Contracts

In Daml, business rules are implemented as smart contracts — digital agreements with built-in logic that automatically execute when conditions are met.

Think of it like farming:

Imagine a farmer, a grain buyer, and a transport company making a deal.

The agreement says:

the farmer delivers 100 tons of wheat,
the transport company confirms delivery,
the buyer sends payment.

With paper contracts, people have to check everything manually. With a Daml smart contract, the rules are already written into the system. Once delivery is confirmed, the contract can automatically move to the next permitted step, such as triggering the payment process.

The important difference is that not everyone sees the contract. Only the parties involved can access it.

Putting everything together:

Database stores records
Daml defines business rules
Smart contracts apply these rules automatically
Ledger validates and maintains the shared state
Blockchain is one way of running a ledger across many independent participants
Canton focuses on shared records with privacy and selective visibility between participants.

Wednesday, May 20, 2026

Qdrant is staying in my setup for good.

Recently, I moved from application-level cosine similarity comparisons to a dedicated vector search engine running at the storage layer. Qdrant significantly improved the performance of semantic search in my pipeline, reducing similarity search time from around 40 seconds to roughly 1 second for the same workload.

What surprised me most was not only the raw speedup, but how much architectural complexity disappeared after the migration.

The previous approach was simple: embeddings were stored as regular data and compared inside the application layer. That was a reasonable starting point.

But I was indexing Flink, which is a big Repo, the architecture crossed a clear boundary.

At that point, the application was doing work that belonged to a specialized vector search engine:
- loading large embedding sets into memory
- calculating similarity scores one by one
- sorting candidates in the application process

Moving this responsibility into Qdrant changed the shape of the system.

The application now focuses on orchestration: parsing code, generating embeddings, storing metadata, and asking semantic questions. Qdrant handles vector indexing, similarity search, scoring, and retrieval.

That separation matters.

It made the pipeline faster, but also cleaner:

- fewer memory-heavy operations in the app layer

- clearer ownership between application logic and retrieval infrastructure
- a much better foundation for future RAG-style workflows (MCP Server?)

It doesn't have to be “always start with the most advanced tool.” Start with the simple architecture that lets you understand the problem. Then, when the system shows you where the boundary is, move the responsibility to the right layer.

In this case, vector similarity search clearly belonged in a vector database.

Qdrant turned out to be the right fit.

Repo: https://github.com/wbrycki/code-genius

Qdrant: https://qdrant.tech/

Monday, May 4, 2026

Do You Really Need Horizontal Scaling?

In times of hype around distributed systems, it’s tempting to scale right away. Using Spark, Flink, or Dask as a processing engine is often straightforward.

But should we design for clusters from the beginning, or start simpler and evolve the system over time?

To effectively scale, we need to understand the data volume. Because horizontal scaling can often be an overkill.

Scaling is not free. The more distributed a system becomes, the higher its complexity. More components means more failure points, operational cost rises.

Vertical scaling is underrated in my opinion.

Newest machines have dozens or even hundreds of cores. Same with RAM.

The “just add more nodes” approach is often the default assumption. But what does adding more nodes really mean? Network overhead, serialization, harder debugging.

Image: Spark Data Scaling Horizontal and Vertical

Example of ETL batch processing:

Spark runs in two main execution modes:

local mode (driver and execution on a single machine)
cluster mode (distributed executors across nodes)

Reality Check on Horizontal Scaling

In distributed systems, moving data between partitions introduces a cost known as shuffling.

Shuffling is required to achieve proper data redistribution and balanced workload across partitions. It also occurs in single-node systems (Spark still operates with a logical distributed execution model, even when running locally). However, in cluster environments, shuffle becomes significantly more expensive due to network communication and coordination overhead.

Technical cost

network latency, data shuffling
In many cases, Spark performance is dominated more by shuffle behavior and data layout than by raw compute.
serialization/deserialization (can also happen on single node, for example local disk spillover)

Operational cost

Kubernetes, cluster management
monitoring, observability
deployments

Cognitive cost

debugging distributed systems
tracing
consistency issues

However, a single large machine is not a universal solution. Now we come to trade-offs.

With vertical scaling we have simplicity, lower latency, easier debugging.

However, vertical scaling is ultimately constrained by hardware limits.

At a certain point, the data size or workload characteristics simply exceed what a single machine can handle. The question is: are you actually at that point yet?

As Microsoft research point out:

"that the majority of analytics
jobs do not process huge data sets. For example, at least
two analytics production clusters (at Microsoft and Ya-
hoo) have median job input sizes under 14 GB"

that the majority of real-
world analytic jobs process less than 100 GB of input,
but popular infrastructures such as Hadoop/MapReduce
were originally designed for petascale processing."

from: Scale-up vs Scale-out for Hadoop: Time to rethink?

This research paper is from 2013 and now we are dealing with a lot more data, but not always and not everywhere - I still experienced under 100GB workloads in current times.

Now let's move to Horizontal scaling: we get scalability, fault tolerance (new Executor can re-start a task)

For some systems it's the only possibility, and we have to count the added complexity and operational overhead in.

Making decision:

Are we CPU-bound? -> scale vertically -> tune jobs -> then scale horizontally if not helping

Tuning jobs: data organization (e.g., Iceberg partitioning - when using Iceberg) often has higher impact than scaling compute.

Are we really in Petabyte scale? -> scale horizontally, as single machine will probably not be effective

SLA (Service Level Agreement) Low Latency or High Availability requirements -> scale horizontally.

Does the operational cost align with business expectations? Maybe longer running job over night will also be tolerable. Cost per Job?

I would rather not scale in early stage system or up to couple TB of data.
I would rather scale for massive datasets, HA requirements, or when dealing with streaming or real-time.

Even if we need to think about scaling from the beginning, it is still better to start simple and scale later. But always measure first.
It is useful to understand tools like autoscalers and Kubernetes taints & tolerations (in environments such as EC2 nodes on EKS), but it is equally important to know when not to use them.

As always, it is not only CPU and memory that matter, but also I/O bottleneck. A larger VM does not always translate into linear performance gains. And scaling is a decision, not a default.