Humans of the Data Sphere Issue #10 April 4th 2025
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
Welcome to Humans of the Data Sphere issue #10!
sometimes the economy needs to scale to zero—Sam Lambert
The pursuit of simplicity is not about achieving a static, minimalistic design from the outset but involves a dynamic process of learning, adapting, and refining —Paraphrasing Andy Warfield’s post on simplicity
Quotable Humans
Kelly Sommers: Every time we separate compute from storage, we bring it back together it seems. And then we do it again a decade later.
355e3b: something I’ve noticed while cloud lead at an o11ly vendor (Instana) and now doing security tools is that most teams don’t have a model of COGS when trying to get their cloud bill under control.
Charity Majors: Psychological safety is NOT about lack of disagreement. Psychological safety REQUIRES: * disagreement and debate * setting standards for behavior and performance, and enforcing them * telling people things they don't want to hear * courage, from the bottom up * humility, from the top down
Charity Majors: Corollary: when we are crafting sociotechnical tools and systems, we should focus on making them usable (and powerful) in the hands of normal, fallible, embodied people. This is one of the core insights of platform engineering...applying product and design thinking to technical systems.
Marc Brooker (on the continued debates around fsync guarantees): In my mind, there are two real take-aways from issues like this. First, abstractions like POSIX (and even Linux specifically) are making it harder for database to take advantage of the semantics of their storage devices. This is the opposite of what good abstractions do! Second, the whole project of “make this data durable across restarts on a single system with high probability” may just be doomed. The alternative is replication - storing the data in multiple places, and designing those places to make correlated failures highly unlikely.
Peter Kraft: If you need to store files on disk, you use the filesystem, right? Right? Well, maybe not! I love this paper because it presents the lessons of 10 years spent building the popular distributed file system Ceph. Originally, Ceph did the obvious thing and stored files in the local file systems of their storage servers. But the semantics and performance of POSIX file systems weren’t quite what Ceph needed, and its developers spent 10 years fighting the operating system until eventually they gave up and built their own storage backend from scratch. As you read this paper, you can just feel the frustration of the authors as they keep trying and failing to get POSIX file systems to do what they need before taking matters into their own hands.
fsync isn't guaranteed to succeed. And when it fails you can't tell which write failed. It may not even be a failure of a write to a file that your process opened.
If you don't checksum your data on write and check the checksum on read (as well as periodic scrubbing a la ZFS) you will never be aware if and when the data gets corrupted and you will have to restore (who knows how far back in time) from backups if and when you notice.
Andy Warfield: But now I’d like to make a bit of a self-critical observation about simplicity: in pretty much every example that I’ve mentioned so far, the improvements that we make toward simplicity are really improvements against an initial feature that wasn’t simple enough. Putting that another way, we launch things that need, over time, to become simpler. Sometimes we are aware of the gaps and sometimes we learn about them later. The thing that I want to point to here is that there’s actually a really important tension between simplicity and velocity, and it’s a tension that kind of runs both ways. On one hand, the pursuit of simplicity is a bit of a “chasing perfection” thing, in that you can never get all the way there, and so there’s a risk of over-designing and second-guessing in ways that prevent you from ever shipping anything. But on the other hand, racing to release something with painful gaps can frustrate early customers and worse, it can put you in a spot where you have backloaded work that is more expensive to simplify it later. This tension between simplicity and velocity has been the source of some of the most heated product discussions that I’ve seen in S3, and it’s a thing that I feel the team actually does a pretty deliberate job of. But it’s a place where when you focus your attention you are never satisfied, because you invariably feel like you are either moving too slowly or not holding a high enough bar.
Gunnar Morling: JEP 483 is part of a broader OpenJDK initiative called Project Leyden, whose objective is to reduce the overall footprint of Java programs, including startup time and time to peak performance. Eventually, its goal is to enable ahead-of-time compilation of Java applications, as such providing an alternative to GraalVM and its support for AOT native image compilation, which has seen tremendous success and uptake recently. AOT class loading and linking is the first step towards this goal within Project Leyden. It builds upon of the Application Class Data Sharing (AppCDS) feature available in earlier Java versions. While AppCDS only reads and parses the class files referenced by an application and dumps them into an archive file, JEP 483 also loads and links the classes and caches that data. I.e. even more work is moved from application runtime to build time, thus resulting in further reduced start-up times.
Gunnar Morling: Synchronous calls are tools that can help assure consistency, but by design they block progression until complete. In that sense, the idea of the synchrony budget is not about a literal budget which you can spend, but rather about being mindful how you implement communication flows between services: as asynchronous as possible, as synchronous as necessary.
Abhinav Upadhyay: A related optimization about data structure layout is keeping the read-only and read-write fields in separate cache lines. Whenever a field is modified, the entire cache line containing other fields and values becomes dirty. If some of the other processor cores have also cached the same cache line to access the read-only fields, their cache line becomes invalid. The next time these cores try to access this cache line, they will have to sync the latest value using cache coherency protocols, which adds a delay to the cache access process.
Nick Van Wiggeren (on EBS’s 90 percent of their provisioned IOPS performance 99 percent of the time perf profile):
While full failure and data loss is very rare with EBS, “slow” is often as bad as “failed”, and that happens much much more often.
This volume has been operating steadily for at least 10 hours. AWS has reported it at 67% idle, with write latency measuring at single-digit ms/operation. Well within expectations. Suddenly, at around 16:00, write latency spikes to 200ms-500ms/operation, idle time races to zero, and the volume is effectively blocked from reading and writing data.
To the application running on top of this database: this is a failure. To the user, this is a 500 response on a webpage after a 10 second wait. To you, this is an incident.
I think it will be hard to compare data engineering in 2024 and data engineering in 2028 and say “those are the same things.”
One of the best ways to make all of these things true at the same time is to use frameworks and open standards. Claude 3.7 knows how to build reliably Airbyte ingestion pipelines because the framework is well documented and there are a lot of examples published. It’s also fantastic at writing dbt code for the same reasons. If you’re able to give it an environment where it can test its own code and validate downstream models as a part of its CoT—code quality goes up even further. Standardized frameworks also emit well-understood error messages, which pushes code quality up further.
In short: good frameworks, tooling, and standards are just as important for AI as they are for humans.
Suffice it to say that I truly believe that a) much data engineering work has already been framework-ized, and b) AI will now make creation of, iteration on, and maintenance of these technical artifacts far more efficient. And for the aspects of data engineering that are not yet framework-ized (dbt or otherwise), there will be tremendous gravity towards pulling them into a framework because of the leverage that these types of high-quality AI experiences will provide.
When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.
Teams with AI performed best overall with a 0.39 standard deviation improvement, though the difference between individuals with AI and teams with AI wasn't statistically significant. But we found an interesting pattern when looking at truly exceptional solutions, those ranking in the top 10% of quality. Teams using AI were significantly more likely to produce these top-tier solutions.
Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences.
Javier Santana (on running large-scale ClickHouse): I always recommend having a replica just for writes, people call this compute-compute separation. We do this by default and handle the failover and everything in case of error or overload.
You might also decide to split the cluster into more replicas depending on the different workloads. If you have queries that need a particular p99 objective, for example, you may want to isolate those queries to a replica and keep it under 40% load so high percentiles stay stable. Real-time sub-second queries are expensive; forget about those cheap batch jobs running on spot instances.
The load balancer is the key to all of this. Any modern LB would do the job.
Ella Chao (on chasing down a memory leak in Warpstream’s compaction jobs): The art of debugging complex systems is simultaneously holding both the system-wide perspective and the microscopic view, and knowing and using the right tools at each level of detail.
Gwen Shapira: A common pattern in Cloud Native data architectures is the separation of compute and storage. Increasingly common is the use of S3 as reliable and cost-effective storage layer. The main challenge of this architecture is that in order to deliver great performing data stores on S3, you need ample caches. And... memory is expensive. But... what if we had a cloud native cache? Cloud-native cache will have an elastic memory footprint, and the usage patterns of data access will balance the cost of memory with the cost of cache misses to optimize both the memory footprint of the cache and its contents. The paper proposes a lightweight machine learning algorithm that can make optimal decisions in real time for billions of QPS (Spanner is heavily used, it turns out).
Darren Shepherd: I read the MCP spec and now my evening is ruined. Which intern designed this protocol?
I thought MCP would be bad but not this bad. Like seriously, your supposed to maintain a HTTP connection? And JSON-RPC, what the heck. Who let the children play with python.
Sergei Egorov (SergeiGPT): MCP is a protocol-not-protocol that allows LLMs to completely ignore the decades of well thought out APIs and instead force humans to write API wrappers and expose them via either unauthenticated STDIO or HTTP SSE without a single mention of the authentication methods (because that’s what all protocols do, right? right?…) and gives you “Best practices for securing your data within your infrastructure”.
Blast from the past: The Architecture of Complexity (1962)
After reading Warpstream’s recent blog post, A Trip Down Memory Lane: How We Resolved a Memory Leak When pprof Failed Us, I was reminded of a wonderful paper from 1962, The Architecture of Complexity, by Herbert A. Simon. The theme linking the two in my subconscious was probably how distributed systems architectures can sometimes get taken down or degraded by seemingly isolated issues in one component (without effective controls for blast-radius); a local issue can eventually go global.
In the paper, Simon explores what different complex systems have in common, from atoms and molecules, to biology, to human organizations. He came to a number of conclusions, with the main one being that complex systems evolve into or from hierarchies.
“If you ask a person to draw a complex object–e.g., a human face–he will almost always proceed in a hierarchic fashion. First he will outline the face. Then he will add or insert features: eyes, nose, mouth, ears, hair. If asked to elaborate, he will begin to develop details for each of the features–pupils, eyelids, lashes for the eyes, and so on–until he reaches the limits of his anatomical knowledge. His information about the object is arranged hierarchicly in memory, like a topical outline.”
“The central theme that runs through my remarks is that complexity frequently takes the form of hierarchy, and that hierarchic systems have some common properties that are independent of their specific content. Hierarchy, I shall argue, is one of the central structural schemes that the architect of complexity uses.”
He goes further to propose the idea of Nearly Decomposable Systems.
The main theoretical findings from the approach can be summed up in two propositions: (a) in a nearly decomposable system, the short-run behavior of each of the component subsystems is approximately independent of the short-run behavior of the other components; (b) in the long run, the behavior of any one of the components depends in only an aggregate way on the behavior of the other components.
The core idea is that complex systems tend to organize into hierarchies with the following characteristics:
Hierarchical structure: Complex systems are composed of subsystems, which themselves contain smaller subsystems, and so on.
Loose coupling between subsystems: The interactions among subsystems are weaker than the interactions within subsystems.
Separation of timescales: Short-term dynamics happen primarily within subsystems, while longer-term dynamics involve interactions between subsystems.
Simon's two key propositions summarize this:
In the short run, subsystems behave approximately independently of each other.
In the long run, each subsystem's behavior depends on other subsystems only in an aggregate way.
This complexity architecture offers significant advantages:
Stability: Disturbances in one subsystem don't immediately cascade throughout the entire system.
Evolvability: Subsystems can adapt relatively independently.
Robustness: Failures can often be contained within subsystems.
Comprehensibility: The system can be understood by examining one level at a time.
Simon argued that near decomposability is so prevalent because it provides evolutionary advantages—systems with this property can evolve more rapidly and maintain stability more effectively than systems with either complete independence (no coordination) or tight coupling (high fragility).
It’s funny reading this 63 years later, as we can recognize these properties in modern software architectures, although his theories are relevant across many types of system. Indeed, this paper is widely recognized as a seminal, even foundational paper (with over 11000 citations), profoundly influential in multiple fields—computer science, economics, biology, and organizational theory. Simon later won a Nobel Prize for his work in decision-making processes in economics.