Humans of the Data Sphere Issue #8 February 15th 2025
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
Welcome to Humans of the Data Sphere issue #8!
"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed." — Tom Robbins
Quotable Humans
Charity Majors: Honestly, I can’t think of anything less meritocratic than simply receiving and replicating all of society’s existing biases. Do you have any idea how much talent gets thrown away, in terms of unrealized potential?
Sam Lambert: we have single PlanetScale databases on our cloud that are powered by 20,000 cores. all through a single connection string. shared nothing architecture is insanely powerful.
Shaun Thomas: I won't go so far as to say SQL NULL is useless, but there's no direct analog in any language I know of, and you have to work around it literally everywhere using special syntax like IS NOT NULL or IS DISTINCT FROM. Even worse, SQL has no analog for 'empty' either.
Julien Verlaguet: Incremental computation will only become mainstream if the dev and ops time experience is simpler and easier than the more common request/response paradigm, not just faster & continuous.
Marc Brooker: One challenge of handling partial/gray failures in distributed systems is telling 'healthy' from 'unhealthy'. Even in terms of error rate, it can take a surprisingly high number of samples to differentiate between normal and abnormal hosts.
Mahesh Balakrishnan: Something that’s lost in “do we really need another new system?” debates is that the building muscle of an org atrophies if you stop rewriting systems from scratch. (and when you really need to build something new, you no longer have the culture / expertise for 0 to 1 efforts).
Yaroslav Tkachenko: In 2025, most blockchains are not treated as databases. They're primarily virtual machines (ethervm.io) that allow you to "program money".
Mahesh Balakrishnan: Sometime in 2004 in his office in Upson Hall, @KenBirman explained the elegance of virtual synchrony to me in very similar terms. When there are no failures, your group is humming along; then something fails and borks the group; and you hit the group with a hammer and seal / flush / fence it; and switch to an entirely different group. @dahlia_malkhi formalized this notion beautifully in the Vertical Paxos papers. Virtual Consensus simply extended the idea to shared logs. On the shoulders of giants, as they say!
Jaana Dogan: Decentralize everything. Open things where possible. Everyone wins.
Jaana Dogan: The biggest winner of the AI race will be distributed systems people. Everything is converging onto a distributed network of stuff and it is only accelerating in the last two years.
Wyatt Woodsen (commenting on a thread about em dash and ChatGPT): I once had a data pipeline issue related to the inclusion of em dashes due to em dashes in roughly half of the output. Eventually traced it back to the keyboard configuration of an offshore dev team in one particular office. God was that a pain to trouble shoot and I swore from then ONLY HYPHENS
James Cowling: Senior engineers are good because they leverage conceptual building blocks that are extensible and composable over long time horizons. LLMs can't currently replicate the performance of a strategic senior engineer but will get there by leveraging great abstractions. We've been running a lot of head-to-head benchmarks at @convex_dev and our experience is that LLMs do *much* better at building real applications when working with higher level abstractions with strong guarantees, rather than reinventing an entire stack.
James Cowling: More developers need to write comments at the top of source files saying what the file is actually for. If you can't write a really short comment explaining it you probably haven't thought hard enough about the structure of your codebase.
Ethan Mollick: Pre-training really was hitting a wall of sorts: diminishing returns (which is what the “scaling law”predicts anyway) The fact that reasoners were developed at exactly that moment & are nowhere close to a wall is how Moore’s Law works: new technique appear to maintain the trend
Tim Kellogg: Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.
Mahesh Balakrishnan: Run towards risk. In skunkworks mode, the goal is to reduce technical risk as quickly as possible. Accordingly, the team has to surge on areas where risk is high. Fight the temptation to make steady progress on well-understood, low-risk parts of the system.
Murat Demirbas (on intelligence everywhere): Purpose will be the driving force. Objects that serve a meaningful role will thrive, while those that drift into nihilism (like Marvin, the depressed robot from The Hitchhiker’s Guide to the Galaxy) will be phased out. Intelligence will seek to create value, not just exist for its own sake.
Gilad Kleinman discusses IVM and Epsio with Chris Riccomini:
Other than the fact most managed database offerings (RDS, Cloud SQL, and so on) don't allow users to install unauthorized extensions, adopting new database technologies is a pretty scary endeavor. We found that asking companies to install a new extension (that could potentially crash) on their production database was a pretty big ask to make. By sitting "behind" the existing database, reading CDC logs, and writing back results to the original database, users can integrate Epsio without worrying about affecting anything other than the results tables it needs to maintain. We even actively recommend not giving Epsio permissions to anything other than that.
In classic "streaming" use cases, the main benefit of IVM was the ease of writing SQL rather than writing custom code. In the use cases above, the benefits are more about query performance and cost—how easy it is to deliver performant, cost-effective queries. No matter how fast or efficient a traditional database is, if you are running a heavy query and most of the dataset hasn’t changed since the last run, there is a lot of wasted compute. This translates into either higher cost, higher latency, or both.
Alex Petrov: In my experience, using LLMs for paper reading is only useful for querying and digging in, but never for summarization. Things LLM (Claude in my case) would suggest are almost never something I would find useful or interesting, or going too far beyond what authors have already put in the abstract.
Julien Le Dem on the human side of software architecture:
When decisions are made by the people who best understand the systems - and who also will be responsible for the consequences of those decisions, creating a more virtuous cycle of incentives - there are drawbacks that result directly from the decentralization of decision making. If you just leave every team to their own devices to independently make decisions without coordination, they are unlikely to just naturally all reach the same conclusion on what problems we’re solving or who is solving what part. There is going to be a level of chaos that needs to be managed.
Since software architecture is very different from regular architecture, we don’t actually need a role that centralizes drawing exhaustive and precise plans to be followed closely. We do need people facilitating alignment amongst teams to manage and limit the increase of complexity caused by decentralized decision making. Whether you call these people software architects or some other senior engineering title doesn’t really matter.
Ethan Mollick on the current limitations of general-purpose AI agents: Then the troubles begin, and they're twofold: not only is Operator blocked by OpenAI's security restrictions on file downloads, but it also starts to struggle with the task itself. The agent methodically tries every conceivable workaround: copying to clipboard, generating direct links, even diving into the site's source code. Each attempt fails - some due to OpenAI's browser restrictions, others due to the agent's own confusion about how to actually accomplish the task. Watching this determined but ultimately failed problem-solving loop reveals both the current limitations of these systems and raises questions about how agents will eventually behave when they encounter barriers in the real world.
Yaroslav Tkachenko explores building a stream processing framework with DataFusion: DataFusion is designed as a pull-based engine. Conceptually, it means that each operator runs a tight loop that pulls data from the upstream sources. In practice, DataFusion uses Tokio Streams. I want to highlight two observations:
Tokio Stream (kinda like an iterator of Futures) is the primary abstraction, even when it comes to bounded sources (e.g. reading a bunch of Parquet files).
Pull-based execution doesn’t offer much control over backpressure. This makes it very different from Apache Flink, which can offer reliable backpressure, fine-grained flow control and adaptive buffers between operators. These things are not as important in the context of a query engine (whose goal is to read a bunch of files as fast as possible), but they do matter a lot for a streaming engine.
Alex Miller: The PolarDB-X paper makes a decent deal about using HLCs because having a timestamp service is a SPOF and perf bottleneck, but then the public docs for PolarDB-X exclusively describe the use of a TSO as the PolarDB-X architecture. So... not too much of a problem after all?
Ananth Packkildurai: My null hypothesis is as the number of configurations increases, the reliability of the software decreases. I wonder if there are any papers/studies published on this?
Lorin Hochstein: The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents. And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.
Cloudflare incident write-up. We’ve all had that sinking feeling when you realize you just dropped the production database. This is a horrifying example of a "wait... what did I just do?" moment at scale: During a routine abuse remediation, action was taken on a complaint that inadvertently disabled the R2 Gateway service instead of the specific endpoint/bucket associated with the report. This was a failure of multiple system level controls (first and foremost) and operator training.
Interesting topic #1 - Systems Correctness Practices at AWS
Marc Brooker and Ankush Desai wrote an article called Systems Correctness Practices at AWS. In it they report the various types of practices that AWS employs to gain confidence and find bugs in the AWS services.
The list includes:
Formal Verification. AWS started with TLA+ but have also made a big investment in the P programming language. P is a state-machine formal verification language that engineers typically find a lot easier to get started with than TLA+. Since 2019, P has been a strategic open-source project used in key AWS services like S3, EBS, DynamoDB, Aurora, and EC2 to ensure system correctness. One major success story was S3’s migration from eventual to strong read-after-write consistency, where P helped validate protocol changes and catch design-level bugs early.
Lightweight Formal Methods—Property-based testing and reference models. Combines “property-based testing with developer-provided correctness specifications“. The primary example of the technique is Amazon S3's ShardStore (a key-value storage node), where engineers developed an executable reference model as a specification and used property-based testing to validate the implementation against these models. This method successfully prevented issues such as subtle crash consistency and concurrency problems from reaching production.
Lightweight Formal Methods—Deterministic simulation (DS). This is a technique used in software testing where a distributed system's is validated using property-based testing, while using a simulated environment to control for non-determinism. In the real-world, these systems experience a lot of non-determinism, so DS needs to control factors like timing, concurrency, and external inputs. The DS framework and the code under test work together to control these factors. For randomness, the framework provides a fixed seed for random number generators. For concurrency, thread scheduling and interleaving are explicitly controlled. For time-dependent behavior, the framework replaces system clocks with mocked or logical clocks. External resources such as network and disk are simulated. This is all so developers can run randomized tests but can also reproduce bugs consistently. Because everything is simulated, it also allows for more precise fault injection. One key aspect noted by Marc and Ankush are that the value of this type of testing is in the fast-feedback it provides “Deterministic simulation testing moves testing of system properties, like behavior under delay and failure, closer to build time instead of integration testing“.
Lightweight Formal Methods—Continuous fuzzing or random test-input generation. “First, by fuzzing SQL queries (and entire transactions), we validated that the logic partitioning SQL execution over shards is correct. Large volumes of random SQL schemas, datasets, and queries are synthesized and run through the engines under test, and the results compared with an oracle based on the nonsharded version of the engine (as well as other approaches to validation, like those pioneered by SQLancer23)“
The article also discusses Fault Injection as a Service and testing for metadata stable failures. This latter point is a particular interest of mine. A metastable failure is one where “some triggering event (like an overload or a cache emptying) causes a distributed system to enter a state where it doesn’t recover without intervention (such as reducing load below normal).“
Marc and Ankush note that “Traditional formal approaches to modeling distributed systems typically focus on safety (nothing bad happens) and liveness (something good eventually happens), but metastable failures remind us that systems have a variety of behaviors that cannot be neatly categorized this way. We have increasingly turned to discrete-event simulation to understand the emergent behavior of systems, investing both in custom-built systems simulations and tooling that allow the use of existing system models (built in languages like TLA+ and P) to simulate system behavior.”
It’s an area I have dabbed in using simple simulations, such as finding pathological workloads for a proposed distributed rate-limiting algorithm and problematic liveness properties of a cooperative resource allocation algorithm. I hope to see more findings published in the future about discrete-event simulation in the context of distributed systems engineering.
Interesting topic #2 - Husky: Efficient compaction at Datadog scale
Datadog published a blog post, Husky: Efficient compaction at Datadog scale, detailing how Husky (their event store) performs compaction. The authors frame the problem in terms of a Goldilocks problem, where different concerns that can be generalized as write optimization vs read optimization which must be balanced.
Ensuring that this compaction system gives us the performance we need is all about finding the right fragment size that maintains a good balance among the following concerns:
For object storage fetches and metadata storage, fewer fragments are better.
For compaction, less work is better, both for CPU used in compaction and the number of PUT/GET requests to object storage.
For queries, where each fragment relevant to the query is read concurrently by a pool of query workers, the trade-off is a somewhat complex one between having fewer, larger fragments while maintaining high parallelism. We’re balancing the speed at which a single worker can scan the rows in a fragment with the overhead of distributing a query to many workers, which can scan many fragments in parallel for larger queries. There is an optimal fragment size to balance between scan speed and distribution overhead.
Compaction can affect storage layout in a positive way, but at the cost of doing more work. Given a particular common query pattern, events can be laid out in the fragments, both in the time dimension and in spatial dimensions (i.e., by tags), so that events that would be relevant to a given query can be close together. Keeping similar events close together improves compression, but it is in tension with “less work is better” as compaction will work harder to achieve this layout, and some analysis of queries must be done to determine the layout for the best system-wide outcome.
In short, fragments that are “too small” are inefficient for queries because many small fetches pay an object storage latency penalty and aren’t as efficiently processed by the query workers, which implement vectorized execution to scan many rows quickly. Fragments that are “too large” drive down parallelism for larger queries, causing those queries to take longer. Compaction attempts to find a fragment size that is just right for typical query patterns to minimize query cost and latency, while at the same time keeping similar data together.
The post outlines a number of interesting aspects to compaction:
Row group size
Time bucketing
Size-tiered compaction
Sort schemas
Locality compaction
Pruning
Locality compaction and pruning in Husky are similar to the practice of partitioning and clustering in other analytics systems, such as the open table formats (Iceberg/Delta). The aim is for queries to prune data files based on metadata such as key ranges. By co-locating data of the same key range in the same file, files can be more aggressively skipped during query planning.
“As the levels increase exponentially in size, while the size of the fragments is held constant, at higher and higher levels, each individual fragment’s minimum and maximum row keys are “closer together” than those at lower levels. There is a higher chance we can prune these high level fragments as the lexical space each one covers is smaller relative to those at lower levels.“
Interesting topic #3 - On solving for the distributed case
Recently, I’ve been exploring the different ways of disaggregating log replication protocols, using Raft as the most famous example of the converged protocol. I’ll soon be releasing a survey of log replication protocols and real-world systems through the lens of different types of disaggregation.
While I was compiling some of the quotes for this issue I stumbled on this post by Janna Dogan.
Already thinking about separating protocols into different abstractions and components, I immediately saw the parallels to Paxos and Raft.
Quoting my own blog post on protocol disaggregation:
Paxos made a fundamental contribution to distributed consensus by formalizing the responsibilities of reaching consensus and acting on the agreed values into distinct roles. Paxos separates the consensus protocol into proposers who drive consensus by proposing values to acceptors, acceptors who form the quorum necessary for reaching agreement (consensus), and learners who need to know the decided values. This creates a clear framework that allows system designers to reason about each role's responsibilities independently while ensuring their interaction maintains safety and liveness properties. The formalization of these roles has influenced the design of practical systems and protocols for decades, even when they don't strictly adhere to the original Paxos model. This cannot be understated.
Raft is a prescriptive, implementation-focused consensus algorithm (which is why it became popular initially). Paxos on the other hand was not prescriptive and focused more on identifying discrete roles and responsibilities. This has led to a plethora of variants and diverse implementations. We can see great examples of how this focus on abstractions and roles led to more creative and flexible implementations of consensus. One example I like of this is Neon’s distributed write-ahead-log.
Going back to what Jaana said, the way I would riff on it would be:
When you have a highly ambiguous systems problem, try to solve it in terms of modular abstractions, roles and responsibilities first. Once you identify these abstractions and roles, you can choose whether to pack those together in a monolith or deploy them as a distributed set of components. Starting with the mixed up monolith, and later attempting to separate it out into clean modular abstractions is almost always impossible.
I think it’s a philosophy worth keeping in mind always.