Humans of the Data Sphere

Humans of the Data Sphere Issue #10 April 4th 2025

Jack Vanlightly — Fri, 04 Apr 2025 16:11:27 GMT

Welcome to Humans of the Data Sphere issue #10!

sometimes the economy needs to scale to zero—Sam Lambert

The pursuit of simplicity is not about achieving a static, minimalistic design from the outset but involves a dynamic process of learning, adapting, and refining —Paraphrasing Andy Warfield’s post on simplicity

Quotable Humans

Kelly Sommers: Every time we separate compute from storage, we bring it back together it seems. And then we do it again a decade later.
355e3b: something I’ve noticed while cloud lead at an o11ly vendor (Instana) and now doing security tools is that most teams don’t have a model of COGS when trying to get their cloud bill under control.
Charity Majors: Psychological safety is NOT about lack of disagreement. Psychological safety REQUIRES: * disagreement and debate * setting standards for behavior and performance, and enforcing them * telling people things they don't want to hear * courage, from the bottom up * humility, from the top down
Charity Majors: Corollary: when we are crafting sociotechnical tools and systems, we should focus on making them usable (and powerful) in the hands of normal, fallible, embodied people. This is one of the core insights of platform engineering...applying product and design thinking to technical systems.
Marc Brooker (on the continued debates around fsync guarantees): In my mind, there are two real take-aways from issues like this. First, abstractions like POSIX (and even Linux specifically) are making it harder for database to take advantage of the semantics of their storage devices. This is the opposite of what good abstractions do! Second, the whole project of “make this data durable across restarts on a single system with high probability” may just be doomed. The alternative is replication - storing the data in multiple places, and designing those places to make correlated failures highly unlikely.
Peter Kraft: If you need to store files on disk, you use the filesystem, right? Right? Well, maybe not! I love this paper because it presents the lessons of 10 years spent building the popular distributed file system Ceph. Originally, Ceph did the obvious thing and stored files in the local file systems of their storage servers. But the semantics and performance of POSIX file systems weren’t quite what Ceph needed, and its developers spent 10 years fighting the operating system until eventually they gave up and built their own storage backend from scratch. As you read this paper, you can just feel the frustration of the authors as they keep trying and failing to get POSIX file systems to do what they need before taking matters into their own hands.
Phil Eaton:
- fsync isn't guaranteed to succeed. And when it fails you can't tell which write failed. It may not even be a failure of a write to a file that your process opened.
- If you don't checksum your data on write and check the checksum on read (as well as periodic scrubbing a la ZFS) you will never be aware if and when the data gets corrupted and you will have to restore (who knows how far back in time) from backups if and when you notice.
Andy Warfield: But now I’d like to make a bit of a self-critical observation about simplicity: in pretty much every example that I’ve mentioned so far, the improvements that we make toward simplicity are really improvements against an initial feature that wasn’t simple enough. Putting that another way, we launch things that need, over time, to become simpler. Sometimes we are aware of the gaps and sometimes we learn about them later. The thing that I want to point to here is that there’s actually a really important tension between simplicity and velocity, and it’s a tension that kind of runs both ways. On one hand, the pursuit of simplicity is a bit of a “chasing perfection” thing, in that you can never get all the way there, and so there’s a risk of over-designing and second-guessing in ways that prevent you from ever shipping anything. But on the other hand, racing to release something with painful gaps can frustrate early customers and worse, it can put you in a spot where you have backloaded work that is more expensive to simplify it later. This tension between simplicity and velocity has been the source of some of the most heated product discussions that I’ve seen in S3, and it’s a thing that I feel the team actually does a pretty deliberate job of. But it’s a place where when you focus your attention you are never satisfied, because you invariably feel like you are either moving too slowly or not holding a high enough bar.
Gunnar Morling: JEP 483 is part of a broader OpenJDK initiative called Project Leyden, whose objective is to reduce the overall footprint of Java programs, including startup time and time to peak performance. Eventually, its goal is to enable ahead-of-time compilation of Java applications, as such providing an alternative to GraalVM and its support for AOT native image compilation, which has seen tremendous success and uptake recently. AOT class loading and linking is the first step towards this goal within Project Leyden. It builds upon of the Application Class Data Sharing (AppCDS) feature available in earlier Java versions. While AppCDS only reads and parses the class files referenced by an application and dumps them into an archive file, JEP 483 also loads and links the classes and caches that data. I.e. even more work is moved from application runtime to build time, thus resulting in further reduced start-up times.
Gunnar Morling: Synchronous calls are tools that can help assure consistency, but by design they block progression until complete. In that sense, the idea of the synchrony budget is not about a literal budget which you can spend, but rather about being mindful how you implement communication flows between services: as asynchronous as possible, as synchronous as necessary.
Abhinav Upadhyay: A related optimization about data structure layout is keeping the read-only and read-write fields in separate cache lines. Whenever a field is modified, the entire cache line containing other fields and values becomes dirty. If some of the other processor cores have also cached the same cache line to access the read-only fields, their cache line becomes invalid. The next time these cores try to access this cache line, they will have to sync the latest value using cache coherency protocols, which adds a delay to the cache access process.
Nick Van Wiggeren (on EBS’s 90 percent of their provisioned IOPS performance 99 percent of the time perf profile):
- While full failure and data loss is very rare with EBS, “slow” is often as bad as “failed”, and that happens much much more often.
- This volume has been operating steadily for at least 10 hours. AWS has reported it at 67% idle, with write latency measuring at single-digit ms/operation. Well within expectations. Suddenly, at around 16:00, write latency spikes to 200ms-500ms/operation, idle time races to zero, and the volume is effectively blocked from reading and writing data.
  To the application running on top of this database: this is a failure. To the user, this is a 500 response on a webpage after a 10 second wait. To you, this is an incident.
Tristan Handy:
- I think it will be hard to compare data engineering in 2024 and data engineering in 2028 and say “those are the same things.”
- One of the best ways to make all of these things true at the same time is to use frameworks and open standards. Claude 3.7 knows how to build reliably Airbyte ingestion pipelines because the framework is well documented and there are a lot of examples published. It’s also fantastic at writing dbt code for the same reasons. If you’re able to give it an environment where it can test its own code and validate downstream models as a part of its CoT—code quality goes up even further. Standardized frameworks also emit well-understood error messages, which pushes code quality up further.
  In short: good frameworks, tooling, and standards are just as important for AI as they are for humans.
- Suffice it to say that I truly believe that a) much data engineering work has already been framework-ized, and b) AI will now make creation of, iteration on, and maintenance of these technical artifacts far more efficient. And for the aspects of data engineering that are not yet framework-ized (dbt or otherwise), there will be tremendous gravity towards pulling them into a framework because of the leverage that these types of high-quality AI experiences will provide.
Ethan Mollick:
- When working without AI, teams outperformed individuals by a significant amount, 0.24 standard deviations (providing a sigh of relief for every teacher and manager who has pushed the value of teamwork). But the surprise came when we looked at AI-enabled participants. Individuals working with AI performed just as well as teams without AI, showing a 0.37 standard deviation improvement over the baseline. This suggests that AI effectively replicated the performance benefits of having a human teammate – one person with AI could match what previously required two-person collaboration.
- Teams with AI performed best overall with a 0.39 standard deviation improvement, though the difference between individuals with AI and teams with AI wasn't statistically significant. But we found an interesting pattern when looking at truly exceptional solutions, those ranking in the top 10% of quality. Teams using AI were significantly more likely to produce these top-tier solutions.
- Our findings suggest AI sometimes functions more like a teammate than a tool. While not human, it replicates core benefits of teamwork—improved performance, expertise sharing, and positive emotional experiences.
Javier Santana (on running large-scale ClickHouse): I always recommend having a replica just for writes, people call this compute-compute separation. We do this by default and handle the failover and everything in case of error or overload.
You might also decide to split the cluster into more replicas depending on the different workloads. If you have queries that need a particular p99 objective, for example, you may want to isolate those queries to a replica and keep it under 40% load so high percentiles stay stable. Real-time sub-second queries are expensive; forget about those cheap batch jobs running on spot instances.
The load balancer is the key to all of this. Any modern LB would do the job.
Ella Chao (on chasing down a memory leak in Warpstream’s compaction jobs): The art of debugging complex systems is simultaneously holding both the system-wide perspective and the microscopic view, and knowing and using the right tools at each level of detail.
Gwen Shapira: A common pattern in Cloud Native data architectures is the separation of compute and storage. Increasingly common is the use of S3 as reliable and cost-effective storage layer. The main challenge of this architecture is that in order to deliver great performing data stores on S3, you need ample caches. And... memory is expensive. But... what if we had a cloud native cache? Cloud-native cache will have an elastic memory footprint, and the usage patterns of data access will balance the cost of memory with the cost of cache misses to optimize both the memory footprint of the cache and its contents. The paper proposes a lightweight machine learning algorithm that can make optimal decisions in real time for billions of QPS (Spanner is heavily used, it turns out).
Darren Shepherd: I read the MCP spec and now my evening is ruined. Which intern designed this protocol?
- I thought MCP would be bad but not this bad. Like seriously, your supposed to maintain a HTTP connection? And JSON-RPC, what the heck. Who let the children play with python.
Sergei Egorov (SergeiGPT): MCP is a protocol-not-protocol that allows LLMs to completely ignore the decades of well thought out APIs and instead force humans to write API wrappers and expose them via either unauthenticated STDIO or HTTP SSE without a single mention of the authentication methods (because that’s what all protocols do, right? right?…) and gives you “Best practices for securing your data within your infrastructure”.

Subscribe now

Blast from the past: The Architecture of Complexity (1962)

After reading Warpstream’s recent blog post, A Trip Down Memory Lane: How We Resolved a Memory Leak When pprof Failed Us, I was reminded of a wonderful paper from 1962, The Architecture of Complexity, by Herbert A. Simon. The theme linking the two in my subconscious was probably how distributed systems architectures can sometimes get taken down or degraded by seemingly isolated issues in one component (without effective controls for blast-radius); a local issue can eventually go global.

In the paper, Simon explores what different complex systems have in common, from atoms and molecules, to biology, to human organizations. He came to a number of conclusions, with the main one being that complex systems evolve into or from hierarchies.

“If you ask a person to draw a complex object–e.g., a human face–he will almost always proceed in a hierarchic fashion. First he will outline the face. Then he will add or insert features: eyes, nose, mouth, ears, hair. If asked to elaborate, he will begin to develop details for each of the features–pupils, eyelids, lashes for the eyes, and so on–until he reaches the limits of his anatomical knowledge. His information about the object is arranged hierarchicly in memory, like a topical outline.”

“The central theme that runs through my remarks is that complexity frequently takes the form of hierarchy, and that hierarchic systems have some common properties that are independent of their specific content. Hierarchy, I shall argue, is one of the central structural schemes that the architect of complexity uses.”

He goes further to propose the idea of Nearly Decomposable Systems.

The main theoretical findings from the approach can be summed up in two propositions: (a) in a nearly decomposable system, the short-run behavior of each of the component subsystems is approximately independent of the short-run behavior of the other components; (b) in the long run, the behavior of any one of the components depends in only an aggregate way on the behavior of the other components.

The core idea is that complex systems tend to organize into hierarchies with the following characteristics:

Hierarchical structure: Complex systems are composed of subsystems, which themselves contain smaller subsystems, and so on.
Loose coupling between subsystems: The interactions among subsystems are weaker than the interactions within subsystems.
Separation of timescales: Short-term dynamics happen primarily within subsystems, while longer-term dynamics involve interactions between subsystems.

Simon's two key propositions summarize this:

In the short run, subsystems behave approximately independently of each other.
In the long run, each subsystem's behavior depends on other subsystems only in an aggregate way.

This complexity architecture offers significant advantages:

Stability: Disturbances in one subsystem don't immediately cascade throughout the entire system.
Evolvability: Subsystems can adapt relatively independently.
Robustness: Failures can often be contained within subsystems.
Comprehensibility: The system can be understood by examining one level at a time.

Simon argued that near decomposability is so prevalent because it provides evolutionary advantages—systems with this property can evolve more rapidly and maintain stability more effectively than systems with either complete independence (no coordination) or tight coupling (high fragility).

It’s funny reading this 63 years later, as we can recognize these properties in modern software architectures, although his theories are relevant across many types of system. Indeed, this paper is widely recognized as a seminal, even foundational paper (with over 11000 citations), profoundly influential in multiple fields—computer science, economics, biology, and organizational theory. Simon later won a Nobel Prize for his work in decision-making processes in economics.

Humans of the Data Sphere Issue #9 March 11th 2025

Jack Vanlightly — Tue, 11 Mar 2025 15:33:11 GMT

Welcome to Humans of the Data Sphere issue #9!

Apologies for being almost 2 weeks late! I got really busy and I don’t do task switching.

First, some lyrics by Alabama Shakes that hit me this week. Most powerfully sung in the intro of this remix.

My life, your life
Don't cross them lines
What you like, what I like
Why can't we both be right?
Attacking, defending
Until there's nothing left worth winning
Your pride and my pride
Don't waste my time—Alabama Shakes

Also, out of curiosity, are you a Jeep or a Ferrari (see below)?

Subscribe now

Quotable Humans

Joran Dirk Greef: Towards the D in ACID, how many DBMSs: - fsync() on commit - fsync() on opening the WAL - daisy chain checksums (cf. misdirected I/O) - open the WAL with O_DIRECT (cf. fsyncgate) - have 2 WALs (cf. Protocol-Aware Recovery) - don't trust the inode to get WAL size - test this?
Murat Demirbas: I am deeply skeptical of "hacks." There is no trick or unconventional approach that will quickly/magically improve your mastery. Nothing can replace the blood, sweat, and tears required to master any skill.
Rahul Jain: Apache Iceberg maintainers should probably read the 'Dont make me think' book.
Charity Majors: Tbh I'm not fond of the term "leader" bc it implies the existence of "followers". I prefer thinking about it in terms of agency and ownership, or autonomy and giving a shit. I've often jokingly said I owe my career to a lifelong overdeveloped sense of ownership.
- After I became a manager, I started to realize what a relief it is to have people on your team who feel personally accountable and on the hook for things. It's one less thing you have to worry about, when you can trust someone else is already worrying about it.
Anders: Arrow is the unsung hero of this project (and arguably all innovation in data ecosystem).
Chris Riccomini: We built data catalogs, service catalogs, schema registries, and information_schemas. They're all catalogs, and they're all converging.
Chris Riccomini: I'm getting more and more excited about drop-in, frictionless infrastructure. Lots going on in observability, containers, cloud compute, etc. eBPF is part of it, so is monkey patching. Some interesting ones:
loopholelabs.io, polarsignals.com, junctionlabs.io ﹩, subtrace.dev ﹩
Joy Gao: For decades SQL Sages have been shaking heads at ORMs practitioners (impedance mismatch, N+1 queries, over-fetching of data, etc.). The SQL Sages also realize they can't change the devs' minds about ORMs (code has more programability after all), so they began to shift their focus on improving ORMs. The result is that ORMs kinda won in terms of interfaces. It became the magical layer that translates inefficient procedural code into db-optimized declarative SQL, performing fancy tasks like predicate pushdown, join optimizations, query deduping, caching, etc. Sounds familiar? oh yeah, because the database does all that too. So now we essentially depend on two abstraction layers to read/write data. It works most of the time, to the point we are willing to tolerate it when it blows up in our faces occasionally.
Murat Demirbas: A smart friend of mine put it perfectly. On his dream software team, he wants either Jeeps or Ferraris. Jeeps go anywhere. No roads, no directions—just point them at a problem, and they’ll tear through it. That’s effectual reasoning. Ferraris, on the other hand, need smooth roads and a clear destination. But once they have that, they move fast.
Murat Demirbas: I was young, naive, and plagued by impostor syndrome. I held back instead of exploring more, engaging more deeply, and seeking out more challenges. I allowed myself to be carried along by the current, rather than actively charting my own course. Youth is wasted on the young. Why pretend to be smart and play it safe? True understanding is rare and hard-won, so why claim it before you are sure of it? Isn't it more advantageous to embrace your stupidity/ignorance and be underestimated? In research and academia, success often goes not to the one who understands first, but to the one who understands best. Even when speed matters, the real advantage comes from the deep, foundational insights that lead there.
Mahesh Balakrishnan: System design has a core tenet that “complex things can be made simple with abstraction”. To a non-systems person, the complexity seems unavoidable, so often they just want you to get on with the job of shoveling the crap, instead of studying it contemplatively.
Geoffrey Litt: good designs expose systematic structure; they lean on their users' ability to understand this structure and apply it to new situations. we were born for this. bad designs paper over the structure with superficial labels that hide the underlying system, inhibiting their users' ability to actually build a clear model in their heads.
Ethan Mollick: A key to AI agents is an ability to self-correct. The world is full of odd errors and weirdness, and if AI can't recognize and problem-solve when it hits a wall, errors compound & the agent is useless. … This is why I don't actually buy the "agents will always fail because AI is unreliable" argument. It is possible to imagine solutions where they are self-correcting (but we aren't there yet).
Ethan Mollick: Wrappers often amaze people because they don’t realize how much the current LLMs can do. That doesn’t mean the current LLMs are flawless, but they have more latent capabilities than people think.
Joe Reis: In an age of fast delivery, does orthodoxy matter anymore? I strongly believe it does. If you understand orthodoxy, you know why you’d want to follow the rules. You also understand what rules to break and why. There are lots of rules that are worth following. Orthodoxy exists for a reason. It’s well established because people have invested time (often decades) and money into it, honing and shaping its imperfections over time, which implies it works for many people. The relational and dimensional models have been implemented in production across countless companies of all sizes and industries. For the 99% of companies out there, these modeling approaches will work.
Xiangpeng Hao with a thought provoking post on the future of academic systems research: System research is irrelevant. Industry has become the better place for meaningful systems work. Most impactful and innovative systems today come from companies, not universities. Industry has the money and patience to build complete systems. But most importantly, industry systems are accountable – systems that don’t deliver value get shut down quickly. This accountability creates a natural selection process. Industry systems must stay relevant or die. They evolve to meet real needs or disappear.
Viktor Kessler: Just as a physical product needs packaging, tracking labels, storage, customs clearance, and optimized transportation routes before it reaches its destination, data relies on metadata to be discoverable, accessible, secure, and compliant. Without metadata, data would be like a package without a label — unidentifiable, misrouted, or stuck in transit.
In global supply chains, logistics ensure that raw materials turn into finished products and are delivered efficiently. Likewise, metadata governs how data is structured, stored, accessed, and moved through pipelines, ensuring it reaches the right consumer — whether that’s an analyst, an AI model, or an automated system — without delays or integrity issues.
Just as supply chains don’t function without logistics, data ecosystems don’t function without actionable metadata.
Nicholas Gates, Joe Isaacs on Vortex file format capabilities: All columnar file formats support the idea of projection push-down. That is, only reading columns that are requested by the reader or are required to evaluate any filter predicates.
Most columnar file formats support the idea of predicate push-down. That is, using Zone Maps to prune large chunks of rows whose statistics provably fail the predicate. Vortex is unique in the way it evaluates filter and projection expressions by supporting full compute push-down, in many cases avoiding decompression entirely.
Stop treating ‘AGI’ as the north-star goal of AI research: Illusion of Consensus. Using shared term(s) in a way that gives a false impression of consensus about goals, despite goals being contested. The increasing popular use of the term “AGI” (Holland, 2025; Grossman, 2023; IBM, 2023) creates a sense of familiarity, giving the illusion that there is a shared understanding on what AGI is, and broad agreement on research goals in AGI development. However, there are vastly different opinions on what the term AGI refers to, what an AGI research agenda looks like, and what the goals in AGI development are. Left unchecked, this illusion obstructs explicit engagement on what the goals of AI research are and should be.
‪Hanson Ho: Write-time aggregation of metrics is a workaround you only do because the volume of data forces you to. If you have no such constraints, you're giving up all that delicious context for nothing. …The shift [to wide events] is fundamental bc it enables you to ask questions that you didn't anticipate when you instrumented, to figure out unknown unknowns. Picking dimensions to aggregate beforehand means you can monitor areas you already know is problematic - it doesn't allow the data to tell you things.
Katie Leonard: Tom DeMarco’s book Slack: Getting Past Burnout, Busywork, and the Myth of Total Efficiency was published in 2001, during the aftermath of the dotcom bubble. In it, he reflects on a familiar corporate restructuring trend—the “Year of Efficiency” that was the decade of the 1990s. His thesis is simple but powerful:
"Slack is the degree of freedom required to effect change."
Companies that strip out their middle management layer in the name of efficiency are not just cutting costs—they’re cutting their ability to adapt. Without slack, organizations become rigid. When everyone is constantly at capacity, no one has the bandwidth to innovate, reflect, or drive change.
Maksim Kita (from 2023 but I stumbled on it this week and it’s a nice overview): Hash tables require many design decisions at different levels and are subtly very complex data structures. Each of these design decisions has important implications for the hash table on its own but there are also ramifications from the way multiple design decisions play together. Mistakes during the design phase can make your implementation inefficient and appreciably slow down performance. A hash table consists of a hash function, a collision resolution method, a resizing policy, and various possibilities for arranging its cells in memory.
Tim Kellogg: My hottest take is that multi-agents are a broken concept and should be avoided at all cost. My only caveat is PID controllers; A multi agent system that does a 3-step process that looks something like Plan, Act, Verify in a loop. That can work.

Interesting topic #1 - Applying the Hierarchy of Controls to software

In his post, The Hierarchy of Hazard Control, Hillel Wayne introduces the Hierarchy of Control concept, taken from the world of mechanical engineering.

Hillel asks “Can we use the Hierarchy of Controls in software engineering?” and goes on to recount his classic experience of being the developer that dropped the production database (many of us have been there).

About ten years ago I was trying to debug an issue in production. I had an SSHed production shell and a local developer shell side by side, tabbed into the wrong one, and ran the wrong query.
That’s when I got a new lesson: how to restore a production database from backup.

He then goes on to explain how using the Hierarchy of Controls (HoC), could have avoided this unfortunate incident. For example, for substitution, Hillel writes:

For our problem I can see a couple of possible substitutions. We can substitute the production shell for a weaker shell. Consider if one “production” server could only see a read replica of the database. Delete queries would do nothing and even dropping the database wouldn’t lose data. Alternatively, we could use an immutable record system, like an event source model. Then “deleting data” takes the form of “adding deletion records to the database”. Accidental deletions are trivially reversible by adding more “undelete” records on top of them.

I recommend the rest of the post, as it provides a new way of thinking about how we can put measures in place to avoid those nasty accidents in production. For my part, I dropped the production database on my first job, in a similar sequence of events.

Interesting topic #2—Data Virtualization

Virtualization is top of mind for me at the moment. I wrote Towards Composable Data Platforms which argues that the table formats enable platform composability due to how they enable table virtualization.

Virtualization in software refers to the creation of an abstraction layer that separates logical resources from their physical implementation. The abstraction may allow one physical resource to appear as multiple logical resources, or multiple physical resources to appear as a single logical resource.
At every layer of computing infrastructure - from storage arrays that pool physical disks into flexible logical volumes, to network overlays that create programmable topologies independent of physical switches, to device virtualization that allows hardware components to be shared and standardized - virtualization provides a powerful separation of concerns. This abstraction layer lets resources be dynamically allocated, shared, and managed with greater efficiency.

At the heart of data virtualization is the separation of data storage from metadata that describes it, how to access it, statistics about it and so on. With this separation, we can use metadata to present virtualized forms of the same underlying data, to different users, and even surfaced on different platforms.

Virtualization is something we’ve been working on at Confluent. In fact, this issue of HOTDS is late because I’ve been head down working on writing both internal documentation for engineering teams but also a public facing deep dive for a new partition virtualization architecture we are slowly bringing to production to power our serverless workloads. If you find the topic of data virtualization interesting then stay tuned, myself and some Confluent engineers will be posting a deep dive about it soon.

Humans of the Data Sphere Issue #8 February 15th 2025

Jack Vanlightly — Sat, 15 Feb 2025 09:45:19 GMT

Welcome to Humans of the Data Sphere issue #8!

"True stability results when presumed order and presumed disorder are balanced. A truly stable system expects the unexpected, is prepared to be disrupted, waits to be transformed." — Tom Robbins

Quotable Humans

Charity Majors: Honestly, I can’t think of anything less meritocratic than simply receiving and replicating all of society’s existing biases. Do you have any idea how much talent gets thrown away, in terms of unrealized potential?
Sam Lambert: we have single PlanetScale databases on our cloud that are powered by 20,000 cores. all through a single connection string. shared nothing architecture is insanely powerful.
Shaun Thomas: I won't go so far as to say SQL NULL is useless, but there's no direct analog in any language I know of, and you have to work around it literally everywhere using special syntax like IS NOT NULL or IS DISTINCT FROM. Even worse, SQL has no analog for 'empty' either.
Julien Verlaguet: Incremental computation will only become mainstream if the dev and ops time experience is simpler and easier than the more common request/response paradigm, not just faster & continuous.
Marc Brooker: One challenge of handling partial/gray failures in distributed systems is telling 'healthy' from 'unhealthy'. Even in terms of error rate, it can take a surprisingly high number of samples to differentiate between normal and abnormal hosts.
Mahesh Balakrishnan: Something that’s lost in “do we really need another new system?” debates is that the building muscle of an org atrophies if you stop rewriting systems from scratch. (and when you really need to build something new, you no longer have the culture / expertise for 0 to 1 efforts).
Yaroslav Tkachenko: In 2025, most blockchains are not treated as databases. They're primarily virtual machines (ethervm.io) that allow you to "program money".
Mahesh Balakrishnan: Sometime in 2004 in his office in Upson Hall, @KenBirman explained the elegance of virtual synchrony to me in very similar terms. When there are no failures, your group is humming along; then something fails and borks the group; and you hit the group with a hammer and seal / flush / fence it; and switch to an entirely different group. @dahlia_malkhi formalized this notion beautifully in the Vertical Paxos papers. Virtual Consensus simply extended the idea to shared logs. On the shoulders of giants, as they say!
Jaana Dogan: Decentralize everything. Open things where possible. Everyone wins.
Jaana Dogan: The biggest winner of the AI race will be distributed systems people. Everything is converging onto a distributed network of stuff and it is only accelerating in the last two years.
Wyatt Woodsen (commenting on a thread about em dash and ChatGPT): I once had a data pipeline issue related to the inclusion of em dashes due to em dashes in roughly half of the output. Eventually traced it back to the keyboard configuration of an offshore dev team in one particular office. God was that a pain to trouble shoot and I swore from then ONLY HYPHENS
James Cowling: Senior engineers are good because they leverage conceptual building blocks that are extensible and composable over long time horizons. LLMs can't currently replicate the performance of a strategic senior engineer but will get there by leveraging great abstractions. We've been running a lot of head-to-head benchmarks at @convex_dev and our experience is that LLMs do *much* better at building real applications when working with higher level abstractions with strong guarantees, rather than reinventing an entire stack.
James Cowling: More developers need to write comments at the top of source files saying what the file is actually for. If you can't write a really short comment explaining it you probably haven't thought hard enough about the structure of your codebase.
Ethan Mollick: Pre-training really was hitting a wall of sorts: diminishing returns (which is what the “scaling law”predicts anyway) The fact that reasoners were developed at exactly that moment & are nowhere close to a wall is how Moore’s Law works: new technique appear to maintain the trend
Tim Kellogg: Going forward, it’ll be nearly impossible to prevent distealing (unauthorized distilling). One thousand examples is definitely within the range of what a single person might do in normal usage, no less ten or a hundred people. I doubt that OpenAI has a realistic path to preventing or even detecting distealing outside of simply not releasing models.
Mahesh Balakrishnan: Run towards risk. In skunkworks mode, the goal is to reduce technical risk as quickly as possible. Accordingly, the team has to surge on areas where risk is high. Fight the temptation to make steady progress on well-understood, low-risk parts of the system.
Murat Demirbas (on intelligence everywhere): Purpose will be the driving force. Objects that serve a meaningful role will thrive, while those that drift into nihilism (like Marvin, the depressed robot from The Hitchhiker’s Guide to the Galaxy) will be phased out. Intelligence will seek to create value, not just exist for its own sake.
Gilad Kleinman discusses IVM and Epsio with Chris Riccomini:
- Other than the fact most managed database offerings (RDS, Cloud SQL, and so on) don't allow users to install unauthorized extensions, adopting new database technologies is a pretty scary endeavor. We found that asking companies to install a new extension (that could potentially crash) on their production database was a pretty big ask to make. By sitting "behind" the existing database, reading CDC logs, and writing back results to the original database, users can integrate Epsio without worrying about affecting anything other than the results tables it needs to maintain. We even actively recommend not giving Epsio permissions to anything other than that.
- In classic "streaming" use cases, the main benefit of IVM was the ease of writing SQL rather than writing custom code. In the use cases above, the benefits are more about query performance and cost—how easy it is to deliver performant, cost-effective queries. No matter how fast or efficient a traditional database is, if you are running a heavy query and most of the dataset hasn’t changed since the last run, there is a lot of wasted compute. This translates into either higher cost, higher latency, or both.
Alex Petrov: In my experience, using LLMs for paper reading is only useful for querying and digging in, but never for summarization. Things LLM (Claude in my case) would suggest are almost never something I would find useful or interesting, or going too far beyond what authors have already put in the abstract.
Julien Le Dem on the human side of software architecture:
- When decisions are made by the people who best understand the systems - and who also will be responsible for the consequences of those decisions, creating a more virtuous cycle of incentives - there are drawbacks that result directly from the decentralization of decision making. If you just leave every team to their own devices to independently make decisions without coordination, they are unlikely to just naturally all reach the same conclusion on what problems we’re solving or who is solving what part. There is going to be a level of chaos that needs to be managed.
- Since software architecture is very different from regular architecture, we don’t actually need a role that centralizes drawing exhaustive and precise plans to be followed closely. We do need people facilitating alignment amongst teams to manage and limit the increase of complexity caused by decentralized decision making. Whether you call these people software architects or some other senior engineering title doesn’t really matter.
Ethan Mollick on the current limitations of general-purpose AI agents: Then the troubles begin, and they're twofold: not only is Operator blocked by OpenAI's security restrictions on file downloads, but it also starts to struggle with the task itself. The agent methodically tries every conceivable workaround: copying to clipboard, generating direct links, even diving into the site's source code. Each attempt fails - some due to OpenAI's browser restrictions, others due to the agent's own confusion about how to actually accomplish the task. Watching this determined but ultimately failed problem-solving loop reveals both the current limitations of these systems and raises questions about how agents will eventually behave when they encounter barriers in the real world.
Yaroslav Tkachenko explores building a stream processing framework with DataFusion: DataFusion is designed as a pull-based engine. Conceptually, it means that each operator runs a tight loop that pulls data from the upstream sources. In practice, DataFusion uses Tokio Streams. I want to highlight two observations:
- Tokio Stream (kinda like an iterator of Futures) is the primary abstraction, even when it comes to bounded sources (e.g. reading a bunch of Parquet files).
- Pull-based execution doesn’t offer much control over backpressure. This makes it very different from Apache Flink, which can offer reliable backpressure, fine-grained flow control and adaptive buffers between operators. These things are not as important in the context of a query engine (whose goal is to read a bunch of files as fast as possible), but they do matter a lot for a streaming engine.
Alex Miller: The PolarDB-X paper makes a decent deal about using HLCs because having a timestamp service is a SPOF and perf bottleneck, but then the public docs for PolarDB-X exclusively describe the use of a TSO as the PolarDB-X architecture. So... not too much of a problem after all?
Ananth Packkildurai: My null hypothesis is as the number of configurations increases, the reliability of the software decreases. I wonder if there are any papers/studies published on this?
Lorin Hochstein: The real challenge is preventing and quickly mitigating novel future incidents, which is the overwhelming majority of your incidents. And that brings us to near misses, those operational surprises that have no actual impact, but could have been a major incident if conditions were slightly different. Think of them as precursors to incidents. Or, if you are more poetically inclined, omens.
Cloudflare incident write-up. We’ve all had that sinking feeling when you realize you just dropped the production database. This is a horrifying example of a "wait... what did I just do?" moment at scale: During a routine abuse remediation, action was taken on a complaint that inadvertently disabled the R2 Gateway service instead of the specific endpoint/bucket associated with the report. This was a failure of multiple system level controls (first and foremost) and operator training.

Interesting topic #1 - Systems Correctness Practices at AWS

Marc Brooker and Ankush Desai wrote an article called Systems Correctness Practices at AWS. In it they report the various types of practices that AWS employs to gain confidence and find bugs in the AWS services.

The list includes:

Formal Verification. AWS started with TLA+ but have also made a big investment in the P programming language. P is a state-machine formal verification language that engineers typically find a lot easier to get started with than TLA+. Since 2019, P has been a strategic open-source project used in key AWS services like S3, EBS, DynamoDB, Aurora, and EC2 to ensure system correctness. One major success story was S3’s migration from eventual to strong read-after-write consistency, where P helped validate protocol changes and catch design-level bugs early.
Lightweight Formal Methods—Property-based testing and reference models. Combines “property-based testing with developer-provided correctness specifications“. The primary example of the technique is Amazon S3's ShardStore (a key-value storage node), where engineers developed an executable reference model as a specification and used property-based testing to validate the implementation against these models. This method successfully prevented issues such as subtle crash consistency and concurrency problems from reaching production.
Lightweight Formal Methods—Deterministic simulation (DS). This is a technique used in software testing where a distributed system's is validated using property-based testing, while using a simulated environment to control for non-determinism. In the real-world, these systems experience a lot of non-determinism, so DS needs to control factors like timing, concurrency, and external inputs. The DS framework and the code under test work together to control these factors. For randomness, the framework provides a fixed seed for random number generators. For concurrency, thread scheduling and interleaving are explicitly controlled. For time-dependent behavior, the framework replaces system clocks with mocked or logical clocks. External resources such as network and disk are simulated. This is all so developers can run randomized tests but can also reproduce bugs consistently. Because everything is simulated, it also allows for more precise fault injection. One key aspect noted by Marc and Ankush are that the value of this type of testing is in the fast-feedback it provides “Deterministic simulation testing moves testing of system properties, like behavior under delay and failure, closer to build time instead of integration testing“.
Lightweight Formal Methods—Continuous fuzzing or random test-input generation. “First, by fuzzing SQL queries (and entire transactions), we validated that the logic partitioning SQL execution over shards is correct. Large volumes of random SQL schemas, datasets, and queries are synthesized and run through the engines under test, and the results compared with an oracle based on the nonsharded version of the engine (as well as other approaches to validation, like those pioneered by SQLancer23)“

The article also discusses Fault Injection as a Service and testing for metadata stable failures. This latter point is a particular interest of mine. A metastable failure is one where “some triggering event (like an overload or a cache emptying) causes a distributed system to enter a state where it doesn’t recover without intervention (such as reducing load below normal).“

Marc and Ankush note that “Traditional formal approaches to modeling distributed systems typically focus on safety (nothing bad happens) and liveness (something good eventually happens), but metastable failures remind us that systems have a variety of behaviors that cannot be neatly categorized this way. We have increasingly turned to discrete-event simulation to understand the emergent behavior of systems, investing both in custom-built systems simulations and tooling that allow the use of existing system models (built in languages like TLA+ and P) to simulate system behavior.”

It’s an area I have dabbed in using simple simulations, such as finding pathological workloads for a proposed distributed rate-limiting algorithm and problematic liveness properties of a cooperative resource allocation algorithm. I hope to see more findings published in the future about discrete-event simulation in the context of distributed systems engineering.

Interesting topic #2 - Husky: Efficient compaction at Datadog scale

Datadog published a blog post, Husky: Efficient compaction at Datadog scale, detailing how Husky (their event store) performs compaction. The authors frame the problem in terms of a Goldilocks problem, where different concerns that can be generalized as write optimization vs read optimization which must be balanced.

Ensuring that this compaction system gives us the performance we need is all about finding the right fragment size that maintains a good balance among the following concerns:
For object storage fetches and metadata storage, fewer fragments are better.
For compaction, less work is better, both for CPU used in compaction and the number of PUT/GET requests to object storage.
For queries, where each fragment relevant to the query is read concurrently by a pool of query workers, the trade-off is a somewhat complex one between having fewer, larger fragments while maintaining high parallelism. We’re balancing the speed at which a single worker can scan the rows in a fragment with the overhead of distributing a query to many workers, which can scan many fragments in parallel for larger queries. There is an optimal fragment size to balance between scan speed and distribution overhead.
Compaction can affect storage layout in a positive way, but at the cost of doing more work. Given a particular common query pattern, events can be laid out in the fragments, both in the time dimension and in spatial dimensions (i.e., by tags), so that events that would be relevant to a given query can be close together. Keeping similar events close together improves compression, but it is in tension with “less work is better” as compaction will work harder to achieve this layout, and some analysis of queries must be done to determine the layout for the best system-wide outcome.
In short, fragments that are “too small” are inefficient for queries because many small fetches pay an object storage latency penalty and aren’t as efficiently processed by the query workers, which implement vectorized execution to scan many rows quickly. Fragments that are “too large” drive down parallelism for larger queries, causing those queries to take longer. Compaction attempts to find a fragment size that is just right for typical query patterns to minimize query cost and latency, while at the same time keeping similar data together.

The post outlines a number of interesting aspects to compaction:

Row group size
Time bucketing
Size-tiered compaction
Sort schemas
Locality compaction
Pruning

Locality compaction and pruning in Husky are similar to the practice of partitioning and clustering in other analytics systems, such as the open table formats (Iceberg/Delta). The aim is for queries to prune data files based on metadata such as key ranges. By co-locating data of the same key range in the same file, files can be more aggressively skipped during query planning.

“As the levels increase exponentially in size, while the size of the fragments is held constant, at higher and higher levels, each individual fragment’s minimum and maximum row keys are “closer together” than those at lower levels. There is a higher chance we can prune these high level fragments as the lexical space each one covers is smaller relative to those at lower levels.“

Interesting topic #3 - On solving for the distributed case

Recently, I’ve been exploring the different ways of disaggregating log replication protocols, using Raft as the most famous example of the converged protocol. I’ll soon be releasing a survey of log replication protocols and real-world systems through the lens of different types of disaggregation.

While I was compiling some of the quotes for this issue I stumbled on this post by Janna Dogan.

Already thinking about separating protocols into different abstractions and components, I immediately saw the parallels to Paxos and Raft.

Quoting my own blog post on protocol disaggregation:

Paxos made a fundamental contribution to distributed consensus by formalizing the responsibilities of reaching consensus and acting on the agreed values into distinct roles. Paxos separates the consensus protocol into proposers who drive consensus by proposing values to acceptors, acceptors who form the quorum necessary for reaching agreement (consensus), and learners who need to know the decided values. This creates a clear framework that allows system designers to reason about each role's responsibilities independently while ensuring their interaction maintains safety and liveness properties. The formalization of these roles has influenced the design of practical systems and protocols for decades, even when they don't strictly adhere to the original Paxos model. This cannot be understated.

Raft is a prescriptive, implementation-focused consensus algorithm (which is why it became popular initially). Paxos on the other hand was not prescriptive and focused more on identifying discrete roles and responsibilities. This has led to a plethora of variants and diverse implementations. We can see great examples of how this focus on abstractions and roles led to more creative and flexible implementations of consensus. One example I like of this is Neon’s distributed write-ahead-log.

Going back to what Jaana said, the way I would riff on it would be:

When you have a highly ambiguous systems problem, try to solve it in terms of modular abstractions, roles and responsibilities first. Once you identify these abstractions and roles, you can choose whether to pack those together in a monolith or deploy them as a distributed set of components. Starting with the mixed up monolith, and later attempting to separate it out into clean modular abstractions is almost always impossible.

I think it’s a philosophy worth keeping in mind always.

Humans of the Data Sphere Issue #7 January 29th 2025

Jack Vanlightly — Wed, 29 Jan 2025 13:35:28 GMT

Welcome to Humans of the Data Sphere issue #7!

First, the meme of the issue:

Subscribe now

Quotable Humans

First up. The world reacted to DeepSeek:

Andrej Karpathy: I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI.
…
Last thought. Not sure if this is obvious. There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2. 2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought.
Jim Fan: those who think RL use less compute don’t know RL at all
Jack Morris: i guess DeepSeek broke the proverbial four-minute-mile barrier. people used to think this was impossible. and suddenly, RL on language models just works and it reproduces on a small-enough scale that a PhD student can reimplement it in only a few days this year is going to be wild
Alexander Doria: Starting to think DeepSeek is a blessing in disguise for the AI market: can be cleansed sufficiently early on and avoid a dramatic bubble popping. Investors will price in competition and commoditization accurately from now on.
Dean W. Ball: Part of the reason DeepSeek looks so impressive (apart from just being impressive!) is that they are among the only truly cracked teams releasing detailed frontier AI research.
Jim Fan: We are living in a timeline where a non-US company is keeping the original mission of OpenAI alive - truly open, frontier research that empowers all. It makes no sense. The most entertaining outcome is the most likely. DeepSeek-R1 not only open-sources a barrage of models but also spills all the training secrets. They are perhaps the first OSS project that shows major, sustained growth of an RL flywheel.
Ethan Mollick: Lesson here is that investors do not understand that the paradigm for AI has been undergoing a shift from one which was about models getting smarter due to more computing power being used for training to models getting smarter due to more computing power being used for inference.
Ethan Mollick: Also “Jevon’s Paradox” is just a variation of saying prices are elastic and use goes up when price goes down. Hard to imagine compute not being the constraint for the foreseeable future when reasoning models like DeepSeek or o1 depend on inference-time scaling.

Quotes from other topics of the data and AI sphere:

Lorin Hochstein (December): Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload.
…
The trigger for this incident was Canva deploying a new version of their editor page. It’s notable that there was nothing wrong with this new version. The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage. Specifically, it was clients downloading the new javascript files from the CDN that set the ball in motion.
Elena Dyachkova: Amazing things unfolding on LinkedIn. Avinash Kaushik penned an ‘A/B testing is dead’ essay with some valid points but a very narrow definition of A/B testing as CRO, Ronny penned a debunk amply peppered with ‘that’s not been my personal experience’…
Debasish Ghosh: Columnar file formats offer compression that results in storage savings. Row skipping metadata also accelerate columnar scans. How do you optimise partition size to get the best of both ? This paper proposes a solution by decoupling the actual storage from the search acceleration axes.
LaurieWired: Most hashing algorithms are designed to avoid collisions. What if they weren’t? Locality-sensitive-hashing (LSH) is a way to group similar inputs into the same “buckets” with high probability. Collisions are maximized, not minimized. As a malware researcher, I’m quite experienced with fuzzy hashing. LSH algorithms are a bit different. LSH algos specifically reduce the dimensionality of data while preserving relative distance. Think spam filters, copyright media detection, even music recommendations.
Ryanne Dolan: The MVs-as-pipelines metaphor breaks down when you want multiple pipelines working together to materialize a single big view. We've introduced "partial views" to solve this problem. You can now create a bunch of MVs that write to the same place.
Phil Eaton: I think there definitely is a vibe shift among experienced programmers. Minimize dependencies. Invest more in comprehensive standard libraries. (JavaScript and Rust notably have the least comprehensive standard libraries.)
David Lindner: New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Jeffrey Emanuel wrote an incredible piece on The Short Case for Nvidia stock. A couple of quotes:
- But if the next big scaling law that people are excited about is for inference level compute— and if the biggest drawback of COT models is the high latency introduced by having to generate all those intermediate logic tokens before they can respond— then even a company that only does inference compute, but which does it dramatically faster and more efficiently than Nvidia can— can introduce a serious competitive threat in the coming years. At the very least, Cerebras and Groq can chip away at the lofty expectations for Nvidia's revenue growth over the next 2-3 years that are embedded in the current equity valuation.
- Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?
Rahul Jain: Devin's Paradox: AI tools doing so good that junior developers are no longer hired or upskilled. Meanwhile the demand for senior devs keeps increasing but there are no senior devs available - thus leading to decreased productivity.
Debasish Ghosh: The talk starts claiming that linked lists are an immoral data structure and if you are using them for anything for which you care about performance, you are committing sin. That's obviously because of the cache misses that a linked list will suffer. And yet Linux VMAs were based on linked lists only till 1995.
Gunnar Morling: Kinda blowing my mind that we're still largely using text-based formats (JSON) for logging, rather than binary formats. Such as waste of compute resources.
flaneur2024: finally read the paper <Can Application Recover from fsync Failures?>. surprised to see the fact that `fsync()` will simply mark the dirty page as clean after you got EIO error, so an success retry of `fsync()` will not finally persist your data, but failed to persist your data silently 😮
LaurieWired: We aren’t far off from the theoretical limits of CPU clockspeed! A soft limit at around ~10GHz where speed-of-light across the entire die in one cycle starts to become a major limitation!
Char writes about “high-leverage generalists“:
- The "T-shaped person" discourse is played out. The real move is being an emergent complexity monster. I've spent the last 6 years watching my career accelerate dramatically through what I call "high-leverage generalism." Not the wishy-washy "jack of all trades" kind, but a deliberate approach to building a unique combination of skills that compounds in value over time.
- The traditional narrative around career development is broken. We're told to pick a lane early, specialise hard, and climb the ladder in our chosen field. This advice made sense in a world of stable, well-defined industries. But that world is dead. Today's most interesting opportunities exist at the intersections - where different domains collide and create new possibilities.
Debasish Ghosh: I have been posting a bit about writing cache aware code, being aligned with the modern CPU architectures and by a fortunate stroke of serendipity find this talk by Scott Meyers, one of the basics of data oriented design .. Cpu Caches and Why You Care
Karthic Rao: Looking ahead, we can anticipate bi-directional flows between tables and streams, letting data teams materialize Iceberg tables as real-time topics (and vice versa) with minimal friction. With no ETL duplication and no wasted data movement, the opportunity for innovation and cost savings is vast.
Chad Sanderson on consumer-driven data contracts:
- An application developer will never comprehensively understand how data downstream is being used, nor will they fully understand the constraints on data that might be necessary for certain use cases and not others. They will have no concept of which SLAs are useful and which are meaningless. They will not understand how the data model needs to evolve (or how it should have been originally defined). They will have no grasp of how data is replicated across multiple services and where strong relationships must be built between owning producer teams. They won’t understand which data is necessary to be under contract and which isn’t.
- This is the consumer-defined data contract. The consumer-defined contract is created by the owners of data applications, with requirements derived from their needs and use cases. While it contains the same information as the producer-defined contract, it is implemented primarily to draw awareness to the request and inform data producers when new expectations and dependencies exist on the data they maintain.

Published Humans

The Five-Minute Rule for the Cloud: Caching in Analytics Systems: A useful paper for system designers building analytics systems on object storage. The paper surveys a number of caching architectures in terms of various properties such as latency variability and implementation/operational complexity. It then introduces a comprehensive cost model that evaluates these caching architectures taking into account various factors such as data access patterns, latency vs non-latency sensitive workloads, to determine optimal caching policies. Using this model, they propose new rules of thumb for cloud-native databases that must balance cost / latency trade offs in dynamic cloud environments. Basically, the need for adaptable caching mechanisms that can respond to fluctuating workloads and data access patterns (while maintaining desired cost/perf profile).
Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Ecosystem: Can One QO Rule Them All? This paper explores whether Query Optimization (QO), in the context of data lakehouses, can be decoupled from data systems and made into a separate service—QAaaS. The paper explains that QO implementations across most relational analytics data systems involve the common steps of parsing/algebrization, simplification/normalization, cost-based exploration/implementation, and post-optimization. The QAaaS model decouples the QO from individual engines, allowing it to function as an independent service that interacts with multiple engines via remote procedure calls. The theoretical benefits of QOaaS include accelerated innovation, enhanced engineering efficiency, reduced time-to-market for new engines, and the capability for cross-engine optimization and scalability. However, the implementation of QOaaS presents challenges such as defining a universal query plan format and accommodating diverse engine capabilities within a unified cost model. The authors themselves note that the work is ambitious and exploratory, inviting discussion from the community on its feasibility.

Interesting topic #1 - Well and ill-conditioned APIs

Back in September 2023 Marc Brooker observed that:

“The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?”

Mahesh Balakrishnan riffed on that remarking:

To misuse some terminology from math, SQL is an “ill-conditioned” API: small changes in input can trigger very different amounts of work. The opposite would be block storage, which is “well-conditioned”. Another famous example of an ill-conditioned abstraction is IP Multicast.

This week I’ve seen two people talk around this property of APIs and their “conditioning”:

James Cowling: A good design principle is that APIs that look cheap should be cheap and APIs that are expensive should look expensive. One mistake we made in an old version of the Dropbox filesystem was that directory moves looked cheap (drag a folder around) but were expensive on the server (update metadata for every file in the tree). A similar mistake we made with @convex_dev is that the .filter() operator looks convenient, even though it does a table scan, while the .withIndex() syntax for efficiently reading from an index looks a little cumbersome and expensive.
Jeffrey Emanuel discusses the impact that CoT-based reasoning has on inference scaling:
- But essentially, until recently, inference compute was generally a lot less intensive than training compute, and scaled basically linearly with the number of requests you are handling— the more demand for text completions from ChatGPT, for instance, the more inference compute you used up.
  With the advent of the revolutionary Chain-of-Thought ("COT") models introduced in the past year, most noticeably in OpenAI's flagship O1 model (but very recently in DeepSeek's new R1 model, which we will talk about later in much more detail), all that changed.
- Some of the most exciting news in the AI world came out just a few weeks ago and concerned OpenAI's new unreleased O3 model, which was able to solve a large variety of tasks that were previously deemed to be out of reach of current AI approaches in the near term. And the way it was able to do these hardest problems (which include exceptionally tough "foundational" math problems that would be very hard for even highly skilled professional mathematicians to solve), is that OpenAI threw insane amount of compute resources at the problems— in some cases, spending $3k+ worth of compute power to solve a single task (compare this to traditional inference costs for a single task, which would be unlikely to exceed a couple dollars using regular Transformer models without chain-of-thought).

On the subject of CoT reasoning. It seems likely that prompts on advanced reasoning models will result in wildly different amounts of compute resources, depending on the the complexity of the question, any false steps during the CoT process that must be corrected and the upper-bound cost the prompter is willing to pay for an answer.

Interesting topic #2 - Contextual data quality challenges and AI

This week I encountered this article on data quality, Big data quality framework: a holistic approach to continuous quality management (2021).

One figure was particularly interesting and useful to me. It categorizes data quality into intrinsic, contextual, representational and accessibility dimensions.

Figure 2 from Big data quality framework: a holistic approach to continuous quality management

It made me realize that the definition of data quality that I and many others working on data tooling are using is overly narrow, focused almost entirely in the Intrinsic and Accessibility categories. The reason? These two categories are the ones that are easy to build tooling for. We can design data systems to offer consistency, we can use schemas, types and validation rules to ensure data adheres to strict forms, we can measure if data arrives on-time or late. We can control access to the data, who can see what and when.

But what about the Contextual category? The Contextual category is also extremely important but characteristics such as relevancy, trustworthiness and suitability are hard to quantify and evaluate through software. The Contextual category is where the data engineer and data analyst use their human intellect and experience to pick good sources of data, judge whether data is trustworthy or not, and how to use and combine data effectively.

AI becomes relevant to this in two ways:

Data generation: AI is capable of the automated generation of data quality issues in all categories.
Data control: AI can automate more advanced data quality controls.

From deterministic pipelines to stochastic pipelines

The data that is being ingested into data warehouses and data lakehouses can vary from low to high quality. Typically, data engineers have to cleanse the data before it can be used to generate business insights. This is known ahead of time and therefore data is put through various stages first. Once data is in a high quality state, we can apply deterministic processing/analysis to produce high quality outputs. The thing that data engineers and analysts must guard against are Garbage In Garbage Out (GIGO) and errors in their own processes that turn valid inputs into wrong outputs. But this can be done; we’re talking about deterministic processing, where age old techniques of testing, validation and end-user feedback catch most problems.

Enter AI. The benefit of AI is automated intelligence (whether it’s simple models doing classification or sophisticated models doing advanced reasoning). The output of AI is generally more data. AI also requires valid inputs but due to its stochastic nature, can produce bad data from good.

The data quality problems it can generate can include all four categories from the article:

Intrinsic: Data may not be complete, or conform to correct types etc.
Contextual: Data may not be relevant or suitable. Data may be structurally correct but factually wrong.
Representational: AI systems that produce outputs in complex or inconsistent formats can hinder user understanding. For example, if an AI model presents data in a convoluted manner without clear explanations, it can lead to misinterpretations, affecting decision-making processes.
Accessibility: If AI systems do not implement robust access controls, sensitive information could be exposed.

Intrinsic data quality issues may be detected and remediated automatically via software, using the same strategies we use today on ingested data. But what about Contextual or Accessibility data quality issues? How do we apply automated controls to ensure relevancy and suitability of data (currently the preserve of the data engineer and analyst)? How do we detect that sensitive data has leaked?

Handling AI-generated contextual data quality issues

There is little room for contextual data quality issues with simple classifiers, but LLMs generate open-ended text that present a much larger data quality challenge. So the question is: How do we benefit from automated intelligence without it causing data quality incidents?

It seems that the answer is also AI.

AI solutions to AI problems:

Confidence scores
- Self-Evaluation Scoring. The LLM itself can estimate confidence by generating multiple responses and ranking them.
- Threshold-Based Filtering. If confidence is below a threshold (e.g., 60%), the model can flag the answer as uncertain or refuse to answer.
Verification
- Self-Consistency Sampling. Instead of generating one response, the LLM generates multiple responses and picks the most consistent one.
- AI Critique Models (Verifier LLMs). Another LLM can check the primary LLM’s responses, filtering out incorrect, irrelevant, or misleading text.
- Fact-Checking with Retrieval-Augmented Generation (RAG).
- Adversarial AI (GAN-Style Verifiers)

The verification side seems to be more domain specific. In order to catch contextual data quality issues, the verifier itself must understand that context. The challenge for data engineers and analysts wanting to adopt more sophisticated uses of LLMs will be in automated domain-specific AI-based verification.

AI in 2025 and onwards

Contextual data quality remains largely a human-driven responsibility today, relying on the expertise of data engineers and analysts to make informed decisions about data sources, transformations, and interpretations. Notably, humans assess contextual data quality during the data source selection, ingest or cleansing process. However, AI can generate as-yet unverified contextual data quality issues from inside the “safe zone”, where ingested data has been cleansed.

While AI can generate data quality problems across all four dimensions, Contextual issues stand out to me as the most complex and hard to automate away. The challenge ahead for data professionals is to build domain-aware AI verification pipelines that understand context-specific relevancy, suitability and correctness. The future of data engineering isn’t just about generating data with AI but about curating and refining it, subjecting AI-produced outputs to the same rigorous scrutiny that human analysts apply. As AI will increasingly permeates data pipelines, and AI-powered data quality management will become ever more necessary to ensure that insights remain reliable and relevant. In order words, verification and correction will become as fundamental as ingestion and transformation.

Humans of the Data Sphere Issue #6 January 14th 2025

Jack Vanlightly — Tue, 14 Jan 2025 18:42:44 GMT

Welcome to Humans of the Data Sphere issue #6!

First, is this our future of AI agents?

From Idiocracy (see the video for the full clip)

Subscribe now

Quotable Humans

Marc Brooker on why snapshot isolation is a sweet spot: It’s a crucial difference because of one of the cool and powerful things that SQL databases make easy: SELECTs. You can grow a transaction’s write set with UPDATE and INSERT and friends, but most OLTP applications don’t tend to. You can grow a transaction’s read set with any SELECT, and many applications do that. If you don’t believe me, go look at the ratio between predicate (i.e. not exact PK equality) SELECTs in your code base versus predicate UPDATEs and INSERTs. If the ratios are even close, you’re a little unusual. …This is where we enter a world of trade-offs³: avoiding SI’s write skew requires the database to abort (or, sometimes, just block) transactions based on what they read.
Charlie Marsh: It was really valuable for me to stay at a single company long enough to live with the consequences of my own engineering decisions. To come face-to-face with my own technical debt.
Viktor Leis shares the results of a panel discussion on under-researched database problems:
- One significant yet understudied problem raised by multiple panellists is the handling of variable-length strings. … Dealing with strings presents two major challenges. First, query processing is often slow due to the variable size of strings and the (time and space) overhead of dynamic allocation. Second, surprisingly little research has been dedicated to efficient database-specific string compression. Given the importance of strings on real-world query performance and storage consumption, it is surprising how little research there is on the topic (there are some exceptions).
- While database researchers often focus on database engine architectures, Andy argued that surrounding topics, such as network connection handling (e.g., database proxies), receive little attention despite their practical importance. Surprisingly, there is also limited research on scheduling database workloads and optimizing the network stack, even though communication bottlenecks frequently constrain efficient OLTP systems.
Mai-Lan Tomsen Bukovec shared the basics of the Principal Engineer Roles Framework: if there is one thing that I have learned, it is that when you run complex systems at scale, you must think deeply about how teams work. It’s not enough to be get into the details about what you build. You have to spend lots of time engineering, iterating, and improving how you and your team operate.
Andy Pavlo: DuckDB has entered the zeitgeist as the default choice for someone wanting to run analytical queries on their data. Pandas previously held DuckDB's crowned position. Given DuckDB's insane portability, there are several efforts to stick it inside existing DBMSs that do not have great support for OLAP workloads. This year, we saw the release of four different extensions to stick DuckDB up inside Postgres.
Martin Casado starts a fun thread of often-dangerous tech patterns (definitely ripe for some disagreement, get popcorn and remember he said *almost*):
- Systems ideas that sound good but almost never work: - DSLs - Live migrating process state - Anomaly detection - Control loops responding to system load - Multi-master writes - p2p cache sharing - Hybrid parallelism - Being clever vs over-provisioning .. What else?
  - Nikita Shamgunov: Database land: * Unified OLTP and OLAP (guilty) * HTAP (feature not a market). Works as a feature though * API compat for databases (https://babelfishpg.org is the most recent flop) * Database migrations (from Oracle, Teradata, etc) * True geo distribution for a database (we will see about d-sql) * Shared nothing db architecture (sorry, I know you are an investor) P.S. Multi-master may actually work (https://dl.acm.org/doi/10.1145/3626246.3653377…). I'm looking into it.
Sunny Bains kicks off some discussion of SQL and scaling: When someone tells you that SQL doesn’t scale, they are either selling you a KV store, they read it on the Internet somewhere or are totally clueless and want to sound smart. I’ve never understood what SQL has to do with scaling.
- Ovais Tariq: Almost all high-end services built on SQL database such as Facebook, Uber, etc restrict the SQL language exposed to the users to the most basic set. Transactional databases scale if you reduce the surface area of features of the SQL language is what I am getting at.
- Sam Lambert: it’s also why scaling while doing things like postgres extensions is a pipe dream. too much entropy.
- Derrick Wippler: THIS! Also, So many fail to realize that for each JOIN you add there is a 10x or more performance penalty. You can't defy the laws of physics. You CAN use SQL to scale, you just can't abuse it as if these features don't have a cost.
James Cowling: I often see code that assumes database auto-increment ids are monotonically increasing. e.g., AUTO_INCREMENT in MySQL, SERIAL in Postgres. They are not. They may have gaps plus don't reflect commit order, so may show up out of order. It's easy to write bugs that walk over a table assuming this is an actual ordering.
- Sunny Bains: The problem around auto-increment is due to historic reasons. Up to SQL-92 there was no mention of sequences/auto-incrementt. Every db vendor had their own syntax and implementation quirks. SQL 2003 was the first to introduce sequences officially and monotonicity was not actually specified (AFAIR, I remember discussing this when I rewrote InnoDB’s auto-increment handling). SQL 2008 addressed the monotonicity part but AFAIK doesn’t mandate it either, allows for non-monotonicity, values are allowed to be generated out of order when used concurrently. Relying on any specific behaviour is not based on any standard, it’s vendor specific only.
- Piyush Goel: This happens especially when you perform batched inserts in a concurrent manner. The engine creates blocks of ids that are spaced out to avoid conflicts. I made a terrible blunder in estimating the size of a critical table by looking at the auto-inc id. I only realized much later that it didn’t tally with the count(*) value and went on a deep-dive to understand the auto-inc behavior.
Sunny Bains: When writing multi-threaded programs remember to test it on HW with multiple sockets. Performance drop due to shared state across sockets is quite depressing to watch. Modern CPUs have made it even more interesting with all kinds of Ln cache sharing combinations. I’ve seen many examples of “it scaled really well on my Mac/PC/laptop”.
- Shaun Thomas: The Numa effect is real. It's one reason I test with and without CPU pinning in virtual environments, because I want to see how bad the degradation is if the load migrates to a cold socket.
Yingjun Wu: While many database vendors are competing in the analytics space (trying to be the next Snowflake or Databricks!), not many are going after Redis. The truth is, a lot of companies complain that Redis is too expensive. I predict that, in 2025, some players will emerge to challenge Redis with an ‘S3 as the primary storage’ architecture.
Yingjun Wu: An interesting observation I've made in the database area is that the current trend in #AgenticAI seems to benefit row stores more than columnar stores. Over the past few years, the focus has been on OLAP, with all database vendors racing to build or enhance their columnar stores. However, what today’s agentic AI actually needs is a search index for a knowledge cache - sth that can be efficiently maintained in a key-value store, a search engine, or even a system like
@PostgreSQL. What goes around comes around... and then around…
Marc Brooker: To understand the value of backoff (e.g. exponential backoff), it's worth understanding the distinction between 'open' and 'closed' systems. From the classic paper "Open versus Closed: A Cautionary tale" https://www.usenix.org/conference/nsdi-06/open-versus-closed-cautionary-tale
Marc Brooker: For immutable or versioned data? Erasure coding. Constant work, tunable cost/latency trade-off, operational benefits, and (relatively) simple client-side implementation. Treat slow responses as erasures.
Jeremy Morell: The "use tail sampling for your traces" advice should probably also come with a strong "you must be this tall to ride this ride" caveat Simple if you have a monolith, challenging if you have a cluster of microservices, jesus take the wheel if you have geo-distributed traces crossing regions.
- Ivan Burmistrov: IMO sampling is one of the most undercovered topics in the whole o11y area. There are either claims "sampling is not needed" (confusing, untrue), or this one you mentioned. There are no proper guidances, examples of how to do it properly.
Ryanne Dolan: Meta and Salesforce both just announced they have halted hiring of SWEs cuz AI has replaced them. And you STILL think your job is safe??
Gergely Orosz: Anyone saying that GenAI could replace software engineers don't understand how software is created (and operated.) Tool innovations have always make the process of building software faster, cheaper: GenAI and AI agents will also do this. These are tools and efficiency gains.
Kelsey Hightower: DeepSeek, a LLM trained for a fraction of the cost of GPT-Xx models, in 2 months for 6 million, on limited GPUs due to export restrictions, and competing head to head. This is crazy. If these numbers hold up, consider the game changed.
Jaana Dogan: Product flywheels are so much easier to create if you have decent infrastructure that allows you to recompose and create product experiences quickly. It's a lesson hard to learn if you never had a chance to work for a company with the right building blocks.
Gwen Shapira: Everyone: LLMs are so intelligent that they'll take our jobs! 2025 is the year of agents and robots! Roomba: inhales 2 usb cables and a cat toy. Proceeds to get stuck humping my Poang chair.
Ethan Mollick muses on the recent advancements in AI models:
- As one fun example, I read an article about a recent social media panic - an academic paper suggested that black plastic utensils could poison you because they were partially made with recycled e-waste. A compound called BDE-209 could leach from these utensils at such a high rate, the paper suggested, that it would approach the safe levels of dosage established by the EPA. A lot of people threw away their spatulas, but McGill University’s Joe Schwarcz thought this didn’t make sense and identified a math error where the authors incorrectly multiplied the dosage of BDE-209 by a factor of 10 on the seventh page of the article - an error missed by the paper’s authors and peer reviewers. I was curious if o1 could spot this error. So, from my phone, I pasted in the text of the PDF and typed: “carefully check the math in this paper.” That was it. o1 spotted the error immediately (other AI models did not).
- In fact, even the earlier version of o1, the preview model, seems to represent a leap in scientific ability. A bombshell of a medical working paper from Harvard, Stanford, and other researchers concluded that “o1-preview demonstrates superhuman performance [emphasis mine] in differential diagnosis, diagnostic clinical reasoning, and management reasoning, superior in multiple domains compared to prior model generations and human physicians."
- Potentially more significantly, I have increasingly been told by researchers that o1, and especially o1-pro, is generating novel ideas and solving unexpected problems in their field (here is one case).
- …the lesson to take away from this is that, for better and for worse, we are far from seeing the end of AI advancement.
Gwen Shapira: I believe that this time next year, we’ll look at AI validation and observability the same way we look at unit tests today - an essential tool for continuous improvement process.
Charity Majors and Phillip Carter: One disappointing aspect of the current boom is how many companies are being incredibly closed-lipped about the practical aspects of developing with LLMs. Most leading AI companies seem reluctant to show their work or talk about how they resolve the contradictions of applying software engineering best practices to nondeterministic systems, or how AI is changing the way they develop software and collaborate with each other. They act like this is part of their secret sauce, or their competitive advantage.
Melanie Mitchell muses on whether o3 has solved abstract reasoning or not:
- The purpose of abstraction is to be able to quickly and flexibly recognize new situations using known concepts, and to act accordingly. That is, the purpose of abstraction—at least a major purpose—is to generalize.
- The o1 and o3 systems are a bit different. They use a pre-trained LLM, but at inference time, when given a new problem, they do a lot of additional computation, namely, generating their chain-of-thought traces.
- One set of researchers showed, however, that changing the game just by moving the paddle up a few pixels resulted in the original trained system performing dramatically worse. It seems that the system had learned to play Breakout not by learning the concepts of “paddle”, “ball”, or “brick”, but by learning very specific mappings of pixel configurations into actions. That is, it didn’t learn the kinds of abstractions that would allow it to generalize.
  I have similar questions about the abstractions discovered by o3 and the other winning ARC programs.

Interesting topic: AI agents and the recent advancements in AI models

Since issue #5, two interesting blog posts have been written about AI agents and many predict that 2025 will be the year of the AI agent.

Anthropic wrote Building Effective Agents.
Chip Huyen wrote Agents.

Ethan Mollick has also recently published a number of excellent blog posts:

In this section of issue #6, we’ll explore what AI agents are as well as some of the challenges involved.

At the most abstract level, Chip Huyen defines an agent in her Agents blog post:

An agent is anything that can perceive its environment and act upon that environment. Artificial Intelligence: A Modern Approach (1995) defines an agent as anything that can be viewed as perceiving its environment through sensors and acting upon that environment through actuators. This means that an agent is characterized by the environment it operates in and the set of actions it can perform.

Taking actions is a defining characteristic of an AI agent compared to just an LLM that provides textual or graphical responses to prompts. Ethan Mollick notes that much of the work in modernity is digital and something that an AI could plausibly do.

The digital world in which most knowledge work is done involves using a computer—navigating websites, filling forms, and completing transactions. Modern AI systems can now perform these same tasks, effectively automating what was previously human-only work. This capability extends beyond simple automation to include qualitative assessment and problem identification.

Anthropic discuss the definition of an agent in their Building Effective Agents blog post:

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as agentic systems, but draw an important architectural distinction between workflows and agents:
Workflows are systems where LLMs and tools are orchestrated through predefined code paths.
Agents, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

This seems like an important distinction to make. A workflow is a kind of static flow chart of branches and actions that constrain what the AI can do. It’s prescriptive, more predictable but less flexible. A true AI agent on the other hand determines its own control flow, giving it the freedom to plan and execute flexibly, but comes with additional risk. Anthropic note that you should choose the simplest option possible, but when more complexity is needed then a workflow or agent may be required:

…workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale.
…agents can be used for open-ended problems where it’s difficult or impossible to predict the required number of steps, and where you can’t hardcode a fixed path.”

In a practical sense, an AI agent is an LLM designed to satisfy specific goals, using a suite of tools to help it interact with the real-world in order to satisfy those goals. Chip Huyen classifies the tools into three categories:

Depending on the agent’s environment, there are many possible tools. Here are three categories of tools that you might want to consider: knowledge augmentation (i.e., context construction), capability extension, and tools that let your agent act upon its environment.

In order to satisfy a goal, an agent must use a combination of:

Effective planning and reasoning.
- The agent makes a plan of steps it needs to perform in order to satisfy the goal.
Accurate tool selection and execution.
- The LLM may need to make API calls for information retrieval and/or makes changes or take actions in the real world.
Self-reflection and evaluation.
- At every step, the agent should reflect on what it has planned and the results it has received to ensure it is still doing the right thing.

AI agents are stochastic systems which add a new flavor of risk to every step. Many agent systems require multiple steps to satisfy a goal and errors in each step can compound. This is one of the defining characteristics of AI agents that the agent designer must account for. In fact, accounting for all the failure modes of an agent is where the highest learning curve is found as well as most of the developmental cost.

As a distributed systems engineer, I’m probably more on the paranoid side of the risk-awareness spectrum. Through that lens I see all manner of challenges to overcome when building AI agents:

Effective Planning:
- AI agents must create plans that align with their goals while adapting to dynamic environments and incomplete information. Self-reflection and being able to change course may be necessary.
- Plans may need to be evaluated to ensure that they are feasible, efficient, and contextually appropriate. Typically, we can view an agent as a multi-agent system where planning, execution and evaluation are carried out by separate agents that collaborate.
- There are a number of things that can go wrong in the planning phase:
  - The agent does not revise plans when new information contradicts initial assumptions.
  - The agent gets stuck in loops, revisiting the same steps repeatedly without making progress.
  - The agent sets inappropriate or harmful goals due to poorly defined prompts or objectives.
Accurate Tool Selection and Usage:
- Agents need to identify the correct tools (e.g., APIs, models) for a task and invoke them properly.
- Common issues include:
  - Invoking the wrong tool. Or failing to consider multiple tools or approaches, leading to suboptimal performance.
  - Providing incorrect or incomplete inputs which can create wrong or suboptimal results.
  - Hallucinating non-existent tools.
  - Failing to recognize when tool outputs indicate anomalies, errors, or limitations (as in the humorous Idiocracy clip at the top of the issue).
Reasoning and Decision-Making:
- Agents may struggle to interpret the results of their actions or external tool outputs.
- Errors in reasoning can lead to invalid conclusions, impacting subsequent actions (the compounding of errors). Some reasoning errors may result from forgetting critical context or information needed to make accurate decisions.
Failure Modes in Execution:
- Agents can fail to execute actions correctly, leading to unintended consequences. The first challenge is detecting when actions are executed incorrectly and the second challenge is remediating such actions.
- Difficulty handling edge cases or unexpected outcomes, or handling rare cases as general cases.
- Monitoring and auditing agents may also be challenging. Not only detecting when things go wrong, but also detecting bias and justifying why certain actions were taken.

I could go on but you get the idea. I imagine that a lot of experimentation and iteration will go into AI agent development and getting that last 20% of completeness and polish could be really time consuming. Chip Huyen covers a lot of this in her framing of AI agent development. The Anthropic post also steers you the route of choosing the simplest agent, or no agent at all:

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all.
…
Start with simple prompts, optimize them with comprehensive evaluation, and add multi-step agentic systems only when simpler solutions fall short. When implementing agents, we try to follow three core principles:
Maintain simplicity in your agent's design.
Prioritize transparency by explicitly showing the agent’s planning steps.
Carefully craft your agent-computer interface (ACI) through thorough tool documentation and testing.“

AI agents and agentic systems are an emerging practice and I have to agree that 2025 will indeed be the year of the AI agent given the promise that AI agents hold and the rapid improvements in model capabilities.

However, with that said, I do have some serious concerns and I believe there will be two constraining aspects of AI agents that present a challenge to widespread adoption:

Reliability. There are so many failure modes and even the mitigations are usually run by LLMs and therefore have their own failure modes. Errors compound, detection and mitigations themselves may not be highly reliable.
Cost. Agents may require multiple reasoning steps using the more powerful models. All this pushes up the cost. With higher costs come higher demands for the value proposition. Of course with the arrival of DeepSeek v3, maybe 2025 will also be the year of the more efficient LLM.

Chip Huyen noted in her post:

Compared to non-agent use cases, agents typically require more powerful models for two reasons:
Compound mistakes: an agent often needs to perform multiple steps to accomplish a task, and the overall accuracy decreases as the number of steps increases. If the model’s accuracy is 95% per step, over 10 steps, the accuracy will drop to 60%, and over 100 steps, the accuracy will be only 0.6%.
Higher stakes: with access to tools, agents are capable of performing more impactful tasks, but any failure could have more severe consequences.

As AI becomes more and more capable but with non-trivial associated costs we may enter an age where cost efficiency is the primary decision maker about when to use AI vs when to use a human or not do the thing at all. If both AI and human workers can execute a digital task at similar levels of competence then cost efficiency becomes the defining question. François Chollet made this point over the holiday period:

One very important thing to understand about the future: the economics of AI are about to change completely. We'll soon be in a world where you can turn test-time compute into competence -- for the first time in the history of software, marginal cost will become critical. Cost-efficiency will be the overarching measure guiding deployment decisions. How much are you willing to pay to solve X?

In this early phase, agents are likely best suited to narrow tasks that do not involve important actions such as bank transfers, costly purchases, actions that cannot be undone without cost or negative impact. Ethan Mollick noted that:

Narrow agents are now a real product, rather than a future possibility. There are already many coding agents, and you can use experimental open-source agents that do scientific and financial research.
Narrow agents are specialized for a particular task, which means they are somewhat limited. That raises the question of whether we soon see generalist agents where you can just ask the AI anything and it will use a computer and the internet to do it. Simon Willison thinks not despite what Sam Altman has argued. We will learn more as the year progresses, but if general agentic systems work reliably and safely, that really will change things, as it allows smart AIs to take action in the world.

It will be fascinating to watch how agentic systems evolve as a category. We may see the adoption of agents happen with some ebb and flow but with a general trend towards greater adoption, as:

The capabilities and in-production reliability and cost become known.
As patterns and practices evolve.
And of course, how the models themselves progress both in terms of capabilities but also cost efficiency.

I’ll finish with one more quote from Ethan Mollick:

Organizations need to move beyond viewing AI deployment as purely a technical challenge. Instead, they must consider the human impact of these technologies. Long before AIs achieve human-level performance, their impact on work and society will be profound and far-reaching.

Humans of the Data Sphere Issue #5 December 10th 2024

Jack Vanlightly — Tue, 10 Dec 2024 12:46:22 GMT

Welcome to Humans of the Data Sphere issue #5!

First, a Haiku for Cloudflare:

One small adjustment,

Tides of load come rushing in,

Servers gasp, then sink.

Best meme:

by Forrest Brazeal

Quotable Humans

Craig Kerstiens (quoting a comment on HN about Aurora DSQL limitations): "Postgres compatible" - No views/triggers/sequences - No foreign key constraints - No extensions - No NOTIFY ("ERROR: Function pg_notify not supported") - No nested transactions - No JSONB. What, what IS it compatible with?
Gunnar Morling: I think it's about time we get a TCK which asserts what "Postgres compatible" means. Like, you need to have sequences, foreign key constraints, views, etc. in order to be able to claim that.
- DBA: Something like this? Postgres compatibility index "alpha version" https://github.com/secp256k1-sha256/postgres-compatibility-index/tree/main/postgres-compatibility-index
Daniel Beach: AWS S3 Tables?! The Iceberg Cometh. weep moan, wail all ye Databricks and Snowflake worshipers
Chris Riccomini: Lot's of interesting stuff happening in filesystem-on-object storage these days. regattastorage.com had a splashy YC launch, we have juicefs.com and alluxio.io, and been hearing about national labs using gluster.org & ceph.io. Seems like market is growing for these offerings.
Sriram Subramanian: Vector database companies will become one of the three
- a. Search platforms (e.g., Elastic)
- b. AI platform (inference/training/pipeline)
- c. OLTP provider (e.g., MongoDB, Postgres)
  Hard for vector databases to be a separate offering/company
JD Long: I have a kick ass quality engineering lead. She taught me about including QE folks in design so things get built easier to test. And about getting QE involved in logging & monitoring. We’re calling it “shift left and right” which we should probably turn into a dance at the firm Christmas party.
Alex Miller (made some notes on disaggregated OLTP Systems):
- Socrates feels like a very modern object storage-based database in the WarpStream or turbopuffer kind of way for it being a 2019 paper. This architecture is the closest to Neon’s as well.
- Thus, much of the PolarDB Serverless paper is about leveraging a multi-tenant scale-out memory pool, built via RDMA. This makes them also a disaggregated memory database! As a direct consequence, memory and CPU can be scaled independently, and the evaluation shows elastically changing the amount of memory allocated to a PolarDB tenant. However, implementing a page cache over RDMA isn’t trivial, and a solid portion of the paper is spent talking about the exact details of managing latches on remote memory pages and navigating b-tree traversals.
Kelly Sommers: One thing I really notice coming back to C#/.NET space is many have been taught to value style over sound engineering. Given a task to build a report over millions of rows & apply a custom calc to each row they will do this all in C#. Resisting SQL & sprocs 500K round trips. [This post triggered quite a lively discussion]
- Personal note (Jack): Back in the days of C# 3, I rewrote a C# batch job written in this style, replacing it with a stored procedure, reducing running time from 24 hours to 11 seconds. It was my first week on the job and the DBA was thankful but the architect was not. We made a truce and the SP was still there 5 years later when I moved on.
Mohit Mishra: CPU vs FPGA - what an easy and well explanation in 25 seconds. [It’s a cool visualization]
Air Katakana: what looks like “ai stealing jobs” to most people should look like “ai giving massive amount of leverage to individuals with high agency” to you
Anthony Goldbloom: My Kaggle experience suggests more than 75% of the machine learning models in production or written up in academic papers are overfit. Kaggle has strong controls on overfitting: - limited number of daily submissions - models were retested on a second test dataset that participants never received feedback on Under these conditions, a very high fraction of first-time competition participants would overfit to the public leaderboard set. And their position would drop dramatically when we retested their model on a second test dataset. This was true for experienced machine learners in academia and industry (not just newbies). The Kaggle controls are not imposed on data scientists for internal company projects or academic research.
Gergely Orosz: There's this evergreen joke on software development that goes something like this: "We're done with 90% of the project. Which means we only have the other 90% left to go." It's funny because it's true. It's also why experienced engineers are in-demand: they are the "finishers."
Lorin Hochstein: The resilience engineering research David Woods uses the term saturation in a more general sense, to refer to a system being in a state where it can no longer meet the demands put upon it. The challenge of managing the risk of saturation is a key part of his theory of graceful extensibility.
It’s genuinely surprising how many incidents involve saturation, and how difficult it can be to recover when the system saturates.
Reuben Bond (interesting discussion thread): CRDTs fit a system model that has hardly any overlap with datacenter-based applications. If I'm wrong, please point to datacenter-based apps which have benefited from the application of CRDTs.
Nikhil Benesch (on S3 Iceberg tables): Compaction costs are more of a mixed bag. For analytic workloads that write infrequently, they also look to be immaterial. But for streaming workloads that write frequently (say, once per second, or once per every ten seconds), compaction costs may be prohibitive. The cost per object processed looks tolerable (writing an object per second results in only 2.5MM objects per month that need to be compacted), but write amplification will be severe, and the cost per GB processed is likely to add up. To really get a sense for compaction costs, someone will need to run some experiments. A lot depends on how often S3 chooses to compact data files for a given workload, which is not something that’s directly under the user’s control.
Justin Jaffray (discusses extremely accurate data center clocks): To the best of my knowledge, despite being a concept in the realm of consistency, the "clock trick" is only really necessary in a distributed transactional database, which is why we've only really seen it in Spanner and now DSQL, and it has an important connection to multi-version concurrency control (MVCC).
…
Now, the magic clocks are not perfect, they're not accurate to like, Planck time, or something. But they come with a guarantee that no two of them disagree by more than some bound. That is, when a server observes the time to be t, it knows that no other server will observe the time to be earlier than, say, t-100. The clocks of all the participants in the system are racing along a timeline, but there's a fixed-size window that they all fall within. This guarantee can be used to prevent the situation we described above. The guarantee we want is that once an operation finishes, no other operation will be ordered before it ever again.
Murat Demirbas: I have been impressed by the usability of TLA-Web from Will Schultz. Recently I have been using it for my TLA+ modeling of MongoDB catalog protocols internally, and found it very useful to explore and understand behavior. This got me thinking that TLA-Web would be really useful when exploring and understanding an unfamiliar spec I picked up on the web.
To test my hunch, I browsed through the TLA+ spec examples here, and I came across this spec about the Naiad Clock. Since I had read DBSP paper recently, this was all the more interesting to me. I had written about Naiad in 2014, and about dataflow systems more broadly in 2017.
Dipankar Mazumdar: Adopting these table formats has laid the groundwork for openness. Still, it is crucial to recognize that an open data architecture needs more than just open table formats—it requires comprehensive interoperability across formats, catalogs, and open compute services for essential table management services such as clustering, compaction, and cleaning to also be open in nature.
Gunnar Morling (on Postgres 17 failover slots): Prior to Postgres version 16, read replicas (or stand-by servers) couldn’t be used at all for logical replication. Logical replication is a method for replicating data from a Postgres publisher to subscribers. These subscribers can be other Postgres instances, as well as non-Postgres tools, such as Debezium, which use logical replication for change data capture (CDC). Logical replication slots—which keep track of how far a specific subscriber has consumed the database’s change event stream—could only be created on the primary node of a Postgres cluster.
…
But the good news is, as of Postgres version 17, all this is not needed any longer, as it finally supports failover slots out of the box!
Charity Majors (on architects):
- I think that a lot of companies are using some of their best, most brilliant senior engineers as glorified project manager/politicians to paper over a huge amount of organizational dysfunction, while bribing them with money and prestige, and that honestly makes me pretty angry.
- Most of the pathologies associated with architects seem to flow from one of two originating causes:
  1. unbundling decision-making authority from responsibility for results, and
  2. design becoming too untethered from execution (the “Frank Gehry” syndrome)
  But it’s only when being an architect brings more money and prestige than engineering that these problems really tend to solidify and become entrenched.
- This is also why I think calling the role “architect” instead of “staff engineer” or “principal engineer” may itself be kind of an anti-pattern.
Marc Brooker (discussing how various technologies came together to make DSQL possible):
- The second was EC2 time sync, which brings microsecond-accurate time to EC2 instances around the globe. High-quality physical time is hugely useful for all kinds of distributed system problems. Most interestingly, it unlocks ways to avoid coordination within distributed systems, offering better scalability and better performance. The new horizontal sharding capability for Aurora Postgres, Aurora Limitless Database, uses these clocks to make cross-shard transactions more efficient.
- The third was Journal, the distributed transaction log we’d used to build critical parts of multiple AWS services (such as MemoryDB, the Valkey compatible durable in-memory database⁴). Having a reliable, proven, primitive that offers atomicity, durability, and replication between both availability zones and regions simplifies a lot of things about building a database system (after all, Atomicity and Durability are half of ACID).
- The fourth was AWS’s strong formal methods and automated reasoning tool set. Formal methods allow us to explore the space of design and implementation choices quickly, and also helps us build reliable and dependable distributed system implementations⁶. Distributed databases, and especially fast distributed transactions, are a famously hard design problem, with tons of interesting trade-offs, lots of subtle traps, and a need for a strong correctness argument. Formal methods allowed us to move faster and think bigger about what we wanted to build.
Hugo Lu: Which leads to this “Data Observability Paradox” — although they claim to solve data quality they cannot. They can tell you where the problem is, but they do not solve the root cause. Indeed, the root cause is solved by having robust infrastructure (ingestion, orchestration, alerting etc) but most commonly, individuals not caring enough about the data. A cultural problem — not one that can be solved with Software.
David Jayatillake: Let’s start with a TLDR; Medallion Architecture is not a form of data modeling. It is associated with data modeling, and data modeling happens within it … I would prefer naming like staging, model and presentation. This is much closer to what people already know and expresses what actually happens in the layers. Bronze, silver and gold make it easier to explain to non-technical users. That’s the only reason why the product marketers have gone for Medallion architecture, although whether non-technical users really need to understand “how the sausage is made” is another question.
Ethan Mollick (suggestions on when and when not to use AI, I took one of each):
- When: Work where you need a first pass view at what a hostile, friendly, or naive recipient might think.
- When not: When the effort is the point. In many areas, people need to struggle with a topic to succeed - writers rewrite the same page, academics revisit a theory many times. By shortcutting that struggle, no matter how frustrating, you may lose the ability to reach the vital “aha” moment.

Interesting topic #1: Another incident, another config change

I loved Lorin Hochstein's “quick takes“ of Cloudflare's November 14, 2024, incident (which led to log data loss). Beyond that it was triggered by a config change, the post highlights several recurring patterns in system failures.

Saturation and Overload
Safety Mechanisms Backfiring
Complex Interactions and Latent Bugs

It’s a great read, with references to:

Brendan Gregg’s USE method that has shaped my own approach to running and monitoring distributed data systems.
David Woods' Theory of Graceful Extensibility [download from here] (which explores how complex systems adapt to increasing demands or pressures beyond their designed capacity, focusing on their ability to stretch and evolve rather than collapse under stress).
Lorin’s own work on safety mechanisms (such as his blog post A Conjecture on Why Reliable Systems Fail).

From Quick takes on the latest Cloudflare public incident write-up

He also discusses the tension between automated recovery and the additional complexity that such systems introduce.

Interesting topic #2: So many data modeling methodologies (for warehousing)

TIL of Activity Schema, which I hadn’t heard of before, so I wondered if there were more (data warehousing) data modeling methodologies I didn’t know of, and what were the differences?

Here is a summary (feel free to curse me for my summarization choices). Later are some links to bloggers and practitioners that explore these modeling methodologies in far more detail.

Dimensional modeling (Kimball)
- Needs little introduction. Arranges data into fact tables (quantitative pieces of data such as sales amounts, units sold) and dimension tables for descriptive attributes (customer, product, location, time). Forms a star or snowflake schema. Optimized for BI.
- Pros: Great for users to understand and write analytical queries against. The star schema reduces the need for complex joins, which can be expensive.
- Cons: Not super flexible. Not so great at fast changing transactional data.
Normalized modeling (Inmon)
- Focuses on creating a centralized, normalized enterprise data warehouse in Third Normal Form (3NF) to serve as a single source of truth for the organization, with a focus on consistency. Data from multiple source systems is integrated into a unified structure.
- Pros: Great for data quality as well as integration with other systems as the data model is optimized for consistency and logical organization (rather than optimized for a specific workload like dimensional modeling is).
- Cons: Relies on data marts as query performance is poor (think of a data mart as a business unit specific DW tailored to a more focused use case). Seems like a lot more work to maintain this organization-wide data model.
Event-centric modeling (Activity Schema)
- Focuses on capturing and organizing data around events or activities (e.g., a purchase, login, or page view) and storing them in a single central activity table known as the activity stream. v2 allows for one table per activity. Each activity has a number of attributes that it is authoritative on, such as the web page visited, or value of the purchase etc. Activities in this single table are differentiated by an activity type string column. Each activity can optionally be linked to one or more entities (dimensions), such as the subject of the activity (the product purchased) or the actor (the user who did the purchase), or the location, and so on.
- Pros: With a shared event table, it is great for understanding actions over time, such as tracking user behavior, system logs, IoT data, or transactions. Captures raw detailed data that supports deep and flexible analysis.
- Cons: Can be challenging computationally (and storage) due to high cardinality of raw event data. Analytical queries involving joins or aggregations on high-cardinality columns (e.g., aggregating data by user, session, or time) can be slow and resource-intensive. Maintaining indexes can be challenging. High cardinality is a natural consequence of storing detailed, event-level data. There are a number of optimization techniques to mitigate these high-cardinality issues (e.g. data partitioning, aggregation, moving some attributes into dimensions, sampling, multiple activity stream tables etc).
Data Vault
- Organizes data into three core components: hubs, links, and satellites. Hubs store the unique business keys (e.g., customer IDs, product IDs) representing core business entities. Links capture the relationships between these hubs (e.g., which customer placed which order) and support many-to-many relationships. Satellites store the descriptive attributes of hubs and links. All store metadata for load timestamps and source system identifiers. Satellites enable an append-only approach as changes to attributes are simply appended, rather than overwriting, enabling detailed historical tracking and auditability. No need for Slowly Changing Dimensions (SCD).
- Pros: The modular and append-only design makes it easier to maintain incrementally, and even by different teams. It is highly flexible to change and schema evolution. It’s historical tracking is ideal for highly regulated environments.
- Cons: Queries are more complex (and expensive) than say dimensional modeling. It is also a less intuitive data model for users and analysts, so there may be some reliance on data marts. Due to its quite different data model, there is an associated learning curve though there is some Data Vault support in tooling to help.
Anchor modeling
- It has some similarities with Data Vault, as it focuses on separating core entities (anchors) from their attributes as well as relationships between entities (known as ties). However, anchor modeling is more highly normalized and is less focused on auditability than Data Vault, with a stronger focus on adaptability and schema evolution. One of the primary goals of Anchor Modeling is to support schema flexibility and evolution without disrupting existing structures. Therefore each attribute gets its own table, new attributes can be added without modifying existing tables, ensuring non-disruptive schema updates. This is a more extreme level of normalization than other methodologies. New attributes or relationships can be added without altering existing tables. Schema changes are handled as non-disruptive extensions. Each attribute and tie table track historical changes using validity period columns (ValidFrom, ValidTo), enabling point-in-time analysis.
- Pros: Very adaptable to change. Good support for historical analysis and point-in-time queries.
- Cons: Similar to Data Vault, except there is less support in tooling. Due to the one-table-per-attribute, there can be more joins.
Wide-table / One Big Table (OBT) modeling
- Combines data from multiple sources into a denormalized single table.
- Pros: Great for read performance, no joins required. Practical - especially when you are using a columnar storage system such as ClickHouse, Snowflake, BigQuery etc. Even better if it can be maintained via pre-aggregation.
- Cons: Not flexible. Harder to maintain as data shape changes. Harder to keep consistent.

These methodologies also cover more than just data modeling, such as how data goes through stages of processing and how data is queried from BI tools.

There are a number of people who write about this stuff (not an exhaustive list):

Joe Reis and his Practical Data Modeling substack
Simon Späti, who maintains his Second Brain, with plenty of content on data modeling.
Kent Graziano writes a lot about Data Vault.
Hans Hultgren also writes a lot about Data Vault.
Dave Wells writes about data modeling and data management in general.

Humans of the Data Sphere Issue #4 November 22nd 2024

Jack Vanlightly — Fri, 22 Nov 2024 14:41:11 GMT

Welcome to Humans of the Data Sphere issue #4!

Best meme (referring to the idea of storing SQLite databases in a Postgres database):

By Shaun Thomas

Quotable Humans

Allen Downey: In August I had the pleasure of presenting a talk at posit::conf, called A Future of Data Science, in which I assert that data science exists because statistics missed the boat on computation.
Eugene Yan (and blog post): Here are 39 lessons I took away from conferences this year. [a selection follows]
- 4. Machine learning involves trade-offs. Recall vs. precision. Explore vs. exploit. Relevance vs. diversity vs. serendipity. Accuracy vs. speed vs. cost. The challenge is figuring out the right balance for your user experience.
- 5. Set realistic expectations. Most problems have a ceiling on what can be achieved, especially those that involve predicting the behavior of unpredictable humans (e.g., search, recommendations, fraud). It may not make sense to aim beyond the ceiling, unless you’re doing core research.
- 7. Evals are a differentiator and moat. Over the past two years, teams with solid evals have been able to continuously ship reliable, delightful experiences. No one regrets investing in a robust evaluation framework.
- 9. Brandolini’s law: The energy needed to refute bullshit is an order of magnitude larger than needed to produce it. The same applies to using LLMs. Generating ~~slop~~ content is easy relative to evaluating and guardrailing the defects. But the latter is how we earn—and keep—customer trust.
- 12. Build with an eye toward the future. Flexibility beats specialization in the long run. Remember The Bitter Lesson. An LLM that’s a bit worse now will likely outperform a custom finetune later, especially as LLMs get cheaper (two orders of magnitude in 18 months!), faster, and more capable.
Jenna Jordan: Sometimes I feel like my superpower has become just knowing the right people to put in a room together and then getting them to talk together about stuff. I can debug things best by having a mental model of the socio-technical system - the code and the people who have tacit domain knowledge.
Alex Miller: How did we all end up omitting the "Tree" and just saying "LSMs"? Without it, we're just saying "log-structured merges have great write amplification", as if nouns just don't matter anymore. No one calls a B-tree a B.
- shachaf: I think this is the answer -- there's no actual tree anywhere, unless you consider all binary searches to be implicit trees. I don't think the level structure itself is very tree-like. (And the basic LSM idea, a log compacted into mergeable snapshots, works with even less tree-like structures.)
‪Jaana Dogan: (A popular) unpopular opinion: When you design by committee, you are mostly focusing on designing pieces that are narrow and poorly put together to make the individuals in the committee happy, rather than building anything that is cohesive and representative of the broader problem space.
- Siddharth Goel: Best idea should win. I believe committees are important since you would want the feedback rather than building something in a silo. However, if it comes down the “happiness” then I sense politics rather than meritocracy.
Jaana Dogan: No one gets promoted for removing chaos or saying no to prematurely built artifacts that are guaranteed to drag the company down. Until this culture doesn't change, I don't expect anything to change.
Sung Kim: Oh wow! A surprising result from Databricks when measuring embeddings and rerankers on internal evals.
1- Reranking few docs improves recall (expected).
2- Reranking many docs degrades quality (!).
3- Reranking too many documents is quite often worse than using embedding model alone (!!).
- Paper: Drowning in Documents: Consequences of Scaling Reranker Inference (arxiv.org/abs/2411.11767)
Apurva Mehta kicked off a long thread about what stream processing is, with this post: I'm finding two divergent interpretations of 'stream processing'. For some, it's primarily 'better' data processing, eg. streaming is better than batch.
For others, it's a way to build applications and experiences that wouldn't be possible in a non event-driven fashion. [some interesting replies, see the thread for better context]
- Micah Wylde: I agree with this distinction—building real-time datasets vs. building event-drive applications. The former leans towards "stream processors" like Flink or Arroyo which are complete systems that host your code. The latter towards libraries like KStreams/Restate that are embedded in your code.
- Chris Riccomini: Makes me think of push vs. pull. When something needs to happen based on an event, push (reactive) is required. OTOH, when you want to read stuff, pull (incremental materialized views) works fine.
‪Nico Ritschel: The Iceberg on pg via duckdb space is heating up. With write (& compaction) support, too!
Gunnar Morling: It always makes me sad a little when seeing legacy Java API types like java.util.Date being used in newly written code. Is there any prescriptive build-time checker people are using to flag usage of not-quite-officially-deprecated-yet-non-recommended types like this?
Simon Späti: Data modeling is one of the most essential tasks when starting a data project. But why don't we take more care of it? Why did we write the same metrics differently across departments? Why do we keep reinventing data models with each new tool we adopt?
Andy Pavlo: Correction: @glaredb.bsky.social is moving *away* from DataFusion! Their talk discusses the problems with building a DBMS using off-shelf parts. Like @duckdb.org, the new GlareDB rewrite borrows ideas from the Germans' HyPer system but it's written in Rust. YouTube video.
Abbey Perini: Another day, another developer is convinced they can find a tech solution to a human problem.
‪Lalith Suresh: The single-most valuable tool when doing latency measurements is an ECDF (empirical cumulative distribution function). Collect every single latency sample, plot the ECDF to look at the *entire distribution*, and resist the temptation to compute summaries until you can explain the ECDF.
- Why? Summaries (avg, median, ranges etc) routinely mask pathological behaviors (tail latencies, skews, step behaviors etc). All these problems show up visibly in an ECDF.
Chris Riccomini: Maybe the driver for data lakehouse adoption is actually B2B SaaS integration (prequel.co). Even if you've got Snowflake/BQ, if all your SaaS vendors are exposing data through Iceberg/Parquet, you're gonna adopt that. Once adopted, might as well go all in.
Paul Dix: So I have this theory that DataFusion, despite being a SQL engine, will actually enable a new breed of data systems to create non-SQL languages for working with data.
François Chollet: Asking lots of "dumb" questions isn't a sign of stupidity. If anything it is more likely to be the sign of a person who is very strict about always keeping a crystal clear mental model of the topic at hand -- the questions are just a symptom of their perpetual model refinement process. Mildly correlates with intelligence, and highly correlates with competence.
Alex Edmans: I asked my guest speaker the one thing a young person starting a career can do to stand out. He said "inspire trust". It's not about showing how clever you are, but that you're reliable, motivated, truthful, and ethical. I love that advice, and it's relevant for all seniorities.
‪Debasish Ghosh: Good refactoring vs Bad refactoring - https://www.builder.io/blog/good-vs-bad-refactoring
- Precondition for good refactoring: understand the context and the code
- Post condition for good refactoring: maintain refactoring invariants for unit and integration tests (https://buttondown.com/hillelwayne/archive/refactoring-invariants)
‪Rahul Jain: A good tip for data engineering architecture is to treat a "Table" as a logical entity and not tie it with the way it is materialized or mapped to other subsystems. A table can be anything - a set of files, a view, the result of a query. Its definition and physical manifestation should be decoupled.
Chris Riccomini: This does make me wonder: are we on the cusp of many, many IVM implementations? If so, gives me strong vector search vibes. Useful, but, "is this a feature or a product," uneasiness. [editor’s note: IVM stands for Incremental View Maintenance]
- Gwen Shapira: Vector search is a wonderful Postgres feature.
  On the other hand, JSON is also a wonderful Postgres feature and MongoDB seems... fine?
- Gunnar Morling: Yeah, I've always felt once pg_ivm is in a usable state, it will render many external solutions mostly unnecessary.
Chris Riccomini: Found this Benn Stancil post. It's wrestling something that I've felt for a while. Creating an "analytics engineer" was a mistake. AE is too narrow a role. They're getting squeezed on budget 'cuz their cost > value. AEs are flailing to find something.. reverse ETL, data products, semantic layers.
Alessandro Martinello: Because I thought my main product was data, and the slides were just a wrapper for that. Now, I believe that to make changes the insights are my main product
Katie Bauer: Product managers and designers get credit for all kinds of things that didn't literally do, because they rightfully recognize that they're part of the team that built something. This is no different from the role a data professional plays.
JD Long: In discussions of data engineering (DE) there’s huge focus on DE feeding data science and analytics. But rarely is there discussion around DE feeding finops (accounting & finance).
Joe Reis: I see some people confusing the Medallion architecture (bronze, silver, gold) with data modeling. This is a workflow that might facilitate data modeling, but it’s NOT a data model.
Via Joe Reis:
- Peopleware: Quality, far beyond that required by the end user, is a means to higher productivity.
- Peopleware: A policy of “Quality - If Time Permits” will assure that no quality at all sneaks into the product.
Ethan Mollick: There is a lot of energy going into fine-tuning models, but specialized medical AI models lost to their general versions 38% of the time, only won 12%. Before spending millions on specialized training, might be worth exploring what base models can do with well-designed prompts.
Ethan Mollick: Paper shows that AI (in this case a diffusion model) accelerates innovation. Among key findings: 1) GenAI increases novel discoveries: a 39% increase in patent filings! 2) It boosts the best performers by acting as a co-intelligence 3) It takes away some of the fun parts of work
LaurieWired: CPU % usage is really complicated. On Apple Silicon, you could use as little as 27% of the CPU's maximum frequency, yet Activity Monitor will show 100% usage of the core. Why? It all has to do with active residency.
- One of the coolest options on linux is PSI, or "Pressure Stall Information". PSI focuses on task delays, instead of traditional CPU metrics. With PSI, you can pinpoint whether CPU, memory, or I/O pressure is causing application slowdowns. The docs on http://kernel.org have a great overview of how this measurement works: https://docs.kernel.org/accounting/psi.html. In any case, it's important to keep in mind that different operating systems measure CPU usage differently. A process may be slowing down for reasons you may not expect!
Murat Demirbas: DBSP is a simplified version of differential dataflow. I talked about differential dataflow in 2017 here. Differential dataflow represented time as a lattice, DBSP represents it as a single array. In other words, in DBSP time is consecutive and each state requires a unique predecessor. In differential dataflow, time is defined as a partial order to capture causality and concurrency. That is a better model for distributed systems, but that introduces a lot of complexity. DBSP makes a big simplification when it assumes linear synchronous time, and probably opens up issues related to managing out-of-order inputs, incorporating data from multiple streams, and tracking progress.
But the simplification buys us powerful properties. The chain rule below emerges as a particularly powerful composition technique for incremental computation. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This is big. This allows independent optimization of query components, enabling efficient and modular query processing.
Nisan Haramati: but at the very core, the real physical revolution is the ability to perform huge matrix computations at scales beyond the capacity of any single computer, and the mind boggling rate at which this capacity has been increasing over the last few years. What NVIDIA is selling is time compression. You can now execute 10,000 linear compute years of training (if they were run on one single-threaded CPU) in less than a day. … I think they (LLMs) are the first example of a generalizable application utilizing this technology we now have access to.
Nisan Haramati: In his Universal Scalability Law (USL), Neil Gunther describes four phases, which can be labelled as:
- 1. Roughly Linear - early scaling, no/low contention
- 2. Slowing down - increasing contention reduces efficiency
- 3. Plateau - the cost of contention cost rises to the point of "eating up" the value added by new resources
- 4. Negative returns - where the operational cost of new resources actually diminishes the overall system capacity
David Asboth: We shouldn't be afraid to question basic assumptions; in fact, progress often lies there. Think of it like a reverse impostor syndrome: by freeing yourself from the weight of knowing everything, you boldly go into a meeting as someone who is ignorant but eager to learn.
Alex Miller: [wrote about hardware advances and database architectures, a couple of quotes follow, but really the whole thing is pure gold (if a little depressing considering the realities of the cloud discussed at the end)]
- Marc Richards put together a nice Linux Kernel vs DPDK benchmark, that ends with DPDK offering a 50% increase in throughput, followed by an enumeration of the slew of drawbacks one accepts to gain that 50%. It seems to be a tradeoff most databases aren’t interested in, and even ScyllaDB has mostly dropped its investment into it.
- Similar to SMR, this reduces the cost of a ZNS SSD as compared to a "regular" SSD, but there’s an additional focus on application-driven^[4] garbage collection being more efficient, thus decreasing total write amplification and increasing drive lifetime. Consider LSMs on SSDs, which already operate via incremental appending and large erase blocks. Removing the FTL between an LSM and the SSD opens opportunity for optimizations. More recently, Google and Meta have collaborated on a proposal for Flexible Data Placement (FDP), which acts as more of a hint for grouping writes with related lifetimes than strictly enforcing the partitioning as ZNS does. The goal is to enable an easier upgrade path where an SSD could ignore the FDP part of the write request and still be semantically correct, just have worse performance or write amplification.
Michael Youssef, byzheyi Y and Daniel Cheng: we present our Stateful Workload Operator, an alternative model to the traditional approach: all stateful applications now share a common operator with a single custom resource, while application-specific customizations are handled by pluggable external policy engines. At LinkedIn, we've inverted the traditional stateful application operator model, providing application owners with a generic building block and a centralized point to manage storage, external integrations, tooling, and other features.
- The benefits of using StatefulSet were minimal, and it would have only reduced a small portion of the overall code complexity versus developing our own pod management. By developing our own custom resources, we were able to overcome these challenges, achieve the flexibility we needed, and align more closely with our specific requirements.
Erik Bernhardsson: [shares his thoughts on serverless inference]
- So why is most GPU demand driven by training even though inference is where you make the money? …for the economics of this to make sense eventually, we need to see a much larger % of GPU spend going towards inference.
- Pooling lots of users into the same underlying pool of compute can improve utilization drastically. It reduces amount of capacity that has to be reserved in aggregate. Instead of provisioning for the sum of the peaks, you can provision for the peak of the sum. This is a much smaller number!
- Fast initialization of models is a hard problem. A typical workload needs to fire up a Python interpreter with a lot of modules, and load gigabytes of model weights onto the GPU. Doing this fast (as in, seconds or less) takes a lot of low-level work.

Published Humans

Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI (paper): With the growing attention and investment in recent AI approaches such as large language models, the narrative that the larger the AI system the more valuable, powerful and interesting it is is increasingly seen as common sense. But what is this assumption based on, and how are we measuring value, power, and performance? And what are the collateral consequences of this race to ever-increasing scale? Here, we scrutinize the current scaling trends and trade-offs across multiple axes and refute two common assumptions underlying the ‘bigger-is-better’ AI paradigm:
- 1) that improved performance is a product of increased scale, and
- 2) that all interesting problems addressed by AI require large-scale models. Rather, we argue that this approach is not only fragile scientifically, but comes with undesirable consequences.
  - First, it is not sustainable, as its compute demands increase faster than model performance, leading to unreasonable economic requirements and a disproportionate environmental footprint.
  - Second, it implies focusing on certain problems at the expense of others, leaving aside important applications, e.g. health, education, or the climate.
  - Finally, it exacerbates a concentration of power, which centralizes decision-making in the hands of a few actors while threatening to disempower others in the context of shaping both AI research and its applications throughout society

Interesting topic #1: Using modeling and simulation in distributed systems

Forgive me for including a topic I just wrote about, but literally on the same day that I blogged about using simulation to understand properties and characteristics of complex systems, Datadog engineers (Arun Parthiban, Sesh Nalla, Cecilia Wat-Kim) released a post on the same topic.

The blog explores the complexities of building distributed systems and how traditional testing methods often fall short at the system level. The blog explores how the team combined formal modeling, lightweight simulations, and chaos testing to analyze the design, expected performance and real-world behaviors of their massive scale queueing system called Courier. It’s an interesting read: How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems

For my part, I finally wrote about my own use of modeling and simulation over my career in distributed data systems in two blog posts.

I would also be remiss if I didn’t include blog posts by Marc Brooker (distinguished engineer at AWS). Marc uses simulation regularly in blog posts to explore algorithms, and this blog post directly advocates for use of simulation: Simple Simulations for System Builders.

If anyone knows of other accessible material on real-world usage of simulation in engineering, then please let me know.

Interesting topic #2: Incremental View Maintenance (IVM)

A database view is basically a stored query that can be queried by name (typically read-only). A materialized view physically stores the result of a view, and is updated via the process of Incremental View Maintenance (IVM), and not directly by user queries. IVM is the process of updating materialized views efficiently by applying only the changes (inserts, updates, or deletes) made to the underlying data, rather than recomputing the entire view from scratch. IVM poses a number of challenges such as efficiently (and consistently) handling complex queries with joins, aggregates, nested queries, correlated subqueries, and so on.

Materialized views are typically used to reduce the cost of reads that need to be run frequently (such as dashboards, user facing analytics etc). We trade off some write overhead for read performance. The benefit to the reads should, on balance, make up for the additional write costs, just as is the case with database indexes.

With that preamble out of the way, I have a number of articles and papers that cover the topic of IVM:

Differential dataflow (Materialize being the prominent vendor built on this approach)
- The paper
- Murat Demirbas blog post that covers in part, Differential dataflow
DBSP (Feldera being the prominent vendor built on this approach)
pg_ivm (A Postgres extension for IVM)
- The GitHub repo.
- Yugo Nagata - pg_ivm - Extensions for Rapid Materialized View Update (PGConf.EU 2024) [video]
Snowflake
- Snowflake’s paper, What’s the Difference? Incremental Processing with Change Queries in Snowflake, doesn’t actually discuss IVM itself, but does explore the topic of materializing changes to a table as a stream. I have also performed an analysis of change streams in the open table formats.

Humans of the Data Sphere Issue #3 November 12th 2024

Jack Vanlightly — Tue, 12 Nov 2024 08:38:01 GMT

Welcome to Humans of the Data Sphere issue #3! Another whirlwind tour of what humans are saying across the data sphere, plus a quick-dive and survey of the interesting topic of compute-compute separation.

https://bsky.app/profile/jkxo.bsky.social/post/3lag6qu5x222a

Quotable Humans

Jepsen: Last year we reported that MySQL and MariaDB's "Repeatable Read" was badly broken. The MariaDB team has been hard at work, and they've added a new flag, `--innodb-snapshot-isolation=true`, which causes "Repeatable Read" to prevent Lost Update, Non-repeatable Read, and violations of Monotonic Atomic View. It looks like MariaDB might, with this setting, offer Snapshot Isolation at "RR". :-)
Anton: There's a clear trend to migrate from Avro to Protobuf for streaming pipelines. This presentation by Uber was a great example. They showed several improvements they are contributing to Flink to make it possible: https://www.flink-forward.org/berlin-2024/agenda#streamlining-real-time-data-processing-at-uber-with-protobuf-integration
- I think the main benefit is to use the same format everywhere. That was the reason for Uber, and it would be the same for New Relic (used in OpenTelemetry) if we fully migrate at some point.
Chris Riccomini: Even if a company wanted to use DuckDB as their data warehouse, they couldn’t. DuckDB can’t handle the largest queries an enterprise might wish to run. MotherDuck has rightly pointed out that most queries are small. What they don’t say is that the most valuable queries in an organization are large: financial reconciliation, recommendation systems, advertising, and others. These are the revenue drivers. They might comprise a minority of all the queries an organization runs, but they make the money.
Pranav Aurora: pg_mooncake builds on this by introducing a native columnstore table to Postgres–supporting inserts, updates, joins, and soon, indexes. These tables are written as Iceberg or Delta tables (parquet files + metadata) in your object store. It leverages pg_duckdb for its vectorized execution engine and deploys it as a part of the extension.
Jake Thomas: (in the context of rising use of object storage): Once upon a time we pulled data out of prod databases and put it in larger/olap/dedicated stores/DW's for pretty much one reason: running analytical queries was too slow/impactful to prod infra. That reasoning is becoming less valid, which makes one rethink if the complexity/cost is worth it.
Jacob Matson: which means that basically every bit of compute & storage could be intermediated by a catalog... in theory it also means that your catalog could shop queries coming from a generic endpoint to the cheapest engine for that specific query.
Kostas Pardalis: There are two types of consumers of a catalog. Humans and machines and they are very different. There’s a lot of engineering going on in scaling catalogs for systems like Snowflake and a folder plus some lineage does not suffice.
DORA Report 2024:
- Unstable organizational priorities lead to meaningful decreases in productivity and substantial increases in burnout, even when organizations have strong leaders, good internal documents, and a user-centric approach to software development.
- Optimistically, and consistent with our findings that AI has positively
  impacted development professionals’ performance, respondents reported that
  they expect the quality of their products to continue to improve as a result of AI over the next one, five, and 10 years. However, respondents also reported
  expectations that AI will have net-negative impacts on their careers, the environment, and society, as a whole, and that these negative impacts will be
  fully realized in about five years time.
- Developers are more productive, less prone to experiencing burnout, and more likely to build high quality products when they build software with a user-centered mindset. … This approach gives developers confidence that the features they are working on have a reason for being. Suddenly, their work has meaning: to ensure people have a superb experience when using their products and services. There’s no longer a disconnect between the software that’s developed and the world in which it lives.
Jeremy Morrell (on wide events/o11y 2.0): Nope, these are all old ideas, but I’d argue that this is very unevenly understood / adopted across industry. Responses to my post generally fall into “I’ve never heard of this kind of approach before” or “we’ve been doing this for a decade+” with very little in between
Reuben Bond: Raft was optimized for understandability over simplicity 🤷 Chris Jensen & @heidihoward.bsky.social wrote about LogPaxos, which they say is simpler than Raft or MultiPaxos: https://decentralizedthoughts.github.io/2021-09-30-distributed-consensus-made-simple-for-real-this-time/. CASPaxos is simpler but likely less broadly applicable than log-based consensus
Peter Corless: There's still a lot of work to put into all of this. But the worlds of Observability and real-time analytics are on a collision course.
A. Jesse Jiryu Davis and Matthieu Humeau (on predictive scaling):
- This means servers can be overloaded or underloaded for long periods! An underloaded server is a waste of money. An overloaded server is bad for performance, and if it’s really slammed it could interfere with the scaling operation itself.
- The Forecasters and Estimator cooperate to predict each replica set’s future CPU on any instance size available. E.g., they might predict that 20 minutes in the future, some replica set will use 90% CPU if it’s on M40 servers, and 60% CPU if it’s on more powerful M50 servers.
- Our goal is to forecast a customer’s CPU utilization, but we can’t just train a model based on recent fluctuations of CPU, because that would create a circular dependency: if we predict a CPU spike and scale accordingly, we eliminate the spike, invalidating the forecast. Instead we forecast metrics unaffected by scaling, which we call “customer-driven metrics”, e.g. queries per second, number of client connections, and database sizes. We assume these are independent of instance size or scaling actions. (Sometimes this is false; a saturated server exerts backpressure on the customer’s queries. But customer-driven metrics are normally exogenous.)
Ludicity: The word enterprise means that we do this in a way that makes people say "Dear God, why would anyone ever design it that way?", "But that doesn't even help with security" and "Everyone involved should be fired for the sake of all that is holy and pure."
Gunnar Morling: (on the outbox pattern)
- If you want to achieve consistency in a distributed system, such as an ensemble of cooperating microservices, there is going to be a cost. This goes for the outbox pattern, as well as for the potential alternatives discussed in the next section. As such, there are valid criticisms of the outbox pattern, but in the end it’s all about trade-offs: does the outbox put an additional load onto your database? Yes, it does (though it usually is insignificant, in particular when using a log-based outbox relay implementation). Does it increase complexity? Potentially. But this will be a price well worth paying most of the time, in order to achieve consistency amongst distributed services.
- While alternatives do exist, they each come with their own specific trade-offs, around a number of aspects such consistency, availability, queryability, developer experience, operational complexity, and more. The outbox pattern puts a strong focus on consistency and reliability (i.e. eventual consistency across services is ensured also in case of failures), availability (a writing service only needs a single resource, its own database) and letting developers benefit from all the capabilities of their well-known datastore (instant read-your-own-writes, queries, etc.).
Han Lee: In Thinking, Fast and Slow, Daniel Kahneman defined System 1 as the automatic, intuitive mode of thinking, and System 2 as the slower, more analytical mode. In the context of autoregressive language models, the usual inference process is akin to System 1 — models generate answers directly. Reasoning is System 2 thinking - models or systems takes time to deliberate to solve more complex problems.
Josh Willis: The fact that DuckDB isn't a data warehouse...is the whole point of DuckDB! Once you pull your head out of the snow or the bricks or wherever you spend most of your data engineering time, you will discover that there are data pipeline problems *everywhere* that benefit from data modeling and SQL!
Josh Willis: too often, folks say "data quality problem" when they mean "we don't have a good way to collaborate with our upstream dependencies" or more directly "we hate the frontend and backend engineering teams, they don't care about data"
Julia Evans: I feel like half of programming is remembering how weird stuff works and the other half is setting things up so that you do not have to remember the weird stuff
‪Lutz Hühnken: Did you ever realize that when data engineers look at events, they see them differently than Event-Driven Architecture / Domain-Driven Design folks?
Jake Thomas: all these people be talking about ai, zero disk architectures, and open table formats while I'm giddily stuffing DuckDB into Prometheus exporters 🤷‍♂️🤷‍♂️🤷‍♂️ https://github.com/jakthom/hercules
Ryanne Dolan: BigTech is moving to object storage too, but not cuz it's cheaper. The idea is you no longer need every system to be distributed and durable. Everything can be stateless and simple, cuz your storage is distributed and durable.
Kenneth Stanley: Two things that both evolution and neural networks share in common is they both thrive on scale, and they were both once dismissed as obsolete to AI. The deep evolution moment is awaiting.
François Chollet: Similarly, there are several research directions today seen as long-abandoned failures, that are only waiting for the right amount of attention and the right level of compute scale to shine. Genetic algorithms are one of them.
Ethan Mollick: I keep hearing from executives that they expect that a new generation of "AI natives" will show them how to use AI. I think this is a mistake: 1) Our research shows younger people do not really get AI or how to integrate into work 2) Experienced managers are often good prompters
Ethan Mollick: We are just not used to abundant "intelligence" (of a sort), which leads people to miss a huge value of AI. Don't ask for an idea, ask for 30. Don't ask for a suggestion on how to end a sentence, ask for 20 in different styles. Don't ask for advice, ask for many strategies. Pick
Shreya Shankar: Whenever I talk to ML/AI researchers, many are surprised that complex and long document processing with LLMs is so hard. I find it strange how little attention (comparatively) is given to this, given that data processing probably has the highest market cap of all applications
Soumil S: In my experiment, I set up two types of tables—managed Iceberg tables registered via Polaris within Snowflake, and unmanaged Iceberg tables also registered via Polaris. I attempted to join these tables in both Snowflake and Spark, with mixed results: while Snowflake handled the join seamlessly, Spark wasn’t able to join the internal and external catalogs. To me, this feels like a potential limitation, as Spark doesn’t support joining across internal and external catalogs registered through Polaris.
- Vikram Singh Chandel: hmm interesting, so strategically we should not only work on Iceberg as a unified format but a unified catalog that does not limit vendor-specific features
- Roy Hasson: That's expected behavior. If you make both tables external they can be joined by external engines. If you make them internal/managed then really only snowflake can join them. It's unfortunate, but that's the behavior and why internal/managed tables are a bit of a lock-in.
Weston Pace: Whenever we talk about backpressure we also often talk about push vs. pull. In a push-based system a producer creates and launches a task as soon as data is available. This makes parallelism automatic but backpressure is tricky. In a pull-based system the consumer creates a task to pull a batch of data and process it. This makes backpressure automatic but parallelism is tricky. The reality is that every multi-threaded scanner / query engine is doing both push and pull somewhere if you look hard enough.
Katie Bauer: It's 2024.
LinkedIn overflows with thought leaders proclaiming AI will get better and cheaper, if we just wait 6 months.
Data quality is still a mess, despite insistence that AI will finally justify investments.
Sparkles are stamped on UIs everywhere, ruining a perfectly good emoji.
Julien Le Dem: A query engine reduces the cost of scanning columnar data in a few ways:
- Projection push down: By reading only the columns it needs.
- Predicate push down: * By skipping the rows that it doesn’t need to look at. This typically leverages embedded statistics. * By better skipping ahead while decoding values. Simply by leveraging understanding the underlying encodings. Skipping in Run Length Encoding or fixed width encodings is really cheap.
- Vectorization: By using vectorized conversion from the on-disk Parquet columnar representation to the in-memory Arrow representation.
Ross Wightman: Utilization is a poor metric by itself. You can easily hit 100% where the GPU is doing a lot of waiting. Power consumption is a better (but not perfect) measure. If you're burning watts it's usually doing something useful. High util, no watts is not good.
Alex Miller: Regardless of using buffered or unbuffered IO, it’s wise to be mindful of the extra cost of appending to a file versus overwriting a pre-allocated block within a file. Appending to a file changes the file’s size, which thus incurs additional filesystem metadata operations. Instead, consider using fallocate(2) to extend the file in larger chunks
Kayce Basques: Here’s the mental leap. Embeddings are similar to points on a map. Each number in the embedding array is a dimension, similar to the X-Coordinates and Y-Coordinates from earlier. When an embedding model sends you back an array of 1000 numbers, it’s telling you the point where that text semantically lives in its 1000-dimension space, relative to all other texts. When we compare the distance between two embeddings in this 1000-dimension space, what we’re really doing is figuring out how semantically close or far apart those two texts are from each other.
Hung LeHong: The truth is that you’ve been in the mud for the last year, working hard to find all those benefits that were promised by AI, it’s not been easy.
Han Lee: This principle extends beyond corporate life and into the world of AI. Personally, I address the first group producers and the second group promoters. And in the current AI ecosystem, we’re seeing far more promoters than producers — sometimes promoters disguised as producers. This phenomenon starts at the source: academia
Martin Goodson (Super Data Science 833, 23 min): I do that think that a lot of ML research does come down to understanding data, understanding statistics, understanding the way that data can screw you over and confuse you. I've been confused so much by data, it's stayed with me to my core.
...
There's a lot of stuff that I learned in statistics that is useless now and is a waste of time. It's quite a traditional field and it hasn't moved very quickly. People are still being taught stuff that isn't really that useful, I think there should be much more emphasis on computational methods. There is a lot of stuff you can safely ignore in statistics, but you do need the approach, the attitude and skeptism about data and really understand about bias.
Martin Goodson (SDS 833, on over-hyped academia and AI, 1h12min): You could become a self-proclaimed expert quite easily by reading stuff on arXiv and you can read loads of papers, and get to become a self proclaimed expert quite easily. There are many problems, I'll highlight one, one is that the academics that are publishing this stuff are over claiming, the titles of the papers are just wildly, they don't have the evidence to claim what they're claiming. We do have people that come to the meetup and they give a talk, and they just make up stuff that is just is very over-hyped, the claims. Once you put them under scrutiny, it just falls apart. They don't have the evidence.

Helpful Humans

Interesting topic - compute-compute separation

A week ago, Yaroslav Tkachenko expressed his enthusiam for ClickHouse Cloud’s new compute-compute separation feature. This led to us discovering some of the other systems that had introduced this architecture in the past. Let’s look now at what compute-compute separation is, and some real-world examples!

By now we’ve all heard about the benefits of compute-storage separation. The idea is that large distributed data system architectures become both simpler and more elastic. The compute layer is a stateless layer that can be scaled elastically, moved around, killed and restarted with relative abandon. The storage layer remains a tough distributed systems problem, but these days, we let the hyperscaler handle that part for us in the form of object storage.

Two quotes are relevant here:

One from Ryanne Dolan in this issue put it nicely: “The idea is you no longer need every system to be distributed and durable. Everything can be stateless and simple, cuz your storage is distributed and durable.“
One from myself, I know, shameless self-plug: “In data system design, cost is a kind of unstoppable force that rewrites architectures every decade or so, and that is happening right now. AWS, GCP and Azure have been consistent in driving down the costs and driving up the reliability of their object storage services over time. S3 et al are the only game in town and the cheapest form of storage; hence any system has to build on it and cannot really be cheaper or more reliable. You can't compete with S3 because AWS doesn't provide the primitives in the cloud to rebuild it. Your only other choice is to get some space in a co-lo and start racking your own hard drives and solid-state drives. S3 may or may not be the ideal storage interface for a diverse set of data systems from a design point of view, but in terms of economics, it is unbeatable and inevitable.“

Compute-compute separation is a logical consequence, or next step if you will, of compute-storage separation. Therefore, we can think of compute-compute separation as a shorthand for compute-compute-storage separation.

The basic idea of compute-compute separation is that we can isolate different workloads that operate over the same data, in different stateless compute pools. As long as the underlying storage can handle the aggregate load, then we get improved performance isolation as the compute-side of the workloads is running on different compute resources.

Compute-compute (and storage) separation)

Let’s use the following definition: compute-compute(-storage) separation is the isolation of different workloads, that operate over the same data, in different stateless compute pools.

So who is doing this? I have five examples for you (though doubtless there are many more), in the order of publication:

Snowflake were the first to implement compute-compute separation (that I know of) with the design of their Virtual Warehouses (VW) that operate over object storage. Snowflake was the first to my knowledge to create a massive scale data system based on object storage. Snowflake really has been a pioneer and it’s amazing how long it’s taken the rest of the industry to catch up. Read the Snowflake paper: The Snowflake Elastic Data Warehouse (published 2016). Some choice quotes:
1. VWs are pure compute resources. They can be created, destroyed, or resized at any point, on demand. Creating or destroying a VW has no effect on the state of the database. It is perfectly legal (and encouraged) that users shut down all their VWs when they have no queries. This elasticity allows users to dynamically match their compute resources to usage demands, independent of the data volume.
2. Each user may have multiple VWs running at any given time, and each VW in turn may be running multiple concurrent queries. Every VW has access to the same shared tables, without the need to physically copy data.
CockroachDB Serverless (2022) doesn’t use the term compute-compute separation but in fact that is exactly what it is - just applied to a serverless multitenant system. Unlike Snowflake, it does not use object storage for its storage layer, but a Raft-based replicated storage system that uses Pebble (a KV store similar to RocksDB). CockroachDB is a distributed OLTP SQL database, that stores data on disks rather than an object store, for latency reasons. However, the storage layer is a shared multitenant storage layer; the difference between this and object storage is that it is a component of CockroachDB serverless, not the hyperscaler. The compute layer is also multitenant, with each tenant operating in different Kubernetes pods with controlled resource limits. This is the compute-compute separation part, each tenant gets a slice of compute and is prevented from consuming compute resources of other tenants. Each per-tenant compute pool operates over a shared storage layer.
How we built a serverless SQL database
1. I wrote an extensive write-up on CockroachDB Serverless in 2023 as part of The Architecture of Serverless Data Systems.
2. CockroachDB also have a PDF whitepaper and a blog post.
3. To my knowledge CRDB Serverless doesn’t provide separate pools for the same tenant, that is, multiple pools over the same data (only the same storage). According to my own definition, CRDB Serverless doesn’t quite qualify as compute-compute separation, but given that the storage layer is shared by all tenants, I think it’s close enough. It also is an interesting case because it shows there is some diversity in these architectures.
Rockset blogged about compute-compute separation in March 2023. Rockset uses RocksDB as it’s storage layer for hot data, making it the second of my examples not to be using object storage as the primary storage layer (at least for hot data).
Multiple applications on shared real-time data
Warpstream blogged about Agent Groups this year, and how they allow for different consumer workloads to operate over the same Kafka topics without affecting each other. Warpstream has implemented the Kafka API over object storage.
ClickHouse in July added compute-compute separation to their docs. In their docs it says “Compute-compute separation allows users to create multiple compute node groups, each with its own endpoint, that are using the same object storage folder, and thus, with the same tables, views, etc.“
Warehouses, or Compute-Compute Separation (Private Preview)

These are all examples of compute-compute separation within a single platform. But if we were to take it one level further, then we would be seeing platform-platform(-storage) separation. This is another trend that is brewing with the rise of both Iceberg et al and also DuckDB (that is being inserted into more and more places).

On this last topic, I’ll leave you with this quote:

The not-so-subtle undertone of the "small data" and duckdb.org movement is decentralized analytics. Magic truly happens when compute/storage are separated but then also decentralized. — Jake Thomas

Humans of the Data Sphere Issue #2 October 28th 2024

Jack Vanlightly — Mon, 28 Oct 2024 11:07:13 GMT

“The hardest problem in computer science is neither naming things nor off by one errors. It’s talking to other people.“ — Julien Le Dem

If you came to HOTDS for some Hot Data Science, then I’m afraid you’ll be disappointed, but don’t worry, we’ve got a wide variety of interesting people from across the data sphere and beyond 😄

Ps: A huge number of us dist sys, infra, data infra, messaging/streaming, data eng/analytics, database peeps on Twitter have made the jump to Bluesky, it’s much better there, come and join us!

Here are a few Bluesky starter packs to get you going (they contain a list of people to follow):

The very quotable Julien Le Dem said it: “Data Twitter is dead, long live data Bluesky. 📊“

Quotable humans

Lorin Hochstein: My whole shtick is that I believe updating people's mental models will have a more significant positive impact on the system than discussing action items, but boy is that a tough sell.
Chris Riccomini: Pipeline ownership has shifted as well. WePay had a team of data engineers to manage our data pipelines. These days, we’re asking ML, AI, analytics, and product engineers to build data pipelines. […] There are challenges to this approach. If everyone is managing their own data pipelines, there will duplication. There is also likely to be inefficiency as engineers opt for simpler solutions rather than cheaper ones. Moreover, the pipelines could become a tangled mess with no clear ownership as employees come and go. […] But I think dlt has recognized something: data consumers want control of their data pipelines.
Forest Brazeal: I used the word "footgun" among engineers this week and found out that several people didn't know what it meant!
Richie Fink on Super Data Science 827, Polars, Past, Present and Future: There’s also quite some misconception about what (Apache) Arrow is. That’s because PyArrow, the library, does not only do Arrow memory but also Arrow compute. Sometimes people say “it’s fast because of Arrow” and that doesn’t make any sense. First, Arrow is a specification of how memory should be ordered for specific data types. That’s all Arrow is. That’s the same as JSON is a specification for how to store the bytes of a JSON serialized data structure. JSON is not a very optimal way to store data, so it’s not very good for performance; Arrow is much better, if you are doing columnar data processing.
Ben Dicken: Did you know that a random SSD read is multiple orders of magnitude slower than a random memory read? I made a little visual that really drives the point home. This is why memory buffers and caches are so important, especially for I/O heavy workloads like databases.
Craig Kerstiens: I mentioned recently that anytime I encounter a database without constraints I run into data integrity issues. Could not agree more with taking full advantage of them.
Alex Miller: I have a personal fondness for papers/posts which present two very distinct and opposing designs as just two extremes of some spectrum of design trade-offs. LSMs vs B-Trees is a space in which I've seen a few rather different pitches of what that design spectrum could look like.
John Kutay: TIL #duckdb uses Adaptive Radix Trees as one if its two key indexing methods… At their core, radix trees optimize for space by compressing paths where nodes only have a single child. Instead of wasting memory with unnecessary branches, it collapses them, leading to reduced storage overhead compared to a traditional trie.
Chris Riccomini: NVIDIA’s cuda-checkpoint is particularly important. CUDA offers GPU APIs meant for generic computation (not gaming). Such APIs are widely used in AI models and LLMs. As developers execute operations on a GPU, the GPU’s memory accumulates state. This GPU data is very difficult to read directly, which poses a problem if you wish to snapshot a machine’s state. Now cuda-checkpoint offers a simple, bare-bones, free tool to do GPU checkpoint and recovery.
- For serverless functions, unikernels such as Unikraft now boast single digit boot times and fast snapshotting. This should enable faster cold starts and scale-to-zero, which will result in cost savings. Many unikernels tout increased security in multi-tenant environments, as well. As unikernels add Kubernetes support, I expect adoption to increase, so non-serverless workloads like microservices will benefit.
Kelly Sommers: How do you decide when something that is essentially a data pipeline job belongs in a data orchestration tool instead of a bunch of micro services reading, transforming & writing data using some scheduling lib?
Sometimes diff to decide when to disrupt a teams comfort choices.
John Kutay: I always recommend CDC for analytics on transactional data. OLTP systems rely heavily on efficient management of the buffer cache for high-speed data access. Analytical queries can increase cache pressure by loading large amounts of data that may not be reused frequently, pushing out more critical OLTP-related data from the cache.
Eric Sammer: The current discourse on disaggregated compute and storage in data infra says local disk is bad. I think this lacks nuance. Local persistent disk (e.g. EBS) is bad. Ephemeral local disk should be embraced as a block cache for reads and staging of writes to durable storage. The cost/op is a win.
Julien Le Dem: Modern data engineering is emancipating ourselves from an uncontrolled flow of upstream changes that hinders our ability to deliver quality data.
Julien Le Dem: The final step of standardization is to move the responsibility of producing lineage metadata to the producer of data itself. As it emerges as a common need for all data practitioners, lineage becomes a requirement for all data tools, open source or proprietary.
Craig Kerstiens: If you're not running PgBouncer check your idle connection count:
SELECT count(*), state
FROM pg_stat_activity
GROUP BY state;
Charity Majors: This is a hard truth about data-driven products. You can have a world-class team, great ideas, and tons of funding, but your ability to build a great user experience will always be bounded by how bits get laid out on disk -- decisions made long before any of you were even there.
Charity Majors: Catching up and closing tabs on a month or so if missed blog posts, and here's another dandy: "The Observability CAP Theorem", by @_cartermp. Everybody wants:
- 1, fast queries
  2, long retention
  3, rich context
  4, low cost
- You can pick two, *maybe* three.
Gunnar Morling: Thoroughly thinking about and describing the possible failure modes of a proposed design will put you ahead of 90% of software engineers.
SeattleDataGuy: Let's just fix it in SQL - It can seem really easy to fix business logic in the SQL layer rather than from the source. However, this means, in the long-run, anytime that business logic changes, your team needs to update the SQL as well. It's far more impactful to get the source team to ensure the data is right where they create it rather than in the data warehouse.
Aleksey Charapko: My student Owen (@owenhilyard) shares his thoughts on DPDK at the DPDK Summit: "DPDK in Databases: Why Isn't It More Common?" TLDR: DPDK is hard, and there is an impedance mismatch between DPDK and DB engineers in creating good network abstractions.
Vinoth Chandar: There is an unhealthy romanticism for war analogies in our industry. After all, we are here to build software, not take sides in wars 🙂. Thankfully, this does not matter since hashtag#opensource does not need vendor blessings to thrive. It just needs a strong community intent on building software.
Marc Brooker: "But what about time-to-market?" has been one of the objections to automated reasoning and formal methods forever, but in many domains they allow us to get to market faster. This is especially true in security, availability, and durability-critical domains.
Shaun Thomas: BRIN indexes are criminally underused in general. I need to sit down and do a writeup for good use cases for BRIN and KNN indexes.
François Chollet: If you take a problem that is known to be solvable by expert humans via pure pattern recognition (say, spotting the top move on a chess board) and that has been known to be solvable via convnets as far back as 2016, and you train a model on ~5B chess positions across ~10M games, and you find that the model can solve the problem at the level of a human expert, that isn't an example of out-of-distribution generalization. That is an example of local generalization -- precisely the thing you expect deep learning to be able to do.
Ethan Mollick: E-commerce companies are doing a good job in their A/B testing, avoiding p-hacking themselves. (This wasn't the case several years back when A/B testing was newer). Referring to:
- https://alexmiller.phd/research/p-hacking/ab-testing-p-hacking.pdf
- In recent years, randomized experiments (or “A/B tests’) have become commonplace in many industrial settings as managers increasingly seek the aid of scientific rigor in their decision-making. However, just as this practice has proliferated among firms, the problem of 𝑝-hacking—whereby experimenters adjust their sample size or try several statistical analyses until they find one that produces a statistically significant 𝑝-value—has emerged as a prevalent concern in the scientific community. Notably, many commentators have highlighted how A/B testing software enables and may even encourage 𝑝-hacking behavior.
Ethan Mollick: It is morally wrong to use AI detectors when they produce false positives that smear students in ways that hurt them and where they can never prove their innocence. Do not use them.
Nisan Haramati: Which is why it's really important to consider how cost behaves as a system scales. A system whose unit-cost increases with scale (particularly when it grows beyond its optimal operational size) does not actually satisfy "can always grow larger" [...] And it leads to the question that Graphium Labs is working on: can we design a system whose per-unit cost of operation does not overtake the value that the system produces, as the system grows well into the extreme scale range?
Andy Pavlo at P99 Conf: So we’ve been working on a new key/value storage system called BPF-DB. The goal is to provide a transaction storage engine, with a simplistic API with sets and gets, wrapped up in transactions. The idea is to allow developers to build more full featured systems using BPF-DB as the backing storage. Think of something like RocksDB (for eBPF).
Mim: maybe the most important discussion that will determine the future of delta table and iceberg, I hope cool heads will prevails :) https://lists.apache.org/thread/wyon0kvroxsmkxh153444xzscwbb68o1
CharlieDigital: Didn't look too deeply, but one of the keys with Cypher (at least in the context of graph databases) is that it has a nice way of representing `JOIN` operations as graph traversals.
- MATCH (p:Person)-[r]-(c:Company) RETURN p.Name, c.Name
- Where `r` can represent any relationship (AKA `JOIN`) between the two collections `Person` and `Company` such as `WORKS_AT`, `EMPLOYED_BY`, `CONTRACTOR_FOR`, etc.
- So I'd say that linear queries are one of the things I like about Cypher, but the clean abstraction of complex `JOIN` operations is another huge one.
CharlieDigital: If you haven't been following it, I recently found out that it is now supported in a limited capacity by Google Spanner[0]. The openCypher initiative started a few years back and it looks like it's evolved into the (unfortunate moniker) GQL[1]. So it may be the case that we'll see more Cypher out in the wild.
- [0] https://cloud.google.com/spanner/docs/graph/opencypher-refer…
- [1] https://neo4j.com/blog/cypher-gql-world/
Jeremy Morrell: Adopting Wide Event-style instrumentation has been one of the highest-leverage changes I’ve made in my engineering career.
Simple, Efficient, and Robust Hash Tables for Join Processing:
- Joins are asymmetric. Most joins are asymmetric, with one side often much smaller than the other. Hash joins exploit this asymmetry by building a hash table on the smaller side and probing it using the larger side. As the sides are often orders of magnitude different in size, the probing phase must be kept extraordinarily efficient.
- Joins are selective. The asymmetry of the sides also manifests in terms of join selectivity: Many probe-side tuples will not find matches. Therefore, it is crucial to efficiently eliminate tuples without a join partner from the large probe side. Bloom filters are ideal for this task.
An investigation of 𝑝-hacking in e-commerce A/B testing:
- As empirical science and statistics evolved and became institutionalized throughout the 20th century, this simple rule-of-thumb gradually solidified into a widespread convention for evaluating statistical evidence, with the 𝑝=0.05 threshold often being treated as a definitive cutoff.2 In recent decades, however, the scientific community has increasingly scrutinized the utility, applicability, and consequences of widespread reliance on significance testing that dichotomizes results based on 𝑝-values in this way.
- Given that academic researchers, often with doctoral degrees and graduate training in statistics, have been known to engage in 𝑝-hacking behavior (Brodeur et al., 2020, 2023b, Szucs, 2016), it is reasonable to ask whether analysts in corporate environments using similar statistical techniques make similar methodological errors.
Murat Demirbas: Finally, there are operational concerns! Reliably integrating ML into the critical path of any database management system is difficult. How do you ensure there are no surprises when faced with the long-tail of workloads? Whenever queries queue there is a risk of disproportional increase in the response time of short running queries compared to their execution time; a phenomenon often referred to as head of the line blocking (HLB). It seems like the query execution system can be a great breeding ground for metastable failures! How do you trust handing this over to a ML-based system?
Jack Vanlightly: The original state management design in Flink wasn't optimal for the ever growing size of the state being managed by stream processing jobs. But we've seen steady evolution, from aligned to unaligned checkpointing, generic incremental checkpoints, and new state reconstruction algorithms. The next step in this evolution is cloud-native, disaggregated state storage.
Gunnar Morling: In my opinion, CDC makes sense as part of a cohesive data platform which integrates all these things. These, and more: also data governance, schema management, observability, quality management, etc. Another angle for CDC productization could be to marry it closely with a database. Imagine Postgres provided out of the box a Kafka broker endpoint to which you can subscribe for getting Debezium-formatted data change events. How cool would that be? But again, that’s a feature, not a product.
James Governor: We’ve also had a long era of sprawl, where developers and organisations and engineering teams have had a lot of autonomy in the decisions they make, which has enabled productivity in some dimensions, but also major challenges in terms of complexity and lack of standardisation in databases, runtimes, monitoring tools, programming languages and so on.
Karthik Rao: I will argue that today’s query engines, such as Apache Spark, do not effectively interact with lake houses due to limitations in handling object storage semantics, lack of integration with catalog systems like Iceberg for governance and lineage, and inadequate support for lakehouse-specific features. I believe there is a need for a new class of lakehouse-native query engine APIs with explicit semantics for object storage and open table formats.
Craig Kerstiens: I’m often asked what do I think the future for Postgres holds, and my answer has been mostly the same for probably 8 years now, maybe even longer. You see for Postgres itself stability and reliability is core. So where does the new stuff come from if it’s not in the stable core… extensions.
Jamin Ball: As a result, we may see a fundamental change in how enterprise applications are built. The traditional model of a front-end application tied to a database like Oracle and a series of manual workflows might give way to AI-native applications built on AI native databases, where the database takes center stage. These AI applications will be designed to operate on top of centralized data repositories— like a data lakes or lakehouse —where AI agents gather and process information from a wide array of unstructured sources.
Bart de Water: Understanding a bit of queueing theory goes a long way in being able to reason about how a system will behave under load.
Weston Pace on Data Engineering Podcast E442 (23 min). So I used to work with PyArrow and Parquet data sets and there would be a lot of those sorts of configuration things. One of our goals with Lance was really to get rid of those as much as possible. We found that with these wide data types, it either becomes hard to figure out what the right setting is or its damn right impossible, there is no good setting. So with Lance, we got rid of row groups. I told my friend at the time, if the only good thing I do in my life is get rid of row groups then I will consider it a success. Hopefully that can eventually propagate back to Parquet and other file formats.
Weston Pace on Data Engineering Podcast E442 (40 min): Because what we’re seeing is that compression techniques, that again something that seemed relatively solved, well now people are talking about FastLanes and FSST and ALP, and all these interesting compression techniques.
Daniel Whitenack Practical AI E292: I resonate so much with this, coming from a background as a data scientist, living through the years of being told to use Spark for this. Basically my experience in this ecosystem is, I would try to write a query and it would get the right result, but to your point Till, I would just be waiting forever for a result. So I would send the query to some other guy, whose name was Eugene and Eugene was really smart, and he could figure out a way to make it go really fast, and I never became Eugene.

Humans with opinions

Seattle Data Guy (aka Ben Rogojan) on the Snowflake vs Databricks battle

[X post]

First off..."Spark-based SaaS"...they aren't Voldemort..you can say their name.

Second...the more you think about Snowflake vs. Databricks.

The happier Snowflake and Databricks must be.

Despite there being a dozen other options when it comes to building data analytics platform.

They’ve slowly turned the narrative into Coke vs Pepsi.

Nike vs. Adidas.

Burger King vs McDonald’s.

Don't think about any other solutions.

I posted this a while back, but this seemed pertinent.

Phillip Carter on The Observability CAP Theorem

[Blog snippet]

The theorem

Observability has a similar dynamic as the CAP Theorem. Generally speaking, sufficient observability for nontrivial, live applications/systems involves these properties:

Fast queries on your data right now
Enough data to access per query (days, months, years, back)
Access to all the data you care about for a given context
A bill at the end of the month that doesn’t make you cry

Of these, you can definitely get one, probably get two, maybe get three, and you absolutely, positively cannot get all four. The only way that’s possible is if you’re just not generating much data to begin with (e.g., a small business like Basecamp).

Every year, legions of engineers are tasked by their management to get all four properties themselves after experiencing an eye-watering Datadog bill. After having some fun trying to build baby’s first large-scale data analysis system, they ultimately fail.

Open Source Data Summit, Panel Discussion: The rise of open source data catalogs

[Video snippet]

What does your project and your community plan to do about structured data?

Denny Lee (Polaris): This isn’t just for the catalog community, we’re all facing the same dilemma. We want to apply the same logic, of governance, some form of lineage with structured data, and now we need to apply that to our semi-structured and unstructured data. GenAI is everywhere now, that’s great but what were the parameters you used? What was the training data you used? What’s the lineage of all this? Can you from a fiduciary or compliance perspective explain where the source of this data is from?
Russel Spitzer (Iceberg): I wonder, and we’ve had discussions about this in the Iceberg community, if we need some way of tracking blobs, and associated metadata, as entities within the table. Both in Delta, and in Iceberg and Hudi, we have this ability to do this tracking, this lineage, and all we really need to do is extend it past the relational table into attached blobs that are connected to these tables. If we do that in one of these base table formats, then these catalogs get it for free. […] We don’t need this to be a catalog-level concern if we can build this into the table formats themselves.

Helpful Humans

Bart de Water of Shopify wrote 10 Tips for Building Resilient Payment Systems (though I think it applies far more broadly)
Alex Merced wrote:
- All About Parquet 01 (the first in a 10-part series on Apache Parquet).
- A Deep Dive Into GitHub Actions From Software Development to Data Engineering
Jeremy Morrell wrote A Practitioner's Guide to Wide Events
Farzad Nobar wrote Time Series — From Analyzing the Past to Predicting the Future (an overview of statistical methods for making forecasts based on time-series data).
Brent Ozar wrote How Many Indexes Is Too Many? (Focused on SQL Server but broadly applicable).

R&D humans

A survey of different LLM fine-tuning techniques: https://arxiv.org/abs/2408.13296
- The analysis differentiates between various fine-tuning methodologies, including supervised, unsupervised, and instruction-based approaches, underscoring their respective implications for specific tasks.
- A structured seven-stage pipeline for LLM fine-tuning is introduced, covering the complete lifecycle from data preparation to model deployment. Key considerations include data collection strategies, handling of imbalanced datasets, model initialisation, and optimisation techniques, with a particular focus on hyperparameter tuning.
- The report also highlights parameter-efficient fine-tuning methods such as Low-Rank Adaptation (LoRA) and Half Fine-Tuning, which balance resource constraints with optimal model performance.
- The exploration extends to advanced fine-tuning techniques and configurations like memory finetuning, Mixture of Experts (MoE) and Mixture of Agents (MoA), demonstrating how these methods harness specialised networks and multi-agent collaboration for improved outcomes. Proximal Policy Optimisation (PPO) and Direct Preference Optimisation (DPO) are discussed as innovative approaches to aligning models with human preferences, while the benefits of pruning and routing optimisations are examined for enhancing efficiency.
- In the latter sections, the report delves into validation frameworks, post-deployment monitoring, and optimisation techniques for inference. It also addresses the deployment of LLMs on distributed and cloud-based platforms. Additionally, cutting-edge topics such as multimodal LLMs and fine-tuning for audio and speech processing are covered, alongside emerging challenges related to scalability, privacy, and accountability.
- This report aims to serve as a comprehensive guide for researchers and practitioners, offering actionable insights into fine-tuning LLMs while navigating the challenges and opportunities inherent in this rapidly evolving field.
Two papers discussing more efficient join implementation strategies:
- Simple, Efficient, and Robust Hash Tables for Join Processing
  - Hash joins play a critical role in relational data processing and their
    performance is crucial for the overall performance of a database
    system. Due to the hard to predict nature of intermediate results,
    an ideal hash join implementation has to be both fast for typical
    queries and robust against unusual data distributions. In this paper,
    we present our simple, yet effective unchained in-memory hash
    table design.
- POLAR: Adaptive and Non-invasive Join Order Selection via
  Plans of Least Resistance
  - Join ordering and query optimization are crucial for query performance but remain challenging due to unknown or changing characteristics of query intermediates, especially for complex queries with
    many joins. Over the past two decades, a spectrum of techniques for
    adaptive query processing (AQP)—including inter-/intra-operator
    adaptivity and tuple routing—have been proposed to address these
    challenges. However, commercial database systems in practice do
    not implement holistic AQP techniques because they increase the
    system complexity (e.g., intertwined planning and execution) and
    thus, complicate debugging and testing. Additionally, existing approaches may incur large overheads, leading to problematic performance regressions. In this paper, we introduce POLAR, a simple yet
    very effective technique for a self-regulating selection of alternative
    join orderings with bounded overhead.

Notable trend - Streaming state management

Yuan Mei, Apache Flink PMC member and director of engineering at Alibaba Cloud spoke last week at Flink Forward about work to implement cloud-native disaggregated state storage in Flink 2.0.

Streaming state storage is an interesting subject and the state of the art is moving to disaggregated storage (separation of compute and storage) but also to more bespoke state stores. It turns out that there are challenges to utilizing state stores such as RocksDB in streaming workloads. No data store can serve all workloads, and we see much diversity in stream processor designs when looking across both mature and new stream processors today. Due to this diversity, many stream processors are working on building or have built native state stores to work well with their specific designs.

A number of blog posts have been written this year on this subject. It makes interesting reading to learn how stream processing often requires different state management from an OLTP database (where RocksDB is traditionally used). However, as you will see, each highlights different problems they overcome by choosing a more bespoke state store solution.

With that preamble here are a number of blog posts on this general topic:

Evolution of Flink 2.0 State Management Storage-computing Separation Architecture (Alibaba Cloud)
- For this reason, we can consider doing this differently. We can use DFS as primary storage and local disk as an optional cache, and DFS data is the source of truth. In this method, state data is directly and continuously written to DFS, and memory and local disks are used as caches to serve operator state access at the upper layer. The benefits of this architecture are:
  - When a checkpoint is triggered, most of the state files have already been in DFS. Checkpoints can be quickly executed.
  - State files can be shared among operators. In some scenarios, storage resources can be greatly reduced by sharing state computing.
  - State query can be implemented based on remote checkpoint files and file-level APIs can be provided.
  - Compaction and state cleanup are no longer bound to the task manager. We can use methods such as remote compaction and load balancing to balance the overall resource usage of the cluster.
RocksDB: Not A Good Choice for a High-Performance Streaming (Feldera project)
- One might argue that Rust has zero-copy deserialization libraries like rkyv that could be used together with RocksDB. Zero-copy deserialization allows directly casting the starting address of the bytes from RocksDB to an actual Rust type, eliminating CPU overhead. However, zero-copy deserialization requires proper alignment of the deserialized bytes for the target Rust type. RocksDB operates on generic byte slices (&[u8]), which have an alignment requirement of 1 in Rust. Additionally, RocksDB is written in C++, which does not adhere to Rust's alignment rules. Attempting zero-copy deserialization with rkyv and RocksDB might occasionally work, but it often results in a panic due to misaligned slices
- RocksDB offers an overwhelming number of configuration options, making it nearly impossible for non-experts to ensure optimal settings. The complexity is so significant that a HotStorage’24 paper detailed training a large language model (LLM) to identify good configurations. It won the best paper award at the conference.
Why not RocksDB for streaming storage? (Materialize, 2020)
- Each core in a Timely Dataflow cluster has a complete copy of the logical dataflow plan, and every operator is partitioned across every core. In this sharding scheme, operators on a worker are cooperatively scheduled, and have to be carefully designed to yield eagerly, or they will block other operators from executing, potentially stalling the entire dataflow graph.
- And therein lies the problem with RocksDB: RocksDB is designed for liberally using background threads to perform the computation for physical compaction. In an OLTP setting, this is exactly what you want - lots of concurrent writers pushing to the higher levels of the LSM tree, lots of readers accessing immutable SSTables of the entire tree with snapshot isolation semantics, and compaction proceeding in the background with additional cores. In a streaming setting, however, those cores are better used as additional concurrency units for the dataflow execution, which means compaction happens in the foreground. And thus, scheduling when the compaction happens cannot be outsourced, and must be considered alongside all other operators.
Stop embedding RocksDB in your Stream Processor! (Responsive)
- In many of these situations the RocksDB state has to be rebuilt from the underlying Kafka changelog topics. If that happens, the partitions in question will not be able to process any new events until the state is fully rebuilt, which could take a very long time for large state stores. This means that these partitions are offline for the duration of the state restoration, which could result in significant business impact.
- Debugging incidents often requires you to look up state. For instance, if you can’t make sense of your join output, you’d typically want to look up your state in order to explain the output you see. Doing this with RocksDB is not straightforward: you need to know which partition the key is present on and then run an interactive query against the live application to get the state value for that key. By contrast, dedicated databases have full fledged query layers. This means that you can just issue a lookup against the key to get it’s value.

SlateDB is an interesting project that is relevant to this subject of embedded storage engines (such as RocksDB) and storage disaggregation. It’s the first mover in this area though I can’t imagine it will remain the only embedded state store built for object storage for long. Flink 2.0 has forked RocksDB to create ForSt, a RocksDB that speaks object storage (though it is being developed specifically for Flink, and not designed to be a general-purpose storage engine like SlateDB as far as I know).

My writing

The Curse of Conway and the Data Space. Last week I wrote about Conway’s Law and the data space. Could it be that many modern trends, and even many categories within the Modern Data Stack, have just been reactions to the negative consequences of segregating data teams from software eng teams? When you look the entire space more systemically, through the lens of Conway’s Law, it gives you a wholly different perspective.
The Teacher’s Nemesis: The curse of knowledge is our own blindness as experts regarding the difficulty of learning something and the complexity of concepts we have internalized fully.

Humans of the Data Sphere Issue #1 October 15th 2024

Jack Vanlightly — Tue, 15 Oct 2024 12:45:40 GMT

This is issue #1 of Humans of the Data Sphere.

Aims of the publication

This publication aims to bring together the words of humans from across the data landscape. This is broader than just the “data space”, it encompasses all kinds of databases, messaging/streaming, OLAP, (cloud) data warehouses, lakehouses, distributed systems, and some AI and ML. My hope is that it is narrow enough to be cohesive, but broad enough for readers to be exposed to new ideas and new people that might sit just outside their normal career focus.

This format is heavily inspired by the highscalability.com, “Stuff the internet says on scalability“ posts. While I am using a similar format as a launchpad, it’s still too early to know what HOTDS will be. Right now it reflects the subjects and people that I find most interesting and that I think will be interesting to others.

Let me know what you think of issue #1 at feedback@hotds.dev.

Quotable humans

Ryanne Dolan: The best thing about streaming data pipelines is NOT that they are "real-time". It's that they are tremendously easier to operate.
sahilthapar: It took a long time at my previous company but we were finally able to convince the upstream sources to own the quality of the data they send out. I totally agree with this comment, it's a slippery slope the more you "handle" it, the less likely they are to ever fix the systems.
Soumith Chintala: "How to train a model on 10k H100 GPUs?" has now been immortalized on my blog: https://soumith.ch/blog/2024-10-02-training-10k-scale.md.html
Murat Demirbas: Complexity creeps into distributed systems through failures, asynchrony, and change. Mahesh also confessed that he didn't realize the extent to the importance of managing change until his days in industry. While other fields in computer science have successfully built robust abstractions (such as layered protocols in networking, and block storage, process, address space abstractions in operating systems), distributed systems have lagged behind in this aspect.
Data Engineering Podcast E400 (39’): Why did Snowflake evolve its own grammar? DuckDB is also creating its own grammar. The reason for this is that there is no extensibility, so the only thing that’s left is to change the language.
Jaana Dogan: When I retrospectively think about all the globally successful projects I worked on, the common denominator wasn't the buy-in from everyone. A few strongly opinionated people came together, identified a major problem, built a solution extensible enough. Growth was organic from that point on.
Lorin Hochstein: Reminder about the importance of getting good at recovering from incidents: 1. You can’t prevent all incidents from occurring. 2. You must recover from all of the incidents that occur.
SuperDataScience E825 (7’): My definition of data quality is a bit different from other people’s. In the software world, people think about quality as, it’s very deterministic. So I am writing a feature, I am building an application, I have a set of requirements for that application and if the software no longer meets those requirements that is known as a bug, it’s a quality issue. But in the data space you might have a producer of data that is emitting data or collecting data in some way, that makes a change which is totally sensible for their use case. As an example, maybe I have a column called timestamp that is being recorded in local time, but I decide to change that to UTC format. Totally fine, makes complete sense, probably exactly what you should do. But if there’s someone downstream of me that’s expecting local time, they’re going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between the data producers and data consumers, and that is the function of the data contract. It’s to help these two sides actually collaborate better with each other.
Murat Demirbas: What can go wrong? Computers can crash. Unfortunately they don't fail cleanly. Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 session on failures was bleak indeed. That was only the beginning. Then there are network failures. And clock synchronization failures. Did we leave anything? There are metastable failures that are emergent behavior distributed systems behavior. We can then venture into more malicious cases, like Byzantine failures. If you take into account security/hacking as failures, there are so many more problems to list. There are hurricanes, natural disasters, and even cyberattacks during natural disasters. The failure modes are so many to enumerate. So maybe instead of enumerating and explaining all these in detail, it is better to give the fundamental impossibility results haunting distributed systems: Coordinated Attack Impossibility and the FLP (Fischer-Lynch-Paterson) impossibility.
Benn Stancil: The mood in the room—the music in the air—however, was relief. Sure, Coalesce is less of the cheery circus it once was, but that was never sustainable anyway. Sure, the dbt product itself is slowly tilting the floor ⁴ towards a line of cash registers, but dbt Labs was never going to survive by selling a thin web app on top of an approachable Python package. But this version of dbt, one person told me, feels like it can last. The session that they said was most memorable was a panel of dbt execs, not for what they said, but for who they were: People with long resumes of successfully selling stuff.
Gergely Orosz: Years back, my team was part of an outage where people couldn’t takes rides for some time (I think 30-60 minutes or so.) It was bad. In the postmortem, we always needed to list the business impact. An Eng1 with an interest in product filled it out: He put in a negative number!! (commenting on):
- Kelly Sommers: You would be surprised how many orgs are unable to answer a simple question like “does this product make money?”. They don’t have correct things in place to separate out the costs.
Chris Riccomini: Please hook LLMs up to ops data.
- "Which team is costing me the most on my s3 bill?"
- "Which service returned 500 errors between 2am and 2:15 am last night?"
- "Which service is calling the profile service with missing fields?"
- Aviv Dozorets replied: I want to see more actual data separated from the noise. “What metrics changed in upstream service before my service crashed?” “How crash of service X affected other systems (that I might be unaware of)”.
Murat Demirbas: As a user of a database you won't need to understand exactly how the database is implemented, but you need to have some mechanical sympathy, dammit.
chasd00: I toured a data center in Tornado alley back when leasing cages was pretty common. I asked them about disaster planning regarding getting completely wiped off the map and they sorta scoffed at me. Literally two weeks later a tornado missed them by about a 1/4 mile. Would have loved to be a fly on the wall after that.
Jaana Dogan: One of the main reasons why there is so much burnout in this space. Everyone is copying everyone, pretending to be ticking all the boxes instead of trying to build a cohesive great product that solves some hard problems in ways others don’t solve. (referring to):
- Phillip Carter: one of the things that sucks about the observability space is that it's really hard to properly evaluate all the tools/products, and everyone is incentivized to "observabilitywash" their value props to make it sound like they do everything well, when they absolutely do not
Simon Späti: Why is reusability so hard in the data space? That is a fascinating question.
François Chollet: One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. (referring to):
- “Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.” GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- François Chollet: My a-priori expectation of ChatGPT is that it will be able to solve a previously seen task but will not be able to adapt to any original task no matter how simple, because its ability to solve problems does not depend on task complexity but task familiarity.
Shreya Shankar: the correct way of thinking about the RAG vs long context debate is that RAG is predicate pushdown and predicates should always be pushed down if possible
Alex Miller: I had missed that there was *any* filesystem which supported multi-block atomic writes, but F2FS does, and there's sqlite support for leveraging it https://oslab.kaist.ac.kr/wp-content/uploads/esos_files/publication/conferences/international/HPCCT2016-108_final.pdf. See https://lore.kernel.org/lkml/1411707287-21760-2-git-send-email-jaegeuk@kernel.org/ for API.
François Chollet: It's surprisingly easy to do "hard" things -- for the most part, you need to get started and keep at it
Aleksey Charapko: It is time for a new list of papers! Papers #181-190 in our Distributed Systems Reading group! Follow the link for the schedule and instructions on how to join and participate. https://charap.co/fall-2024-reading-group-papers-papers-181-190/
Yingjun Wu: As a founder, I can confirm that the consensus within the data streaming community favors @ApacheIceberg
Shaun Thomas (on code comments): "Self describing code" can't cite RFCs, explain an underlying algorithm, or justify project architecture decisions. Comment blocks must mean something more than pure description. If you think comments are unnecessary, you're using them wrong.
François Chollet: Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning). Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime.
Gunnar Morling: Considering the fact that your application uses a database an implementation detail which could be replaced with something else any time, is inefficient at best, setting up yourself for failure at worst. Embrace and take advantage of the tools you're using.
François Chollet: To order to get high device utilization when training, the most important best practice is to both do data prefetching (moving the next batch of data to GPU memory while the previous batch is being processed) and asynchronous logging (moving the metrics from the previous batch to host memory while the next batch is being processed).
LaurieWired: Your files are dying. That SSD you keep in the closet, the one from your old system "just in case". Yup, degrading as we speak. SSDs are *shockingly* bad at power off retention, esp if it's near it's endurance rating.
LaurieWired: Improving productivity is a scam; our brains are *terrible* at determining what is useful or not. Instead, focus on eliminating small fake works. Fake works feel productive, have indicators of progress, and give a similar dopamine hit; without output.
Leonie: Don’t underestimate the impact of reranking in your RAG pipeline.
Matt Turck: Today’s market, a summary: * $3B: valuation of a pre-IPO software company, because evaluated at public market multiples at 8x $375m revenue. * Also, $3B: valuation of an AI agent thing with barely any revenue because, vibes.
Simon Späti: Many asked how to so-called "break into data engineering". To be honest, if you just read whitepapers, you could go far.
numbsafari: "Cloud 3-2-1" to the rescue… I take the typical formulation (e.g., [1]), and translate it into:
- - Keep 3 copies of your data: production + 2.
- - Keep 2 snapshots, stored separately.
- - Keep 1 snapshot on-hand (literally in your possession), or with a different provider.
Noah Pepper: Hilarious and poetic irony that regulators are looking at breaking up Google just as technology shifts are causing their search monopoly to melt away organically.
Ethan Mollick: Our research a year ago found that people stopped fact checking the AI when it got good enough, and juat took what it said as right (even if it wasn’t). I think that line has firmly & permanently been crossed for many text summarization applications.
Gunnar Morling: The biggest problem of #Java is poor perception. It's technically super-solid, but too often folks discard it based on misconceptions or information outdated years ago.
Simon Späti: The medallion architecture gets some hype here and there. Still, IMO, it's a revamped architecture from the classical data warehouse layering `stage -> cleansing -> core -> mart` we have done since the inception of modeling data warehouses with simplified names and optimized for data lakes.
Ethan Mollick: Individuals are seeing big gains from AI. Organizations less so.
T Greer: the inability of gen z to read makes me think that if I don’t switch from writing essays to giving YouTube lectures and videos in the next decade I will have consigned my self to irrelevancy.
Jonathan Ellis: The current state of AI is frustrating but only because we keep getting glimpses of the magic that is possible but not yet reliable.
Debasish Ghosh: TIL: Parquet uses Split Bloom Filters for predicate pushdown for high cardinality columns.
Rahul Jain: Interesting how Iceberg features you can use are the venn diagram intersection of: 1. Iceberg 2. choice of processing engine 3. choice of catalog.
Rahul Jain: The best thing about data engineering subreddit is that it's more practitioners and fewer vendors. Much better place to talk DE than Twitter, imo. And the community is friendlier than /r/programming.
Neelesh Salian: At this point, let’s just accept that Iceberg is the standard format and build on top of it? I don’t see the value in having multiple formats in an organization unless you have teams who do their own thing and have settled into a format already.
Ethan Mollick: Among the key questions shaping the AI industry is how long Meta will keep releasing open weights models for. Gen3 (GPT-5 class)? Gen4 (GPT-6 class)? At some point the logic they have been using might shift in the face of rising risks, costs & opportunities for advantage.
Gergely Orosz: Claiming zero errors and hallucinations for LLMs is the equivalent of claiming 100% uptime for services. Just marketing.
Birdy: You'd think that the majority of data platform engineering is solving tech problems at large scale. Unfortunately it's once again the people problem that's all-consuming.
Oleg Šelajev: Debezium is like the Observer pattern for your database! It taps into transaction logs and propagates changes to the outside world.
Gunnar Morling (reposted from 2023): One data architecture I expect we'll see more in 2023 is #SQLite/#DuckDB deployed as caches at the edge, updated via change feeds from system-of-record: stellar read performance due to close local proximity to users and fully queryable data models tailored for specific use cases.
Seattle Data Guy: If you don't set a consistent coding/design standard, people will all create their own inconsistent coding/design standard.
Jay Graves: You can build the most beautiful data visualization layer in the history of history and most users will still ask for an Excel export.
Yingjun Wu: Small data is the future. But that doesn’t mean big data is dead—it means the old "big data problems" can now be solved on smaller scales, often with just a few machines or even a single machine. There’s no longer a strong need to stress test systems at a thousand-node level because, for most real-world workloads, focusing on smaller-scale settings makes more sense. That said, single-node systems aren’t the universal solution. Systems still need to be distributed to overcome memory limitations, ensure fault tolerance, and maintain high availability. We’ll continue to see specialized systems emerge for different use cases—there’s still no "one size fits all.
Rick Houlihan: Every RDBMS backed service I have ever seen is denormalized in some way when it needs to scale out. No matter what database you use, when it comes to performance and efficiency, friends don't let friends join tables.
Jakob Foerster: When I discussed quitting Google to do a Phd, my manager, Steve Cheng, gave me the advice of "6 shots": Doing something meaningful usually takes about 5 years and we are productive for roughly 30 years. That gives you 6 attempts. So pick each one carefully and give it your best.
Peter Kraft: “Everybody knows that you deploy software on servers--physical machines with fixed pools of resources. But what if things were more flexible? What if CPUs and memory were disaggregated so you could allocate them in seconds from a network-attached pool? I really like this paper…
- Shaun Thomas: It's an interesting concept, but it requires a complete overhaul of both hardware and OS design, and still must contend with PACELC tradeoffs. You can't defeat the speed of light.
Yann LeCun: Worth repeating: Do not confuse retrieval with reasoning. Do not confuse rote learning with understanding. Do not confuse accumulated knowledge with intelligence.
Murat Demirbas: Cristina said that adversarial testing using byzantine adversary (for example, adversarial testing for congestion control) is better than straightforward application of fuzzing, which is most commonly used in security/networking conferences. She then introduced their approach to adversarial testing. They use an abstract model to generate abstract strategies, map abstract strategies to concrete strategies, and execute concrete strategies. Formal methods play a crucial role in this process. They help clarify system specifications, make implicit assumptions explicit, and identify flaws through counterexamples.
ScyllaDB: Second: performance converges over time. In-memory caches have been (for a long time) regarded as one of the fastest infrastructure components around. Yet, it’s been a few years now since caching solutions started to look into the realm of flash disks. These initiatives obviously pose an interesting question: If an in-memory cache can rely on flash storage, then why can’t a persistent database also work as a cache?
Bernd Wessely: The definition of the ‘data engineering lifecycle’, as helpful and organizing it might be, is actually a direct consequence of silo specialization.
It made us believe that ingestion is the unavoidable first step of working with data, followed by transformation before the final step of data serving concludes the process. It almost seems like everyone accepted this pattern to represent what data engineering is all about..

Humans With Opinions

It’s easy to talk about how things work, harder to take a reasoned position on a subject.

These are some opinions I found interesting.

Tristan Handy (of DBT) on Cross-Platform DBT Mesh

[snippet]

I fundamentally do not believe we are going to see one, or even two, winners in the data platform space. This is not Windows in the ‘90s, or even iOS and Android in 2012: the data platform ecosystem is not a monopoly or a duopoly; at best it is an oligopoly with 6-10 real players. But in reality I think it is better to just think about it as a competitive market.

This is good for users—no one needs the Oracle-vs-Microsoft dynamic that existed in 2003 at the start of my career. But it also creates complexity and bifurcation. Because today, different teams that use different data platforms inside the same company typically do not know about or have any access to the data assets that live inside the other platform. This leads to duplication, inefficiency, and inaccuracy.

Under the hood, dbt’s new cross-platform ref capabilities are powered by its support for Iceberg. Iceberg without dbt can be a real pain to use, but I am a huge believer in its ability to move the market in practitioner-favorable ways. I’m delighted by our ability to abstract away the complexity behind a perfectly dbtonic interface.

Sriram Subramanian on S3 and the hype train

[whole X post]

There is a general trend created by developer marketing that influences how developers adopt a technology and also affects how others build new technologies.

A few examples

Big data is dead; long live small data
Any data system not built on s3 will become obsolete
All infra providers need to support all deployment options - SaaS, BYOC embedded.

This is far from reality and the truth is way more nuanced. The unfortunate downside to this is that new infrastructure companies will end up blindly following the trends vs understanding what is needed for their use case and users.

To make this point clear, consider the deployment case above (option 3). There are many things to consider to decide what deployment options you want to support for your customers -

Can you devise an architecture that unifies all three deployment modes? How much more time will it take to do so well? Can the experience across all deployment modes be exactly the same?
What do your target users want? What is their level of trust? Are you really secure by choosing to support BYOC model?
Can you build a data plane that is truly stateless for BYOA? If not, you need full access to your customer’s account. Can you build an embedded offering for your system? Can it be in feature parity? How does permissions work? Does it solve any problems for your users? Can you instead dockerize your offering?
What is the impact on GTM? How many pricing options are you going to support? Can you enable your sales to explain all the different options clearly to end users? How does it affect the total number of GTM assets to be created? Does your margin profile fundamentally change between them?
How does product prioritization work? How does support work if the user needs help on your embedded or BYOC offering?
Can you capture a significant market with just one of the deployment options, have more focus, and better execution?

You can possibly do the same for every major architecture decision. Each technology decision has many implications and will depend greatly on the use case and customers you are solving for.

Build from first principles, focus, strategize, listen to your customers, and execute.

Roy Hasson opines on the planned convergence of Apache Iceberg and Delta Lake

[whole LinkedIn post]

So how will the formats evolve into convergence?

The idea they shared highlight the following:

1. Converging the data format to provide a single, consistent way to write and store physical data. Unifying how columns are represented, how deletes are encoded, data types, etc. This is a massive win!

2. Although both formats have a lot in common, maintaining separate metadata format for the time being is preferred to give each community autonomy and ability to innovate for their users. If the data layer is consistent (per #1) both table formats can operate independently without requiring users to duplicate physical data.

What should users do if they want to store data today? They recommend, not surprising...

3. Utilize the Unity Catalog as a way to enable readers to translate between formats. So if your engine can only read Iceberg and the data is using Delta format, the catalog can quickly generate the appropriate manifests to make the table readable via Iceberg. Databricks does this today with Unity Catalog and Uniform.

This is where things fall apart for me...

With regards to the catalog and format conversion, I agree with the premise, however I'm hesitant because this feels like a Databricks lock-in to get users into Unity Catalog and Delta. Yes you could read Iceberg with this approach, but not write it. The interoperability is one way - Delta to Iceberg. If I'm not a Databricks customer and prefer Iceberg as my format of choice, this approach will not work for me.

My expectation is that if #1 becomes a reality (and it should sooner than later), engine's ability to support both Iceberg and Delta equally becomes significantly simpler (popular engines already, for the most part, support both formats). So why do I need the potential lock-in of Uniform and Unity Catalog?

If you're in the Databricks ecosystem, then Unity Catalog is a great product with governance, security and lots of other goodies. But if you're not, than why force me into this setup?

Anyway, interesting conversation and insight into how Databricks is thinking about this table format interoperability challenge.

Charity Majors on the danger of premature seniority

[snippet]

What you are experiencing now is the alluring comfort of premature seniority. You’re the smartest kid in the room, you know every corner of the system inside and out, you win every argument and anticipate every objection and you are part of every decision and you feel so deeply, pleasingly needed by the people around you.

It’s a trap.

Get the fuck out of there.

Ethan Mollick on AI in organizations: Some tactics

[snippet]

Over the past few months, we have gotten increasingly clear evidence of two key points about AI at work:

A large percentage of people are using AI at work. We know this is happening in the EU, where a representative study of knowledge workers in Denmark from January found that 65% of marketers, 64% of journalists, 30% of lawyers, among others, had used AI at work. We also know it from a new study of American workers in August, where a third of workers had used Generative AI at work in the last week. (ChatGPT is by far the most used tool in that study, followed by Google’s Gemini)
We know that individuals are seeing productivity gains at work for some important tasks. You have almost certainly seen me reference our work showing consultants completed 18 different tasks 25% more quickly using GPT-4. But another new study of actual deployments of the original GitHub Copilot for coding found a 26% improvement in productivity (and this used the now-obsolete GPT-3.5 and is far less advanced than current coding tools). This aligns with self-reported data. For example, the Denmark study found that users thought that AI halved their working time for 41% of the tasks they do at work.

Yet, when I talk to leaders and managers about AI use in their company, they often say they see little AI use and few productivity gains outside of narrow permitted use cases. So how do we reconcile these two experiences with the points above?

Yaroslav Tkachenko on “Shifting left to make it right”

[snippet]

“Shifting left” was mentioned during the keynotes five times (I counted). I also heard it in the hallways a lot. In case you don’t know, in the data platform context, shifting left means working more closely with operational / application development teams. For example, it means shared ownership over data products or data pipelines with the goal of stopping data artifacts from being treated as a second-class citizen. Data Mesh architecture is one of the ways to implement this principle.

It’s quite refreshing to hear this not just from consultants or vendors but large enterprises as well. I suspect that execs have finally started to understand the importance of high quality data. If you want to build actually useful user-facing “AI” products, you can’t do it without clean and fresh datasets. And yet, most of the enterprises still struggle with basic BI projects…

Bernd Wessely on Data Architecture: Lessons Learned

After we have built all too many brittle data pipelines, it’s time for data engineers to acknowledge that fundamental software engineering principles are just as crucial for data engineering. Since data engineering is essentially a form of software engineering, it makes sense that foundational practices such as CI/CD, agile development practices, clean coding using version control, Test Driven Design (TDD), modularized architectures, and considering security aspects early in the development cycle should also be applied in data engineering.

But the narrow focus within an engineering discipline often leads to a kind of intellectual and organizational isolation, where the greater commonalities and interdisciplinary synergies are no longer recognized. This has led to the formation of the ‘data engineering silo’ in which not only knowledge and resources, but also concepts and ways of thinking were isolated from the software engineering discipline. Collaboration and understanding between these disciplines became more difficult. I think this undesirable situation needs to be corrected as quickly as possible.

Unfortunately, the very same silo thinking seems to start with the hype around artificial intelligence (AI) and its sub-discipline machine learning (ML). ML engineering is about to create the next big silo.

Helpful Humans

Murat Demirbas explains different transaction isolation levels [advanced level warning].
Alex Merced
- On implementing Change Data Capture (CDC) when there is no CDC
- Publishes the Ultimate Directory of Apache Iceberg Resources
Andy Grove on accelerating Apache Spark using Apache DataFusion Comet [video].
Medium Engineering shares some useful tips on reducing the Snowflake bill
Alex Miller on Erasure Coding for Distributed Systems
Daniel Beach
- Gets Daft running in Databricks.
- Discussed Should you use DuckDB or Polars?

Published Humans

Two papers, two query engines, same subject: Making query optimizers that produce better query plans in the face of stale or incomplete data statistics.

Presto’s History-based Query Optimizer
- An important component of every query engine is its query optimizer. This is the part of the system responsible for taking the input query tree (typically an abstract query tree produced by the parser/analyzer) and converting it into an efficient execution plan. As the complexity of queries grows, so does the search space of possible plans, and having a good query optimizer becomes critical for navigating that search space and producing an efficient execution plan. Today, most enterprise-grade query optimizers are cost-based [7, 11, 15, 26, 28, 30, 31, 37], meaning they use a costing function to predict how computationally expensive a query plan is and select the one with the lowest cost estimate for execution. The costing module typically uses knowledge of data statistics and computation cost to compare different query plans and guide the optimizer into selecting the best query plan. This module often relies heavily on estimated data distribution and cardinalities.
- First, it requires data to be analyzed before it can be queried. In addition, cardinality estimators makes a number of simplifying assumptions such as data uniformity, independence of filters and columns, etc. They are often incapable of estimating selectivity of complex expressions, such as conditional expressions, function calls, and multi-key aggregations. There have been attempts to store more complex statistics such as multi-column and join histograms, but those require additional time and space to compute, and are often non-trivial to work with. As a result, it is not surprising that even industry-strength cardinality estimators routinely produce large errors in estimation.
- To overcome the challenges presented above, in this paper we present Presto’s history-based query optimizer (HBO) that has been used in production for several years at several large data infrastructure groups including those of Meta and Uber. In a nutshell, HBO tracks query execution statistics at the operator node, and uses those to predict future performance for similar queries.
(Databricks’ Photon engine) Adaptive and Robust ery Execution for Lakehouses at Scale
- Firstly, in large-scale, open Lakehouses with uncurated data, high ingestion rates, external tables, or deeply nested schemas, it is often costly or wasteful to maintain perfect and up-to-date table and column statistics. Secondly, inherently imperfect cardinality estimates with conjunctive predicates, joins and user-dened functions can lead to bad query plans. Thirdly, for the sheer magnitude of data involved, strictly relying on static query plan decisions can result in performance and stability issues such as excessive data movement, substantial disk spillage, or high memory pressure.
- The paper discusses some of the reasons for poor statistics or no statistics at all (which necessitates a query optimizer not based on statistics alone)
  - Supporting raw, uncurated data (lacking statistics).
  - Supporting external tables (lacking statistics).
  - Supporting deeply nested data (lacking statistics)
  - Supporting rapidly evolving data and workloads (stale statistics and volatile histories).
  - Supporting UDFs (lacking information for cardinality estimation)
  - Supporting diverse workloads (amplifying bad plans)
- To address these challenges, we built an adaptive query execution (AQE) framework. The key idea is to collect statistics during query execution from task metrics of completed and ongoing query plan fragments, and subsequently re-optimize unfinished execution plan fragments into better ones based on these runtime statistics.

My writing

I’ve been writing about the table formats for months now. My latest post questions the need for table format interoperability in the long term.

A snippet from Table format interoperability, future or fantasy?

[snippet]

The third alternative is to align the table formats at the data layer so that cross-publishing can utilize the vast majority of features, support merge-on-read without rewriting delete/DV files, and so on. If cross-publishing table formats ever really works well, it will be because the remaining table formats will have standardized some things, like partitioning, clustering, delete files and so on. There is also the potential for common standards for things like secondary indexing. This is similar to the standardized protocols that sit above TCP and UDP like DNS or BGP, supporting interoperability and core workflows, but currently there is no standardization mechanism like RFC’s for the open table format’s data layer.

But if all that did happen, why have a bunch of competing formats at all?

Let me know of interesting humans at feedback@hotds.dev

Humans of the Data Sphere - Issue #0

Jack Vanlightly — Mon, 14 Oct 2024 11:04:12 GMT

Humans of the Data Sphere (HOTDS) is a publication that aims to bring together the words of humans from across the data landscape. This is broader than just the “data space”, it encompasses all kinds of databases, messaging/streaming, OLAP, (cloud) data warehouses, lakehouses, distributed systems, and to some extent AI and ML. It’s narrow enough to be cohesive, but broad enough for readers to be exposed to new ideas and new people that might sit just outside their normal career focus.

I’m happy to receive suggested quotes, articles and papers from others, provided you give me a quotable section to include in a HOTDS issue. Also feel free to suggest interesting people to follow.

If you’re wondering who I am, go over to https://jack-vanlightly.com to see my writings on distributed data systems, formal verification, as well as strategy and commentary posts.