Humans of the Data Sphere Issue #4 November 22nd 2024
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
Welcome to Humans of the Data Sphere issue #4!
Best meme (referring to the idea of storing SQLite databases in a Postgres database):
Quotable Humans
Allen Downey: In August I had the pleasure of presenting a talk at posit::conf, called A Future of Data Science, in which I assert that data science exists because statistics missed the boat on computation.
Eugene Yan (and blog post): Here are 39 lessons I took away from conferences this year. [a selection follows]
4. Machine learning involves trade-offs. Recall vs. precision. Explore vs. exploit. Relevance vs. diversity vs. serendipity. Accuracy vs. speed vs. cost. The challenge is figuring out the right balance for your user experience.
5. Set realistic expectations. Most problems have a ceiling on what can be achieved, especially those that involve predicting the behavior of unpredictable humans (e.g., search, recommendations, fraud). It may not make sense to aim beyond the ceiling, unless you’re doing core research.
7. Evals are a differentiator and moat. Over the past two years, teams with solid evals have been able to continuously ship reliable, delightful experiences. No one regrets investing in a robust evaluation framework.
9. Brandolini’s law: The energy needed to refute bullshit is an order of magnitude larger than needed to produce it. The same applies to using LLMs. Generating ~~slop~~ content is easy relative to evaluating and guardrailing the defects. But the latter is how we earn—and keep—customer trust.
12. Build with an eye toward the future. Flexibility beats specialization in the long run. Remember The Bitter Lesson. An LLM that’s a bit worse now will likely outperform a custom finetune later, especially as LLMs get cheaper (two orders of magnitude in 18 months!), faster, and more capable.
Jenna Jordan: Sometimes I feel like my superpower has become just knowing the right people to put in a room together and then getting them to talk together about stuff. I can debug things best by having a mental model of the socio-technical system - the code and the people who have tacit domain knowledge.
Alex Miller: How did we all end up omitting the "Tree" and just saying "LSMs"? Without it, we're just saying "log-structured merges have great write amplification", as if nouns just don't matter anymore. No one calls a B-tree a B.
shachaf: I think this is the answer -- there's no actual tree anywhere, unless you consider all binary searches to be implicit trees. I don't think the level structure itself is very tree-like. (And the basic LSM idea, a log compacted into mergeable snapshots, works with even less tree-like structures.)
Jaana Dogan: (A popular) unpopular opinion: When you design by committee, you are mostly focusing on designing pieces that are narrow and poorly put together to make the individuals in the committee happy, rather than building anything that is cohesive and representative of the broader problem space.
Siddharth Goel: Best idea should win. I believe committees are important since you would want the feedback rather than building something in a silo. However, if it comes down the “happiness” then I sense politics rather than meritocracy.
Jaana Dogan: No one gets promoted for removing chaos or saying no to prematurely built artifacts that are guaranteed to drag the company down. Until this culture doesn't change, I don't expect anything to change.
Sung Kim: Oh wow! A surprising result from Databricks when measuring embeddings and rerankers on internal evals.
1- Reranking few docs improves recall (expected).
2- Reranking many docs degrades quality (!).
3- Reranking too many documents is quite often worse than using embedding model alone (!!).
Paper: Drowning in Documents: Consequences of Scaling Reranker Inference (arxiv.org/abs/2411.11767)
Apurva Mehta kicked off a long thread about what stream processing is, with this post: I'm finding two divergent interpretations of 'stream processing'. For some, it's primarily 'better' data processing, eg. streaming is better than batch.
For others, it's a way to build applications and experiences that wouldn't be possible in a non event-driven fashion. [some interesting replies, see the thread for better context]
Micah Wylde: I agree with this distinction—building real-time datasets vs. building event-drive applications. The former leans towards "stream processors" like Flink or Arroyo which are complete systems that host your code. The latter towards libraries like KStreams/Restate that are embedded in your code.
Chris Riccomini: Makes me think of push vs. pull. When something needs to happen based on an event, push (reactive) is required. OTOH, when you want to read stuff, pull (incremental materialized views) works fine.
Nico Ritschel: The Iceberg on pg via duckdb space is heating up. With write (& compaction) support, too!
Gunnar Morling: It always makes me sad a little when seeing legacy Java API types like java.util.Date being used in newly written code. Is there any prescriptive build-time checker people are using to flag usage of not-quite-officially-deprecated-yet-non-recommended types like this?
Simon Späti: Data modeling is one of the most essential tasks when starting a data project. But why don't we take more care of it? Why did we write the same metrics differently across departments? Why do we keep reinventing data models with each new tool we adopt?
Andy Pavlo: Correction: @glaredb.bsky.social is moving *away* from DataFusion! Their talk discusses the problems with building a DBMS using off-shelf parts. Like @duckdb.org, the new GlareDB rewrite borrows ideas from the Germans' HyPer system but it's written in Rust. YouTube video.
Abbey Perini: Another day, another developer is convinced they can find a tech solution to a human problem.
Lalith Suresh: The single-most valuable tool when doing latency measurements is an ECDF (empirical cumulative distribution function). Collect every single latency sample, plot the ECDF to look at the *entire distribution*, and resist the temptation to compute summaries until you can explain the ECDF.
Why? Summaries (avg, median, ranges etc) routinely mask pathological behaviors (tail latencies, skews, step behaviors etc). All these problems show up visibly in an ECDF.
Chris Riccomini: Maybe the driver for data lakehouse adoption is actually B2B SaaS integration (prequel.co). Even if you've got Snowflake/BQ, if all your SaaS vendors are exposing data through Iceberg/Parquet, you're gonna adopt that. Once adopted, might as well go all in.
Paul Dix: So I have this theory that DataFusion, despite being a SQL engine, will actually enable a new breed of data systems to create non-SQL languages for working with data.
François Chollet: Asking lots of "dumb" questions isn't a sign of stupidity. If anything it is more likely to be the sign of a person who is very strict about always keeping a crystal clear mental model of the topic at hand -- the questions are just a symptom of their perpetual model refinement process. Mildly correlates with intelligence, and highly correlates with competence.
Alex Edmans: I asked my guest speaker the one thing a young person starting a career can do to stand out. He said "inspire trust". It's not about showing how clever you are, but that you're reliable, motivated, truthful, and ethical. I love that advice, and it's relevant for all seniorities.
Debasish Ghosh: Good refactoring vs Bad refactoring - https://www.builder.io/blog/good-vs-bad-refactoring
- Precondition for good refactoring: understand the context and the code
- Post condition for good refactoring: maintain refactoring invariants for unit and integration tests (https://buttondown.com/hillelwayne/archive/refactoring-invariants)
Rahul Jain: A good tip for data engineering architecture is to treat a "Table" as a logical entity and not tie it with the way it is materialized or mapped to other subsystems. A table can be anything - a set of files, a view, the result of a query. Its definition and physical manifestation should be decoupled.
Chris Riccomini: This does make me wonder: are we on the cusp of many, many IVM implementations? If so, gives me strong vector search vibes. Useful, but, "is this a feature or a product," uneasiness. [editor’s note: IVM stands for Incremental View Maintenance]
Gwen Shapira: Vector search is a wonderful Postgres feature.
On the other hand, JSON is also a wonderful Postgres feature and MongoDB seems... fine?
Gunnar Morling: Yeah, I've always felt once pg_ivm is in a usable state, it will render many external solutions mostly unnecessary.
Chris Riccomini: Found this Benn Stancil post. It's wrestling something that I've felt for a while. Creating an "analytics engineer" was a mistake. AE is too narrow a role. They're getting squeezed on budget 'cuz their cost > value. AEs are flailing to find something.. reverse ETL, data products, semantic layers.
Alessandro Martinello: Because I thought my main product was data, and the slides were just a wrapper for that. Now, I believe that to make changes the insights are my main product
Katie Bauer: Product managers and designers get credit for all kinds of things that didn't literally do, because they rightfully recognize that they're part of the team that built something. This is no different from the role a data professional plays.
JD Long: In discussions of data engineering (DE) there’s huge focus on DE feeding data science and analytics. But rarely is there discussion around DE feeding finops (accounting & finance).
Joe Reis: I see some people confusing the Medallion architecture (bronze, silver, gold) with data modeling. This is a workflow that might facilitate data modeling, but it’s NOT a data model.
Via Joe Reis:
Peopleware: Quality, far beyond that required by the end user, is a means to higher productivity.
Peopleware: A policy of “Quality - If Time Permits” will assure that no quality at all sneaks into the product.
Ethan Mollick: There is a lot of energy going into fine-tuning models, but specialized medical AI models lost to their general versions 38% of the time, only won 12%. Before spending millions on specialized training, might be worth exploring what base models can do with well-designed prompts.
Ethan Mollick: Paper shows that AI (in this case a diffusion model) accelerates innovation. Among key findings: 1) GenAI increases novel discoveries: a 39% increase in patent filings! 2) It boosts the best performers by acting as a co-intelligence 3) It takes away some of the fun parts of work
LaurieWired: CPU % usage is really complicated. On Apple Silicon, you could use as little as 27% of the CPU's maximum frequency, yet Activity Monitor will show 100% usage of the core. Why? It all has to do with active residency.
One of the coolest options on linux is PSI, or "Pressure Stall Information". PSI focuses on task delays, instead of traditional CPU metrics. With PSI, you can pinpoint whether CPU, memory, or I/O pressure is causing application slowdowns. The docs on http://kernel.org have a great overview of how this measurement works: https://docs.kernel.org/accounting/psi.html. In any case, it's important to keep in mind that different operating systems measure CPU usage differently. A process may be slowing down for reasons you may not expect!
Murat Demirbas: DBSP is a simplified version of differential dataflow. I talked about differential dataflow in 2017 here. Differential dataflow represented time as a lattice, DBSP represents it as a single array. In other words, in DBSP time is consecutive and each state requires a unique predecessor. In differential dataflow, time is defined as a partial order to capture causality and concurrency. That is a better model for distributed systems, but that introduces a lot of complexity. DBSP makes a big simplification when it assumes linear synchronous time, and probably opens up issues related to managing out-of-order inputs, incorporating data from multiple streams, and tracking progress.
But the simplification buys us powerful properties. The chain rule below emerges as a particularly powerful composition technique for incremental computation. If you apply two queries in sequence and you want to incrementalize that composition, it's enough to incrementalize each query independently. This is big. This allows independent optimization of query components, enabling efficient and modular query processing.
Nisan Haramati: but at the very core, the real physical revolution is the ability to perform huge matrix computations at scales beyond the capacity of any single computer, and the mind boggling rate at which this capacity has been increasing over the last few years. What NVIDIA is selling is time compression. You can now execute 10,000 linear compute years of training (if they were run on one single-threaded CPU) in less than a day. … I think they (LLMs) are the first example of a generalizable application utilizing this technology we now have access to.
Nisan Haramati: In his Universal Scalability Law (USL), Neil Gunther describes four phases, which can be labelled as:
1. Roughly Linear - early scaling, no/low contention
2. Slowing down - increasing contention reduces efficiency
3. Plateau - the cost of contention cost rises to the point of "eating up" the value added by new resources
4. Negative returns - where the operational cost of new resources actually diminishes the overall system capacity
David Asboth: We shouldn't be afraid to question basic assumptions; in fact, progress often lies there. Think of it like a reverse impostor syndrome: by freeing yourself from the weight of knowing everything, you boldly go into a meeting as someone who is ignorant but eager to learn.
Alex Miller: [wrote about hardware advances and database architectures, a couple of quotes follow, but really the whole thing is pure gold (if a little depressing considering the realities of the cloud discussed at the end)]
Marc Richards put together a nice Linux Kernel vs DPDK benchmark, that ends with DPDK offering a 50% increase in throughput, followed by an enumeration of the slew of drawbacks one accepts to gain that 50%. It seems to be a tradeoff most databases aren’t interested in, and even ScyllaDB has mostly dropped its investment into it.
Similar to SMR, this reduces the cost of a ZNS SSD as compared to a "regular" SSD, but there’s an additional focus on application-driven[4] garbage collection being more efficient, thus decreasing total write amplification and increasing drive lifetime. Consider LSMs on SSDs, which already operate via incremental appending and large erase blocks. Removing the FTL between an LSM and the SSD opens opportunity for optimizations. More recently, Google and Meta have collaborated on a proposal for Flexible Data Placement (FDP), which acts as more of a hint for grouping writes with related lifetimes than strictly enforcing the partitioning as ZNS does. The goal is to enable an easier upgrade path where an SSD could ignore the FDP part of the write request and still be semantically correct, just have worse performance or write amplification.
Michael Youssef, byzheyi Y and Daniel Cheng: we present our Stateful Workload Operator, an alternative model to the traditional approach: all stateful applications now share a common operator with a single custom resource, while application-specific customizations are handled by pluggable external policy engines. At LinkedIn, we've inverted the traditional stateful application operator model, providing application owners with a generic building block and a centralized point to manage storage, external integrations, tooling, and other features.
The benefits of using StatefulSet were minimal, and it would have only reduced a small portion of the overall code complexity versus developing our own pod management. By developing our own custom resources, we were able to overcome these challenges, achieve the flexibility we needed, and align more closely with our specific requirements.
Erik Bernhardsson: [shares his thoughts on serverless inference]
So why is most GPU demand driven by training even though inference is where you make the money? …for the economics of this to make sense eventually, we need to see a much larger % of GPU spend going towards inference.
Pooling lots of users into the same underlying pool of compute can improve utilization drastically. It reduces amount of capacity that has to be reserved in aggregate. Instead of provisioning for the sum of the peaks, you can provision for the peak of the sum. This is a much smaller number!
Fast initialization of models is a hard problem. A typical workload needs to fire up a Python interpreter with a lot of modules, and load gigabytes of model weights onto the GPU. Doing this fast (as in, seconds or less) takes a lot of low-level work.
Published Humans
Hype, Sustainability, and the Price of the Bigger-is-Better Paradigm in AI (paper): With the growing attention and investment in recent AI approaches such as large language models, the narrative that the larger the AI system the more valuable, powerful and interesting it is is increasingly seen as common sense. But what is this assumption based on, and how are we measuring value, power, and performance? And what are the collateral consequences of this race to ever-increasing scale? Here, we scrutinize the current scaling trends and trade-offs across multiple axes and refute two common assumptions underlying the ‘bigger-is-better’ AI paradigm:
1) that improved performance is a product of increased scale, and
2) that all interesting problems addressed by AI require large-scale models. Rather, we argue that this approach is not only fragile scientifically, but comes with undesirable consequences.
First, it is not sustainable, as its compute demands increase faster than model performance, leading to unreasonable economic requirements and a disproportionate environmental footprint.
Second, it implies focusing on certain problems at the expense of others, leaving aside important applications, e.g. health, education, or the climate.
Finally, it exacerbates a concentration of power, which centralizes decision-making in the hands of a few actors while threatening to disempower others in the context of shaping both AI research and its applications throughout society
Interesting topic #1: Using modeling and simulation in distributed systems
Forgive me for including a topic I just wrote about, but literally on the same day that I blogged about using simulation to understand properties and characteristics of complex systems, Datadog engineers (Arun Parthiban, Sesh Nalla, Cecilia Wat-Kim) released a post on the same topic.
The blog explores the complexities of building distributed systems and how traditional testing methods often fall short at the system level. The blog explores how the team combined formal modeling, lightweight simulations, and chaos testing to analyze the design, expected performance and real-world behaviors of their massive scale queueing system called Courier. It’s an interesting read: How we use formal modeling, lightweight simulations, and chaos testing to design reliable distributed systems
For my part, I finally wrote about my own use of modeling and simulation over my career in distributed data systems in two blog posts.
Obtaining statistical properties through modeling and simulation
The Law of Large Numbers: A Foundation for Statistical Modeling in Distributed Systems
I would also be remiss if I didn’t include blog posts by Marc Brooker (distinguished engineer at AWS). Marc uses simulation regularly in blog posts to explore algorithms, and this blog post directly advocates for use of simulation: Simple Simulations for System Builders.
If anyone knows of other accessible material on real-world usage of simulation in engineering, then please let me know.
Interesting topic #2: Incremental View Maintenance (IVM)
A database view is basically a stored query that can be queried by name (typically read-only). A materialized view physically stores the result of a view, and is updated via the process of Incremental View Maintenance (IVM), and not directly by user queries. IVM is the process of updating materialized views efficiently by applying only the changes (inserts, updates, or deletes) made to the underlying data, rather than recomputing the entire view from scratch. IVM poses a number of challenges such as efficiently (and consistently) handling complex queries with joins, aggregates, nested queries, correlated subqueries, and so on.
Materialized views are typically used to reduce the cost of reads that need to be run frequently (such as dashboards, user facing analytics etc). We trade off some write overhead for read performance. The benefit to the reads should, on balance, make up for the additional write costs, just as is the case with database indexes.
With that preamble out of the way, I have a number of articles and papers that cover the topic of IVM:
Differential dataflow (Materialize being the prominent vendor built on this approach)
DBSP (Feldera being the prominent vendor built on this approach)
pg_ivm (A Postgres extension for IVM)
Snowflake
Snowflake’s paper, What’s the Difference? Incremental Processing with Change Queries in Snowflake, doesn’t actually discuss IVM itself, but does explore the topic of materializing changes to a table as a stream. I have also performed an analysis of change streams in the open table formats.