Humans of the Data Sphere Issue #7 January 29th 2025
Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, streaming, distributed systems and the data engineering/analytics space.
Welcome to Humans of the Data Sphere issue #7!
First, the meme of the issue:
Quotable Humans
First up. The world reacted to DeepSeek:
Andrej Karpathy: I will say that Deep Learning has a legendary ravenous appetite for compute, like no other algorithm that has ever been developed in AI.
…
Last thought. Not sure if this is obvious. There are two major types of learning, in both children and in deep learning. There is 1) imitation learning (watch and repeat, i.e. pretraining, supervised finetuning), and 2) trial-and-error learning (reinforcement learning). My favorite simple example is AlphaGo - 1) is learning by imitating expert players, 2) is reinforcement learning to win the game. Almost every single shocking result of deep learning, and the source of all *magic* is always 2. 2 is significantly significantly more powerful. 2 is what surprises you. 2 is when the paddle learns to hit the ball behind the blocks in Breakout. 2 is when AlphaGo beats even Lee Sedol. And 2 is the "aha moment" when the DeepSeek (or o1 etc.) discovers that it works well to re-evaluate your assumptions, backtrack, try something else, etc. It's the solving strategies you see this model use in its chain of thought.
Jim Fan: those who think RL use less compute don’t know RL at all
Jack Morris: i guess DeepSeek broke the proverbial four-minute-mile barrier. people used to think this was impossible. and suddenly, RL on language models just works and it reproduces on a small-enough scale that a PhD student can reimplement it in only a few days this year is going to be wild
Alexander Doria: Starting to think DeepSeek is a blessing in disguise for the AI market: can be cleansed sufficiently early on and avoid a dramatic bubble popping. Investors will price in competition and commoditization accurately from now on.
Dean W. Ball: Part of the reason DeepSeek looks so impressive (apart from just being impressive!) is that they are among the only truly cracked teams releasing detailed frontier AI research.
Jim Fan: We are living in a timeline where a non-US company is keeping the original mission of OpenAI alive - truly open, frontier research that empowers all. It makes no sense. The most entertaining outcome is the most likely. DeepSeek-R1 not only open-sources a barrage of models but also spills all the training secrets. They are perhaps the first OSS project that shows major, sustained growth of an RL flywheel.
Ethan Mollick: Lesson here is that investors do not understand that the paradigm for AI has been undergoing a shift from one which was about models getting smarter due to more computing power being used for training to models getting smarter due to more computing power being used for inference.
Ethan Mollick: Also “Jevon’s Paradox” is just a variation of saying prices are elastic and use goes up when price goes down. Hard to imagine compute not being the constraint for the foreseeable future when reasoning models like DeepSeek or o1 depend on inference-time scaling.
Quotes from other topics of the data and AI sphere:
Lorin Hochstein (December): Like so many other incidents that came before, this is another tale of saturation, where the failure mode involves overload.
…
The trigger for this incident was Canva deploying a new version of their editor page. It’s notable that there was nothing wrong with this new version. The incident wasn’t triggered by a bug in the code in the new version, or even by some unexpected emergent behavior in the code of this version. No, while the incident was triggered by a deploy, the changes from the previous version are immaterial to this outage. Rather, it was the system behavior that emerged from clients downloading the new version that led to the outage. Specifically, it was clients downloading the new javascript files from the CDN that set the ball in motion.
Elena Dyachkova: Amazing things unfolding on LinkedIn. Avinash Kaushik penned an ‘A/B testing is dead’ essay with some valid points but a very narrow definition of A/B testing as CRO, Ronny penned a debunk amply peppered with ‘that’s not been my personal experience’…
Debasish Ghosh: Columnar file formats offer compression that results in storage savings. Row skipping metadata also accelerate columnar scans. How do you optimise partition size to get the best of both ? This paper proposes a solution by decoupling the actual storage from the search acceleration axes.
LaurieWired: Most hashing algorithms are designed to avoid collisions. What if they weren’t? Locality-sensitive-hashing (LSH) is a way to group similar inputs into the same “buckets” with high probability. Collisions are maximized, not minimized. As a malware researcher, I’m quite experienced with fuzzy hashing. LSH algorithms are a bit different. LSH algos specifically reduce the dimensionality of data while preserving relative distance. Think spam filters, copyright media detection, even music recommendations.
Ryanne Dolan: The MVs-as-pipelines metaphor breaks down when you want multiple pipelines working together to materialize a single big view. We've introduced "partial views" to solve this problem. You can now create a bunch of MVs that write to the same place.
Phil Eaton: I think there definitely is a vibe shift among experienced programmers. Minimize dependencies. Invest more in comprehensive standard libraries. (JavaScript and Rust notably have the least comprehensive standard libraries.)
David Lindner: New Google DeepMind safety paper! LLM agents are coming – how do we stop them finding complex plans to hack the reward? Our method, MONA, prevents many such hacks, *even if* humans are unable to detect them!
Jeffrey Emanuel wrote an incredible piece on The Short Case for Nvidia stock. A couple of quotes:
But if the next big scaling law that people are excited about is for inference level compute— and if the biggest drawback of COT models is the high latency introduced by having to generate all those intermediate logic tokens before they can respond— then even a company that only does inference compute, but which does it dramatically faster and more efficiently than Nvidia can— can introduce a serious competitive threat in the coming years. At the very least, Cerebras and Groq can chip away at the lofty expectations for Nvidia's revenue growth over the next 2-3 years that are embedded in the current equity valuation.
Now, it's no secret that there is a strong power law distribution of Nvidia's hyper-scaler customer base, with the top handful of customers representing the lion's share of high-margin revenue. How should one think about the future of this business when literally every single one of these VIP customers is building their own custom chips specifically for AI training and inference?
Rahul Jain: Devin's Paradox: AI tools doing so good that junior developers are no longer hired or upskilled. Meanwhile the demand for senior devs keeps increasing but there are no senior devs available - thus leading to decreased productivity.
Debasish Ghosh: The talk starts claiming that linked lists are an immoral data structure and if you are using them for anything for which you care about performance, you are committing sin. That's obviously because of the cache misses that a linked list will suffer. And yet Linux VMAs were based on linked lists only till 1995.
Gunnar Morling: Kinda blowing my mind that we're still largely using text-based formats (JSON) for logging, rather than binary formats. Such as waste of compute resources.
flaneur2024: finally read the paper <Can Application Recover from fsync Failures?>. surprised to see the fact that `fsync()` will simply mark the dirty page as clean after you got EIO error, so an success retry of `fsync()` will not finally persist your data, but failed to persist your data silently 😮
LaurieWired: We aren’t far off from the theoretical limits of CPU clockspeed! A soft limit at around ~10GHz where speed-of-light across the entire die in one cycle starts to become a major limitation!
Char writes about “high-leverage generalists“:
The "T-shaped person" discourse is played out. The real move is being an emergent complexity monster. I've spent the last 6 years watching my career accelerate dramatically through what I call "high-leverage generalism." Not the wishy-washy "jack of all trades" kind, but a deliberate approach to building a unique combination of skills that compounds in value over time.
The traditional narrative around career development is broken. We're told to pick a lane early, specialise hard, and climb the ladder in our chosen field. This advice made sense in a world of stable, well-defined industries. But that world is dead. Today's most interesting opportunities exist at the intersections - where different domains collide and create new possibilities.
Debasish Ghosh: I have been posting a bit about writing cache aware code, being aligned with the modern CPU architectures and by a fortunate stroke of serendipity find this talk by Scott Meyers, one of the basics of data oriented design .. Cpu Caches and Why You Care
Karthic Rao: Looking ahead, we can anticipate bi-directional flows between tables and streams, letting data teams materialize Iceberg tables as real-time topics (and vice versa) with minimal friction. With no ETL duplication and no wasted data movement, the opportunity for innovation and cost savings is vast.
Chad Sanderson on consumer-driven data contracts:
An application developer will never comprehensively understand how data downstream is being used, nor will they fully understand the constraints on data that might be necessary for certain use cases and not others. They will have no concept of which SLAs are useful and which are meaningless. They will not understand how the data model needs to evolve (or how it should have been originally defined). They will have no grasp of how data is replicated across multiple services and where strong relationships must be built between owning producer teams. They won’t understand which data is necessary to be under contract and which isn’t.
This is the consumer-defined data contract. The consumer-defined contract is created by the owners of data applications, with requirements derived from their needs and use cases. While it contains the same information as the producer-defined contract, it is implemented primarily to draw awareness to the request and inform data producers when new expectations and dependencies exist on the data they maintain.
Published Humans
The Five-Minute Rule for the Cloud: Caching in Analytics Systems: A useful paper for system designers building analytics systems on object storage. The paper surveys a number of caching architectures in terms of various properties such as latency variability and implementation/operational complexity. It then introduces a comprehensive cost model that evaluates these caching architectures taking into account various factors such as data access patterns, latency vs non-latency sensitive workloads, to determine optimal caching policies. Using this model, they propose new rules of thumb for cloud-native databases that must balance cost / latency trade offs in dynamic cloud environments. Basically, the need for adaptable caching mechanisms that can respond to fluctuating workloads and data access patterns (while maintaining desired cost/perf profile).
Towards Query Optimizer as a Service (QOaaS) in a Unified LakeHouse Ecosystem: Can One QO Rule Them All? This paper explores whether Query Optimization (QO), in the context of data lakehouses, can be decoupled from data systems and made into a separate service—QAaaS. The paper explains that QO implementations across most relational analytics data systems involve the common steps of parsing/algebrization, simplification/normalization, cost-based exploration/implementation, and post-optimization. The QAaaS model decouples the QO from individual engines, allowing it to function as an independent service that interacts with multiple engines via remote procedure calls. The theoretical benefits of QOaaS include accelerated innovation, enhanced engineering efficiency, reduced time-to-market for new engines, and the capability for cross-engine optimization and scalability. However, the implementation of QOaaS presents challenges such as defining a universal query plan format and accommodating diverse engine capabilities within a unified cost model. The authors themselves note that the work is ambitious and exploratory, inviting discussion from the community on its feasibility.
Interesting topic #1 - Well and ill-conditioned APIs
Back in September 2023 Marc Brooker observed that:
“The declarative nature of SQL is a major strength, but also a common source of operational problems. This is because SQL obscures one of the most important practical questions about running a program: how much work are we asking the computer to do?”
Mahesh Balakrishnan riffed on that remarking:
To misuse some terminology from math, SQL is an “ill-conditioned” API: small changes in input can trigger very different amounts of work. The opposite would be block storage, which is “well-conditioned”. Another famous example of an ill-conditioned abstraction is IP Multicast.
This week I’ve seen two people talk around this property of APIs and their “conditioning”:
James Cowling: A good design principle is that APIs that look cheap should be cheap and APIs that are expensive should look expensive. One mistake we made in an old version of the Dropbox filesystem was that directory moves looked cheap (drag a folder around) but were expensive on the server (update metadata for every file in the tree). A similar mistake we made with @convex_dev is that the .filter() operator looks convenient, even though it does a table scan, while the .withIndex() syntax for efficiently reading from an index looks a little cumbersome and expensive.
Jeffrey Emanuel discusses the impact that CoT-based reasoning has on inference scaling:
But essentially, until recently, inference compute was generally a lot less intensive than training compute, and scaled basically linearly with the number of requests you are handling— the more demand for text completions from ChatGPT, for instance, the more inference compute you used up.
With the advent of the revolutionary Chain-of-Thought ("COT") models introduced in the past year, most noticeably in OpenAI's flagship O1 model (but very recently in DeepSeek's new R1 model, which we will talk about later in much more detail), all that changed.
Some of the most exciting news in the AI world came out just a few weeks ago and concerned OpenAI's new unreleased O3 model, which was able to solve a large variety of tasks that were previously deemed to be out of reach of current AI approaches in the near term. And the way it was able to do these hardest problems (which include exceptionally tough "foundational" math problems that would be very hard for even highly skilled professional mathematicians to solve), is that OpenAI threw insane amount of compute resources at the problems— in some cases, spending $3k+ worth of compute power to solve a single task (compare this to traditional inference costs for a single task, which would be unlikely to exceed a couple dollars using regular Transformer models without chain-of-thought).
On the subject of CoT reasoning. It seems likely that prompts on advanced reasoning models will result in wildly different amounts of compute resources, depending on the the complexity of the question, any false steps during the CoT process that must be corrected and the upper-bound cost the prompter is willing to pay for an answer.
Interesting topic #2 - Contextual data quality challenges and AI
This week I encountered this article on data quality, Big data quality framework: a holistic approach to continuous quality management (2021).
One figure was particularly interesting and useful to me. It categorizes data quality into intrinsic, contextual, representational and accessibility dimensions.
It made me realize that the definition of data quality that I and many others working on data tooling are using is overly narrow, focused almost entirely in the Intrinsic and Accessibility categories. The reason? These two categories are the ones that are easy to build tooling for. We can design data systems to offer consistency, we can use schemas, types and validation rules to ensure data adheres to strict forms, we can measure if data arrives on-time or late. We can control access to the data, who can see what and when.
But what about the Contextual category? The Contextual category is also extremely important but characteristics such as relevancy, trustworthiness and suitability are hard to quantify and evaluate through software. The Contextual category is where the data engineer and data analyst use their human intellect and experience to pick good sources of data, judge whether data is trustworthy or not, and how to use and combine data effectively.
AI becomes relevant to this in two ways:
Data generation: AI is capable of the automated generation of data quality issues in all categories.
Data control: AI can automate more advanced data quality controls.
From deterministic pipelines to stochastic pipelines
The data that is being ingested into data warehouses and data lakehouses can vary from low to high quality. Typically, data engineers have to cleanse the data before it can be used to generate business insights. This is known ahead of time and therefore data is put through various stages first. Once data is in a high quality state, we can apply deterministic processing/analysis to produce high quality outputs. The thing that data engineers and analysts must guard against are Garbage In Garbage Out (GIGO) and errors in their own processes that turn valid inputs into wrong outputs. But this can be done; we’re talking about deterministic processing, where age old techniques of testing, validation and end-user feedback catch most problems.
Enter AI. The benefit of AI is automated intelligence (whether it’s simple models doing classification or sophisticated models doing advanced reasoning). The output of AI is generally more data. AI also requires valid inputs but due to its stochastic nature, can produce bad data from good.
The data quality problems it can generate can include all four categories from the article:
Intrinsic: Data may not be complete, or conform to correct types etc.
Contextual: Data may not be relevant or suitable. Data may be structurally correct but factually wrong.
Representational: AI systems that produce outputs in complex or inconsistent formats can hinder user understanding. For example, if an AI model presents data in a convoluted manner without clear explanations, it can lead to misinterpretations, affecting decision-making processes.
Accessibility: If AI systems do not implement robust access controls, sensitive information could be exposed.
Intrinsic data quality issues may be detected and remediated automatically via software, using the same strategies we use today on ingested data. But what about Contextual or Accessibility data quality issues? How do we apply automated controls to ensure relevancy and suitability of data (currently the preserve of the data engineer and analyst)? How do we detect that sensitive data has leaked?
Handling AI-generated contextual data quality issues
There is little room for contextual data quality issues with simple classifiers, but LLMs generate open-ended text that present a much larger data quality challenge. So the question is: How do we benefit from automated intelligence without it causing data quality incidents?
It seems that the answer is also AI.
AI solutions to AI problems:
Confidence scores
Self-Evaluation Scoring. The LLM itself can estimate confidence by generating multiple responses and ranking them.
Threshold-Based Filtering. If confidence is below a threshold (e.g., 60%), the model can flag the answer as uncertain or refuse to answer.
Verification
Self-Consistency Sampling. Instead of generating one response, the LLM generates multiple responses and picks the most consistent one.
AI Critique Models (Verifier LLMs). Another LLM can check the primary LLM’s responses, filtering out incorrect, irrelevant, or misleading text.
Fact-Checking with Retrieval-Augmented Generation (RAG).
Adversarial AI (GAN-Style Verifiers)
The verification side seems to be more domain specific. In order to catch contextual data quality issues, the verifier itself must understand that context. The challenge for data engineers and analysts wanting to adopt more sophisticated uses of LLMs will be in automated domain-specific AI-based verification.
AI in 2025 and onwards
Contextual data quality remains largely a human-driven responsibility today, relying on the expertise of data engineers and analysts to make informed decisions about data sources, transformations, and interpretations. Notably, humans assess contextual data quality during the data source selection, ingest or cleansing process. However, AI can generate as-yet unverified contextual data quality issues from inside the “safe zone”, where ingested data has been cleansed.
While AI can generate data quality problems across all four dimensions, Contextual issues stand out to me as the most complex and hard to automate away. The challenge ahead for data professionals is to build domain-aware AI verification pipelines that understand context-specific relevancy, suitability and correctness. The future of data engineering isn’t just about generating data with AI but about curating and refining it, subjecting AI-produced outputs to the same rigorous scrutiny that human analysts apply. As AI will increasingly permeates data pipelines, and AI-powered data quality management will become ever more necessary to ensure that insights remain reliable and relevant. In order words, verification and correction will become as fundamental as ingestion and transformation.