Humans of the Data Sphere Issue #1 October 15th 2024

Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, distributed systems and the data engineering/analytics space.

Oct 15, 2024

This is issue #1 of Humans of the Data Sphere.

Aims of the publication

This publication aims to bring together the words of humans from across the data landscape. This is broader than just the “data space”, it encompasses all kinds of databases, messaging/streaming, OLAP, (cloud) data warehouses, lakehouses, distributed systems, and some AI and ML. My hope is that it is narrow enough to be cohesive, but broad enough for readers to be exposed to new ideas and new people that might sit just outside their normal career focus.

This format is heavily inspired by the highscalability.com, “Stuff the internet says on scalability“ posts. While I am using a similar format as a launchpad, it’s still too early to know what HOTDS will be. Right now it reflects the subjects and people that I find most interesting and that I think will be interesting to others.

Let me know what you think of issue #1 at feedback@hotds.dev.

Quotable humans

Ryanne Dolan: The best thing about streaming data pipelines is NOT that they are "real-time". It's that they are tremendously easier to operate.
sahilthapar: It took a long time at my previous company but we were finally able to convince the upstream sources to own the quality of the data they send out. I totally agree with this comment, it's a slippery slope the more you "handle" it, the less likely they are to ever fix the systems.
Soumith Chintala: "How to train a model on 10k H100 GPUs?" has now been immortalized on my blog: https://soumith.ch/blog/2024-10-02-training-10k-scale.md.html
Murat Demirbas: Complexity creeps into distributed systems through failures, asynchrony, and change. Mahesh also confessed that he didn't realize the extent to the importance of managing change until his days in industry. While other fields in computer science have successfully built robust abstractions (such as layered protocols in networking, and block storage, process, address space abstractions in operating systems), distributed systems have lagged behind in this aspect.
Data Engineering Podcast E400 (39’): Why did Snowflake evolve its own grammar? DuckDB is also creating its own grammar. The reason for this is that there is no extensibility, so the only thing that’s left is to change the language.
Jaana Dogan: When I retrospectively think about all the globally successful projects I worked on, the common denominator wasn't the buy-in from everyone. A few strongly opinionated people came together, identified a major problem, built a solution extensible enough. Growth was organic from that point on.
Lorin Hochstein: Reminder about the importance of getting good at recovering from incidents: 1. You can’t prevent all incidents from occurring. 2. You must recover from all of the incidents that occur.
SuperDataScience E825 (7’): My definition of data quality is a bit different from other people’s. In the software world, people think about quality as, it’s very deterministic. So I am writing a feature, I am building an application, I have a set of requirements for that application and if the software no longer meets those requirements that is known as a bug, it’s a quality issue. But in the data space you might have a producer of data that is emitting data or collecting data in some way, that makes a change which is totally sensible for their use case. As an example, maybe I have a column called timestamp that is being recorded in local time, but I decide to change that to UTC format. Totally fine, makes complete sense, probably exactly what you should do. But if there’s someone downstream of me that’s expecting local time, they’re going to experience a data quality issue. So my perspective is that data quality is actually a result of mismanaged expectations between the data producers and data consumers, and that is the function of the data contract. It’s to help these two sides actually collaborate better with each other.
Murat Demirbas: What can go wrong? Computers can crash. Unfortunately they don't fail cleanly. Fail-fast is failing fast! And again unfortunately, partial failures (limping computers) are very difficult to deal with. Even worse, with the transistor density so high, we now need to deal with silent failures. We have memory corruption and even silent faults from CPUs. The HPTS'24 session on failures was bleak indeed. That was only the beginning. Then there are network failures. And clock synchronization failures. Did we leave anything? There are metastable failures that are emergent behavior distributed systems behavior. We can then venture into more malicious cases, like Byzantine failures. If you take into account security/hacking as failures, there are so many more problems to list. There are hurricanes, natural disasters, and even cyberattacks during natural disasters. The failure modes are so many to enumerate. So maybe instead of enumerating and explaining all these in detail, it is better to give the fundamental impossibility results haunting distributed systems: Coordinated Attack Impossibility and the FLP (Fischer-Lynch-Paterson) impossibility.
Benn Stancil: The mood in the room—the music in the air—however, was relief. Sure, Coalesce is less of the cheery circus it once was, but that was never sustainable anyway. Sure, the dbt product itself is slowly tilting the floor ⁴ towards a line of cash registers, but dbt Labs was never going to survive by selling a thin web app on top of an approachable Python package. But this version of dbt, one person told me, feels like it can last. The session that they said was most memorable was a panel of dbt execs, not for what they said, but for who they were: People with long resumes of successfully selling stuff.
Gergely Orosz: Years back, my team was part of an outage where people couldn’t takes rides for some time (I think 30-60 minutes or so.) It was bad. In the postmortem, we always needed to list the business impact. An Eng1 with an interest in product filled it out: He put in a negative number!! (commenting on):
- Kelly Sommers: You would be surprised how many orgs are unable to answer a simple question like “does this product make money?”. They don’t have correct things in place to separate out the costs.
Chris Riccomini: Please hook LLMs up to ops data.
- "Which team is costing me the most on my s3 bill?"
- "Which service returned 500 errors between 2am and 2:15 am last night?"
- "Which service is calling the profile service with missing fields?"
- Aviv Dozorets replied: I want to see more actual data separated from the noise. “What metrics changed in upstream service before my service crashed?” “How crash of service X affected other systems (that I might be unaware of)”.
Murat Demirbas: As a user of a database you won't need to understand exactly how the database is implemented, but you need to have some mechanical sympathy, dammit.
chasd00: I toured a data center in Tornado alley back when leasing cages was pretty common. I asked them about disaster planning regarding getting completely wiped off the map and they sorta scoffed at me. Literally two weeks later a tornado missed them by about a 1/4 mile. Would have loved to be a fly on the wall after that.
Jaana Dogan: One of the main reasons why there is so much burnout in this space. Everyone is copying everyone, pretending to be ticking all the boxes instead of trying to build a cohesive great product that solves some hard problems in ways others don’t solve. (referring to):
- Phillip Carter: one of the things that sucks about the observability space is that it's really hard to properly evaluate all the tools/products, and everyone is incentivized to "observabilitywash" their value props to make it sound like they do everything well, when they absolutely do not
Simon Späti: Why is reusability so hard in the data space? That is a fascinating question.
François Chollet: One more piece of evidence to add to the pile. This was an extremely heretic viewpoint in early 2023, and now it is increasingly becoming self-evident conventional wisdom. (referring to):
- “Furthermore, we investigate the fragility of mathematical reasoning in these models and demonstrate that their performance significantly deteriorates as the number of clauses in a question increases. We hypothesize that this decline is due to the fact that current LLMs are not capable of genuine logical reasoning; instead, they attempt to replicate the reasoning steps observed in their training data.” GSM-Symbolic: Understanding the Limitations of Mathematical Reasoning in Large Language Models
- François Chollet: My a-priori expectation of ChatGPT is that it will be able to solve a previously seen task but will not be able to adapt to any original task no matter how simple, because its ability to solve problems does not depend on task complexity but task familiarity.
Shreya Shankar: the correct way of thinking about the RAG vs long context debate is that RAG is predicate pushdown and predicates should always be pushed down if possible
Alex Miller: I had missed that there was *any* filesystem which supported multi-block atomic writes, but F2FS does, and there's sqlite support for leveraging it https://oslab.kaist.ac.kr/wp-content/uploads/esos_files/publication/conferences/international/HPCCT2016-108_final.pdf. See https://lore.kernel.org/lkml/1411707287-21760-2-git-send-email-jaegeuk@kernel.org/ for API.
François Chollet: It's surprisingly easy to do "hard" things -- for the most part, you need to get started and keep at it
Aleksey Charapko: It is time for a new list of papers! Papers #181-190 in our Distributed Systems Reading group! Follow the link for the schedule and instructions on how to join and participate. https://charap.co/fall-2024-reading-group-papers-papers-181-190/
Yingjun Wu: As a founder, I can confirm that the consensus within the data streaming community favors @ApacheIceberg
Shaun Thomas (on code comments): "Self describing code" can't cite RFCs, explain an underlying algorithm, or justify project architecture decisions. Comment blocks must mean something more than pure description. If you think comments are unnecessary, you're using them wrong.
François Chollet: Interesting work on reviving RNNs. https://arxiv.org/abs/2410.01201 -- in general the fact that there are many recent architectures coming from different directions that roughly match Transformers is proof that architectures aren't fundamentally important in the curve-fitting paradigm (aka deep learning). Curve-fitting is about embedding a dataset on a curve. The critical factor is the dataset, not the specific hard-coded bells and whistles that constrain the curve's shape. As long as your curve is sufficiently expressive all architectures will converge to the same performance in the large-data regime.
Gunnar Morling: Considering the fact that your application uses a database an implementation detail which could be replaced with something else any time, is inefficient at best, setting up yourself for failure at worst. Embrace and take advantage of the tools you're using.
François Chollet: To order to get high device utilization when training, the most important best practice is to both do data prefetching (moving the next batch of data to GPU memory while the previous batch is being processed) and asynchronous logging (moving the metrics from the previous batch to host memory while the next batch is being processed).
LaurieWired: Your files are dying. That SSD you keep in the closet, the one from your old system "just in case". Yup, degrading as we speak. SSDs are *shockingly* bad at power off retention, esp if it's near it's endurance rating.
LaurieWired: Improving productivity is a scam; our brains are *terrible* at determining what is useful or not. Instead, focus on eliminating small fake works. Fake works feel productive, have indicators of progress, and give a similar dopamine hit; without output.
Leonie: Don’t underestimate the impact of reranking in your RAG pipeline.
Matt Turck: Today’s market, a summary: * $3B: valuation of a pre-IPO software company, because evaluated at public market multiples at 8x $375m revenue. * Also, $3B: valuation of an AI agent thing with barely any revenue because, vibes.
Simon Späti: Many asked how to so-called "break into data engineering". To be honest, if you just read whitepapers, you could go far.
numbsafari: "Cloud 3-2-1" to the rescue… I take the typical formulation (e.g., [1]), and translate it into:
- - Keep 3 copies of your data: production + 2.
- - Keep 2 snapshots, stored separately.
- - Keep 1 snapshot on-hand (literally in your possession), or with a different provider.
Noah Pepper: Hilarious and poetic irony that regulators are looking at breaking up Google just as technology shifts are causing their search monopoly to melt away organically.
Ethan Mollick: Our research a year ago found that people stopped fact checking the AI when it got good enough, and juat took what it said as right (even if it wasn’t). I think that line has firmly & permanently been crossed for many text summarization applications.
Gunnar Morling: The biggest problem of #Java is poor perception. It's technically super-solid, but too often folks discard it based on misconceptions or information outdated years ago.
Simon Späti: The medallion architecture gets some hype here and there. Still, IMO, it's a revamped architecture from the classical data warehouse layering `stage -> cleansing -> core -> mart` we have done since the inception of modeling data warehouses with simplified names and optimized for data lakes.
Ethan Mollick: Individuals are seeing big gains from AI. Organizations less so.
T Greer: the inability of gen z to read makes me think that if I don’t switch from writing essays to giving YouTube lectures and videos in the next decade I will have consigned my self to irrelevancy.
Jonathan Ellis: The current state of AI is frustrating but only because we keep getting glimpses of the magic that is possible but not yet reliable.
Debasish Ghosh: TIL: Parquet uses Split Bloom Filters for predicate pushdown for high cardinality columns.
Rahul Jain: Interesting how Iceberg features you can use are the venn diagram intersection of: 1. Iceberg 2. choice of processing engine 3. choice of catalog.
Rahul Jain: The best thing about data engineering subreddit is that it's more practitioners and fewer vendors. Much better place to talk DE than Twitter, imo. And the community is friendlier than /r/programming.
Neelesh Salian: At this point, let’s just accept that Iceberg is the standard format and build on top of it? I don’t see the value in having multiple formats in an organization unless you have teams who do their own thing and have settled into a format already.
Ethan Mollick: Among the key questions shaping the AI industry is how long Meta will keep releasing open weights models for. Gen3 (GPT-5 class)? Gen4 (GPT-6 class)? At some point the logic they have been using might shift in the face of rising risks, costs & opportunities for advantage.
Gergely Orosz: Claiming zero errors and hallucinations for LLMs is the equivalent of claiming 100% uptime for services. Just marketing.
Birdy: You'd think that the majority of data platform engineering is solving tech problems at large scale. Unfortunately it's once again the people problem that's all-consuming.
Oleg Šelajev: Debezium is like the Observer pattern for your database! It taps into transaction logs and propagates changes to the outside world.
Gunnar Morling (reposted from 2023): One data architecture I expect we'll see more in 2023 is #SQLite/#DuckDB deployed as caches at the edge, updated via change feeds from system-of-record: stellar read performance due to close local proximity to users and fully queryable data models tailored for specific use cases.
Seattle Data Guy: If you don't set a consistent coding/design standard, people will all create their own inconsistent coding/design standard.
Jay Graves: You can build the most beautiful data visualization layer in the history of history and most users will still ask for an Excel export.
Yingjun Wu: Small data is the future. But that doesn’t mean big data is dead—it means the old "big data problems" can now be solved on smaller scales, often with just a few machines or even a single machine. There’s no longer a strong need to stress test systems at a thousand-node level because, for most real-world workloads, focusing on smaller-scale settings makes more sense. That said, single-node systems aren’t the universal solution. Systems still need to be distributed to overcome memory limitations, ensure fault tolerance, and maintain high availability. We’ll continue to see specialized systems emerge for different use cases—there’s still no "one size fits all.
Rick Houlihan: Every RDBMS backed service I have ever seen is denormalized in some way when it needs to scale out. No matter what database you use, when it comes to performance and efficiency, friends don't let friends join tables.
Jakob Foerster: When I discussed quitting Google to do a Phd, my manager, Steve Cheng, gave me the advice of "6 shots": Doing something meaningful usually takes about 5 years and we are productive for roughly 30 years. That gives you 6 attempts. So pick each one carefully and give it your best.
Peter Kraft: “Everybody knows that you deploy software on servers--physical machines with fixed pools of resources. But what if things were more flexible? What if CPUs and memory were disaggregated so you could allocate them in seconds from a network-attached pool? I really like this paper…
- Shaun Thomas: It's an interesting concept, but it requires a complete overhaul of both hardware and OS design, and still must contend with PACELC tradeoffs. You can't defeat the speed of light.
Yann LeCun: Worth repeating: Do not confuse retrieval with reasoning. Do not confuse rote learning with understanding. Do not confuse accumulated knowledge with intelligence.
Murat Demirbas: Cristina said that adversarial testing using byzantine adversary (for example, adversarial testing for congestion control) is better than straightforward application of fuzzing, which is most commonly used in security/networking conferences. She then introduced their approach to adversarial testing. They use an abstract model to generate abstract strategies, map abstract strategies to concrete strategies, and execute concrete strategies. Formal methods play a crucial role in this process. They help clarify system specifications, make implicit assumptions explicit, and identify flaws through counterexamples.
ScyllaDB: Second: performance converges over time. In-memory caches have been (for a long time) regarded as one of the fastest infrastructure components around. Yet, it’s been a few years now since caching solutions started to look into the realm of flash disks. These initiatives obviously pose an interesting question: If an in-memory cache can rely on flash storage, then why can’t a persistent database also work as a cache?
Bernd Wessely: The definition of the ‘data engineering lifecycle’, as helpful and organizing it might be, is actually a direct consequence of silo specialization.
It made us believe that ingestion is the unavoidable first step of working with data, followed by transformation before the final step of data serving concludes the process. It almost seems like everyone accepted this pattern to represent what data engineering is all about..

Humans With Opinions

It’s easy to talk about how things work, harder to take a reasoned position on a subject.

These are some opinions I found interesting.

Tristan Handy (of DBT) on Cross-Platform DBT Mesh

[snippet]

I fundamentally do not believe we are going to see one, or even two, winners in the data platform space. This is not Windows in the ‘90s, or even iOS and Android in 2012: the data platform ecosystem is not a monopoly or a duopoly; at best it is an oligopoly with 6-10 real players. But in reality I think it is better to just think about it as a competitive market.

This is good for users—no one needs the Oracle-vs-Microsoft dynamic that existed in 2003 at the start of my career. But it also creates complexity and bifurcation. Because today, different teams that use different data platforms inside the same company typically do not know about or have any access to the data assets that live inside the other platform. This leads to duplication, inefficiency, and inaccuracy.

Under the hood, dbt’s new cross-platform ref capabilities are powered by its support for Iceberg. Iceberg without dbt can be a real pain to use, but I am a huge believer in its ability to move the market in practitioner-favorable ways. I’m delighted by our ability to abstract away the complexity behind a perfectly dbtonic interface.

Sriram Subramanian on S3 and the hype train

[whole X post]

There is a general trend created by developer marketing that influences how developers adopt a technology and also affects how others build new technologies.

A few examples

Big data is dead; long live small data
Any data system not built on s3 will become obsolete
All infra providers need to support all deployment options - SaaS, BYOC embedded.

This is far from reality and the truth is way more nuanced. The unfortunate downside to this is that new infrastructure companies will end up blindly following the trends vs understanding what is needed for their use case and users.

To make this point clear, consider the deployment case above (option 3). There are many things to consider to decide what deployment options you want to support for your customers -

Can you devise an architecture that unifies all three deployment modes? How much more time will it take to do so well? Can the experience across all deployment modes be exactly the same?
What do your target users want? What is their level of trust? Are you really secure by choosing to support BYOC model?
Can you build a data plane that is truly stateless for BYOA? If not, you need full access to your customer’s account. Can you build an embedded offering for your system? Can it be in feature parity? How does permissions work? Does it solve any problems for your users? Can you instead dockerize your offering?
What is the impact on GTM? How many pricing options are you going to support? Can you enable your sales to explain all the different options clearly to end users? How does it affect the total number of GTM assets to be created? Does your margin profile fundamentally change between them?
How does product prioritization work? How does support work if the user needs help on your embedded or BYOC offering?
Can you capture a significant market with just one of the deployment options, have more focus, and better execution?

You can possibly do the same for every major architecture decision. Each technology decision has many implications and will depend greatly on the use case and customers you are solving for.

Build from first principles, focus, strategize, listen to your customers, and execute.

Roy Hasson opines on the planned convergence of Apache Iceberg and Delta Lake

[whole LinkedIn post]

So how will the formats evolve into convergence?

The idea they shared highlight the following:

1. Converging the data format to provide a single, consistent way to write and store physical data. Unifying how columns are represented, how deletes are encoded, data types, etc. This is a massive win!

2. Although both formats have a lot in common, maintaining separate metadata format for the time being is preferred to give each community autonomy and ability to innovate for their users. If the data layer is consistent (per #1) both table formats can operate independently without requiring users to duplicate physical data.

What should users do if they want to store data today? They recommend, not surprising...

3. Utilize the Unity Catalog as a way to enable readers to translate between formats. So if your engine can only read Iceberg and the data is using Delta format, the catalog can quickly generate the appropriate manifests to make the table readable via Iceberg. Databricks does this today with Unity Catalog and Uniform.

This is where things fall apart for me...

With regards to the catalog and format conversion, I agree with the premise, however I'm hesitant because this feels like a Databricks lock-in to get users into Unity Catalog and Delta. Yes you could read Iceberg with this approach, but not write it. The interoperability is one way - Delta to Iceberg. If I'm not a Databricks customer and prefer Iceberg as my format of choice, this approach will not work for me.

My expectation is that if #1 becomes a reality (and it should sooner than later), engine's ability to support both Iceberg and Delta equally becomes significantly simpler (popular engines already, for the most part, support both formats). So why do I need the potential lock-in of Uniform and Unity Catalog?

If you're in the Databricks ecosystem, then Unity Catalog is a great product with governance, security and lots of other goodies. But if you're not, than why force me into this setup?

Anyway, interesting conversation and insight into how Databricks is thinking about this table format interoperability challenge.

Charity Majors on the danger of premature seniority

[snippet]

What you are experiencing now is the alluring comfort of premature seniority. You’re the smartest kid in the room, you know every corner of the system inside and out, you win every argument and anticipate every objection and you are part of every decision and you feel so deeply, pleasingly needed by the people around you.

It’s a trap.

Get the fuck out of there.

Ethan Mollick on AI in organizations: Some tactics

[snippet]

Over the past few months, we have gotten increasingly clear evidence of two key points about AI at work:

A large percentage of people are using AI at work. We know this is happening in the EU, where a representative study of knowledge workers in Denmark from January found that 65% of marketers, 64% of journalists, 30% of lawyers, among others, had used AI at work. We also know it from a new study of American workers in August, where a third of workers had used Generative AI at work in the last week. (ChatGPT is by far the most used tool in that study, followed by Google’s Gemini)
We know that individuals are seeing productivity gains at work for some important tasks. You have almost certainly seen me reference our work showing consultants completed 18 different tasks 25% more quickly using GPT-4. But another new study of actual deployments of the original GitHub Copilot for coding found a 26% improvement in productivity (and this used the now-obsolete GPT-3.5 and is far less advanced than current coding tools). This aligns with self-reported data. For example, the Denmark study found that users thought that AI halved their working time for 41% of the tasks they do at work.

Yet, when I talk to leaders and managers about AI use in their company, they often say they see little AI use and few productivity gains outside of narrow permitted use cases. So how do we reconcile these two experiences with the points above?

Yaroslav Tkachenko on “Shifting left to make it right”

[snippet]

“Shifting left” was mentioned during the keynotes five times (I counted). I also heard it in the hallways a lot. In case you don’t know, in the data platform context, shifting left means working more closely with operational / application development teams. For example, it means shared ownership over data products or data pipelines with the goal of stopping data artifacts from being treated as a second-class citizen. Data Mesh architecture is one of the ways to implement this principle.

It’s quite refreshing to hear this not just from consultants or vendors but large enterprises as well. I suspect that execs have finally started to understand the importance of high quality data. If you want to build actually useful user-facing “AI” products, you can’t do it without clean and fresh datasets. And yet, most of the enterprises still struggle with basic BI projects…

Bernd Wessely on Data Architecture: Lessons Learned

After we have built all too many brittle data pipelines, it’s time for data engineers to acknowledge that fundamental software engineering principles are just as crucial for data engineering. Since data engineering is essentially a form of software engineering, it makes sense that foundational practices such as CI/CD, agile development practices, clean coding using version control, Test Driven Design (TDD), modularized architectures, and considering security aspects early in the development cycle should also be applied in data engineering.

But the narrow focus within an engineering discipline often leads to a kind of intellectual and organizational isolation, where the greater commonalities and interdisciplinary synergies are no longer recognized. This has led to the formation of the ‘data engineering silo’ in which not only knowledge and resources, but also concepts and ways of thinking were isolated from the software engineering discipline. Collaboration and understanding between these disciplines became more difficult. I think this undesirable situation needs to be corrected as quickly as possible.

Unfortunately, the very same silo thinking seems to start with the hype around artificial intelligence (AI) and its sub-discipline machine learning (ML). ML engineering is about to create the next big silo.

Helpful Humans

Murat Demirbas explains different transaction isolation levels [advanced level warning].
Alex Merced
- On implementing Change Data Capture (CDC) when there is no CDC
- Publishes the Ultimate Directory of Apache Iceberg Resources
Andy Grove on accelerating Apache Spark using Apache DataFusion Comet [video].
Medium Engineering shares some useful tips on reducing the Snowflake bill
Alex Miller on Erasure Coding for Distributed Systems
Daniel Beach
- Gets Daft running in Databricks.
- Discussed Should you use DuckDB or Polars?

Published Humans

Two papers, two query engines, same subject: Making query optimizers that produce better query plans in the face of stale or incomplete data statistics.

Presto’s History-based Query Optimizer
- An important component of every query engine is its query optimizer. This is the part of the system responsible for taking the input query tree (typically an abstract query tree produced by the parser/analyzer) and converting it into an efficient execution plan. As the complexity of queries grows, so does the search space of possible plans, and having a good query optimizer becomes critical for navigating that search space and producing an efficient execution plan. Today, most enterprise-grade query optimizers are cost-based [7, 11, 15, 26, 28, 30, 31, 37], meaning they use a costing function to predict how computationally expensive a query plan is and select the one with the lowest cost estimate for execution. The costing module typically uses knowledge of data statistics and computation cost to compare different query plans and guide the optimizer into selecting the best query plan. This module often relies heavily on estimated data distribution and cardinalities.
- First, it requires data to be analyzed before it can be queried. In addition, cardinality estimators makes a number of simplifying assumptions such as data uniformity, independence of filters and columns, etc. They are often incapable of estimating selectivity of complex expressions, such as conditional expressions, function calls, and multi-key aggregations. There have been attempts to store more complex statistics such as multi-column and join histograms, but those require additional time and space to compute, and are often non-trivial to work with. As a result, it is not surprising that even industry-strength cardinality estimators routinely produce large errors in estimation.
- To overcome the challenges presented above, in this paper we present Presto’s history-based query optimizer (HBO) that has been used in production for several years at several large data infrastructure groups including those of Meta and Uber. In a nutshell, HBO tracks query execution statistics at the operator node, and uses those to predict future performance for similar queries.
(Databricks’ Photon engine) Adaptive and Robust ery Execution for Lakehouses at Scale
- Firstly, in large-scale, open Lakehouses with uncurated data, high ingestion rates, external tables, or deeply nested schemas, it is often costly or wasteful to maintain perfect and up-to-date table and column statistics. Secondly, inherently imperfect cardinality estimates with conjunctive predicates, joins and user-dened functions can lead to bad query plans. Thirdly, for the sheer magnitude of data involved, strictly relying on static query plan decisions can result in performance and stability issues such as excessive data movement, substantial disk spillage, or high memory pressure.
- The paper discusses some of the reasons for poor statistics or no statistics at all (which necessitates a query optimizer not based on statistics alone)
  - Supporting raw, uncurated data (lacking statistics).
  - Supporting external tables (lacking statistics).
  - Supporting deeply nested data (lacking statistics)
  - Supporting rapidly evolving data and workloads (stale statistics and volatile histories).
  - Supporting UDFs (lacking information for cardinality estimation)
  - Supporting diverse workloads (amplifying bad plans)
- To address these challenges, we built an adaptive query execution (AQE) framework. The key idea is to collect statistics during query execution from task metrics of completed and ongoing query plan fragments, and subsequently re-optimize unfinished execution plan fragments into better ones based on these runtime statistics.

My writing

I’ve been writing about the table formats for months now. My latest post questions the need for table format interoperability in the long term.

A snippet from Table format interoperability, future or fantasy?

[snippet]

The third alternative is to align the table formats at the data layer so that cross-publishing can utilize the vast majority of features, support merge-on-read without rewriting delete/DV files, and so on. If cross-publishing table formats ever really works well, it will be because the remaining table formats will have standardized some things, like partitioning, clustering, delete files and so on. There is also the potential for common standards for things like secondary indexing. This is similar to the standardized protocols that sit above TCP and UDP like DNS or BGP, supporting interoperability and core workflows, but currently there is no standardization mechanism like RFC’s for the open table format’s data layer.

But if all that did happen, why have a bunch of competing formats at all?

Let me know of interesting humans at feedback@hotds.dev

Humans of the Data Sphere

Humans of the Data Sphere Issue #1 October 15th 2024

Your biweekly dose of insights, observations, commentary and opinions from interesting people from the world of databases, AI, distributed systems and the data engineering/analytics space.

Aims of the publication

Quotable humans

Humans With Opinions

Tristan Handy (of DBT) on Cross-Platform DBT Mesh

Sriram Subramanian on S3 and the hype train

Roy Hasson opines on the planned convergence of Apache Iceberg and Delta Lake

Charity Majors on the danger of premature seniority

Ethan Mollick on AI in organizations: Some tactics

Yaroslav Tkachenko on “Shifting left to make it right”

Bernd Wessely on Data Architecture: Lessons Learned

Helpful Humans

Published Humans

My writing

A snippet from Table format interoperability, future or fantasy?

Ready for more?