Apache iceberg the Hadoop of the modern-data-stack?

Apache iceberg the Hadoop of the modern-data-stack?

(blog.det.life)

115

by samrohn

hendiatris

This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.

What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.

jasonjmcghee

Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.

indoordin0saur

Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.

hendiatris

I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.

Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...

EdwardDiego

What query engine are you using?

Tends to be that an optimal file size for Parquet is about 1GiB, once again, the "many small files" problem of Hadoop remains.

Then it's things like, can you organise your data in such a way to take advantage of RLE etc.?

indoordin0saur

Either Spark or Redshift (serverless)

EdwardDiego

Parquet tuning has always been like that, ever since it first came out in 2013.

I worry with Iceberg that people think it's just a case of "use an Iceberg table in Snowflake" and boom, amazingly fast querying of data in S3!

chrsig

how nested is the data in the parquet files?

joking

[flagged]

mritchie712

This is a bit overblown.

Is Iceberg "easy" to set up? No.

Can you get set up in a week? Yes.

If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.

For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.

If you don't want to mess with any of this, we[2] spin it all up for you.

0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws

1 - https://www.definite.app/blog/duck-takes-flight

2 - https://www.definite.app/

simlevesque

I think Iceberg can work in real time but the current implementations make it impossible.

I have a vision for a way to make it work. I made another comment here. Your blog posts were helpful, I digged a bit in the Duck Takes Flight code in python and rust.

whalesalad

heads up the logo on your site needs to be 2x'd in pixel density it comes across as blurry on hidpi displays. or convert it to an svg/vector.

mritchie712

fixed!

pid-1

If you're already in AWS, why wouldn't you use AWS Glue Catalog + AWS SDK for pandas + Athena?

You can setup a data lake, save data and start doing queries in like 10 minutes with this setup.

thedougd

These days you can 'just' create an S3 tables bucket. https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...

tsss

Athena is really expensive though and you will often run into a hard limit on the size of your query.

pid-1

Like most things serverless Athena is cheap as long as you don't use it.

My company has 100s of data pipelines that are executed infrequently.

For this use case Athena is ridiculously cheap and easy to use vs most other solutions.

fifilura

I never found Athena expensive. Compared to employment cost it will be miniscule.

And some times, if your query is CPU extensive but the queried data size is not huge you can get a ridiculous value for money, like many CPU-days in 10 minutes for just $5 if your query covers 1TB after partitioning.

Query size limits are also configurable.

Obviously it depends on what data you are working on, but not having to set up and pay for a computational cluster is a huge cost saving.

mritchie712

Agreed.

A lot of people worry would worry about "vendor lock-in" here, but it's certainly convenient.

simlevesque

I'm working on an alternative Iceberg client to work better in write heavy use cases. Instead of many smaller files it writes on the same file until it's 1mb in size but it gives it a new name. Then I update the manifest to the new filename and checksum. I keep old files on disk for 60 seconds to allow pending queries. I'm also working on auto compaction, when I have ten 1mb files I compact them, same with ten 10mb files, etc...

I feel like this could be a game changer for the ecosystem. It's more cpu and network heavy for writes but the reads are always fast. And the writes are still faster than pyiceberg.

I want to hear opinions or how this could never work.

chehai

This approach reminds me of ClickHouse's MergeTree.

Also, https://paimon.apache.org/ seems to be better for streaming use cases.

thom

Interesting. My personal feeling is that we're slowly headed to a world where we can have our cake and eat it: fast bulk ingestion, fast OLAP, fast OLTP, low latency, all together in the same datastore. I'm hoping we just get to collapse whole complex data platforms into a single consistent store with great developer experience, and never look back.

gregw2

I'd love it, but I feel like there is another horizon that Iceberg hasn't tackled to truly get us there.

Iceberg (and Delta Table format) is really OLAP-optimized, being built on a columnar datastore, Parquet. This means it will be slow to do writes compared to a traditional row-based datastore and doesn't really have normal/optimal OLTP indexing.

Fast OLTP + Fast OLAP + low latency is best done via HTAP-type databases which store data in both row and columnar form and give you ability in the SELECT clause to pick your latency tolerance and the query engine will pick the OLTP engine if it knows there are still some OLAP writes queued up that entered the system more than <latency-timeframe> ago but aren't fully on disk yet.

Various vendors do have HTAP, but all with proprietary storage engines and query engines. But Iceberg alone doesn't get you there. I haven't seen discussion of this I don't know if anyone has tried to write both Hudi and Iceberg/Delta in parallel so they could do HTAP; maybe they use pure Hudi instead?

I'd have to re-look at Hudi to see if it's deferred compaction is more like this. XTable doesn't seem to target this issue.

ndm000

I’ve felt the same way. It’s so inefficient to have two patterns - OLAP and OLTP - both using SQL interfaces but requiring syncing between systems. There are some physical limits at play though. OLAP will always take less processing and disk usage if the data it needs is all right next to each other (columnar storage) where as OLTP’s need for fast writes usually means row based storage is more efficient. I think the solution would be one system that stores data consistently both ways and knows when to use which method for a given query.

thom

In a sense, OLAP is just a series of indexing strategies that takes OLTP data and formats it for particular use cases (sometimes with eventual consistency). Some of these indexing strategies in enterprises today involve building out entire bespoke platforms to extract and transform the data. Incremental view maintenance is a step in the right direction - tools like Materialize give you good performance to keep calculated data up to date, and also break out of the streaming world of only paying attention to recent data. But you need to close the loop and also be able to do massive crunchy queries on top of that. I have no doubt we'll get there, really exciting times.

ndm000

Completely agree. All of the pieces are there and it's just waiting to be acted upon. I haven't seen any of the major players really doubling down on this, but would be so compelling.

simlevesque

I think it's possible too and the Iceberg spec allows it but the implementations are not suited for every use case.

mritchie712

nice! anywhere we can follow your progress?

simlevesque

Not right now sadly I have some work obligations taking my time but I can't wait to share more.

I'm using a basic implementation that's not backed by iceberg, just Parquet files in hive partitions that I can query using DuckDB.

FuriouslyAdrift

so... sharding?

robertkoss

This article is just shameless advertising for Estuary Flow, a company that the author is working for. "Operational Maturity", as if Iceberg, Delta or Hudi are not mature. These are battle-tested frameworks that have been in production for years. The "small files problem" is not really a problem because every framework supports some way of compacting smaller files. Just run a nightly job that compacts the small files and you're good 2 go.

prpl

AWS can do it for you with S3 tables.

Gasp0de

Does anyone have a good alternative for storing large amounts of very small files that need to be individually queriable? We are dealing with a large amount of sensor readings that we need to be able to query on a per sensor basis and a timespan, and we are dealing with the problem mentioned in the article, that storing millions of small files in S3 is expensive.

this_user

Do you absolutely have to write the data to files directly? If not, then using a time series database might be the better option. Most of them are pretty much designed for workloads with large numbers of append operations. You could always export to individual files later on if you need it.

Another option if you have enough local storage would be to use something like JuiceFS that creates a virtual file system where the files are initially written to the local cache before JuiceFS writes the data to your S3 provider as larger chunks.

SeaweedFS can do something similar if you configure it the right way. But both options require that you have enough storage outside of your object storage.

Gasp0de

We tried some readymade options but they were way more expensive than our custom built S3 solution (by a factor of x10 approximately). I think we tried timescale and AWS Timestream. I haven't heard of SeaweedFS.

ramses0

https://github.com/seaweedfs/seaweedfs?tab=readme-ov-file#qu...

https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Drive-Bene...

https://github.com/seaweedfs/seaweedfs/wiki/Cloud-Tier

https://github.com/seaweedfs/seaweedfs/wiki/Benchmarks

https://github.com/seaweedfs/seaweedfs/wiki/Words-from-Seawe...

https://github.com/seaweedfs/seaweedfs/wiki/Amazon-S3-API

...your true issue is it seems like you're using the filesystem as the "only" storage layer in play, but you also need time and entity querying(!?!).

>> we need to be able to query on a per sensor basis and a timespan

...look at the "Cloud-Tier" wiki page. If you're truly in an "everything's hot all the time" situation, you really should be using a database. If you're pulling "usually recent stuff, occasionally old stuff" then fronting with something like SeaweedFS seems like it might "just" transparently reduce your overall costs.

Really, I'd nudge towards "write .txt ; compact ... ; SELECT ... && cat .txt".

Basically, keep your inbound writes cached to (eg) seaweed as unit files. "Compact them" every hour by appending rows to some appropriate database (I mean: migrate to using litefs, turso, postgres, something like that). When you read, you may need to supplement "tip" data from your incoming files, but the majority should be hitting a "real" remote database, there's plenty to choose from!

A nifty note, sqlite can connect to multiple DB's at once: https://www.sqlite.org/lang_attach.html ... https://stackoverflow.com/posts/10020/revisions

...something like `select * from raw union (select * from one_hour) union (select * from today) union (select * from historical) ...`

paulsutter

If you want to keep them in S3, consolidate into sorted parquet files. You get random access to row groups, and only the columns you need are read so it’s very efficient. DuckDB can both build and access these files efficiently. You could compact files hourly/nightly/weekly whatever

Of course you could also use Aurora for a clean scalable Postgres that can survive zone failures for a simpler solution

Gasp0de

The problem is that the initial writing is already so expensive, I guess we'd have to write multiple sensors into the same file instead of having one file per sensor per interval. I'll look into parquet access options, if we could write 10k sensors into one file but still read a single sensor from that file that could work.

spothedog1

New S3 Table Buckets [1] do automatic compaction

[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...

necrobrit

Table buckets are currently quite hard to use for a lot of use cases as they _only_ support primitive types. No nested types.

Hopefully this will come at some point. Product looks very cool otherwise.

hendiatris

You may be able to get close with sufficiently small row groups, but you will have to do some tests. You can do this in a few hours of work, by taking some sensor data, sorting it by the identifier and then writing it to parquet with one row group per sensor. You can do this with the ParquetWriter class in PyArrow, or something else that allows you fine grained control of how the file is written. I just checked and saw that you can have around 7 million row groups per file, so you should be fine.

Then spin up duckdb and do some performance tests. I’m not sure this will work, there is some overheard with reading parquet, which is why it is discouraged to have small files and row groups.

0cf8612b2e1e

Why can you not rewrite the initial file into something partitioned by sensor+time? Would the one time job really be that much more additional cost vs the additional complexity of multiple sensors per file?

Do you ever go back and reaggregate older data into bigger, sorted files? That is, maybe you originally partitioned by hour, but stale data is so infrequently accessed, you could roll up into partitions per week/month/whatever. Depending on the specifics, you might save some space from less file overhead and better compression statistics.

Gasp0de

The costly thing is the intial writing already. S3 is our cold storage, we don't often read from it. So compaction would only make reading cheaper, but create a writing cost in the process.

bloomingkales

Something like Redis instead? [sensorid-timerange] = value. Your key is [sensorid-timerange] to get the values for that sensor and that time range.

No more files. You might be able to avoid per usage pricing just by hosting this on a regular vps.

Gasp0de

We use Redis for buffering for a certain timeperiod, and then we write data for one sensor for that period to S3. However we fill up large Redis clusters pretty fast, so we can only buffer for a shortish period.

ramses0

SeaweedFS? https://news.ycombinator.com/item?id=39235593

tobias3

I guess we need more requirements from OP, such if it should be self-hosted or a cloud service

alchemist1e9

https://github.com/mxmlnkn/ratarmount

> To use all fsspec features, either install via pip install ratarmount[fsspec] or pip install ratarmount[fsspec]. It should also suffice to simply pip install fsspec if ratarmountcore is already installed.

themgt

I've only played with it a bit but Nvidia AIStore project seems underappreciated: "lightweight, built-from-scratch storage stack tailored for AI applications" + S3 compatible

https://github.com/NVIDIA/aistore

prpl

partition by sensor first, then timestamp (or the reverse if it makes sense). If they are avro, orc, or parquet, stage and register them directly with Iceberg (see AppendFilesCommit) and compact occasionally. Newer files (uncompacted) will be small, older files will be larger and more optimized.

You can also do this with a landing table or even branches+WAP.

alienreborn

Better article (imo) on similar topic: https://www.dataengineeringweekly.com/p/is-apache-iceberg-th...

tomnicholas1

I think the posted article was generated from this one - the structure of the content is so similar.

alexmorley

Most of these issues will be ring true to lots of folk using Iceberg at the moment. But this does not:

> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.

I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).

twoodfin

Microsoft is right now making a huge bet on Delta by way of their “Microsoft Fabric” initiative (as always with Microsoft: Is it a product? Is it a branding scheme? Yes.)

They seem to be the only vendor crazy enough to try to fast-follow Databricks, who is clearly driving the increasingly elaborate and sophisticated Delta ecosystem (check the GitHub traffic…)

But Microsoft + Databricks is a lot of momentum for Delta.

On the merits of open & simple, I agree, better for everyone if Iceberg wins out—as Iceberg and not as some Frankenstandard mashed together with Delta by the force of 1,000 Databricks engineers.

datadrivenangel

The only reason Microsoft is using Delta is to emphasize to CTOs and investors that fabric is as good as databricks, even when that is obviously false to anyone who has smelled the evaporative scent of vaporware before.

twoodfin

Very different business, of course, but Databricks v. Fabric reminds me a lot of Slack v. Teams.

Regardless of the relative merits now, I think everyone agrees that a few years ago Slack was clearly superior. Microsoft could have certainly bought Slack instead of pumping probably billions into development, marketing, discounts to destroy them.

I think Microsoft could and would consider buying Databricks—$80–100B is a lot, but not record-shattering.

If I were them, though, I’d spend a few billion competing as an experiment, first.

foobiekr

Anti-trust is the reason a lot of the kinds of deals you’re talking about don’t happen.

twoodfin

I agree. If the anti-trust regime had been different Microsoft would have bought Databricks years ago. Satya Nadella has surely been tapping his foot watching their valuation grow and grow.

The Trump folks have given mixed messages on the Biden-era FTC; I'd put the odds that with the right tap dancing (sigh) Microsoft could make a blockbuster like this in the B2B space work.

esafak

Microsoft's gonna Microsoft.

datax2

"Hadoop’s meteoric rise led many organizations to implement it without understanding its complexities, often resulting in underutilized clusters or over-engineered architectures. Iceberg is walking a similar path."

This pain is too real, and too close to home. I've seen this outcome turn the entire business off of consuming their data via hadoop because it turns into a wasteland of delayed deliveries, broken datasets, op's teams who cannot scale, and architects overselling too robust designs.

I've tried to scale down hadoop to the business user with visual etl tools like Alteryx, but there again compatibility between Alteryx and hadoop suck via ODBC connectors. I came from an AWS based stack into a poorly leapfrogged data stack and it's hard not to pull my hair out between the business struggling to use it and infra + op's not keeping up. Now these teams want to push to iceburg or big query while ignoring the mountains of tech debt they have created.

Don't get me wrong Hadoop isn't a bad idea, its just complex and a time suck, and unless you have time to dedicate to properly deploy these solutions which most business do not, your implementation will suffer, your business will suffer.

"While the parallels to Hadoop are striking, we also have the opportunity to avoid its pitfalls." no one in IT learns from their failures unless they are writing the checks, most will flip before they feel the pain.

zhousun

The only datastack iceberg (or lakehouse) will never replace is OLTP systems, for high-concurrency updates optimistic concurrency control & object store is simply a no go.

Iceberg out-of-the-box is "NOT" good at streaming use cases, unlike formats like Hudi or Paimon, the table format does not have the concept of merge/ index. However, the beauty of iceberg is it is very unopinionated, so it is indeed possible to design an engine to stream write to iceberg. As far as I know this is how engines like Upsolver was implemented: 1. Have in-memory buffer to track incoming rows before flushing a version to iceberg (every 10s to a few minutes). 2. Build Indexing structure to write position deletes/ deletion vector instead of equality deletes. 3. The writer will all try to merge small files and optimize the table.

And stay tuned, we at https://www.mooncake.dev/ are working on a solution to mirror a postgres table to iceberg, and keep them always up-to-date.

orthoxerox

I think the complexity of Iceberg is overblown. It's just a table format and it's strictly better than the Hive-style /schema/table/partition_key=partition_value/one_of_many_files.parquet

It has a lot of knobs to fiddle with (more than Delta Lake, which tries very hard to come up with good defaults), but even if you don't touch any of them, you already end up with tables that are as good as Hive's, except now your writers don't break your readers.

This is already a massive boon that lets you escape the rigidity of a timetable schedule for your data pipelines. Anything else you can come up with (switching your table to MOR and rewriting it as a separate step etc) is further improvements.

paulsutter

Does this feel about 3x too verbose, like it’s generated?

jasonjmcghee

Idk if it's the verbosity but yes, reads as generated to me. Specifically sounds like ChatGPT's writing.

mritchie712

100%, might be gpt4.5

theyinwhy

What's a good alternative? Google BigQuery?

Crafted by Rajat

Source Code

hckrnws

Apache iceberg the Hadoop of the modern-data-stack?