hckrnws
I'm working on an alternative Iceberg client to work better in write heavy use cases. Instead of many smaller files it writes on the same file until it's 1mb in size but it gives it a new name. Then I update the manifest to the new filename and checksum. I keep old files on disk for 60 seconds to allow pending queries. I'm also working on auto compaction, when I have ten 1mb files I compact them, same with ten 10mb files, etc...
I feel like this could be a game changer for the ecosystem. It's more cpu and network heavy for writes but the reads are always fast. And the writes are still faster than pyiceberg.
I want to hear opinions or how this could never work.
This approach reminds me of ClickHouse's MergeTree.
Also, https://paimon.apache.org/ seems to be better for streaming use cases.
Interesting. My personal feeling is that we're slowly headed to a world where we can have our cake and eat it: fast bulk ingestion, fast OLAP, fast OLTP, low latency, all together in the same datastore. I'm hoping we just get to collapse whole complex data platforms into a single consistent store with great developer experience, and never look back.
I’ve felt the same way. It’s so inefficient to have two patterns - OLAP and OLTP - both using SQL interfaces but requiring syncing between systems. There are some physical limits at play though. OLAP will always take less processing and disk usage if the data it needs is all right next to each other (columnar storage) where as OLTP’s need for fast writes usually means row based storage is more efficient. I think the solution would be one system that stores data consistently both ways and knows when to use which method for a given query.
In a sense, OLAP is just a series of indexing strategies that takes OLTP data and formats it for particular use cases (sometimes with eventual consistency). Some of these indexing strategies in enterprises today involve building out entire bespoke platforms to extract and transform the data. Incremental view maintenance is a step in the right direction - tools like Materialize give you good performance to keep calculated data up to date, and also break out of the streaming world of only paying attention to recent data. But you need to close the loop and also be able to do massive crunchy queries on top of that. I have no doubt we'll get there, really exciting times.
I think it's possible too and the Iceberg spec allows it but the implementations are not suited for every use case.
nice! anywhere we can follow your progress?
Not right now sadly I have some work obligations taking my time but I can't wait to share more.
I'm using a basic implementation that's not backed by iceberg, just Parquet files in hive partitions that I can query using DuckDB.
so... sharding?
This is a bit overblown.
Is Iceberg "easy" to set up? No.
Can you get set up in a week? Yes.
If you really need a datalake, spending a week setting it up is not so bad. We have a guide[0] here that will get you started in under an hour.
For smaller (e.g. under 10tb) data where you don't need real-time, DuckDB is becoming a really solid option. Here's on setup[1] we've played around with using Arrow Flight.
If you don't want to mess with any of this, we[2] spin it all up for you.
0 - https://www.definite.app/blog/cloud-iceberg-duckdb-aws
I think Iceberg can work in real time but the current implementations make it impossible.
I have a vision for a way to make it work. I made another comment here. Your blog posts were helpful, I digged a bit in the Duck Takes Flight code in python and rust.
This is a huge challenge with Iceberg. I have found that there is substantial bang for your buck in tuning how parquet files are written, particularly in terms of row group size and column-level bloom filters. In addition to that, I make heavy use of the encoding options (dictionary/RLE) while denormalizing data into as few files as possible. This has allowed me to rely on DuckDB for querying terabytes of data at low cost and acceptable performance.
What we are lacking now is tooling that gives you insight into how you should configure Iceberg. Does something like this exist? I have been looking for something that would show me the query plan that is developed from Iceberg metadata, but didn’t find anything. It would go a long way to showing where the bottleneck is for queries.
Have you written about your parquet strategy anywhere? Or have suggested reading related to the tuning you've done? Super interested.
Also very interested in the parquet tuning. I have been building my data lake and most optimization I do is just with efficient partitioning.
I will write something up when the dust settles, I’m still testing things out. It’s a project where the data is fairly standardized but there is about a petabyte to deal with, so I think it makes sense to make investments in efficiency at the lower level rather than through tons of resources at it. That has meant a custom parser for the input data written in Rust, lots of analysis of the statistics of the data, etc. It has been a different approach to data engineering and one that I hope we see more of.
Regarding reading materials, I found this DuckDB post to be especially helpful in realizing how parquet could be better leveraged for efficiency: https://duckdb.org/2024/03/26/42-parquet-a-zip-bomb-for-the-...
¿chatgpt?
Better article (imo) on similar topic: https://www.dataengineeringweekly.com/p/is-apache-iceberg-th...
I think the posted article was generated from this one - the structure of the content is so similar.
Does anyone have a good alternative for storing large amounts of very small files that need to be individually queriable? We are dealing with a large amount of sensor readings that we need to be able to query on a per sensor basis and a timespan, and we are dealing with the problem mentioned in the article, that storing millions of small files in S3 is expensive.
Do you absolutely have to write the data to files directly? If not, then using a time series database might be the better option. Most of them are pretty much designed for workloads with large numbers of append operations. You could always export to individual files later on if you need it.
Another option if you have enough local storage would be to use something like JuiceFS that creates a virtual file system where the files are initially written to the local cache before JuiceFS writes the data to your S3 provider as larger chunks.
SeaweedFS can do something similar if you configure it the right way. But both options require that you have enough storage outside of your object storage.
We tried some readymade options but they were way more expensive than our custom built S3 solution (by a factor of x10 approximately). I think we tried timescale and AWS Timestream. I haven't heard of SeaweedFS.
https://github.com/mxmlnkn/ratarmount
> To use all fsspec features, either install via pip install ratarmount[fsspec] or pip install ratarmount[fsspec]. It should also suffice to simply pip install fsspec if ratarmountcore is already installed.
If you want to keep them in S3, consolidate into sorted parquet files. You get random access to row groups, and only the columns you need are read so it’s very efficient. DuckDB can both build and access these files efficiently. You could compact files hourly/nightly/weekly whatever
Of course you could also use Aurora for a clean scalable Postgres that can survive zone failures for a simpler solution
The problem is that the initial writing is already so expensive, I guess we'd have to write multiple sensors into the same file instead of having one file per sensor per interval. I'll look into parquet access options, if we could write 10k sensors into one file but still read a single sensor from that file that could work.
Something like Redis instead? [sensorid-timerange] = value. Your key is [sensorid-timerange] to get the values for that sensor and that time range.
No more files. You might be able to avoid per usage pricing just by hosting this on a regular vps.
We use Redis for buffering for a certain timeperiod, and then we write data for one sensor for that period to S3. However we fill up large Redis clusters pretty fast, so we can only buffer for a shortish period.
New S3 Table Buckets [1] do automatic compaction
[1] https://docs.aws.amazon.com/AmazonS3/latest/userguide/s3-tab...
Table buckets are currently quite hard to use for a lot of use cases as they _only_ support primitive types. No nested types.
Hopefully this will come at some point. Product looks very cool otherwise.
Most of these issues will be ring true to lots of folk using Iceberg at the moment. But this does not:
> Yet, competing table formats like Delta Lake and Hudi mirror this fragmentation. [ ... ] > Just as Spark emerged as the dominant engine in the Hadoop ecosystem, a dominant table format and catalog may appear in the Iceberg era.
I think extremely few people are making bets on any other open source table format now - that consolidation has already happened in 2023-2024 (see e.g. Databricks who have their own competing format leaning heavily into iceberg; or adoption from all of the major data warehouse providers).
Microsoft is right now making a huge bet on Delta by way of their “Microsoft Fabric” initiative (as always with Microsoft: Is it a product? Is it a branding scheme? Yes.)
They seem to be the only vendor crazy enough to try to fast-follow Databricks, who is clearly driving the increasingly elaborate and sophisticated Delta ecosystem (check the GitHub traffic…)
But Microsoft + Databricks is a lot of momentum for Delta.
On the merits of open & simple, I agree, better for everyone if Iceberg wins out—as Iceberg and not as some Frankenstandard mashed together with Delta by the force of 1,000 Databricks engineers.
The only reason Microsoft is using Delta is to emphasize to CTOs and investors that fabric is as good as databricks, even when that is obviously false to anyone who has smelled the evaporative scent of vaporware before.
Very different business, of course, but Databricks v. Fabric reminds me a lot of Slack v. Teams.
Regardless of the relative merits now, I think everyone agrees that a few years ago Slack was clearly superior. Microsoft could have certainly bought Slack instead of pumping probably billions into development, marketing, discounts to destroy them.
I think Microsoft could and would consider buying Databricks—$80–100B is a lot, but not record-shattering.
If I were them, though, I’d spend a few billion competing as an experiment, first.
Anti-trust is the reason a lot of the kinds of deals you’re talking about don’t happen.
I agree. If the anti-trust regime had been different Microsoft would have bought Databricks years ago. Satya Nadella has surely been tapping his foot watching their valuation grow and grow.
The Trump folks have given mixed messages on the Biden-era FTC; I'd put the odds that with the right tap dancing (sigh) Microsoft could make a blockbuster like this in the B2B space work.
Microsoft's gonna Microsoft.
Does this feel about 3x too verbose, like it’s generated?
Idk if it's the verbosity but yes, reads as generated to me. Specifically sounds like ChatGPT's writing.
100%, might be gpt4.5
Crafted by Rajat
Source Code