Amazon switched compression from Gzip to Zstd for its own service data

0

A tweet from Adrian Cockcroft, a former VP of Amazon, recently highlighted the benefits of switching from gzip compression to Zstandard compression at Amazon and sparked community discussion about the compression algorithm. Other big companies, including Twitter and Honeycomb, have shared great gains using zstd.

Analyze savings on Twitter, Dan Luu recently started a conversation by tweeting:

I wonder how much garbage was eliminated by Yann Collect creating zstd. When I ran the numbers on Twitter, which is tiny compared to huge tech companies, the move from HDFS to zstd was around 8 digits/year. Across the world (not annualized) it looks like it must be >= 9 figs?

Cockcroft replied

A lot was saved when moving AWS from gzip to zstd – around 30% reduction in compressed S3 storage, at exabyte scale.

Zstandard, better known by its C implementation zstd, is a lossless data compression algorithm developed by Yann Collet of Facebook that provides high compression ratio with very good performance on various data sets. Distributed as open source software under the BSD license, the reference library offers a wide range of speed/compression trade-offs with an extremely fast decoder.

Cockcroft’s statement initially raised doubts in the community, with some developers examination how AWS compressed customer data on S3. An AWS internal employee clarified:

Adrian misspoke, or everyone misunderstands what he meant. What he meant wasn’t that S3 changed the way it stores compressed customer data. What he meant was that AWS changed the way it stores its own service data (mostly logs) in S3 – going (as a customer of S3 themselves) from gzipping logs to ztsd logs, we were able to reduce our S3 storage costs by 30%.

Liz Fong-Jones, lead developer advocate at Honeycomb, accepted when switching to zstd:

We don’t use it for column files because it’s too slow, but we do use it for Kafka (…) Honeycomb sees 25% bandwidth savings after switching from Snappy to zstd in prod. (…) It’s not just storage and computing. for us, it’s the NETWORK. AWS inter-AZ data transfer is absurdly expensive.

In a popular Reddit thread, user black Knight is one of many positive comments to share:

My company did something similar a few years ago and saw similar benefits. We run zstandard everywhere we can, not just storage, but other things like internal HTTP traffic.

User treffer on Hacker News comments:

Particularly fast compression algorithms (zstd, lz4, snappy, lzo, …) are worth the CPU cost with virtually no downside. The problem is finding the right sweet spot that reduces the current bottleneck without creating a CPU bottleneck, but zstd offers the most flexibility there too.

AWS exposes Zstandard and supports other compression algorithms in the API of some managed services. For example, after introducing Zstandard support for Amazon Redshift, the cloud provider developed its own AZ64 algorithm for the cloud data warehouse. According to the cloud provider, proprietary compression consumes 5-10% less storage and is 70% faster than zstd encoding.

Amazon has not issued any official comment regarding the compression technology used for its own internal data or the S3 storage savings involved.

Share.

Comments are closed.