데이터 압축 방법을 정리해보자...

대충 아래와 같이 정리가 된다. 관심없는 포멧이 너무 많이 붙었으니 중요한 것들만 잘라서 다시 보도록 하자.

Algorithm Compression Type Compression Ratio Compression Speed Decompression Speed Use Cases License Notes
DEFLATE Lossless Moderate (20-50%) Fast Very Fast General-purpose, ZIP, gzip files BSD-like Widely used, good balance between compression and speed.
LZMA (xz) Lossless High (30-70%) Slow Moderate High compression, Linux packages GPL/MIT Excellent compression but slower than DEFLATE.
Brotli Lossless High (30-70%) Slow Fast Web assets, static content BSD-like Optimized for text and HTML; slower to compress but fast to decompress.
Zstandard (zstd) Lossless High (30-70%) Fast Very Fast General-purpose, real-time compression BSD-like Modern, high-speed algorithm with adjustable compression levels.
LZ4 Lossless Moderate (20-50%) Very Fast Very Fast Real-time data, databases, gaming BSD-like Extremely fast, suitable for scenarios where speed is critical.
Snappy (Google) Lossless Moderate (20-50%) Fast Very Fast Google protocols, real-time systems BSD-like Designed for speed, slightly lower compression ratio than LZ4.
Gzip Lossless Moderate (20-50%) Slow Fast Web servers, backups GPL Uses DEFLATE under the hood; older but still widely used.
XZ Lossless High (30-70%) Slow Moderate Linux packages, high compression GPL/MIT Uses LZMA; excellent for archiving but slower than other algorithms.
BZIP2 Lossless High (30-70%) Slow Moderate Tarballs, backups GPL Older algorithm with good compression but slower than modern alternatives.
Zlib Lossless Moderate (20-50%) Fast Very Fast General-purpose, games, networking BSD-like Lightweight and widely used, but less efficient than newer algorithms.
RLE (Run-Length Encoding) Lossless Low (10-30%) Fast Fast Simple data, images, audio Public domain Simple and fast but limited to repetitive data.
Huffman Coding Lossless Moderate (20-50%) Slow Fast Text, images, audio Public domain Simple and effective for small datasets but not ideal for large data.
LZW Lossless Moderate (20-50%) Fast Fast GIF, TIFF images GPL-like Older algorithm with moderate compression but limited by patent history.
LZSS Lossless Moderate (20-50%) Fast Fast General-purpose Public domain Simple and efficient for small-scale compression.
LZ5 Lossless High (30-70%) Moderate Moderate High compression, archives BSD-like Modern algorithm with excellent compression but slower than LZ4 or Snappy.
APNG Lossless Moderate (20-50%) Fast Fast Animated images BSD-like Optimized for PNG animations.
WebP Lossy/Lossless High (30-80%) Moderate Fast Web images BSD-like Developed by Google for web images, supports both lossy and lossless compression.
JPEG Lossy High (30-80%) Slow Fast Images Public domain Widely used for images, lossy compression.
PNG Lossless Moderate (20-50%) Moderate Fast Images Public domain Lossless compression, widely used for web images.
HEIF Lossy High (30-80%) Slow Fast Images Proprietary Modern image format with better compression than JPEG but patent-encumbered.
MP3 Lossy High (30-80%) Slow Fast Audio Various Popular for audio compression.
AAC Lossy High (30-80%) Moderate Fast Audio Various High-quality audio compression, widely used in multimedia.
AV1 Lossy High (30-80%) Slow Fast Video BSD-like Modern, royalty-free video compression standard.
H.264/AVC Lossy High (30-80%) Slow Fast Video MPEG LA Widely used for video compression, but patent-encumbered.
H.265/HEVC Lossy High (30-80%) Slow Moderate Video MPEG LA Better compression than H.264 but slower and patent-encumbered.
VP9 Lossy High (30-80%) Slow Fast Video BSD-like Google’s royalty-free video compression standard.
FLAC Lossless Moderate (20-50%) Slow Fast Audio BSD-like Lossless audio compression, popular for high-quality audio.
OGG Vorbis Lossy High (30-70%) Moderate Fast Audio BSD-like Popular for lossy audio compression.
ZIP Lossless Moderate (20-50%) Moderate Fast General-purpose archiving Public domain Uses DEFLATE or Brotli; widely used for file archiving.
Tar Lossless Moderate (20-50%) Fast Fast Data archiving Public domain Often combined with other algorithms (e.g., tar.gz).
RAR Lossless/Lossy High (30-70%) Moderate Moderate Archiving, backups Proprietary Popular but proprietary, supports both lossy and lossless compression.
7-Zip Lossless High (30-70%) Slow Moderate Archiving, backups LGPL Supports multiple algorithms (e.g., LZMA, DEFLATE).
WebAssembly (Wasm) Lossless Moderate (20-50%) Fast Fast Web applications MIT Used for on-the-fly compression in browsers.
AEC (Adaptive Entropy Coding) Lossless Moderate (20-50%) Moderate Fast Audio, video Various Used in modern codecs like Opus and VP9.
Opus Lossy High (30-70%) Moderate Fast Audio BSD-like High-quality audio compression for VoIP and streaming.
Speex Lossy Moderate (20-50%) Moderate Fast Audio BSD-like Older audio codec, widely used in VoIP.
Theora Lossy Moderate (20-50%) Slow Fast Video BSD-like Older video codec, widely used in web applications.
VP8 Lossy Moderate (20-50%) Moderate Fast Video BSD-like Google’s older video codec, used in WebM.
VP9 Lossy High (30-70%) Slow Fast Video BSD-like Google’s modern video codec, royalty-free.
AVIF Lossy High (30-70%) Slow Fast Images BSD-like Next-generation image format, based on VP9.
WebP Lossy/Lossless High (30-70%) Moderate Fast Web images BSD-like Developed by Google for web images.
TIFF Lossless Moderate (20-50%) Moderate Fast Images Public domain Flexible format, supports lossless and lossy compression.
PDF Lossless Moderate (20-50%) Moderate Fast Documents Proprietary Supports multiple compression algorithms (e.g., DEFLATE, JPEG).
SVG Lossless Moderate (20-50%) Fast Fast Vector graphics Public domain Text-based format, not optimized for compression.
MPEG-4 Lossy High (30-70%) Slow Fast Video MPEG LA Older video compression standard, widely used.
MPEG-2 Lossy Moderate (20-50%) Slow Fast Video, DVDs MPEG LA Older standard, widely used in broadcasting and storage.
MPEG-7 Lossy High (30-70%) Slow Fast Multimedia MPEG LA Advanced multimedia compression standard.
MPEG-H Lossy High (30-70%) Slow Fast Audio, video MPEG LA Modern audio and video compression standard.
MP4 Lossy High (30-70%) Slow Fast Video MPEG LA Popular video format, uses H.264/MPEG-4 AVC for compression.
AVCHD Lossy High (30-70%) Slow Fast Video MPEG LA High-definition video compression for Blu-ray and HD camcorders.
VP6 Lossy Moderate (20-50%) Moderate Fast Video Adobe Older video codec, widely used in Flash and video streaming.
VP7 Lossy High (30-70%) Slow Fast Video Public domain Google’s open-source video codec.
VP8 Lossy Moderate (20-50%) Moderate Fast Video BSD-like Google’s older video codec, used in WebM.
VP9 Lossy High (30-70%) Slow Fast Video BSD-like Google’s modern video codec, royalty-free.
VP10 Lossy High (30-70%) Slow Fast Video BSD-like Successor to VP9, improved compression efficiency.
VP11 Lossy High (30-70%) Slow Fast Video BSD-like Latest version of VPx codecs, optimized for modern hardware.
VP12 Lossy High (30-70%) Slow Fast Video BSD-like Experimental version with improved compression.
VP13 Lossy High (30-70%) Slow Fast Video BSD-like Experimental version with advanced compression techniques.
VP14 Lossy High (30-70%) Slow Fast Video BSD-like Latest experimental version with enhanced compression.
VP15 Lossy High (30-70%) Slow Fast Video BSD-like Future versions may include AI-driven compression techniques.
VP16 Lossy High (30-70%) Slow Fast Video BSD-like Potential integration with machine learning for better compression.
VP17 Lossy High (30-70%) Slow Fast Video BSD-like Ongoing development for next-generation video compression.
VP18 Lossy High (30-70%) Slow Fast Video BSD-like Experimental focus on real-time compression for low-latency applications.
VP19 Lossy High (30-70%) Slow Fast Video BSD-like Exploring hybrid compression techniques combining traditional and AI-based methods.
VP20 Lossy High (30-70%) Slow Fast Video BSD-like Future versions may integrate AI for improved compression efficiency.
VP21 Lossy High (30-70%) Slow Fast Video BSD-like Experimental focus on ultra-high compression for minimal bandwidth usage.
VP22 Lossy High (30-70%) Slow Fast Video BSD-like Potential integration of quantum computing principles for faster compression.
VP23 Lossy High (30-70%) Slow Fast Video BSD-like Ongoing research into neural network-based compression for better quality.
VP24 Lossy High (30-70%) Slow Fast Video BSD-like Experimental use of generative AI for lossy compression.
VP25 Lossy High (30-70%) Slow Fast Video BSD-like Future versions may include quantum-inspired compression techniques.
VP26 Lossy High (30-70%) Slow Fast Video BSD-like Exploring hybrid models combining traditional codecs with AI.
VP27 Lossy High (30-70%) Slow Fast Video BSD-like Experimental focus on real-time compression for virtual reality applications.
VP28 Lossy High (30-70%) Slow Fast Video BSD-like  

일단 내가 요즘 사용하는 것들만 다시 정리해봤다. 사실 zip이 가장 흔하고 워낙 오래된 역사를 자랑하고 있다고 볼 수 있지 싶은데, 최근들어는 점점 bzip2와 xz를 쓰는 편인데 사실 옛날에는 너무 느려서 생각도 안해보던 것들이다. 그래도 multithread를 하면 bzip2는 제법 빠르기 때문에 쓸만하고 zlib 같은 것은 알게 모르게 이런 저런 앱에 붙어있어서 꽤나 자주 쓰는 편이다. 표에서 딱 볼 수 있는 것과 같이 압축/복원 속도가 가장 빠르면서 압축율도 좋은 것은 LZ4가 되겠다. 실제로 사용해보면 굉장히 빠르고 압축률도 별로 나쁘지 않다. 사실 그림이나 동영상 같은 것들은 원래 압축이 되어있어서 거의 효과가 없고 반대로 텍스트 문서라든가 압축이 안되어있는 사용자 데이터 따위가 크게 압축이 되는 편인데, 이런 면에 있어서는 사실 압축 방법들에 의한 차이가 대부분 속도와의 trade off이다. 그러니까 파일 전체를 다 뒤져서 발생빈도를 계산하고 거기에 맞춰서 압축후의 정보량을 결정하는 식 (=허프만코드)인 거라 크고 많은 데이터를 처리해야 한다면 다 뒤지는 것은 어렵고 윈도우로 이동해가면서 하거나 아니면 분할해놓고 하거나 해야한다. 혹은 데이터를 압축하기 좋은 형태로 변환을 한다거나. 그런데 이런 과정들이 붙으면 붙을 수록 특정 경우에는 압축이 잘 될지 몰라도 어떤 경우엔 별로 듣질 않고 하다보니 실질적인 효용에 있어서 차이가 날 수 밖에 없다.

하드웨어에서도 압축을 하는 것이 유리한 것이 일단 압축해놓고 데이터를 주고 받으면 버스 점유율이라든가 메모리 사용량이 현저히 줄기 때문에 그렇다. 특히나 버스나 임시 메모리에 들어가는 데이터들은 빠른 시간 처리를 요하는 것들이라 사실 압축을 하게 되면 효율이 굉장히 높을 것들이기 때문에 그러한데, 마찬가지로 빠른 시간 처리를 요하기 때문에 압축 속도가 굉장히 빨라야 한다. OS를 생각하자면 메모리 압박이 심할 때 비활성화된 앱들을 빠르게 압축해서 disk swapping하는 식으로 시스템 안정성을 도모하기도 한다고 들었다.

Algorithm Compression Type Compression Ratio Compression Speed Decompression Speed Use Cases License Notes
DEFLATE Lossless Moderate (20-50%) Fast Very Fast General-purpose, ZIP, gzip files BSD-like Widely used, good balance between compression and speed.
LZMA (xz) Lossless High (30-70%) Slow Moderate High compression, Linux packages GPL/MIT Excellent compression but slower than DEFLATE.
Brotli Lossless High (30-70%) Slow Fast Web assets, static content BSD-like Optimized for text and HTML; slower to compress but fast to decompress.
Zstandard (zstd) Lossless High (30-70%) Fast Very Fast General-purpose, real-time compression BSD-like Modern, high-speed algorithm with adjustable compression levels.
LZ4 Lossless Moderate (20-50%) Very Fast Very Fast Real-time data, databases, gaming BSD-like Extremely fast, suitable for scenarios where speed is critical.
Snappy (Google) Lossless Moderate (20-50%) Fast Very Fast Google protocols, real-time systems BSD-like Designed for speed, slightly lower compression ratio than LZ4.
Gzip Lossless Moderate (20-50%) Slow Fast Web servers, backups GPL Uses DEFLATE under the hood; older but still widely used.
XZ Lossless High (30-70%) Slow Moderate Linux packages, high compression GPL/MIT Uses LZMA; excellent for archiving but slower than other algorithms.
BZIP2 Lossless High (30-70%) Slow Moderate Tarballs, backups GPL Older algorithm with good compression but slower than modern alternatives.
Zlib Lossless Moderate (20-50%) Fast Very Fast General-purpose, games, networking BSD-like Lightweight and widely used, but less efficient than newer algorithms.
RLE (Run-Length Encoding) Lossless Low (10-30%) Fast Fast Simple data, images, audio Public domain Simple and fast bu