Investigate the .NET Framework bzip2 library

Find out about libraries that support bzip2 in the .NET Framework. Compare processing speed with Python and bzcat.

This article has a Python version.

Library

I found three different libraries that support bzip2.

  1. SharpZipLib: Reimplemented in managed code
  2. SharpCompress: Reimplemented in managed code
  3. AR.Compression.BZip2 Native library wrapper

Check the multi-stream handling and deployment time in these libraries.

Multi-stream

Data that is individually compressed with bzip2 and concatenated is called a multi-stream.

Creation example


$ echo -n hello | bzip2 > a.bz2
$ echo -n world | bzip2 > b.bz2
$ cat a.bz2 b.bz2 > ab.bz2

It can be handled as it is with a command such as bzcat.

$ bzcat ab.bz2
helloworld

It is used for parallel compression pbzip2 and Wikipedia dumps.

[Addition] The same can be done with gzip, which is investigated in the following article.

SharpZipLib

Supports multiple compression algorithms.

The bzip2 implementation is below.

Minimum configuration

We have confirmed that bzip2 can be extracted with only the following 6 files.

test

Use the library obtained by NuGet instead of the minimum configuration.

nuget install SharpZipLib
cp SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll .

Try deploying ab.bz2, which was created as a multi-stream example.

#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2InputStream(fs)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())
hello

Only the first stream is processed.

Looking at BZip2InputStream.cs, it seems that it is not made for multi-stream, so it is necessary to read it sequentially.

#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    while fs.Position < fs.Length do
        use bz = new BZip2InputStream(fs, IsStreamOwner = false)
        use sr = new StreamReader(bz)
        printfn "%s" (sr.ReadToEnd())
hello
world

SharpCompress

Supports multiple compression algorithms.

The bzip2 implementation is below.

This directory can be extracted and used independently. The only thing is that it lacks its own definition of CompressionMode, but it replaces the existing enum.

BZip2Stream.cs (additional)


using System.IO.Compression;

Minimum configuration

We have confirmed that bzip2 can be extracted with only the following 3 files.

However, class CBZip2InputStream is ʻinternal, so it must be public`.

test

Use the library obtained by NuGet instead of the minimum configuration.

nuget install sharpcompress
cp SharpCompress.0.25.1/lib/net46/SharpCompress.dll .

Try deploying ab.bz2, which was created as a multi-stream example.

#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())

Execution result


helloworld

With support for multi-stream, I was able to read it all at once.

If you set the multistream flag to false, fs will be closed when the end of one stream is reached. Looking at CBZip2InputStream.cs, it doesn't seem to be supposed to stay open. Therefore, it seems that the only way to read it sequentially is to do something that ignores Dispose.

#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
    let mutable ignore = true
    use fs = { new FileStream("ab.bz2", FileMode.Open) with
        override __.Dispose disposing = if not ignore then base.Dispose disposing }
    while fs.Position < fs.Length do
        use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
        use sr = new StreamReader(bz)
        printfn "%s" (sr.ReadToEnd())
    ignore <- false

Execution result


hello
world

AR.Compression.BZip2

Unlike other libraries, it specializes in bzip2. Decompression and compression are implemented in one class, so we'll skip looking at the minimum configuration separately.

It's registered with NuGet.

Build

Sharing the library with Mono requires a bit of work, so this time I'll build it myself instead of using NuGet.

Rename the DLL file specified in P / Invoke.

10:        private const string DllName = "libbz2";

This will use libbz2.dll on Windows and libbz2.so on WSL. WSL references /usr/lib/libbz2.so even if it is not in the current directory.

Build the DLL.

csc -o -out:AR.Compression.BZip2.dll -t:library -unsafe sources/AR.BZip2/*.cs

Bzip2 for Windows uses the binaries distributed below.

test

Try deploying ab.bz2, which was created as a multi-stream example.

#r "AR.Compression.BZip2.dll"
open System.IO
open System.IO.Compression
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())

Execution result


helloworld

All streams have been expanded at once. Looking at BZip2Stream.cs, it seems that it is not supposed to be read sequentially. Considering the handling of the base stream, which will be seen next, it seems that it cannot be handled without modification.

Load base stream

In each library, check where to read the passed base stream.

417:                            thech = baseStream.ReadByte();
231:            int magic0 = bsStream.ReadByte();
232:            int magic1 = bsStream.ReadByte();
233:            int magic2 = bsStream.ReadByte();
242:            int magic3 = bsStream.ReadByte();
378:                    thech = (char)bsStream.ReadByte();
649:                                    thech = (char)bsStream.ReadByte();
717:                                                thech = (char)bsStream.ReadByte();
814:                                            thech = (char)bsStream.ReadByte();
  9:             private const int BufferSize = 128 * 1024;
 14:             private readonly byte[] _buffer = new byte[BufferSize];
368:                                             _data.avail_in = _stream.Read(_buffer, 0, ufferSize);

SharpZipLib and SharpCompress read one byte at a time with ReadByte as needed. So it doesn't seem to overrun even if you exit at the bzip2 stream delimiter. Since the variable name thech (the character?) Is common, there may be something in common. (I don't see this variable name in libbz2)

In AR.Compression.BZip2, it is read into the buffer with a fixed length. Since it is not expanded by itself, it may not be possible to read it in bytes. Even if the processing is separated for each bzip2 stream, it will overrun, so some measures are required.

Reading byte by byte is accurate with respect to the location of the base stream, but at a disadvantage in terms of processing speed.

measurement

Compare the time it takes to unpack a huge file.

Wikipedia Japanese version of dump data is used. This file has a multi-stream configuration.

Sequential expansion

Sequentially expand the stream with SharpZipLib and SharpCompress.

test1.fsx


#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System
open System.IO
open ICSharpCode.SharpZipLib.BZip2
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable streams, bytes = 0, 0L
    while fs.Position < fs.Length do
        use bz = new BZip2InputStream(fs, IsStreamOwner = false)
        let mutable len = 1
        while len > 0 do
            len <- bz.Read(buffer, 0, buffer.Length)
            bytes <- bytes + int64 len
        streams <- streams + 1
    Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)

test2.fsx


#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    let mutable ignore = true
    use fs = { new FileStream(target, FileMode.Open) with
        override __.Dispose disposing = if not ignore then base.Dispose disposing }
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable streams, bytes = 0, 0L
    while fs.Position < fs.Length do
        use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
        let mutable len = 1
        while len > 0 do
            len <- bz.Read(buffer, 0, buffer.Length)
            bytes <- bytes + int64 len
        streams <- streams + 1
    ignore <- false
    Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)

Execution result


$ time ./test1.exe  # SharpZipLib
streams: 24,957, bytes: 13,023,068,290

real    16m2.849s

$ time ./test2.exe  # SharpCompress
streams: 24,957, bytes: 13,023,068,290

real    18m26.520s

SharpZipLib seems to be faster.

Bulk deployment

Extract all streams at once with SharpCompress and AR.Compression.BZip2.

test3.fsx


#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable bytes, len = 0L, 1
    while len > 0 do
        len <- bz.Read(buffer, 0, buffer.Length)
        bytes <- bytes + int64 len
    Console.WriteLine("bytes: {0:#,0}", bytes)

test4.fsx


#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable bytes, len = 0L, 1
    while len > 0 do
        len <- bz.Read(buffer, 0, buffer.Length)
        bytes <- bytes + int64 len
    Console.WriteLine("bytes: {0:#,0}", bytes)

Execution result


$ time ./test3.exe  # SharpCompress
bytes: 13,023,068,290

real    17m36.925s

$ time ./test4.exe  # AR.Compression.BZip2
bytes: 13,023,068,290

real    8m23.916s

AR.Compression.BZip2 is fast because it calls the native library.

Python

Compare with Python's bz2 module. This is also a native rapper.

[Reference] Sequentially expand multi-stream bzip2 with Python

test5.py (sequential)


import bz2
target  = "jawiki-20200501-pages-articles-multistream.xml.bz2"
streams = 0
bytes   = 0
size    = 1024 * 1024  # 1MB
with open(target, "rb") as f:
    decompressor = bz2.BZ2Decompressor()
    data = b''
    while data or (data := f.read(size)):
        bytes += len(decompressor.decompress(data))
        data = decompressor.unused_data
        if decompressor.eof:
            decompressor = bz2.BZ2Decompressor()
            streams += 1
print(f"streams: {streams:,}, bytes: {bytes:,}")

test6.py (collective)


import bz2
target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
bytes  = 0
size   = 1024 * 1024  # 1MB
with bz2.open(target, "rb") as f:
    while (data := f.read(size)):
        bytes += len(data)
print(f"bytes: {bytes:,}")

Execution result


$ time py.exe test5.py  #Sequential
streams: 24,957, bytes: 13,023,068,290

real    8m12.155s

$ time py.exe test6.py  #Bulk
bytes: 13,023,068,290

real    8m1.476s

Most of the processing is done by libbz2, and the other overhead is small and fast.

bzcat

It also measures the WSL1 bzcat command.

$ time bzcat jawiki-20200501-pages-articles-multistream.xml.bz2 > /dev/null

real    8m21.056s
user    8m5.563s
sys     0m15.422s

Summary

Summarize the results. Add the measurement result of WSL1 (Mono). The speed of Python is eye-catching.

Sequential(Win) Sequential(WSL1) Bulk(Win) Bulk(WSL1)
SharpZipLib 16m02.849s 22m49.375s
SharpCompress 18m26.520s 23m56.694s 17m36.925s 22m54.247s
AR.Compression.BZip2 8m23.916s 8m36.495s
Python (bz2) 8m12.155s 8m45.590s 8m01.476s 8m28.749s
bzcat 8m21.056s

Libraries implemented in managed code took more than twice as long. If you don't have managed binding, it's safer to use AR.Compression.BZip2.

Starting with .NET Framework 4.5, the DeflateStream class uses the zlib library for compression.

Related article

See the following articles for Wikipedia dumps.

reference

This article deals with bzip2 in SharpZipLib.

An article that mentions SharpCompress.

Recommended Posts

Investigate the .NET Framework bzip2 library
The Common Clk Framework
I tried the changefinder library!