Find out about libraries that support bzip2 in the .NET Framework. Compare processing speed with Python and bzcat.

This article has a Python version.

Sequentially expand multi-stream bzip2 with Python

Library

I found three different libraries that support bzip2.

SharpZipLib: Reimplemented in managed code
SharpCompress: Reimplemented in managed code
AR.Compression.BZip2 Native library wrapper

Check the multi-stream handling and deployment time in these libraries.

Multi-stream

Data that is individually compressed with bzip2 and concatenated is called a multi-stream.

`Creation example`


$ echo -n hello | bzip2 > a.bz2
$ echo -n world | bzip2 > b.bz2
$ cat a.bz2 b.bz2 > ab.bz2

It can be handled as it is with a command such as bzcat.

$ bzcat ab.bz2
helloworld

It is used for parallel compression pbzip2 and Wikipedia dumps.

[Addition] The same can be done with gzip, which is investigated in the following article.

Multi-stream GZIP can be read as it is with GZipStream (C #)

SharpZipLib

Supports multiple compression algorithms.

The bzip2 implementation is below.

src/ICSharpCode.SharpZipLib/BZip2

Minimum configuration

We have confirmed that bzip2 can be extracted with only the following 6 files.

src/ICSharpCode.SharpZipLib/BZip2/BZip2Constants.cs
src/ICSharpCode.SharpZipLib/BZip2/BZip2Exception.cs
src/ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs
src/ICSharpCode.SharpZipLib/Checksum/BZip2Crc.cs
src/ICSharpCode.SharpZipLib/Checksum/IChecksum.cs
src/ICSharpCode.SharpZipLib/Core/Exceptions/SharpZipBaseException.cs

test

Use the library obtained by NuGet instead of the minimum configuration.

https://www.nuget.org/packages/SharpZipLib/

nuget install SharpZipLib
cp SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll .

Try deploying ab.bz2, which was created as a multi-stream example.

The old version of ICSharpCode.SharpZipLib.dll is included in Mono. If you do not write the path, it will be referenced, so refer to the DLL obtained by NuGet with the path.

#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2InputStream(fs)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())

hello

Only the first stream is processed.

Looking at BZip2InputStream.cs, it seems that it is not made for multi-stream, so it is necessary to read it sequentially.

If ʻIsStreamOwner is true, the stream fspassed after the processing is completed will be closed. The default istrue, so specify false`.

#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    while fs.Position < fs.Length do
        use bz = new BZip2InputStream(fs, IsStreamOwner = false)
        use sr = new StreamReader(bz)
        printfn "%s" (sr.ReadToEnd())

hello
world

SharpCompress

Supports multiple compression algorithms.

The bzip2 implementation is below.

src/SharpCompress/Compressors/BZip2

This directory can be extracted and used independently. The only thing is that it lacks its own definition of CompressionMode, but it replaces the existing enum.

`BZip2Stream.cs (additional)`


using System.IO.Compression;

Minimum configuration

We have confirmed that bzip2 can be extracted with only the following 3 files.

src/SharpCompress/Compressors/BZip2/BZip2Constants.cs
src/SharpCompress/Compressors/BZip2/CBZip2InputStream.cs
src/SharpCompress/Compressors/BZip2/CRC.cs

However, class CBZip2InputStream is ʻinternal, so it must be public`.

test

Use the library obtained by NuGet instead of the minimum configuration.

https://www.nuget.org/packages/SharpCompress/

nuget install sharpcompress
cp SharpCompress.0.25.1/lib/net46/SharpCompress.dll .

Try deploying ab.bz2, which was created as a multi-stream example.

The last argument of the BZip2Stream constructor is a flag to read the multi-stream.

#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())

`Execution result`


helloworld

With support for multi-stream, I was able to read it all at once.

If you set the multistream flag to false, fs will be closed when the end of one stream is reached. Looking at CBZip2InputStream.cs, it doesn't seem to be supposed to stay open. Therefore, it seems that the only way to read it sequentially is to do something that ignores Dispose.

#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
    let mutable ignore = true
    use fs = { new FileStream("ab.bz2", FileMode.Open) with
        override __.Dispose disposing = if not ignore then base.Dispose disposing }
    while fs.Position < fs.Length do
        use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
        use sr = new StreamReader(bz)
        printfn "%s" (sr.ReadToEnd())
    ignore <- false

`Execution result`


hello
world

AR.Compression.BZip2

Unlike other libraries, it specializes in bzip2. Decompression and compression are implemented in one class, so we'll skip looking at the minimum configuration separately.

It's registered with NuGet.

https://www.nuget.org/packages/AR.Compression.BZip2/

Build

Sharing the library with Mono requires a bit of work, so this time I'll build it myself instead of using NuGet.

Rename the DLL file specified in P / Invoke.

sources/AR.BZip2/BZip2Stream_Interop.cs

10:        private const string DllName = "libbz2";

This will use libbz2.dll on Windows and libbz2.so on WSL. WSL references /usr/lib/libbz2.so even if it is not in the current directory.

Build the DLL.

csc -o -out:AR.Compression.BZip2.dll -t:library -unsafe sources/AR.BZip2/*.cs

Bzip2 for Windows uses the binaries distributed below.

Releases · philr/bzip2-windows
DLL only (libbz2.dll) | 64-bit (x64)

test

Try deploying ab.bz2, which was created as a multi-stream example.

If the last argument of the constructor of BZip2Stream is true, the stream fs passed after the processing is completed will remain open. The default is false, so specify true. It has the same usage as SharpZipLib's IsStreamOwner, but the specification method is reversed. It has nothing to do with multi-stream.

#r "AR.Compression.BZip2.dll"
open System.IO
open System.IO.Compression
do
    use fs = new FileStream("ab.bz2", FileMode.Open)
    use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
    use sr = new StreamReader(bz)
    printfn "%s" (sr.ReadToEnd())

`Execution result`


helloworld

All streams have been expanded at once. Looking at BZip2Stream.cs, it seems that it is not supposed to be read sequentially. Considering the handling of the base stream, which will be seen next, it seems that it cannot be handled without modification.

Load base stream

In each library, check where to read the passed base stream.

Stream in .NET is called base stream to distinguish it from stream in the meaning of bzip2.
SharpZipLib: src/ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs

417:                            thech = baseStream.ReadByte();

SharpCompress: src/SharpCompress/Compressors/BZip2/CBZip2InputStream.cs

231:            int magic0 = bsStream.ReadByte();
232:            int magic1 = bsStream.ReadByte();
233:            int magic2 = bsStream.ReadByte();
242:            int magic3 = bsStream.ReadByte();
378:                    thech = (char)bsStream.ReadByte();
649:                                    thech = (char)bsStream.ReadByte();
717:                                                thech = (char)bsStream.ReadByte();
814:                                            thech = (char)bsStream.ReadByte();

AR.Compression.BZip2: sources/AR.BZip2/BZip2Stream.cs

  9:             private const int BufferSize = 128 * 1024;
 14:             private readonly byte[] _buffer = new byte[BufferSize];
368:                                             _data.avail_in = _stream.Read(_buffer, 0, ufferSize);

SharpZipLib and SharpCompress read one byte at a time with ReadByte as needed. So it doesn't seem to overrun even if you exit at the bzip2 stream delimiter. Since the variable name thech (the character?) Is common, there may be something in common. (I don't see this variable name in libbz2)

In AR.Compression.BZip2, it is read into the buffer with a fixed length. Since it is not expanded by itself, it may not be possible to read it in bytes. Even if the processing is separated for each bzip2 stream, it will overrun, so some measures are required.

Reading byte by byte is accurate with respect to the location of the base stream, but at a disadvantage in terms of processing speed.

measurement

Compare the time it takes to unpack a huge file.

Wikipedia Japanese version of dump data is used. This file has a multi-stream configuration.

https://dumps.wikimedia.org/jawiki/
jawiki-20200501-pages-articles-multistream.xml.bz2 3.0 GB

Sequential expansion

Sequentially expand the stream with SharpZipLib and SharpCompress.

`test1.fsx`


#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System
open System.IO
open ICSharpCode.SharpZipLib.BZip2
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable streams, bytes = 0, 0L
    while fs.Position < fs.Length do
        use bz = new BZip2InputStream(fs, IsStreamOwner = false)
        let mutable len = 1
        while len > 0 do
            len <- bz.Read(buffer, 0, buffer.Length)
            bytes <- bytes + int64 len
        streams <- streams + 1
    Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)

`test2.fsx`


#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    let mutable ignore = true
    use fs = { new FileStream(target, FileMode.Open) with
        override __.Dispose disposing = if not ignore then base.Dispose disposing }
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable streams, bytes = 0, 0L
    while fs.Position < fs.Length do
        use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
        let mutable len = 1
        while len > 0 do
            len <- bz.Read(buffer, 0, buffer.Length)
            bytes <- bytes + int64 len
        streams <- streams + 1
    ignore <- false
    Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)

`Execution result`


$ time ./test1.exe  # SharpZipLib
streams: 24,957, bytes: 13,023,068,290

real    16m2.849s

$ time ./test2.exe  # SharpCompress
streams: 24,957, bytes: 13,023,068,290

real    18m26.520s

SharpZipLib seems to be faster.

Bulk deployment

Extract all streams at once with SharpCompress and AR.Compression.BZip2.

`test3.fsx`


#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable bytes, len = 0L, 1
    while len > 0 do
        len <- bz.Read(buffer, 0, buffer.Length)
        bytes <- bytes + int64 len
    Console.WriteLine("bytes: {0:#,0}", bytes)

`test4.fsx`


#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
    use fs = new FileStream(target, FileMode.Open)
    use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
    let buffer = Array.zeroCreate<byte>(1024 * 1024)
    let mutable bytes, len = 0L, 1
    while len > 0 do
        len <- bz.Read(buffer, 0, buffer.Length)
        bytes <- bytes + int64 len
    Console.WriteLine("bytes: {0:#,0}", bytes)

`Execution result`


$ time ./test3.exe  # SharpCompress
bytes: 13,023,068,290

real    17m36.925s

$ time ./test4.exe  # AR.Compression.BZip2
bytes: 13,023,068,290

real    8m23.916s

AR.Compression.BZip2 is fast because it calls the native library.

Python

Compare with Python's bz2 module. This is also a native rapper.

[Reference] Sequentially expand multi-stream bzip2 with Python

`test5.py (sequential)`


import bz2
target  = "jawiki-20200501-pages-articles-multistream.xml.bz2"
streams = 0
bytes   = 0
size    = 1024 * 1024  # 1MB
with open(target, "rb") as f:
    decompressor = bz2.BZ2Decompressor()
    data = b''
    while data or (data := f.read(size)):
        bytes += len(decompressor.decompress(data))
        data = decompressor.unused_data
        if decompressor.eof:
            decompressor = bz2.BZ2Decompressor()
            streams += 1
print(f"streams: {streams:,}, bytes: {bytes:,}")

`test6.py (collective)`


import bz2
target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
bytes  = 0
size   = 1024 * 1024  # 1MB
with bz2.open(target, "rb") as f:
    while (data := f.read(size)):
        bytes += len(data)
print(f"bytes: {bytes:,}")

`Execution result`


$ time py.exe test5.py  #Sequential
streams: 24,957, bytes: 13,023,068,290

real    8m12.155s

$ time py.exe test6.py  #Bulk
bytes: 13,023,068,290

real    8m1.476s

Most of the processing is done by libbz2, and the other overhead is small and fast.

bzcat

It also measures the WSL1 bzcat command.

$ time bzcat jawiki-20200501-pages-articles-multistream.xml.bz2 > /dev/null

real    8m21.056s
user    8m5.563s
sys     0m15.422s

Summary

Summarize the results. Add the measurement result of WSL1 (Mono). The speed of Python is eye-catching.

	Sequential(Win)	Sequential(WSL1)	Bulk(Win)	Bulk(WSL1)
SharpZipLib	16m02.849s	22m49.375s
SharpCompress	18m26.520s	23m56.694s	17m36.925s	22m54.247s
AR.Compression.BZip2			8m23.916s	8m36.495s
Python (bz2)	8m12.155s	8m45.590s	8m01.476s	8m28.749s
bzcat				8m21.056s

Libraries implemented in managed code took more than twice as long. If you don't have managed binding, it's safer to use AR.Compression.BZip2.

DeflateStream used by System.IO.Compression.GZipStream seems to have switched from its own implementation to a native wrapper.
GZipStream class (System.IO.Compression) | Microsoft Docs

Starting with .NET Framework 4.5, the DeflateStream class uses the zlib library for compression.

See the following articles for Wikipedia dumps.

Retrieve pages from Wikipedia dump

reference

This article deals with bzip2 in SharpZipLib.

How to compress and decompress bzip2 in C #-Notes

An article that mentions SharpCompress.

Tsuboyaki ver2 continued Sharp Compress

Investigate the .NET Framework bzip2 library

Library

Multi-stream

`Creation example`

Minimum configuration

test

`BZip2Stream.cs (additional)`

Minimum configuration

test

`Execution result`

`Execution result`

Build

test

`Execution result`

Load base stream

measurement

Sequential expansion

`test1.fsx`

`test2.fsx`

`Execution result`

Bulk deployment

`test3.fsx`

`test4.fsx`

`Execution result`

`test5.py (sequential)`

`test6.py (collective)`

`Execution result`

Summary

Related article

reference