Find out about libraries that support bzip2 in the .NET Framework. Compare processing speed with Python and bzcat.
This article has a Python version.
I found three different libraries that support bzip2.
Check the multi-stream handling and deployment time in these libraries.
Data that is individually compressed with bzip2 and concatenated is called a multi-stream.
Creation example
$ echo -n hello | bzip2 > a.bz2
$ echo -n world | bzip2 > b.bz2
$ cat a.bz2 b.bz2 > ab.bz2
It can be handled as it is with a command such as bzcat
.
$ bzcat ab.bz2
helloworld
It is used for parallel compression pbzip2 and Wikipedia dumps.
[Addition] The same can be done with gzip, which is investigated in the following article.
SharpZipLib
Supports multiple compression algorithms.
The bzip2 implementation is below.
We have confirmed that bzip2 can be extracted with only the following 6 files.
Use the library obtained by NuGet instead of the minimum configuration.
nuget install SharpZipLib
cp SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll .
Try deploying ab.bz2, which was created as a multi-stream example.
#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
use fs = new FileStream("ab.bz2", FileMode.Open)
use bz = new BZip2InputStream(fs)
use sr = new StreamReader(bz)
printfn "%s" (sr.ReadToEnd())
hello
Only the first stream is processed.
Looking at BZip2InputStream.cs, it seems that it is not made for multi-stream, so it is necessary to read it sequentially.
is
true, the stream
fspassed after the processing is completed will be closed. The default is
true, so specify
false`.#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System.IO
open ICSharpCode.SharpZipLib.BZip2
do
use fs = new FileStream("ab.bz2", FileMode.Open)
while fs.Position < fs.Length do
use bz = new BZip2InputStream(fs, IsStreamOwner = false)
use sr = new StreamReader(bz)
printfn "%s" (sr.ReadToEnd())
hello
world
SharpCompress
Supports multiple compression algorithms.
The bzip2 implementation is below.
This directory can be extracted and used independently. The only thing is that it lacks its own definition of CompressionMode
, but it replaces the existing enum.
BZip2Stream.cs (additional)
using System.IO.Compression;
We have confirmed that bzip2 can be extracted with only the following 3 files.
However, class CBZip2InputStream
is ʻinternal, so it must be
public`.
Use the library obtained by NuGet instead of the minimum configuration.
nuget install sharpcompress
cp SharpCompress.0.25.1/lib/net46/SharpCompress.dll .
Try deploying ab.bz2, which was created as a multi-stream example.
BZip2Stream
constructor is a flag to read the multi-stream.#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
use fs = new FileStream("ab.bz2", FileMode.Open)
use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
use sr = new StreamReader(bz)
printfn "%s" (sr.ReadToEnd())
Execution result
helloworld
With support for multi-stream, I was able to read it all at once.
If you set the multistream flag to false
, fs
will be closed when the end of one stream is reached. Looking at CBZip2InputStream.cs, it doesn't seem to be supposed to stay open. Therefore, it seems that the only way to read it sequentially is to do something that ignores Dispose
.
#r "SharpCompress.dll"
open System.IO
open SharpCompress.Compressors
do
let mutable ignore = true
use fs = { new FileStream("ab.bz2", FileMode.Open) with
override __.Dispose disposing = if not ignore then base.Dispose disposing }
while fs.Position < fs.Length do
use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
use sr = new StreamReader(bz)
printfn "%s" (sr.ReadToEnd())
ignore <- false
Execution result
hello
world
AR.Compression.BZip2
Unlike other libraries, it specializes in bzip2. Decompression and compression are implemented in one class, so we'll skip looking at the minimum configuration separately.
It's registered with NuGet.
Sharing the library with Mono requires a bit of work, so this time I'll build it myself instead of using NuGet.
Rename the DLL file specified in P / Invoke.
10: private const string DllName = "libbz2";
This will use libbz2.dll on Windows and libbz2.so on WSL. WSL references /usr/lib/libbz2.so even if it is not in the current directory.
Build the DLL.
csc -o -out:AR.Compression.BZip2.dll -t:library -unsafe sources/AR.BZip2/*.cs
Bzip2 for Windows uses the binaries distributed below.
Try deploying ab.bz2, which was created as a multi-stream example.
BZip2Stream
is true
, the stream fs
passed after the processing is completed will remain open. The default is false
, so specify true
. It has the same usage as SharpZipLib's IsStreamOwner, but the specification method is reversed. It has nothing to do with multi-stream.#r "AR.Compression.BZip2.dll"
open System.IO
open System.IO.Compression
do
use fs = new FileStream("ab.bz2", FileMode.Open)
use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
use sr = new StreamReader(bz)
printfn "%s" (sr.ReadToEnd())
Execution result
helloworld
All streams have been expanded at once. Looking at BZip2Stream.cs, it seems that it is not supposed to be read sequentially. Considering the handling of the base stream, which will be seen next, it seems that it cannot be handled without modification.
In each library, check where to read the passed base stream.
Stream in .NET is called base stream to distinguish it from stream in the meaning of bzip2.
SharpZipLib: src/ICSharpCode.SharpZipLib/BZip2/BZip2InputStream.cs
417: thech = baseStream.ReadByte();
231: int magic0 = bsStream.ReadByte();
232: int magic1 = bsStream.ReadByte();
233: int magic2 = bsStream.ReadByte();
242: int magic3 = bsStream.ReadByte();
378: thech = (char)bsStream.ReadByte();
649: thech = (char)bsStream.ReadByte();
717: thech = (char)bsStream.ReadByte();
814: thech = (char)bsStream.ReadByte();
9: private const int BufferSize = 128 * 1024;
14: private readonly byte[] _buffer = new byte[BufferSize];
368: _data.avail_in = _stream.Read(_buffer, 0, ufferSize);
SharpZipLib and SharpCompress read one byte at a time with ReadByte
as needed. So it doesn't seem to overrun even if you exit at the bzip2 stream delimiter. Since the variable name thech
(the character?) Is common, there may be something in common. (I don't see this variable name in libbz2)
In AR.Compression.BZip2, it is read into the buffer with a fixed length. Since it is not expanded by itself, it may not be possible to read it in bytes. Even if the processing is separated for each bzip2 stream, it will overrun, so some measures are required.
Reading byte by byte is accurate with respect to the location of the base stream, but at a disadvantage in terms of processing speed.
Compare the time it takes to unpack a huge file.
Wikipedia Japanese version of dump data is used. This file has a multi-stream configuration.
Sequentially expand the stream with SharpZipLib and SharpCompress.
test1.fsx
#r "SharpZipLib.1.2.0/lib/net45/ICSharpCode.SharpZipLib.dll"
open System
open System.IO
open ICSharpCode.SharpZipLib.BZip2
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
use fs = new FileStream(target, FileMode.Open)
let buffer = Array.zeroCreate<byte>(1024 * 1024)
let mutable streams, bytes = 0, 0L
while fs.Position < fs.Length do
use bz = new BZip2InputStream(fs, IsStreamOwner = false)
let mutable len = 1
while len > 0 do
len <- bz.Read(buffer, 0, buffer.Length)
bytes <- bytes + int64 len
streams <- streams + 1
Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)
test2.fsx
#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
let mutable ignore = true
use fs = { new FileStream(target, FileMode.Open) with
override __.Dispose disposing = if not ignore then base.Dispose disposing }
let buffer = Array.zeroCreate<byte>(1024 * 1024)
let mutable streams, bytes = 0, 0L
while fs.Position < fs.Length do
use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, false)
let mutable len = 1
while len > 0 do
len <- bz.Read(buffer, 0, buffer.Length)
bytes <- bytes + int64 len
streams <- streams + 1
ignore <- false
Console.WriteLine("streams: {0:#,0}, bytes: {1:#,0}", streams, bytes)
Execution result
$ time ./test1.exe # SharpZipLib
streams: 24,957, bytes: 13,023,068,290
real 16m2.849s
$ time ./test2.exe # SharpCompress
streams: 24,957, bytes: 13,023,068,290
real 18m26.520s
SharpZipLib seems to be faster.
Extract all streams at once with SharpCompress and AR.Compression.BZip2.
test3.fsx
#r "SharpCompress.dll"
open System
open System.IO
open SharpCompress.Compressors
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
use fs = new FileStream(target, FileMode.Open)
use bz = new BZip2.BZip2Stream(fs, CompressionMode.Decompress, true)
let buffer = Array.zeroCreate<byte>(1024 * 1024)
let mutable bytes, len = 0L, 1
while len > 0 do
len <- bz.Read(buffer, 0, buffer.Length)
bytes <- bytes + int64 len
Console.WriteLine("bytes: {0:#,0}", bytes)
test4.fsx
#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression
let target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
do
use fs = new FileStream(target, FileMode.Open)
use bz = new BZip2Stream(fs, CompressionMode.Decompress, false)
let buffer = Array.zeroCreate<byte>(1024 * 1024)
let mutable bytes, len = 0L, 1
while len > 0 do
len <- bz.Read(buffer, 0, buffer.Length)
bytes <- bytes + int64 len
Console.WriteLine("bytes: {0:#,0}", bytes)
Execution result
$ time ./test3.exe # SharpCompress
bytes: 13,023,068,290
real 17m36.925s
$ time ./test4.exe # AR.Compression.BZip2
bytes: 13,023,068,290
real 8m23.916s
AR.Compression.BZip2 is fast because it calls the native library.
Python
Compare with Python's bz2 module. This is also a native rapper.
[Reference] Sequentially expand multi-stream bzip2 with Python
test5.py (sequential)
import bz2
target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
streams = 0
bytes = 0
size = 1024 * 1024 # 1MB
with open(target, "rb") as f:
decompressor = bz2.BZ2Decompressor()
data = b''
while data or (data := f.read(size)):
bytes += len(decompressor.decompress(data))
data = decompressor.unused_data
if decompressor.eof:
decompressor = bz2.BZ2Decompressor()
streams += 1
print(f"streams: {streams:,}, bytes: {bytes:,}")
test6.py (collective)
import bz2
target = "jawiki-20200501-pages-articles-multistream.xml.bz2"
bytes = 0
size = 1024 * 1024 # 1MB
with bz2.open(target, "rb") as f:
while (data := f.read(size)):
bytes += len(data)
print(f"bytes: {bytes:,}")
Execution result
$ time py.exe test5.py #Sequential
streams: 24,957, bytes: 13,023,068,290
real 8m12.155s
$ time py.exe test6.py #Bulk
bytes: 13,023,068,290
real 8m1.476s
Most of the processing is done by libbz2, and the other overhead is small and fast.
bzcat
It also measures the WSL1 bzcat command.
$ time bzcat jawiki-20200501-pages-articles-multistream.xml.bz2 > /dev/null
real 8m21.056s
user 8m5.563s
sys 0m15.422s
Summarize the results. Add the measurement result of WSL1 (Mono). The speed of Python is eye-catching.
Sequential(Win) | Sequential(WSL1) | Bulk(Win) | Bulk(WSL1) | |
---|---|---|---|---|
SharpZipLib | 16m02.849s | 22m49.375s | ||
SharpCompress | 18m26.520s | 23m56.694s | 17m36.925s | 22m54.247s |
AR.Compression.BZip2 | 8m23.916s | 8m36.495s | ||
Python (bz2) | 8m12.155s | 8m45.590s | 8m01.476s | 8m28.749s |
bzcat | 8m21.056s |
Libraries implemented in managed code took more than twice as long. If you don't have managed binding, it's safer to use AR.Compression.BZip2.
DeflateStream used by System.IO.Compression.GZipStream seems to have switched from its own implementation to a native wrapper.
Starting with .NET Framework 4.5, the DeflateStream class uses the zlib library for compression.
See the following articles for Wikipedia dumps.
This article deals with bzip2 in SharpZipLib.
An article that mentions SharpCompress.