I wrote the process of reading the full text from the Wiktionary dump in Python, but I will port it to F # and compare the processing speed.

This is a series of articles.

Search for efficient Wiktionary processing
Compare Wiktionary processing speed with F # and Python ← This article

The script for this article is posted in the following repositories.

https://github.com/7shi/wiktionary-tools/tree/master/fsharp/research

Overview

When I first started dumping Wiktionary, I thought I would use F #, but the .NET Framework couldn't handle bzip2 by default, so I started implementing it in Python. After parallelizing, the process was completed in a little over a minute, so I felt that Python was sufficient in terms of speed.

Still, I was wondering how fast F # would be, so I'll try to make up for the missing library.

I wrote it briefly, but it took a lot of time to investigate.

The environment used for the measurement is as follows.

OS: Windows 10 1909
CPU: AMD Ryzen 5 2500U with Radeon Vega Mobile Gfx (4 cores)
.NET Framework 4.8.03752
Python 3.8.2 (WSL1)

Number of lines

An external library is required to work with bzip2 in the .NET Framework.

Investigate the .NET Framework bzip2 library

First, read all the lines of the dump and count the number of lines. List the corresponding Python code and duration.

language	code	Time required
Python	python/research/countlines.py	3m34.911s
F#	fsharp/research/countlines.fsx	2m40.270s
command	`bzcat FILE.xml.bz2 \| wc -l`	2m32.203s

F # is the speed approaching the command.

#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression

let target = "enwiktionary-20200501-pages-articles-multistream.xml.bz2"
let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    use bs = new BZip2Stream(fs, CompressionMode.Decompress)
    use sr = new StreamReader(bs)
    while not (isNull (sr.ReadLine())) do
        lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Stream split

Read the bzip2 file separately for each stream.

Mask and read FileStream to avoid the overhead of going through the bytes. Use the SubStream created in the following article.

Extract and combine streams → [fsharp / research / StreamUtils.fsx](https://github.com/7shi/wiktionary-tools/blob/ master / fsharp / research / StreamUtils.fsx)

Use the stream length data (streamlen.tsv) generated in Previous article.

language	code	Time required
Python	python/research/countlines-BytesIO.py	3m37.827s
F#	fsharp/research/countlines-split.fsx	5m23.122s

Apparently there is a non-negligible overhead in starting the BZip2Stream read, which is quite slow. I used SubStream to reduce the overhead even a little, but I can't catch up at all.

#r "AR.Compression.BZip2.dll"
#load "StreamUtils.fsx"
open System
open System.IO
open System.IO.Compression
open StreamUtils

let target, slen =
    use sr = new StreamReader("streamlen.tsv")
    sr.ReadLine(),
    [|  while not <| sr.EndOfStream do
        yield sr.ReadLine() |> Convert.ToInt32 |]

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for length in slen do
        use ss = new SubStream(fs, length)
        use bs = new BZip2Stream(ss, CompressionMode.Decompress)
        use sr = new StreamReader(bs)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Overhead avoidance

When parallelized in Python, I read every 10 streams to reduce the overhead of interprocess communication, but I take the same action. Also try reading every 100 streams for comparison.

language	code	Time required	Remarks
F#	fsharp/research/countlines-split-10.fsx	2m50.913s	Every 10 streams
F#	fsharp/research/countlines-split-100.fsx	2m40.727s	Every 100 streams

It's much faster. Since every 100 streams is faster, the next step will be splitting every 100 streams.

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for lengths in Seq.chunkBySize 100 slen do
        use ss = new SubStream(fs, Array.sum lengths)
        use bs = new BZip2Stream(ss, CompressionMode.Decompress)
        use sr = new StreamReader(bs)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Divide by 100 elements with Seq.chunkBySize and sum with ʻArray.sum`.

String conversion

In Python it was faster to convert to a string and use StringIO. Try the same process in F #.

language	code	Time required	Remarks
Python	python/research/countlines-StringIO.py	3m18.568s	Per stream
F#	fsharp/research/countlines-split-string.fsx	7m50.915s	Per stream
F#	fsharp/research/countlines-split-string-10.fsx	3m55.453s	Every 10 streams
F#	fsharp/research/countlines-split-string-100.fsx	3m23.417s	Every 100 streams

This method is rejected because it is not very fast in F #.

let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    for length in slen do
        let text =
            use ss = new SubStream(fs, length)
            use bs = new BZip2Stream(ss, CompressionMode.Decompress)
            use ms = new MemoryStream()
            bs.CopyTo(ms)
            Encoding.UTF8.GetString(ms.ToArray())
        use sr = new StringReader(text)
        while not (isNull (sr.ReadLine())) do
            lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Text extraction

Extract the contents of the XML <text> tag and count the number of lines.

`common part`


let mutable lines = 0
do
    use fs = new FileStream(target, FileMode.Open)
    fs.Seek(int64 slen.[0], SeekOrigin.Begin) |> ignore
    for lengths in Seq.chunkBySize 100 slen.[1 .. slen.Length - 2] do
        for _, text in getPages(fs, Array.sum lengths) do
            for _ in text do
                lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)

Try various methods. Replaces getPages with the method.

String processing

Parse tags with string processing line by line.

language	code	Time required	Remarks
Python	python/research/countlines-text.py	4m06.555s	startswith
F#	fsharp/research/countlines-text-split-StartsWith.fsx	4m42.877s	StartsWith
F#	fsharp/research/countlines-text-split-slice.fsx	4m14.069s	slice
F#	fsharp/research/countlines-text-split.fsx	4m05.507s	Substring

The .NET Framework StartsWith is slow because it seems to be doing things like Unicode normalization.

Compare composite character prefix match between .NET Framework and Python

Using Substring to do the equivalent of StartsWith, it's finally about as fast as Python.

`StartsWith`


let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use bs = new BZip2Stream(ss, CompressionMode.Decompress)
    use sr = new StreamReader(bs)
    let mutable ns, id = 0, 0
    while sr.Peek() <> -1 do
        let mutable line = sr.ReadLine().TrimStart()
        if line.StartsWith "<ns>" then
            ns <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
            id <- 0
        elif id = 0 && line.StartsWith "<id>" then
            id <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
        elif line.StartsWith "<text " then
            let p = line.IndexOf '>'
            if line.[p - 1] = '/' then () else
            if ns <> 0 then
                while not <| line.EndsWith "</text>" do
                line <- sr.ReadLine()
            else
                line <- line.[p + 1 ..]
                yield id, seq {
                    while not <| isNull line do
                    if line.EndsWith "</text>" then
                        if line.Length > 7 then yield line.[.. line.Length - 8]
                        line <- null
                    else
                        yield line
                        line <- sr.ReadLine() }}

Create and replace a function that replaces StartsWith.

`slice`


let inline startsWith (target:string) (value:string) =
    target.Length >= value.Length && target.[.. value.Length - 1] = value

`Substring`


let inline startsWith (target:string) (value:string) =
    target.Length >= value.Length && target.Substring(0, value.Length) = value

XML parser

Compare how to create a tree with an XML parser and how to just parse without creating a tree.

A root element is required for XML parsing. As we've already seen, F # slows down via strings, so we combine Streams for processing. Use the ConcatStream created in the following article.

Extract and combine streams → [fsharp / research / StreamUtils.fsx](https://github.com/7shi/wiktionary-tools/blob/ master / fsharp / research / StreamUtils.fsx)

There is a tree

language	code	Time required	Remarks
Python	python/research/countlines-text-xml.py	5m50.826s	ElementTree.fromstring
F#	fsharp/research/countlines-text-split-doc.fsx	6m21.588s	XmlDocument

F # is slower than Python.

let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use cs = new ConcatStream([
        new MemoryStream("<pages>"B)
        new BZip2Stream(ss, CompressionMode.Decompress)
        new MemoryStream("</pages>"B) ])
    use xr = XmlReader.Create(cs)
    let doc = XmlDocument()
    doc.Load(xr)
    for page in doc.ChildNodes.[0].ChildNodes do
        let ns = Convert.ToInt32(page.SelectSingleNode("ns").InnerText)
        if ns = 0 then
            let id = Convert.ToInt32(page.SelectSingleNode("id").InnerText)
            let text = page.SelectSingleNode("revision/text").InnerText
            use sr = new StringReader(text)
            yield id, seq {
                while sr.Peek() <> -1 do
                    yield sr.ReadLine() }}

No tree

See the following articles for various Python parsers.

Compare speed of Python XML parsing

F # uses a pull-type XmlReader.

language	code	Time required	Remarks
Python	python/research/countlines-text-xmlparser.py	6m46.163s	XMLParser
Python	python/research/countlines-text-xmlpull.py	6m04.553s	XMLPullParser
Python	python/research/countlines-text-xmliter.py	6m29.298s	ElementTree.iterparse
F#	fsharp/research/countlines-text-split-reader.fsx	3m17.916s	XmlReader
F#	fsharp/research/countlines-text-reader.fsx	3m16.122s	XmlReader(undivided)

"Undivided" means a stream. Looking at the results, it seems that there is almost no overhead due to splitting.

The .NET Framework XmlDocument is constructed using XmlReader. XmlReader is overwhelmingly faster to use alone. In the steps that follow, we will only use the XmlReader method.

The situation should be the same in Python, but it seems that creating a tree is much more efficient, and creating a tree is faster.

let getPages(stream, length) = seq {
    use ss = new SubStream(stream, length)
    use cs = new ConcatStream([
        new MemoryStream("<pages>"B)
        new BZip2Stream(ss, CompressionMode.Decompress)
        new MemoryStream("</pages>"B) ])
    use xr = XmlReader.Create(cs)
    let mutable ns, id = 0, 0
    while xr.Read() do
        if xr.NodeType = XmlNodeType.Element then
            match xr.Name with
            | "ns" ->
                if xr.Read() then ns <- Convert.ToInt32 xr.Value
                id <- 0
            | "id" ->
                if id = 0 && xr.Read() then id <- Convert.ToInt32 xr.Value
            | "text" ->
                if ns = 0 && not xr.IsEmptyElement && xr.Read() then
                    yield id, seq {
                        use sr = new StringReader(xr.Value)
                        while sr.Peek() <> -1 do
                            yield sr.ReadLine() }
            | _ -> () }

Language table

From the code so far, we will squeeze out the common parts.

fsharp/research/MediaWikiParse.fsx

Create a table of which items contain data in which language (ʻoutput1.tsv). The language name is normalized to another table (ʻoutput2.tsv).

do
    use sw = new StreamWriter("output1.tsv")
    sw.NewLine <- "\n"
    for id, lid in results do
        fprintfn sw "%d\t%d" id lid
do
    use sw = new StreamWriter("output2.tsv")
    sw.NewLine <- "\n"
    for kv in langs do
        fprintfn sw "%d\t%s" kv.Value kv.Key

Since one article is separated by the line == language name ==, it will be detected and associated with id. Distinguish because === item name === is used in the lower headings.

As we have already seen, StartsWith is slow. Compare how to look up one character at a time from the beginning of a line with a regular expression.

XML parsing uses string processing in Python and XmlReader in F #.

language	code	Time required	Remarks
Python	python/research/checklang.py	4m26.421s	startswith
F#	fsharp/research/checklang-StartsWith.fsx	3m43.965s	StartsWith
Python	python/research/checklang-ch.py	4m30.566s	Character by character
F#	fsharp/research/checklang.fsx	3m24.302s	Character by character
Python	python/research/checklang-re.py	5m9.869s	Regular expressions
F#	fsharp/research/checklang-re.fsx	3m46.270s	Regular expressions

In F #, it's fast to look at each character, but it's cumbersome to implement and not versatile. Regular expressions are almost as fast as StartsWith. Considering versatility, it seems safe to use regular expressions.

`StartsWith`


            for line in text do
                if line.StartsWith "==" && not <| line.StartsWith "===" then
                    let lang = line.[2..].Trim()
                    let mutable e = lang.Length - 1
                    while e > 0 && lang.[e] = '=' do e <- e - 1
                    let lang = lang.[..e].Trim()

`Character by character`


            for line in text do
                if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then

`Regular expressions`


            for line in text do
                let m = r.Match line
                if m.Success then
                    let lang = m.Groups.[1].Value.Trim()

Parallelization

Parallelization is done using multithreading in F # and multiprocessing in Python.

Porting Python parallel computing sample to F #

language	code	Time required	Remarks
Python	python/research/checklang-parallel.py	1m16.566s	startswith
F#	fsharp/research/checklang-parallel.fsx	1m03.941s	Character by character
Python	python/research/checklang-parallel-re.py	1m19.372s	Regular expressions
F#	fsharp/research/checklang-parallel-re.fsx	1m07.009s	Regular expressions

It's a close margin. The growth width due to parallelization is larger in Python.

`Character by character`


let getlangs(pos, length) = async { return [
    use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
    fs.Seek(pos, SeekOrigin.Begin) |> ignore
    for id, text in MediaWikiParse.getPages(fs, length) do
        for line in text do
            if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then
                let lang = line.[2..].Trim()
                let mutable e = lang.Length - 1
                while e > 0 && lang.[e] = '=' do e <- e - 1
                yield id, lang.[..e].Trim() ]}

`Regular expressions`


let getlangs(pos, length) = async { return [
    let r = Regex "^==([^=].*)=="
    use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
    fs.Seek(pos, SeekOrigin.Begin) |> ignore
    for id, text in MediaWikiParse.getPages(fs, length) do
        for line in text do
            let m = r.Match line
            if m.Success then
                yield id, m.Groups.[1].Value.Trim() ]}

let results =
    sposlen.[1 .. sposlen.Length - 2]
    |> Seq.chunkBySize 100
    |> Seq.map Array.unzip
    |> Seq.map (fun (ps, ls) -> Array.min ps, Array.sum ls)
    |> Seq.map getlangs
    |> Async.Parallel
    |> Async.RunSynchronously
    |> List.concat
let langs = Dictionary<string, int>()
do
    use sw = new StreamWriter("output1.tsv")
    sw.NewLine <- "\n"
    for id, lang in results do
        let lid =
            if langs.ContainsKey lang then langs.[lang] else
            let lid = langs.Count + 1
            langs.[lang] <- lid
            lid
        fprintfn sw "%d\t%d" id lid
do
    use sw = new StreamWriter("output2.tsv")
    sw.NewLine <- "\n"
    for kv in langs do
        fprintfn sw "%d\t%s" kv.Value kv.Key

.NET Core and Mono

Measured with .NET Core and Mono on WSL1. Compare with the .NET Framework results on Windows.

The correspondence with the abbreviations is as follows.

Framework: .NET Framework 4.8.03752 (Windows 10)
Core: .NET Core 3.1.103 (WSL1)
Mono: Mono 6.4.0 (WSL1)

code	Framework	Core	Mono	Remarks
fsharp/research/checklang.fsx	3m24.302s	3m25.545s	4m22.330s
fsharp/research/checklang-re.fsx	3m46.270s	3m42.882s	4m51.236s	Regular expressions
fsharp/research/checklang-parallel.fsx	1m03.941s	0m59.014s	2m39.716s	Parallel
fsharp/research/checklang-parallel-re.fsx	1m07.009s	1m06.136s	3m28.074s	Parallel, regular expression

.NET Core seems to be as fast as or a bit faster than the .NET Framework.

.NET Core is basically to create a project, but since there are multiple executable files and it was troublesome, I wrote the config by the following method and coped with it.

Use F # with .NET Core without project

Impressions

Using the fastest method resulted in a slightly faster F #. It was impressive that Python was faster even if it was ported with the same processing content.

As I tried in the following article, there is an overwhelming difference in character-based processing.

Write a script to convert a MySQL dump to TSV

Speed comparison of Wiktionary full text processing with F # and Python

Overview

Number of lines

Stream split

Overhead avoidance

String conversion

Text extraction

common part

String processing

StartsWith

slice

Substring

XML parser

There is a tree

No tree

Language table

StartsWith

Character by character

Regular expressions

Parallelization

Character by character

Regular expressions

.NET Core and Mono

Impressions

`common part`

`StartsWith`

`slice`

`Substring`

`StartsWith`

`Character by character`

`Regular expressions`

`Character by character`

`Regular expressions`