I wrote the process of reading the full text from the Wiktionary dump in Python, but I will port it to F # and compare the processing speed.
This is a series of articles.
The script for this article is posted in the following repositories.
When I first started dumping Wiktionary, I thought I would use F #, but the .NET Framework couldn't handle bzip2 by default, so I started implementing it in Python. After parallelizing, the process was completed in a little over a minute, so I felt that Python was sufficient in terms of speed.
Still, I was wondering how fast F # would be, so I'll try to make up for the missing library.
The environment used for the measurement is as follows.
An external library is required to work with bzip2 in the .NET Framework.
First, read all the lines of the dump and count the number of lines. List the corresponding Python code and duration.
language | code | Time required |
---|---|---|
Python | python/research/countlines.py | 3m34.911s |
F# | fsharp/research/countlines.fsx | 2m40.270s |
command | bzcat FILE.xml.bz2 | wc -l |
2m32.203s |
F # is the speed approaching the command.
#r "AR.Compression.BZip2.dll"
open System
open System.IO
open System.IO.Compression
let target = "enwiktionary-20200501-pages-articles-multistream.xml.bz2"
let mutable lines = 0
do
use fs = new FileStream(target, FileMode.Open)
use bs = new BZip2Stream(fs, CompressionMode.Decompress)
use sr = new StreamReader(bs)
while not (isNull (sr.ReadLine())) do
lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)
Read the bzip2 file separately for each stream.
Mask and read FileStream
to avoid the overhead of going through the bytes. Use the SubStream
created in the following article.
Use the stream length data (streamlen.tsv
) generated in Previous article.
language | code | Time required |
---|---|---|
Python | python/research/countlines-BytesIO.py | 3m37.827s |
F# | fsharp/research/countlines-split.fsx | 5m23.122s |
Apparently there is a non-negligible overhead in starting the BZip2Stream read, which is quite slow. I used SubStream
to reduce the overhead even a little, but I can't catch up at all.
#r "AR.Compression.BZip2.dll"
#load "StreamUtils.fsx"
open System
open System.IO
open System.IO.Compression
open StreamUtils
let target, slen =
use sr = new StreamReader("streamlen.tsv")
sr.ReadLine(),
[| while not <| sr.EndOfStream do
yield sr.ReadLine() |> Convert.ToInt32 |]
let mutable lines = 0
do
use fs = new FileStream(target, FileMode.Open)
for length in slen do
use ss = new SubStream(fs, length)
use bs = new BZip2Stream(ss, CompressionMode.Decompress)
use sr = new StreamReader(bs)
while not (isNull (sr.ReadLine())) do
lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)
When parallelized in Python, I read every 10 streams to reduce the overhead of interprocess communication, but I take the same action. Also try reading every 100 streams for comparison.
language | code | Time required | Remarks |
---|---|---|---|
F# | fsharp/research/countlines-split-10.fsx | 2m50.913s | Every 10 streams |
F# | fsharp/research/countlines-split-100.fsx | 2m40.727s | Every 100 streams |
It's much faster. Since every 100 streams is faster, the next step will be splitting every 100 streams.
let mutable lines = 0
do
use fs = new FileStream(target, FileMode.Open)
for lengths in Seq.chunkBySize 100 slen do
use ss = new SubStream(fs, Array.sum lengths)
use bs = new BZip2Stream(ss, CompressionMode.Decompress)
use sr = new StreamReader(bs)
while not (isNull (sr.ReadLine())) do
lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)
Divide by 100 elements with Seq.chunkBySize
and sum with ʻArray.sum`.
In Python it was faster to convert to a string and use StringIO
. Try the same process in F #.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/countlines-StringIO.py | 3m18.568s | Per stream |
F# | fsharp/research/countlines-split-string.fsx | 7m50.915s | Per stream |
F# | fsharp/research/countlines-split-string-10.fsx | 3m55.453s | Every 10 streams |
F# | fsharp/research/countlines-split-string-100.fsx | 3m23.417s | Every 100 streams |
This method is rejected because it is not very fast in F #.
let mutable lines = 0
do
use fs = new FileStream(target, FileMode.Open)
for length in slen do
let text =
use ss = new SubStream(fs, length)
use bs = new BZip2Stream(ss, CompressionMode.Decompress)
use ms = new MemoryStream()
bs.CopyTo(ms)
Encoding.UTF8.GetString(ms.ToArray())
use sr = new StringReader(text)
while not (isNull (sr.ReadLine())) do
lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)
Extract the contents of the XML <text>
tag and count the number of lines.
common part
let mutable lines = 0
do
use fs = new FileStream(target, FileMode.Open)
fs.Seek(int64 slen.[0], SeekOrigin.Begin) |> ignore
for lengths in Seq.chunkBySize 100 slen.[1 .. slen.Length - 2] do
for _, text in getPages(fs, Array.sum lengths) do
for _ in text do
lines <- lines + 1
Console.WriteLine("lines: {0:#,0}", lines)
Try various methods. Replaces getPages
with the method.
Parse tags with string processing line by line.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/countlines-text.py | 4m06.555s | startswith |
F# | fsharp/research/countlines-text-split-StartsWith.fsx | 4m42.877s | StartsWith |
F# | fsharp/research/countlines-text-split-slice.fsx | 4m14.069s | slice |
F# | fsharp/research/countlines-text-split.fsx | 4m05.507s | Substring |
The .NET Framework StartsWith
is slow because it seems to be doing things like Unicode normalization.
Using Substring
to do the equivalent of StartsWith
, it's finally about as fast as Python.
StartsWith
let getPages(stream, length) = seq {
use ss = new SubStream(stream, length)
use bs = new BZip2Stream(ss, CompressionMode.Decompress)
use sr = new StreamReader(bs)
let mutable ns, id = 0, 0
while sr.Peek() <> -1 do
let mutable line = sr.ReadLine().TrimStart()
if line.StartsWith "<ns>" then
ns <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
id <- 0
elif id = 0 && line.StartsWith "<id>" then
id <- Convert.ToInt32 line.[4 .. line.IndexOf('<', 4) - 1]
elif line.StartsWith "<text " then
let p = line.IndexOf '>'
if line.[p - 1] = '/' then () else
if ns <> 0 then
while not <| line.EndsWith "</text>" do
line <- sr.ReadLine()
else
line <- line.[p + 1 ..]
yield id, seq {
while not <| isNull line do
if line.EndsWith "</text>" then
if line.Length > 7 then yield line.[.. line.Length - 8]
line <- null
else
yield line
line <- sr.ReadLine() }}
Create and replace a function that replaces StartsWith
.
slice
let inline startsWith (target:string) (value:string) =
target.Length >= value.Length && target.[.. value.Length - 1] = value
Substring
let inline startsWith (target:string) (value:string) =
target.Length >= value.Length && target.Substring(0, value.Length) = value
Compare how to create a tree with an XML parser and how to just parse without creating a tree.
A root element is required for XML parsing. As we've already seen, F # slows down via strings, so we combine Stream
s for processing. Use the ConcatStream
created in the following article.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/countlines-text-xml.py | 5m50.826s | ElementTree.fromstring |
F# | fsharp/research/countlines-text-split-doc.fsx | 6m21.588s | XmlDocument |
F # is slower than Python.
let getPages(stream, length) = seq {
use ss = new SubStream(stream, length)
use cs = new ConcatStream([
new MemoryStream("<pages>"B)
new BZip2Stream(ss, CompressionMode.Decompress)
new MemoryStream("</pages>"B) ])
use xr = XmlReader.Create(cs)
let doc = XmlDocument()
doc.Load(xr)
for page in doc.ChildNodes.[0].ChildNodes do
let ns = Convert.ToInt32(page.SelectSingleNode("ns").InnerText)
if ns = 0 then
let id = Convert.ToInt32(page.SelectSingleNode("id").InnerText)
let text = page.SelectSingleNode("revision/text").InnerText
use sr = new StringReader(text)
yield id, seq {
while sr.Peek() <> -1 do
yield sr.ReadLine() }}
See the following articles for various Python parsers.
F # uses a pull-type XmlReader.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/countlines-text-xmlparser.py | 6m46.163s | XMLParser |
Python | python/research/countlines-text-xmlpull.py | 6m04.553s | XMLPullParser |
Python | python/research/countlines-text-xmliter.py | 6m29.298s | ElementTree.iterparse |
F# | fsharp/research/countlines-text-split-reader.fsx | 3m17.916s | XmlReader |
F# | fsharp/research/countlines-text-reader.fsx | 3m16.122s | XmlReader(undivided) |
The .NET Framework XmlDocument
is constructed using XmlReader
. XmlReader
is overwhelmingly faster to use alone. In the steps that follow, we will only use the XmlReader
method.
The situation should be the same in Python, but it seems that creating a tree is much more efficient, and creating a tree is faster.
let getPages(stream, length) = seq {
use ss = new SubStream(stream, length)
use cs = new ConcatStream([
new MemoryStream("<pages>"B)
new BZip2Stream(ss, CompressionMode.Decompress)
new MemoryStream("</pages>"B) ])
use xr = XmlReader.Create(cs)
let mutable ns, id = 0, 0
while xr.Read() do
if xr.NodeType = XmlNodeType.Element then
match xr.Name with
| "ns" ->
if xr.Read() then ns <- Convert.ToInt32 xr.Value
id <- 0
| "id" ->
if id = 0 && xr.Read() then id <- Convert.ToInt32 xr.Value
| "text" ->
if ns = 0 && not xr.IsEmptyElement && xr.Read() then
yield id, seq {
use sr = new StringReader(xr.Value)
while sr.Peek() <> -1 do
yield sr.ReadLine() }
| _ -> () }
From the code so far, we will squeeze out the common parts.
Create a table of which items contain data in which language (ʻoutput1.tsv). The language name is normalized to another table (ʻoutput2.tsv
).
do
use sw = new StreamWriter("output1.tsv")
sw.NewLine <- "\n"
for id, lid in results do
fprintfn sw "%d\t%d" id lid
do
use sw = new StreamWriter("output2.tsv")
sw.NewLine <- "\n"
for kv in langs do
fprintfn sw "%d\t%s" kv.Value kv.Key
Since one article is separated by the line == language name ==
, it will be detected and associated with id. Distinguish because === item name ===
is used in the lower headings.
As we have already seen, StartsWith
is slow. Compare how to look up one character at a time from the beginning of a line with a regular expression.
XML parsing uses string processing in Python and XmlReader
in F #.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/checklang.py | 4m26.421s | startswith |
F# | fsharp/research/checklang-StartsWith.fsx | 3m43.965s | StartsWith |
Python | python/research/checklang-ch.py | 4m30.566s | Character by character |
F# | fsharp/research/checklang.fsx | 3m24.302s | Character by character |
Python | python/research/checklang-re.py | 5m9.869s | Regular expressions |
F# | fsharp/research/checklang-re.fsx | 3m46.270s | Regular expressions |
In F #, it's fast to look at each character, but it's cumbersome to implement and not versatile. Regular expressions are almost as fast as StartsWith
. Considering versatility, it seems safe to use regular expressions.
StartsWith
for line in text do
if line.StartsWith "==" && not <| line.StartsWith "===" then
let lang = line.[2..].Trim()
let mutable e = lang.Length - 1
while e > 0 && lang.[e] = '=' do e <- e - 1
let lang = lang.[..e].Trim()
Character by character
for line in text do
if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then
Regular expressions
for line in text do
let m = r.Match line
if m.Success then
let lang = m.Groups.[1].Value.Trim()
Parallelization is done using multithreading in F # and multiprocessing in Python.
language | code | Time required | Remarks |
---|---|---|---|
Python | python/research/checklang-parallel.py | 1m16.566s | startswith |
F# | fsharp/research/checklang-parallel.fsx | 1m03.941s | Character by character |
Python | python/research/checklang-parallel-re.py | 1m19.372s | Regular expressions |
F# | fsharp/research/checklang-parallel-re.fsx | 1m07.009s | Regular expressions |
It's a close margin. The growth width due to parallelization is larger in Python.
Character by character
let getlangs(pos, length) = async { return [
use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
fs.Seek(pos, SeekOrigin.Begin) |> ignore
for id, text in MediaWikiParse.getPages(fs, length) do
for line in text do
if line.Length >= 3 && line.[0] = '=' && line.[1] = '=' && line.[2] <> '=' then
let lang = line.[2..].Trim()
let mutable e = lang.Length - 1
while e > 0 && lang.[e] = '=' do e <- e - 1
yield id, lang.[..e].Trim() ]}
Regular expressions
let getlangs(pos, length) = async { return [
let r = Regex "^==([^=].*)=="
use fs = new FileStream(target, FileMode.Open, FileAccess.Read)
fs.Seek(pos, SeekOrigin.Begin) |> ignore
for id, text in MediaWikiParse.getPages(fs, length) do
for line in text do
let m = r.Match line
if m.Success then
yield id, m.Groups.[1].Value.Trim() ]}
let results =
sposlen.[1 .. sposlen.Length - 2]
|> Seq.chunkBySize 100
|> Seq.map Array.unzip
|> Seq.map (fun (ps, ls) -> Array.min ps, Array.sum ls)
|> Seq.map getlangs
|> Async.Parallel
|> Async.RunSynchronously
|> List.concat
let langs = Dictionary<string, int>()
do
use sw = new StreamWriter("output1.tsv")
sw.NewLine <- "\n"
for id, lang in results do
let lid =
if langs.ContainsKey lang then langs.[lang] else
let lid = langs.Count + 1
langs.[lang] <- lid
lid
fprintfn sw "%d\t%d" id lid
do
use sw = new StreamWriter("output2.tsv")
sw.NewLine <- "\n"
for kv in langs do
fprintfn sw "%d\t%s" kv.Value kv.Key
Measured with .NET Core and Mono on WSL1. Compare with the .NET Framework results on Windows.
The correspondence with the abbreviations is as follows.
code | Framework | Core | Mono | Remarks |
---|---|---|---|---|
fsharp/research/checklang.fsx | 3m24.302s | 3m25.545s | 4m22.330s | |
fsharp/research/checklang-re.fsx | 3m46.270s | 3m42.882s | 4m51.236s | Regular expressions |
fsharp/research/checklang-parallel.fsx | 1m03.941s | 0m59.014s | 2m39.716s | Parallel |
fsharp/research/checklang-parallel-re.fsx | 1m07.009s | 1m06.136s | 3m28.074s | Parallel, regular expression |
.NET Core seems to be as fast as or a bit faster than the .NET Framework.
.NET Core is basically to create a project, but since there are multiple executable files and it was troublesome, I wrote the config by the following method and coped with it.
Using the fastest method resulted in a slightly faster F #. It was impressive that Python was faster even if it was ported with the same processing content.
As I tried in the following article, there is an overwhelming difference in character-based processing.
Recommended Posts