Extract specific languages from Wiktionary

The entire Wiktionary data is too large to investigate a particular language, so I created a script to specify and extract the language as a pre-process.

This is a series of articles.

  1. Search for efficient Wiktionary processing
  2. Compare Wiktionary processing speed between F # and Python
  3. Get the language code of Wiktionary
  4. Extract a specific language from Wiktionary ← This article
  5. Investigate English Irregular Verbs in Wiktionary

The script for this article is posted in the following repositories.

Overview

It's wasteful to process the whole sentence to investigate a particular language. Extract by specifying the language as preprocessing.

It is a text file that can be opened with an editor, so it is easy to handle. Information can be extracted by ordinary text processing without devising a special method for speeding up as before.

Preparation

Use the Wiktionary English version of the dump file.

The dump data is provided compressed with bzip2. The May 1, 2020 edition, available at the time of writing, will be used uncompressed and compressed. (It will be about 6GB when expanded)

You need to keep the downloaded xml.bz2 somewhere. It can be anywhere, but this time I will create a dedicated folder in my home directory.

Examine the length of the stream as you expand the data. Check page information (equivalent to index) and language headings in parallel from the expanded data.

Execution result


$ python db-make.py ~/share/wiktionary/enwiktionary-20200501-pages-articles-multistream.xml.bz2
934,033,103 / 934,033,103 | 68,608
checking redirects...
reading language codes...
checking language names...
writing DB files...

Eight files will be generated.

File name Contents
db-settings.tsv Setting information (file name)
db-streams.tsv Stream information (ID, offset, length)
db-namespaces.tsv Namespaces (ns tag)
db-pages.tsv Page information (ID, stream, namespace, title, transfer, namespace)
db-idlang.tsv Language (ID) included in page (ID)
db-langname.tsv Correspondence table of language ID and language name (including alias)
db-langcode.tsv Language code and language name correspondence table
db-templates.tsv Embedded templates

Use the prepared SQL to submit to SQLite. It's a simple SQL, so I think it's a quick way to read this to see the table structure.

$ sqlite3 enwiktionary.db ".read db.sql"
importing 'db-settings.tsv'...
importing 'db-streams.tsv'...
importing 'db-namespaces.tsv'...
importing 'db-pages.tsv'...
importing 'db-idlang.tsv'...
importing 'db-langname.tsv'...
Importing 'db-langcode.tsv'...
Importing 'db-templates.tsv'...

The preparation is complete.

Once you've populated the data, you don't need the generated db-*. Tsv, but if you're looking at commands like grep as well as SQLite, it's a good idea to keep it.

Parallelization and generator

Introducing ideas for parallelization.

db-make.py (excerpt)


    with concurrent.futures.ProcessPoolExecutor() as executor:
        for pgs, idl in executor.map(getlangs, f(getstreams(target))):

f and getstreams are generators. ʻExecutor.map parallelizes getlangs` and looks like a generator to the main process.

getstreams is a process that cannot be parallelized. Data filtered by f and yield is passed to getlangs. f is more than just a filter, it displays progress information and processes data that you don't pass to getlangs.

Language name

You can find the language name in the generated db-langname.tsv.

We have prepared SQL that creates rankings in descending order of the number of recorded words.

$ sqlite3 enwiktionary.db ".read rank.sql" > rank.tsv
$ head -n 10 rank.tsv
1       English 928987
2       Latin   805426
3       Spanish 668035
4       Italian 559757
5       Russian 394340
6       French  358570
7       Portuguese      282596
8       German  272451
9       Chinese 192619
10      Finnish 176100

Language extraction

I have prepared a script that specifies the language name and extracts it to a file called language name.txt.

The extracted text has page breaks as comments. The title corresponds to the headword.

<!-- <title>title</title> -->

Extract English as an example.

$ time python collect-lang.py enwiktionary.db English
reading positions... 928,987 / 928,987
optimizing... 49,835 -> 6,575
reading streams... 6,575 / 6,575
English: 928,988

Check the number of lines and file size.

$ wc -l English.txt
14461960 English.txt
$ wc --bytes English.txt
452471057 English.txt

The number of words recorded in English is the largest, but after extraction, it will be about 430MB in size and can be opened with an editor.

It is possible to specify multiple language names.

$ python collect-lang.py enwiktionary.db Arabic Estonian Hebrew Hittite Ido Interlingua Interlingue Novial "Old English" "Old High German" "Old Saxon" Phoenician Vietnamese Volapük Yiddish
reading positions... 143,926 / 143,926
optimizing... 25,073 -> 10,386
reading streams... 10,386 / 10,386
Arabic: 50,380
Estonian: 8,756
Hebrew: 9,845
Hittite: 392
Ido: 19,978
Interlingua: 3,271
Interlingue: 638
Novial: 666
Old English: 10,608
Old High German: 1,434
Old Saxon: 1,999
Phoenician: 129
Vietnamese: 25,588
Volapük: 3,918
Yiddish: 6,324

Separate language

Newly added artificial languages and reconstructed proto-languages cannot be extracted with the previous script because Wiktionary is stored differently.

These have their own pages for each word.

Use the script to find out what languages are available.

$ python search-title.py enwiktionary.db
reading `pages`... 6,860,637 / 6,860,637

search-title.tsv is output. The words are dropped from the title of the page and arranged in descending order of appearance.

$ grep Appendix search-title.tsv | head -n 5
3492    Appendix:Lojban/
3049    Appendix:Proto-Germanic/
2147    Appendix:Klingon/
1851    Appendix:Quenya/
888     Appendix:Proto-Slavic/
$ grep Reconstruction search-title.tsv | head -n 5
5096    Reconstruction:Proto-Germanic/
3009    Reconstruction:Proto-Slavic/
1841    Reconstruction:Proto-West Germanic/
1724    Reconstruction:Proto-Indo-European/
1451    Reconstruction:Proto-Samic/

I prepared a script to specify the title with a regular expression and extract it.

Here is an example of use. You must specify the output file name.

Proto-Indo-European


$ python collect-title.py enwiktionary.db PIE.txt "^Reconstruction:Proto-Indo-European/"
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 1,726 / 1,726

Toki Pona (artificial language)


$ python collect-title.py enwiktionary.db Toki_Pona.txt "^Appendix:Toki Pona/"
reading `pages`... 6,860,637 / 6,860,637
Sorting...
writing `pages`... 130 / 130

Since the script only handles regular expressions, it is possible to extract all pages that contain a particular language name.

Novial (artificial language)


$ python collect-title.py enwiktionary.db Novial2.txt Novial
reading `pages`... 6,860,557 / 6,860,557
Sorting...
writing `pages`... 148 / 148

Script template

We have prepared a template as a reference when writing your own script. Read all the data while showing the progress.

$ python db-template.py enwiktionary.db
reading `settings`... 1 / 1
reading `streams`... 68,609 / 68,609
reading `namespaces`... 46 / 46
reading `pages`... 6,860,557 / 6,860,557
reading `idlang`... 6,916,807 / 6,916,807
reading `langname`... 3,978 / 3,978
reading `langcode`... 8,146 / 8,146
reading `templates`... 32,880 / 32,880

Recommended Posts

Extract specific languages from Wiktionary
Extract specific data from complex JSON
Extract data from S3
Extract features (features) from sentences.
Extract table from wikipedia
Extract redirects from Wikipedia dumps
Find all patterns to extract a specific number from the set
Extract a page from a Wikipedia dump
Extract text from images in Python
Extract specific multiple columns with pandas
Extract strings from files in Python
Try to extract specific data from JSON format data in object storage Cloudian/S3