Wikipedia provides a dump of all pages. Although it is a huge amount of data, an index is prepared so that it can be handled while being compressed. Let's actually retrieve the data.

Preparation

A description of the dump data is below.

[Wikipedia: Database Download-Wikipedia](https://ja.wikipedia.org/wiki/Wikipedia:%E3%83%87%E3%83%BC%E3%82%BF%E3%83%99%E3 % 83% BC% E3% 82% B9% E3% 83% 80% E3% 82% A6% E3% 83% B3% E3% 83% AD% E3% 83% BC% E3% 83% 89)

Due to the huge file size, please be careful not to open the unzipped XML with a normal editor or browser.

The data for the Japanese version of Wikipedia is below.

https://dumps.wikimedia.org/jawiki/

From the May 1, 2020 edition available at the time of writing, the following two files will be used.

jawiki-20200501-pages-articles-multistream.xml.bz2 3.0 GB
jawiki-20200501-pages-articles-multistream-index.txt.bz2 23.9 MB

The first XML is the body data. Since it is already compressed and this size, it will be a ridiculous size when expanded, but it will not be expanded this time because it is considered so that it can be handled as it is compressed.

The second index expands. It will be about 107MB.

specification

The following article examines the structure of dumped XML tags.

Examine the structure and number of occurrences of XML tags

The structure of the main part is as follows. One item is stored in one page tag.

<mediawiki>
    <siteinfo> ⋯ </siteinfo>
    <page> ⋯ </page>
    <page> ⋯ </page>
           ⋮
    <page> ⋯ </page>
</mediawiki>

The bz2 file does not simply compress the entire XML file, but consists of blocks of 100 items. You can take out the block and deploy it pinpoint. This structure is called ** multi-stream **.

siteinfo

page × 100

⋯

The index has the following structure for each row.

offset of bz2:id:title

Check the actual data.

$ head -n 5 jawiki-20200501-pages-articles-multistream-index.txt
690:1:Wikipedia:Upload log April 2004
690:2:Wikipedia:Delete record/Past log December 2002
690:5:Ampersand
690:6:Wikipedia:Sandbox
690:10:language

To know the length of a block starting at 690, you need to know where the next block starts.

$ head -n 101 jawiki-20200501-pages-articles-multistream-index.txt | tail -n 2
690:217:List of musicians(group)
814164:219:List of song titles

Since each item is one item, you can find the total number of items by counting the number of lines. There are about 2.5 million items.

$ wc -l jawiki-20200501-pages-articles-multistream-index.txt
2495246 jawiki-20200501-pages-articles-multistream-index.txt

take out

Let's actually take out a specific item. The target is "Qiita".

Qiita - Wikipedia

Information acquisition

Search for "Qiita".

$ grep Qiita jawiki-20200501-pages-articles-multistream-index.txt
2919984762:3691277:Qiita
3081398799:3921935:Template:Qiita tag
3081398799:3921945:Template:Qiita tag/doc

Ignore Template and target the first id = 3691277.

Basically, there are 100 items per block, but there are exceptions and it seems that they are out of alignment, so manually check the start position of the next block.

2919984762:3691305:Category:Gabon's Bilateral Relations
2920110520:3691306:Category:Japan-Cameroon relations

You have all the information you need.

id	block	次のblock
3691277	2919984762	2920110520

Python

Start Python.

$ python
Python 3.8.2 (default, Apr  8 2020, 14:31:25)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.

Open the compressed file.

>>> f = open("jawiki-20200501-pages-articles-multistream.xml.bz2", "rb")

Specify the offset to retrieve the block containing the Qiita item.

>>> f.seek(2919984762)
2919984762
>>> block = f.read(2920110520 - 2919984762)

Expand the block to get the string.

>>> import bz2
>>> data = bz2.decompress(block)
>>> xml = data.decode(encoding="utf-8")

And check the contents. Contains 100 page tags.

>>> print(xml)
  <page>
    <title>Category:Mayor of Eniwa</title>
    <ns>14</ns>
    <id>3691165</id>
(Omitted)

It's awkward as it is, so parse it as XML. Since the root element is required for parsing, add it appropriately.

>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring("<root>" + xml + "</root>")

Check the contents. There are 100 page tags under the root.

>>> len(root)
100
>>> [child.tag for child in root]
['page', 'page',(Omitted), 'page']

Get the page by specifying the id. The argument to find is a notation called XPath.

>>> page = root.find("page/[id='3691277']")

And check the contents.

>>> page.find("title").text
'Qiita'
>>> page.find("revision/text").text[:50]
'{{Infobox Website\n|Site name=Qiita\n|logo=\n|screenshot=\n|Skull'

Save as a file.

>>> tree = ET.ElementTree(page)
>>> tree.write("Qiita.xml", encoding="utf-8")

You will get a file that looks like this:

`Qiita.xml`


<page>
    <title>Qiita</title>
    <ns>0</ns>
    <id>3691277</id>
    <revision>
      <id>77245770</id>
      <parentid>75514051</parentid>
      <timestamp>2020-04-26T12:21:10Z</timestamp>
      <contributor>
        <username>Linuxmetel</username>
        <id>1613984</id>
      </contributor>
      <comment>Added explanation of Qiita controversy and LGTM stock</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text bytes="4507" xml:space="preserve">{{Infobox Website
|Site name=Qiita
(Omitted)
[[Category:Japanese website]]</text>
      <sha1>mtwuh9z42c7j6ku1irgizeww271k4dc</sha1>
    </revision>
  </page>

script

I have summarized the flow in a script.

[py][sqlite3] get a page from the Wikipedia's dump

Store and use the index in SQLite.

SQLite

The script converts the index to TSV and generates SQL for ingestion.

python conv_index.py jawiki-20200501-pages-articles-multistream-index.txt

Three files will be generated.

jawiki-20200501-pages-articles-multistream-index-1.tsv
jawiki-20200501-pages-articles-multistream-index-2.tsv
jawiki-20200501-pages-articles-multistream-index.sql

Import into SQLite.

sqlite3 jawiki.db ".read jawiki-20200501-pages-articles-multistream-index.sql"

You are now ready.

How to use

The DB contains only the index, so you need the xml.bz2 file in the same directory. Do not rename the xml.bz2 file name as it is recorded in the DB.

If you specify the DB and item name, the result will be displayed. By default, only the contents of the text tag are output, but if you specify -x, all the tags inside the page tag will be output.

python mediawiki.py jawiki.db Qiita
python mediawiki.py -x jawiki.db Qiita

You can output to a file.

python mediawiki.py -o Qiita.txt jawiki.db Qiita
python mediawiki.py -o Qiita.xml -x jawiki.db Qiita

mediawiki.py is designed to be used as a library as well.

import mediawiki
db = mediawiki.DB("jawiki.db")
print(db["Qiita"].text)

Articles about multi-stream and bz2 modules.

Sequentially expand multi-stream bzip2 with Python

reference

I referred to the Wikipedia index specifications.

wikipedia - how to use information provided in wiki download's index file? - Stack Overflow

The ElementTree XML API referenced the documentation.

xml.etree.ElementTree --- ElementTree XML API

I investigated how to use SQLite when processing example sentence data.

Extract multilingual example sentence data by specifying multiple languages

Extract a page from a Wikipedia dump