Wikipedia provides a dump of all pages. Although it is a huge amount of data, an index is prepared so that it can be handled while being compressed. Let's actually retrieve the data.
A description of the dump data is below.
Due to the huge file size, please be careful not to open the unzipped XML with a normal editor or browser.
The data for the Japanese version of Wikipedia is below.
From the May 1, 2020 edition available at the time of writing, the following two files will be used.
The first XML is the body data. Since it is already compressed and this size, it will be a ridiculous size when expanded, but it will not be expanded this time because it is considered so that it can be handled as it is compressed.
The second index expands. It will be about 107MB.
The following article examines the structure of dumped XML tags.
The structure of the main part is as follows. One item is stored in one page tag.
<mediawiki>
<siteinfo> ⋯ </siteinfo>
<page> ⋯ </page>
<page> ⋯ </page>
⋮
<page> ⋯ </page>
</mediawiki>
The bz2 file does not simply compress the entire XML file, but consists of blocks of 100 items. You can take out the block and deploy it pinpoint. This structure is called ** multi-stream **.
siteinfo | page × 100 | page × 100 | ⋯ |
The index has the following structure for each row.
offset of bz2:id:title
Check the actual data.
$ head -n 5 jawiki-20200501-pages-articles-multistream-index.txt
690:1:Wikipedia:Upload log April 2004
690:2:Wikipedia:Delete record/Past log December 2002
690:5:Ampersand
690:6:Wikipedia:Sandbox
690:10:language
To know the length of a block starting at 690, you need to know where the next block starts.
$ head -n 101 jawiki-20200501-pages-articles-multistream-index.txt | tail -n 2
690:217:List of musicians(group)
814164:219:List of song titles
Since each item is one item, you can find the total number of items by counting the number of lines. There are about 2.5 million items.
$ wc -l jawiki-20200501-pages-articles-multistream-index.txt
2495246 jawiki-20200501-pages-articles-multistream-index.txt
Let's actually take out a specific item. The target is "Qiita".
Search for "Qiita".
$ grep Qiita jawiki-20200501-pages-articles-multistream-index.txt
2919984762:3691277:Qiita
3081398799:3921935:Template:Qiita tag
3081398799:3921945:Template:Qiita tag/doc
Ignore Template and target the first id = 3691277.
Basically, there are 100 items per block, but there are exceptions and it seems that they are out of alignment, so manually check the start position of the next block.
2919984762:3691305:Category:Gabon's Bilateral Relations
2920110520:3691306:Category:Japan-Cameroon relations
You have all the information you need.
id | block | 次のblock |
---|---|---|
3691277 | 2919984762 | 2920110520 |
Python
Start Python.
$ python
Python 3.8.2 (default, Apr 8 2020, 14:31:25)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
Open the compressed file.
>>> f = open("jawiki-20200501-pages-articles-multistream.xml.bz2", "rb")
Specify the offset to retrieve the block containing the Qiita item.
>>> f.seek(2919984762)
2919984762
>>> block = f.read(2920110520 - 2919984762)
Expand the block to get the string.
>>> import bz2
>>> data = bz2.decompress(block)
>>> xml = data.decode(encoding="utf-8")
And check the contents. Contains 100 page tags.
>>> print(xml)
<page>
<title>Category:Mayor of Eniwa</title>
<ns>14</ns>
<id>3691165</id>
(Omitted)
It's awkward as it is, so parse it as XML. Since the root element is required for parsing, add it appropriately.
>>> import xml.etree.ElementTree as ET
>>> root = ET.fromstring("<root>" + xml + "</root>")
Check the contents. There are 100 page tags under the root.
>>> len(root)
100
>>> [child.tag for child in root]
['page', 'page',(Omitted), 'page']
Get the page by specifying the id. The argument to find
is a notation called XPath.
>>> page = root.find("page/[id='3691277']")
And check the contents.
>>> page.find("title").text
'Qiita'
>>> page.find("revision/text").text[:50]
'{{Infobox Website\n|Site name=Qiita\n|logo=\n|screenshot=\n|Skull'
Save as a file.
>>> tree = ET.ElementTree(page)
>>> tree.write("Qiita.xml", encoding="utf-8")
You will get a file that looks like this:
Qiita.xml
<page>
<title>Qiita</title>
<ns>0</ns>
<id>3691277</id>
<revision>
<id>77245770</id>
<parentid>75514051</parentid>
<timestamp>2020-04-26T12:21:10Z</timestamp>
<contributor>
<username>Linuxmetel</username>
<id>1613984</id>
</contributor>
<comment>Added explanation of Qiita controversy and LGTM stock</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text bytes="4507" xml:space="preserve">{{Infobox Website
|Site name=Qiita
(Omitted)
[[Category:Japanese website]]</text>
<sha1>mtwuh9z42c7j6ku1irgizeww271k4dc</sha1>
</revision>
</page>
I have summarized the flow in a script.
Store and use the index in SQLite.
SQLite
The script converts the index to TSV and generates SQL for ingestion.
python conv_index.py jawiki-20200501-pages-articles-multistream-index.txt
Three files will be generated.
Import into SQLite.
sqlite3 jawiki.db ".read jawiki-20200501-pages-articles-multistream-index.sql"
You are now ready.
The DB contains only the index, so you need the xml.bz2 file in the same directory. Do not rename the xml.bz2 file name as it is recorded in the DB.
If you specify the DB and item name, the result will be displayed. By default, only the contents of the text tag are output, but if you specify -x
, all the tags inside the page tag will be output.
python mediawiki.py jawiki.db Qiita
python mediawiki.py -x jawiki.db Qiita
You can output to a file.
python mediawiki.py -o Qiita.txt jawiki.db Qiita
python mediawiki.py -o Qiita.xml -x jawiki.db Qiita
mediawiki.py is designed to be used as a library as well.
import mediawiki
db = mediawiki.DB("jawiki.db")
print(db["Qiita"].text)
Articles about multi-stream and bz2 modules.
I referred to the Wikipedia index specifications.
The ElementTree XML API referenced the documentation.
I investigated how to use SQLite when processing example sentence data.
Recommended Posts