Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab

Recently, I read a dump file of Wikipedia and worked on it, so I would like to summarize what I checked at that time and points to note when reading it.

Target audience of the article

--People who want to use Wikipedia dump files as natural language processing datasets --People who want to use MeCab for word-separation for large amounts of data --People who are having trouble with OOM when reading and processing large amounts of data

Download Wikipedia dump file

Download the Wikipedia dump file from this link or execute the following wget.

> wget https://dumps.wikimedia.org/jawiki/20200601/jawiki-20200601-pages-articles-multistream.xml.bz2

Deploy using Wiki Extractor

Install with pip.

> pip install wikiextractor

Execute expansion. This time, save the output destination in one text file (wiki.txt). You can save it in any directory by specifying the path to -o.

> python -m wikiextractor.WikiExtractor jawiki-latest-pages-articles.xml.bz2 -o - --processes 4 > wiki.txt

Preprocessing of wiki.txt

As you can see by checking the contents, the doc tag remains, so remove it just in case.

> head wiki.txt
------------------------------------------------------
<doc id="17230" url="https://ja.wikipedia.org/wiki?curid=17230" title="Yoshimoto Imagawa">
Yoshimoto Imagawa

Yoshimoto Imagawa was the guardian daimyo and warring lord of Suruga and Totomi during the Warring States period. Mr. Imagawa, the 11th head of the family. Due to his marriage with his sisters, Shingen Takeda and Hojo Ujiyasu are brothers-in-law. The ruler of the vast area of Tokaido, nicknamed "Kaido's No. 1 Yumitori".

In addition to territorial management such as rational military reforms by establishing a parent-child system, he also demonstrated his talent in terms of expedition and succeeded in transforming Imagawa into a Sengoku daimyo. The territory was expanded from Suruga and Totomi to parts of Mikawa and Owari. Although he built the heyday of the Imagawa clan during the Warring States period, he was defeated by Oda Nobunaga's army in the Battle of Okehazama when he invaded Owari Province and was defeated by Yoshikatsu Mori (Shinsuke).

Born in Eisho 16 (1519) as the third son of Imagawa's parents. My mother is the daughter of Nakamikado Nobutane (Jukei-ni), who is my father's regular room. However, there is a theory that Yoshimoto was originally a child of a concubine and adopted Jukei-ni after the Hanakura turbulence (described later). When he was born, he was sent to the Buddhist gate at the age of four because he had his mother, Imagawa Ujiteru, and Hikogoro, and was entrusted to Kotokei Shogun at Seko Zentokuji Temple in Fuji District, Suruga Province. His disciple, Sessai Choro (later Sessai Ohara), took over the role because of the death of Shogun in the second year of Kyoroku (1529). After that, he entered Kenninji Temple with Sessai and became Sengaku Shoho under the guidance of Ryutaka Tsunean. In addition, he and Sessai studied at Myoshinji Temple during Okyu Sokyu and deepened his scholarship.

After that, he returned to Suruga from Kyoto at the behest of Ujiteru, but immediately after that, Ujiteru died suddenly in the 5th year of Tenbun (1536). At this point, he had no inheritance right because his brother Hikogoro was still there, but even Hikogoro died on the same day as Ujiteru, so the inheritance right came around. Being a student of Jukei-ni, who is the same as Ujiteru and Hikogoro, was also a boost, and Yoshiharu, who was begged for repatriation by his senior vassals, was given a bias from the mainstream Shogun Yoshiharu Ashikaga, and named himself Yoshimoto. It was. However, the succession of the head was confused by the opposition of the influential vassal Fukushima, and in the end, Mr. Fukushima rebelled against Yoshimoto's half-brother, Genko Etan, who draws his own blood (Hanakura Ran). ).

Delete the tag referring to the article here.

> cat wiki.txt | sed '/^<[^>]*>$/d' > wiki_removed_doc_tag.txt

Word-separation with MeCab

Japanese is not separated by spaces, and it is necessary to recognize words well. It seems that Sentencepiece etc. are compatible with languages that do not have spaces to some extent, but it seems that there are cases where it feels good to write in advance with MeCab.

Install MeCab

#Other than windows
> pip install mecab-python3
# windows
> pip install mecab-python-windows

Word-separation using MeCab

import MeCab

text = "Word-separation is not an easy task, but it is a challenging task."

tokenizer = MeCab.Tagger("-Owakati") #Word-separated mode
tokens = tokenizer.parse(text).split()
print(tokens)
# => ['Word-separation', 'Is', 'Easy', 'so', 'Is', 'Yes', 'No', 'Hmm', 'But', ',', 'Challenging', 'Nana', 'task', 'soす', '.']

You can forcibly put wiki_removed_doc_tag.txt in this text, but this time I will read it line by line considering the load on the memory. By the way, in Windows, if you do not specify the encoding in the open () option, an error will occur, so specify encoding = "utf-8_sig".

mecab_tokenization.py


import MeCab

file_path = "wiki_removed_doc_tag.txt"
output_path = "wiki_mecab_space_separated.txt"
tokenizer = MeCab.Tagger("-Owakati")
output_text = "" #Add results line by line here

#Read and process
with open(file_path, "r") as f_in:
  for line in f_in:
    tokens = tokenizer.parse(line).split() #Word-separation
    text = " ".join(tokens) #Separate with spaces
    output_text += text

#save
with open(output_path, "w") as f_out:
  f_out.write(output_text)

By doing this, it is possible to process without generating OOM and without loading the entire memory.

If you come to this point, you will get a lot of space-separated sentences, so you can use it for learning sentence pieces.

Recommended Posts

Preprocessing of Wikipedia dump files and word-separation of large amounts of data by MeCab
Detect General MIDI data from large amounts of MIDI
Data cleansing 3 Use of OpenCV and preprocessing of image data
Analysis of financial data by pandas and its visualization (1)
Visualization method of data by explanatory variable and objective variable
Overview of natural language processing and its data preprocessing
Correlation by data preprocessing
Preprocessing of prefecture data
Notes on handling large amounts of data with python + pandas
Python application: Data cleansing # 3: Use of OpenCV and preprocessing of image data
Pre-processing and post-processing of pytest
Visualization of data by prefecture
Automatic acquisition of gene expression level data by python and R
Practice of data analysis by Python and pandas (Tokyo COVID-19 data edition)