Aozora Bunko is a website that volunteers to digitize and publish literary works that are out of copyright. Currently, for engineers who use Aozora Bunko data, ** text format and HTML format of all data are uploaded daily to Aozora Bunko's Github, and batch download is also possible ** It has become.
If you want to use the data of Aozora Bunko in natural language processing, there is a fixed format in the text data, so it takes a little time from data collection to preprocessing only for a specific writer. So I wrote ** a batch of downloads for a specific writer + a Python script that can be easily preprocessed **, so I decided to leave it in Qiita.
--Batch download only data for specific writers in Aozora Bunko --Preprocess the downloaded text file and save it as a UTF-8 TSV file. --Save as TSV with the text body in the first column and the work name in the second column --Text formatting is as follows
item name | Before plastic surgery (example) | After plastic surgery (example) |
---|---|---|
Notes and bibliographic information | "In the text" "Bottom:" "* [#" 勹 <evening ", 3rd level 1-14-76]」 | (Delete) |
Clause delimiter | 「―――」「×××」「***」 | (Delete) |
Literary symbols | Dash "-", three-point reader "...", rice "*" | (Delete) |
Lines of 1 character or less, blank lines | "Three" (numbers separated by clauses, etc.) | (Delete) |
Ruby notation | "Tokyo | Express train for Yuki" |
Indentation (double-byte space at the beginning of the line) | "In the study," | "In the study," |
The easiest way to download data for a particular author in bulk is to use the ** svn command **.
The URL is https://github.com/aozorabunko/aozorabunko/trunk/cards/ {author ID}
. {Writer ID}
is a 6-digit number that appears on the URL when you access the page of a specific writer on the Aozora Bunko website.
For example, in the case of Ryunosuke Akutagawa, it is "000879".
You can download all files of a specific writer (for example, Ryunosuke Akutagawa) by specifying the URL with svn export
as shown below.
By the way, svn
should work standard on Linux and Mac. I don't know if it's Windows, but even if you're a Windows user, it's easy to start Ubuntu with WSL and type commands **.
Download all data of Ryunosuke Akutagawa
svn export https://github.com/aozorabunko/aozorabunko/trunk/cards/000879/
This will create a local ./000879/
directory.
With the following Python script, the downloaded ZIP file is batch text-formatted + preprocessed and saved as TSV. The outline of the process is as follows. Each process is as described in the comments. It was okay to classify it, but for the time being, it works only with functions.
--Search all zip files under a specific directory and store them in the list
--Create output directory
--Loop processing for each file in list order (for)
--Read ZIP-compressed txt as Pandas DataFrame: save_cleanse_text ()
--Convert original data to UTF-8 and save as text file
--Text formatting: text_cleanse_df ()
--Save as the work name in the second column
aozora_preprocess.py
import pandas as pd
from pathlib import Path
author_id = '000879' #Aozora Bunko writer number
author_name = 'Ryunosuke Akutagawa' #Writer name in Aozora Bunko notation
write_title = True #Do you put the title of the work in the second column?
write_header = True #Whether the first line is the column name (column name "text" "title")
save_utf8_org = True #UTF original data-Whether to save the text file set to 8
out_dir = Path(f'./out_{author_id}/') #File output destination
tx_org_dir = Path(out_dir / './org/') #UTF of original text-8 Save destination of conversion file
tx_edit_dir = Path(out_dir / './edit/') #File save destination after text formatting
def text_cleanse_df(df):
#Find the beginning of the text ('---…'Premise that the text starts immediately after the break)
head_tx = list(df[df['text'].str.contains(
'-------------------------------------------------------')].index)
#Find the end of the text ('Bottom book:'Premise that the text ends just before)
atx = list(df[df['text'].str.contains('Bottom book:')].index)
if head_tx == []:
#if'---…'If there is no delimiter, the premise that the text starts immediately after the writer's name
head_tx = list(df[df['text'].str.contains(author_name)].index)
head_tx_num = head_tx[0]+1
else:
#2nd'---…'The text starts immediately after the break
head_tx_num = head_tx[1]+1
df_e = df[head_tx_num:atx[0]]
#Deleted Aozora Bunko format
df_e = df_e.replace({'text': {'《.*?》': ''}}, regex=True)
df_e = df_e.replace({'text': {'[.*?]': ''}}, regex=True)
df_e = df_e.replace({'text': {'|': ''}}, regex=True)
#Removed indentation (double-byte space at the beginning of the line)
df_e = df_e.replace({'text': {' ': ''}}, regex=True)
#Remove clause breaks
df_e = df_e.replace({'text': {'^.$': ''}}, regex=True)
df_e = df_e.replace({'text': {'^―――.*$': ''}}, regex=True)
df_e = df_e.replace({'text': {'^***.*$': ''}}, regex=True)
df_e = df_e.replace({'text': {'^×××.*$': ''}}, regex=True)
#Remove symbols and parentheses left by deleting symbols
df_e = df_e.replace({'text': {'―': ''}}, regex=True)
df_e = df_e.replace({'text': {'…': ''}}, regex=True)
df_e = df_e.replace({'text': {'※': ''}}, regex=True)
df_e = df_e.replace({'text': {'「」': ''}}, regex=True)
#Delete lines consisting of one character or less
df_e['length'] = df_e['text'].map(lambda x: len(x))
df_e = df_e[df_e['length'] > 1]
#The index will shift, so re-roll
df_e = df_e.reset_index().drop(['index'], axis=1)
#Remove blank lines (just in case)
df_e = df_e[~(df_e['text'] == '')]
#Since the index shifts, re-roll and delete the character length column
df_e = df_e.reset_index().drop(['index', 'length'], axis=1)
return df_e
def save_cleanse_text(target_file):
try:
#Read file
print(target_file)
#Read as Pandas DataFrame (variants cannot be read unless read with cp932)
df_tmp = pd.read_csv(target_file, encoding='cp932', names=['text'])
#UTF original data-Convert to 8 and save text file
if save_utf8_org:
out_org_file_nm = Path(target_file.stem + '_org_utf-8.tsv')
df_tmp.to_csv(Path(tx_org_dir / out_org_file_nm), sep='\t',
encoding='utf-8', index=None)
#Text formatting
df_tmp_e = text_cleanse_df(df_tmp)
if write_title:
#Make a title column
df_tmp_e['title'] = df_tmp['text'][0]
out_edit_file_nm = Path(target_file.stem + '_clns_utf-8.txt')
df_tmp_e.to_csv(Path(tx_edit_dir / out_edit_file_nm), sep='\t',
encoding='utf-8', index=None, header=write_header)
except:
print(f'ERROR: {target_file}')
def main():
tx_dir = Path(author_id + './files/')
#Create a list of zip files
zip_list = list(tx_dir.glob('*.zip'))
#Create a save directory
tx_edit_dir.mkdir(exist_ok=True, parents=True)
if save_utf8_org:
tx_org_dir.mkdir(exist_ok=True, parents=True)
for target_file in zip_list:
save_cleanse_text(target_file)
if __name__ == '__main__':
main()
100_ruby_1154_org_utf-8.txt (original data)
Momotaro
Ryunosuke Akutagawa
-------------------------------------------------------
[About the symbols that appear in the text]
"":ruby
(Example) Peach << Momo >>
|: Symbol that identifies the beginning of a character string with ruby
(Example) Heaven and Earth | Around the time of the creation myth "Koro" Hey
[#]: Enterer's note: Mainly explanation of external characters and designation of emphasis marks
(Numbers are JIS X 0213 area area number or Unicode, base page and number of lines)
(Example) * [# "Word + Making a mound", Level 4 2-88-74]
-------------------------------------------------------
[# 8 indentation] 1 [# "1" is the middle heading]
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...)
(…)
What kind of person picked up this baby "Akaji" after leaving the depths of the deep mountains? ――I don't need to talk about it anymore. At the end of Tanigawa, there was one grandmother, as the children of Japanese people all over Japan know, the kimono of the old man who went to Shibaka. I was washing it. ......
(…)
100_ruby_1154_clns_utf-8.tsv (preprocessed data)
text title
Once upon a time, once upon a time, there was a large peach tree in the depths of a deep mountain. (...) Momotaro
(…)
What kind of person picked up this baby after leaving the depths of the deep mountains? I don't need to talk about it anymore. At the end of Tanigawa, an old woman was washing the kimono or something of an old man who went to mow the bush, as children all over Japan know. Momotaro
(…)
The parts that are not needed to be processed as natural language have been successfully deleted. I hope you can use it as a class or add other items you want to process.
Recommended Posts