ConcordanceIndex
class of NLTK, and NLTK (Natural Language Toolkit) is one of the libraries for natural language processing in Python.import re #Regular expression manipulation
import zipfile #Working with zip files
import urllib.request #Get data on the web
import os.path #Manipulating pathnames
import glob #Get file path name
def download(URL):
#Download zip file
zip_file = re.split(r'/', URL)[-1]
urllib.request.urlretrieve(URL, zip_file)
dir = os.path.splitext(zip_file)[0]
#Unzip and save the zip file
with zipfile.ZipFile(zip_file) as zip_object:
zip_object.extractall(dir)
os.remove(zip_file)
#Get the path of the saved file
path = os.path.join(dir,'*.txt')
list = glob.glob(path)
return list[0]
def convert(download_text):
#File reading
data = open(download_text, 'rb').read()
text = data.decode('shift_jis')
#Extraction of text
text = re.split(r'\-{5,}', text)[2]
text = re.split(r'Bottom book:', text)[0]
text = re.split(r'[#New Page]', text)[0]
#Noise removal
text = re.sub(r'《.+?》', '', text)
text = re.sub(r'[#.+?]', '', text)
text = re.sub(r'|', '', text)
text = re.sub(r'\r\n', '', text)
text = re.sub(r'\u3000', '', text)
text = re.sub(r'「', '', text)
text = re.sub(r'」', '', text)
text = re.sub(r'、', '', text)
text = re.sub(r'。', '', text)
return text
URL = 'https://www.aozora.gr.jp/cards/000081/files/43737_ruby_19028.zip'
download_file = download(URL)
text = convert(download_file)
print(text)
download
method and passing it to the convert
method to extract only the body is as follows.
ConcordanceIndex
class is intended for English processing, so use MeCab to convert Japanese sentences into a ** separated format ** with spaces between words.!apt install aptitude
!aptitude install mecab libmecab-dev mecab-ipadic-utf8 git make curl xz-utils file -y
!pip install mecab-python3==0.7
MeCab.Tagger
class with the word-separated output mode -Owakati
, then separate it with a word using the parse
method.import MeCab
mecab = MeCab.Tagger("-Owakati")
words = mecab.parse(text).split()
join
to join words using a single-byte space as a delimiter.doc = ' '.join(words)
print(doc)
nltk
here, but it will not work unless you also download a tokenizer called punkt
.doc
with ** NLTK and convert it to text format **.import nltk
nltk.download('punkt')
text_ = nltk.Text(nltk.word_tokenize(doc))
ConcordanceIndex
class with the input text as text_
and display the ** output in KWIC format based on the keyword **.word = 'Giovanni'
#Create an instance and specify the input text
c = nltk.text.ConcordanceIndex(text_)
#Display KWIC format by keyword
c.print_concordance(word, width=40, lines=50)
print_concordance
method for displaying KWIC format allows you to specify ** display width ** with width
and ** maximum number of lines ** with lines
. Here, all the matched 196 points are displayed.offsets
method. Here is the result of the search, which is the original purpose.print(c.offsets(word))
Recommended Posts