Implements a morphological analysis type full-text search engine that searches Wikipedia. Full-text search is realized by calculating the similarity between the search string and each article and displaying the search results in order of similarity. We will implement index and inverse index for fast search.
This is a continuation of the previously written [Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method. If you think of full-text search as a function that recommends articles similar to the search string, you can read that the morphological analysis type full-text search engine is a recommendation engine for content-based filtering.
There are two ways to achieve full-text search. Morphological analysis and N-Gram. The features of each are as follows.
Morphological analysis | N-gram | |
---|---|---|
Indexing speed | slow | fast |
Index size | small | large |
Search noise | small | Many |
Search omission | Many | Few |
Search speed | fast | slow |
Language dependent | Need a dictionary | No dictionary required |
I will explain only the parts that are difficult to understand. About the point that a dictionary is necessary due to search omission and language dependence. Morphological analysis type full-text search depends on the dictionary. For example, if "Tokyo Sky Tree" is a morphological analysis that is not registered in the dictionary, it will be divided and recognized as "Tokyo", "Sky", and "Tree". As a result, unintended search results such as the Christmas tree in Tokyo will be displayed. The weakness of morphological analysis type full-text search is that it cannot recognize the surrounding context.
■ input1 Antioxidant and detoxifying power of the body output1 : ++++++++++++++++++++++ Search Results: Body Antioxidant and Detoxifying Power ++++++++++++++++++++++ --1st place: Similarity: 0.00787401574803 -[2-Adamantanone](https://ja.wikipedia.org/wiki/2-%E3%82%A2%E3%83%80%E3%83%9E%E3%83%B3%E3%82 % BF% E3% 83% 8E% E3% 83% B3) --Second place: Similarity: 0.00654878847413 -[(R) -1-Isothiocyanato-4- (methylsulfinyl) butane](https://ja.wikipedia.org/wiki/%28R%29-1-%E3%82%A4%E3%82%BD % E3% 83% 81% E3% 82% AA% E3% 82% B7% E3% 82% A2% E3% 83% 8A% E3% 83% 88-4-% 28% E3% 83% A1% E3% 83% 81% E3% 83% AB% E3% 82% B9% E3% 83% AB% E3% 83% 95% E3% 82% A3% E3% 83% 8B% E3% 83% AB% 29% E3% 83% 96% E3% 82% BF% E3% 83% B3) --Third place: Similarity: 0.00507614213198 -[2-Iodoxybenzoic acid](https://ja.wikipedia.org/wiki/2-%E3%83%A8%E3%83%BC%E3%83%89%E3%82%AD%E3% 82% B7% E5% AE% 89% E6% 81% AF% E9% A6% 99% E9% 85% B8) --4th place: Similarity: 0.00465313028765 -[Tributyltin Oxide](https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%96%E3%83%81%E3%83%AB%E3 % 82% B9% E3% 82% BA% E3% 82% AA% E3% 82% AD% E3% 82% B7% E3% 83% 89) --5th place: Similarity: 0.00440917107584 -[Anoxomer](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%8E%E3%82%AD%E3%82%BD%E3%83%9E%E3% 83% BC)
We will develop a data acquisition batch and search function that accesses and analyzes all pages of the Japanese version of Wikipedia.
■ Operation flow of Wikipedia data acquisition batch
■ Operation flow of search function
■ Output example of the feature vector of the search character string input1: Red and white song battle output1: [[u'Red and White', 0.5], [u'Song Battle', 0.5]]
input2: Lightest elementary particles among mass particles excluding neutrinos output2: [[u'New', 0.1666666666666666666], [u'Turin', 0.1666666666666666666], [u'Particles', 0.16666666666666666], [u'Elementary Particles', 0.1666666666666666666], [u'Mass', 0.16666666666666666]]
■ Example of similarity calculation from index data
Let's calculate the similarity between the search string "Kohaku Uta Gassen" and each article. (It may be faster to read the implementation of search.py
)
index-data: Get data from feature word index with "Red and White"
title | value |
---|---|
Anime red and white song battle | 0.0226161369193 |
MAESTRO-T | 0.00769230769231 |
Electric comics and Mika Orihara's rugged home run | 0.00740740740741 |
index-data: Data acquisition from feature word index in "Uta Gassen"
title | value |
---|---|
1996 television_(Japan) | 0.005555555555555556 |
Anime red and white song battle | 0.00105263157895 |
2003 television_(Japan) | 0.00100840336134 |
The similarity calculation formula for "Anime Kouhaku Uta Gassen" in this case is as follows. 0.5 * 0.0226161369193 + 0.5 * 0.00105263157895 = 0.011834384249125
github - text_serarch_prototype
install
install
easy_install lxml
pip install beautifulsoup4
pip install janome
pip install nltk
pip install redis
data_import.py
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import ssl
import time
from tx.storage import Storage
import requests
from tx.tfidf import TFIDF
#wikipedia full title list
DATA_PATH = './data/jawiki-latest-all-titles-in-ns0'
def create_index(title):
"""
HTTP access to Wikipedia page
Extract the feature vector of the sentence and generate an index for searching
"""
url = Storage.get_wikipedia_url(str(title))
print title, url
#HTTP access to Wikipedia page to extract text feature vector
tfidf = TFIDF.gen_web(url)
#Generate an index for searching
s.save_tfidf(title, tfidf)
return
print 'start'
#Generate URL from all wikipedia title list files
s = Storage()
f = open(DATA_PATH, 'r')
ct = 0
for title in f:
ct += 1
if ct < 0:
continue
try:
create_index(title)
except UnicodeDecodeError:
print "ERROR", title
except requests.exceptions.ConnectionError:
time.sleep(2)
except requests.exceptions.Timeout:
time.sleep(2)
except ssl.SSLError:
time.sleep(2)
storage.py
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import redis
import urllib
PREFIX = 'tx'
KEY_TITLE = '%s:TITLE:{}' % PREFIX
KEY_INDEX_TITLE = '%s:INDEX:{}' % PREFIX
KEY_R_INDEX_WORD = '%s:R_INDEX:{}' % PREFIX
class KeyMixin(object):
@classmethod
def get_key_title(cls, title):
title = urllib.quote_plus(title)
return KEY_TITLE.format(title)
@classmethod
def get_key_index(cls, title):
title = urllib.quote_plus(title)
return KEY_INDEX_TITLE.format(title)
@classmethod
def get_key_r_index(cls, word):
return KEY_R_INDEX_WORD.format(word)
class Storage(KeyMixin):
"""
Save the index for each title and the index for each feature word in redis
title ...Wikipedia title
word ...Characteristic words
"""
_cli = None
timeout = 60 * 60 * 24 * 30
@classmethod
def get_wikipedia_url(cls, title):
"""
Generate wikipedia URL from title
"""
_base_url = "https://ja.wikipedia.org/wiki/{}"
url = _base_url.format(urllib.quote_plus(title))
return url[:-3]
@property
def client(self):
"""
:rtype : Redis
"""
if Storage._cli is None:
Storage._cli = redis.Redis(host='localhost', port=6379, db=2)
return Storage._cli
def save_tfidf(self, title, tfidf):
self.set_index(title, tfidf)
self.set_r_index(title, tfidf)
def set_index(self, title, tfidf):
"""
tfidf for each title
Overwrite tfidf value in ZSET with title as key
"""
key = Storage.get_key_index(title)
self.client.delete(key)
for word, score in tfidf:
self.client.zadd(key, word, score)
def set_r_index(self, title, tfidf):
"""
Character-by-character reverse index
Feature string(word)You can reverse the title from.
"""
for word, score in tfidf:
key = Storage.get_key_r_index(word)
self.client.zadd(key, title, score)
def get_r_index(self, word):
"""
Reverse the article from the feature word.
"""
key = Storage.get_key_r_index(word)
return self.client.zrevrange(key, 0, 1000, withscores=True)
tfidf.py
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
from collections import defaultdict
from math import sqrt
import re
import requests
from bs4 import BeautifulSoup
from janome.tokenizer import Tokenizer
import nltk
class TFIDF(object):
_t = None
@classmethod
def gen(cls, text, enable_one_char=False):
"""
Get TF-IDF
:param text: str
:rtype :list[list[str, float]]
"""
_text = cls.filter(text)
return cls.analysis(_text, enable_one_char=enable_one_char)
@classmethod
def gen_web(cls, url, enable_one_char=False):
"""
Get TF-IDF from url
:param url: str
:rtype: list[list[str, float]]
"""
# HTTP GET
response = requests.get(url, timeout=2)
# filter HTTP Tag
soup = BeautifulSoup(response.text, "lxml")
text = soup.title.name + soup.get_text()
return cls.gen(text, enable_one_char=enable_one_char)
@classmethod
def similarity(cls, tfidf1, tfidf2):
"""
Get TF-IDF and Cosine Similarity
cosθ =A / B/|A||B|
:param tfidf1: list[list[str, float]]
:param tfidf2: list[list[str, float]]
:rtype : float
"""
tfidf2_dict = {key: value for key, value in tfidf2}
ab = 0 #A / B
for key, value in tfidf1:
value2 = tfidf2_dict.get(key)
if value2:
ab += float(value * value2)
# |A| and |B|
a = sqrt(sum([v ** 2 for k, v in tfidf1]))
b = sqrt(sum([v ** 2 for k, v in tfidf2]))
return float(ab / (a * b))
@classmethod
def some_similarity(cls, base_url, data):
"""
:param base_url: str
:param data: list[lost[str, str]]
:rtype : list[lost[str, str, float]]
"""
base_tfidf = cls.gen_web(base_url)
return [[title, url, cls.similarity(base_tfidf, cls.gen_web(url))] for title, url in data]
@classmethod
def analysis(cls, text, enable_one_char):
"""
Calc TF-IDF
Morphological analysis of text and return of the number of nouns(Morphological Analysis)
:param text: str
:rtype : dict{str: int}
"""
result = defaultdict(int)
result2 = {}
count = 0
t = cls._get_tokenizer()
#Morphological analysis
# f_in = lambda x, y: x.encode('utf-8') in y.encode('utf-8')
# f_in = lambda x, y: x.decode('shift-jis') in y.decode('shift-jis')
f_in = lambda x, y: x in y
for token in t.tokenize(text):
if not f_in('noun', token.part_of_speech):
continue
count += 1
if f_in('Non-independent', token.part_of_speech):
continue
if f_in('suffix', token.part_of_speech):
continue
if f_in('number', token.part_of_speech):
continue
if not enable_one_char:
if len(token.surface) == 1:
continue
result[token.surface] += 1
result2[token.surface] = token
# TF-IDF calculation
result3 = []
for key in result:
result3.append([key, result[key]])
result3.sort(key=lambda x: x[1], reverse=True)
result4 = []
for r in result3[:100]:
# print r[0], float(float(r[1])/float(count)), result2[r[0]]
result4.append([r[0], float(float(r[1])/float(count))])
return result4
@classmethod
def filter(cls, text):
"""
Filter text to eliminate noise
:param text: str
:rtype : str
"""
#Eliminate alphabets, half-width alphanumeric characters, line breaks and tabs
text = re.sub(r'[a-zA-Z0-9¥"¥.¥,¥@]+', '', text)
text = re.sub(r'[!"“#$%&()\*\+\-\.,\/:;<=>?@\[\\\]^_`{|}~]', '', text)
text = re.sub(r'[\n|\r|\t|Year|Month|Day]', '', text)
#Eliminate non-Japanese characters(Korean, Chinese, Hebrew, etc.)
jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([Ah-Hmm]+|[A-Hmm]+|[\u4e00-\u9FFF]+|[Ah-んA-Hmm\u4e00-\u9FFF]+)')
text = "".join(jp_chartype_tokenizer.tokenize(text))
return text
@classmethod
def _get_tokenizer(cls):
if TFIDF._t is not None:
return TFIDF._t
TFIDF._t = Tokenizer()
return TFIDF._t
■ Execution result of data_import.py
search.py
# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
from collections import defaultdict
from tx.storage import Storage
from tx.tfidf import TFIDF
class Search(object):
@classmethod
def search(cls, v, count=5):
"""
Search wikipedia pages related to v
"""
#Extract feature vector by morphological analysis of v
tfidf = TFIDF.gen(v)
s = Storage()
result = defaultdict(float)
for search_word, search_score in tfidf:
#Inquire about the reverse index
title_score_map = s.get_r_index(search_word)
for _title, _score in title_score_map:
#Calculate and aggregate similarity
result[_title] += search_score * _score
#Descending sort by similarity score
search_result = [(k, result[k]) for k in result]
search_result = sorted(search_result, key=lambda x: x[1], reverse=True)
if len(search_result) >= count:
return search_result[:count]
return search_result
def printer(l, keyword):
print "++++++++++++++++++++++"
print "search results:{}".format(keyword)
print "++++++++++++++++++++++"
count = 1
for title, score in l:
print "- {}Rank:Degree of similarity:{}".format(str(count).encode("utf-8"), score)
print "-", title, Storage.get_wikipedia_url(title)
count += 1
def main():
v = "Antioxidant and detoxifying power of the body"
printer(Search.search(v), v)
v = "Regional JP domain name"
printer(Search.search(v), v)
v = "African Identity Competitive Capitalist System Class Struggle"
printer(Search.search(v), v)
main()
Execution result
>>> python search.py
++++++++++++++++++++++
search results:Antioxidant and detoxifying power of the body
++++++++++++++++++++++
-First place:Degree of similarity:0.00787401574803
- 2-Adamantanone
https://ja.wikipedia.org/wiki/2-%E3%82%A2%E3%83%80%E3%83%9E%E3%83%B3%E3%82%BF%E3%83%8E%E3%83%B3
-2nd place:Degree of similarity:0.00654878847413
- (R)-1-Isothiocianato-4-(Methyl sulfinyl)butane
https://ja.wikipedia.org/wiki/%28R%29-1-%E3%82%A4%E3%82%BD%E3%83%81%E3%82%AA%E3%82%B7%E3%82%A2%E3%83%8A%E3%83%88-4-%28%E3%83%A1%E3%83%81%E3%83%AB%E3%82%B9%E3%83%AB%E3%83%95%E3%82%A3%E3%83%8B%E3%83%AB%29%E3%83%96%E3%82%BF%E3%83%B3
-3rd place:Degree of similarity:0.00507614213198
- 2-Iodoxybenzoic acid
https://ja.wikipedia.org/wiki/2-%E3%83%A8%E3%83%BC%E3%83%89%E3%82%AD%E3%82%B7%E5%AE%89%E6%81%AF%E9%A6%99%E9%85%B8
-4th:Degree of similarity:0.00465313028765
-Tributyltin oxide
https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%96%E3%83%81%E3%83%AB%E3%82%B9%E3%82%BA%E3%82%AA%E3%82%AD%E3%82%B7%E3%83%89
-5th place:Degree of similarity:0.00440917107584
-Anoxomer
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%8E%E3%82%AD%E3%82%BD%E3%83%9E%E3%83%BC
++++++++++++++++++++++
search results:Regional JP domain name
++++++++++++++++++++++
-First place:Degree of similarity:0.0217041800643
-Abitibi-Temiscamang region
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%93%E3%83%86%E3%82%A3%E3%83%93%E3%83%BB%E3%83%86%E3%83%9F%E3%82%B9%E3%82%AB%E3%83%9E%E3%83%B3%E3%82%B0%E5%9C%B0%E5%9F%9F
-2nd place:Degree of similarity:0.0203389830508
- .jp
https://ja.wikipedia.org/wiki/.jp
-3rd place:Degree of similarity:0.0203217612193
- .shimane.jp
https://ja.wikipedia.org/wiki/.shimane.jp
-4th:Degree of similarity:0.0203217612193
- .okinawa.jp
https://ja.wikipedia.org/wiki/.okinawa.jp
-5th place:Degree of similarity:0.0203217612193
- .chiba.jp
https://ja.wikipedia.org/wiki/.chiba.jp
++++++++++++++++++++++
search results:African Identity Competitive Capitalist System Class Struggle
++++++++++++++++++++++
-First place:Degree of similarity:0.017993795243
-African socialism
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E7%A4%BE%E4%BC%9A%E4%B8%BB%E7%BE%A9
-2nd place:Degree of similarity:0.0109634551495
-Africa Liberal Network
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E8%87%AA%E7%94%B1%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC%E3%82%AF
-3rd place:Degree of similarity:0.00844594594595
-New Partnership for Africa Development
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E9%96%8B%E7%99%BA%E3%81%AE%E3%81%9F%E3%82%81%E3%81%AE%E6%96%B0%E3%83%91%E3%83%BC%E3%83%88%E3%83%8A%E3%83%BC%E3%82%B7%E3%83%83%E3%83%97
-4th:Degree of similarity:0.00833333333333
-Flag of the African Union
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E9%80%A3%E5%90%88%E3%81%AE%E6%97%97
-5th place:Degree of similarity:0.008203125
-African Central Bank
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E4%B8%AD%E5%A4%AE%E9%8A%80%E8%A1%8C
It would be nice to be able to write a beautiful program to operate Redis. There seems to be How to use DBPedia to get information from Wikipedia.