Full-text search of Wikipedia by implementing a morphological analysis type full-text search engine

Implements a morphological analysis type full-text search engine that searches Wikipedia. Full-text search is realized by calculating the similarity between the search string and each article and displaying the search results in order of similarity. We will implement index and inverse index for fast search.

This is a continuation of the previously written [Recommendation] Summary of advantages and disadvantages of content-based and collaborative filtering / implementation method. If you think of full-text search as a function that recommends articles similar to the search string, you can read that the morphological analysis type full-text search engine is a recommendation engine for content-based filtering.

Characteristics of morphological analysis type full-text search engine

There are two ways to achieve full-text search. Morphological analysis and N-Gram. The features of each are as follows.

Morphological analysis N-gram
Indexing speed slow fast
Index size small large
Search noise small Many
Search omission Many Few
Search speed fast slow
Language dependent Need a dictionary No dictionary required

I will explain only the parts that are difficult to understand. About the point that a dictionary is necessary due to search omission and language dependence. Morphological analysis type full-text search depends on the dictionary. For example, if "Tokyo Sky Tree" is a morphological analysis that is not registered in the dictionary, it will be divided and recognized as "Tokyo", "Sky", and "Tree". As a result, unintended search results such as the Christmas tree in Tokyo will be displayed. The weakness of morphological analysis type full-text search is that it cannot recognize the surrounding context.

Finished product search results

■ input1 Antioxidant and detoxifying power of the body output1 : ++++++++++++++++++++++ Search Results: Body Antioxidant and Detoxifying Power ++++++++++++++++++++++ --1st place: Similarity: 0.00787401574803 -[2-Adamantanone](https://ja.wikipedia.org/wiki/2-%E3%82%A2%E3%83%80%E3%83%9E%E3%83%B3%E3%82 % BF% E3% 83% 8E% E3% 83% B3) --Second place: Similarity: 0.00654878847413 -[(R) -1-Isothiocyanato-4- (methylsulfinyl) butane](https://ja.wikipedia.org/wiki/%28R%29-1-%E3%82%A4%E3%82%BD % E3% 83% 81% E3% 82% AA% E3% 82% B7% E3% 82% A2% E3% 83% 8A% E3% 83% 88-4-% 28% E3% 83% A1% E3% 83% 81% E3% 83% AB% E3% 82% B9% E3% 83% AB% E3% 83% 95% E3% 82% A3% E3% 83% 8B% E3% 83% AB% 29% E3% 83% 96% E3% 82% BF% E3% 83% B3) --Third place: Similarity: 0.00507614213198 -[2-Iodoxybenzoic acid](https://ja.wikipedia.org/wiki/2-%E3%83%A8%E3%83%BC%E3%83%89%E3%82%AD%E3% 82% B7% E5% AE% 89% E6% 81% AF% E9% A6% 99% E9% 85% B8) --4th place: Similarity: 0.00465313028765 -[Tributyltin Oxide](https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%96%E3%83%81%E3%83%AB%E3 % 82% B9% E3% 82% BA% E3% 82% AA% E3% 82% AD% E3% 82% B7% E3% 83% 89) --5th place: Similarity: 0.00440917107584 -[Anoxomer](https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%8E%E3%82%AD%E3%82%BD%E3%83%9E%E3% 83% BC)

Full-text search engine specifications

We will develop a data acquisition batch and search function that accesses and analyzes all pages of the Japanese version of Wikipedia.

■ Operation flow of Wikipedia data acquisition batch

  1. Get the article title from file containing all Wikipedia titles-jawiki-latest-all-titles-in-ns0.gz
  2. Generate a Wikipedia URL from the title.
  3. Execute HTTP GET on the URL to expand the body part on the memory.
  4. Calculate TF-IDF, which is a feature vector of HTTP body. (For TF-IDF, see Previous article.)
  5. Generate an index for each article and an index for each feature word in Redis.
  6. Return to 1.

■ Operation flow of search function

  1. Calculate TF-IDF, which is a feature vector of the search string.
  2. Obtain data from the feature word index for each feature word that composes the feature vector.
  3. Calculate the similarity of each article from the index data and display the result.

■ Output example of the feature vector of the search character string input1: Red and white song battle output1: [[u'Red and White', 0.5], [u'Song Battle', 0.5]]

input2: Lightest elementary particles among mass particles excluding neutrinos output2: [[u'New', 0.1666666666666666666], [u'Turin', 0.1666666666666666666], [u'Particles', 0.16666666666666666], [u'Elementary Particles', 0.1666666666666666666], [u'Mass', 0.16666666666666666]]

■ Example of similarity calculation from index data Let's calculate the similarity between the search string "Kohaku Uta Gassen" and each article. (It may be faster to read the implementation of search.py)

index-data: Get data from feature word index with "Red and White"

title value
Anime red and white song battle 0.0226161369193
MAESTRO-T 0.00769230769231
Electric comics and Mika Orihara's rugged home run 0.00740740740741

index-data: Data acquisition from feature word index in "Uta Gassen"

title value
1996 television_(Japan) 0.005555555555555556
Anime red and white song battle 0.00105263157895
2003 television_(Japan) 0.00100840336134

The similarity calculation formula for "Anime Kouhaku Uta Gassen" in this case is as follows. 0.5 * 0.0226161369193 + 0.5 * 0.00105263157895 = 0.011834384249125

Source code

github - text_serarch_prototype

スクリーンショット 2015-11-22 23.29.20.png

install

install


easy_install lxml
pip install beautifulsoup4
pip install janome
pip install nltk
pip install redis

Wikipedia data acquisition batch

data_import.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import ssl
import time
from tx.storage import Storage
import requests
from tx.tfidf import TFIDF

#wikipedia full title list
DATA_PATH = './data/jawiki-latest-all-titles-in-ns0'


def create_index(title):
    """
HTTP access to Wikipedia page
Extract the feature vector of the sentence and generate an index for searching
    """
    url = Storage.get_wikipedia_url(str(title))
    print title, url

    #HTTP access to Wikipedia page to extract text feature vector
    tfidf = TFIDF.gen_web(url)

    #Generate an index for searching
    s.save_tfidf(title, tfidf)
    return


print 'start'

#Generate URL from all wikipedia title list files
s = Storage()
f = open(DATA_PATH, 'r')
ct = 0
for title in f:
    ct += 1
    if ct < 0:
        continue
    try:
        create_index(title)
    except UnicodeDecodeError:
        print "ERROR", title
    except requests.exceptions.ConnectionError:
        time.sleep(2)
    except requests.exceptions.Timeout:
        time.sleep(2)
    except ssl.SSLError:
        time.sleep(2)

storage.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
import redis
import urllib

PREFIX = 'tx'
KEY_TITLE = '%s:TITLE:{}' % PREFIX
KEY_INDEX_TITLE = '%s:INDEX:{}' % PREFIX
KEY_R_INDEX_WORD = '%s:R_INDEX:{}' % PREFIX


class KeyMixin(object):
    @classmethod
    def get_key_title(cls, title):
        title = urllib.quote_plus(title)
        return KEY_TITLE.format(title)

    @classmethod
    def get_key_index(cls, title):
        title = urllib.quote_plus(title)
        return KEY_INDEX_TITLE.format(title)

    @classmethod
    def get_key_r_index(cls, word):
        return KEY_R_INDEX_WORD.format(word)


class Storage(KeyMixin):
    """
Save the index for each title and the index for each feature word in redis

    title ...Wikipedia title
    word ...Characteristic words
    """

    _cli = None
    timeout = 60 * 60 * 24 * 30

    @classmethod
    def get_wikipedia_url(cls, title):
        """
Generate wikipedia URL from title
        """
        _base_url = "https://ja.wikipedia.org/wiki/{}"
        url = _base_url.format(urllib.quote_plus(title))
        return url[:-3]

    @property
    def client(self):
        """
        :rtype : Redis
        """
        if Storage._cli is None:
            Storage._cli = redis.Redis(host='localhost', port=6379, db=2)
        return Storage._cli

    def save_tfidf(self, title, tfidf):
        self.set_index(title, tfidf)
        self.set_r_index(title, tfidf)

    def set_index(self, title, tfidf):
        """
tfidf for each title
Overwrite tfidf value in ZSET with title as key
        """
        key = Storage.get_key_index(title)
        self.client.delete(key)
        for word, score in tfidf:
            self.client.zadd(key, word, score)

    def set_r_index(self, title, tfidf):
        """
Character-by-character reverse index
Feature string(word)You can reverse the title from.
        """
        for word, score in tfidf:
            key = Storage.get_key_r_index(word)
            self.client.zadd(key, title, score)

    def get_r_index(self, word):
        """
Reverse the article from the feature word.
        """
        key = Storage.get_key_r_index(word)
        return self.client.zrevrange(key, 0, 1000, withscores=True)

tfidf.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
from collections import defaultdict
from math import sqrt
import re
import requests
from bs4 import BeautifulSoup
from janome.tokenizer import Tokenizer
import nltk


class TFIDF(object):
    _t = None

    @classmethod
    def gen(cls, text, enable_one_char=False):
        """
        Get TF-IDF
        :param text: str
        :rtype :list[list[str, float]]
        """
        _text = cls.filter(text)
        return cls.analysis(_text, enable_one_char=enable_one_char)

    @classmethod
    def gen_web(cls, url, enable_one_char=False):
        """
        Get TF-IDF from url
        :param url: str
        :rtype: list[list[str, float]]
        """
        # HTTP GET
        response = requests.get(url, timeout=2)

        # filter HTTP Tag
        soup = BeautifulSoup(response.text, "lxml")
        text = soup.title.name + soup.get_text()
        return cls.gen(text, enable_one_char=enable_one_char)

    @classmethod
    def similarity(cls, tfidf1, tfidf2):
        """
        Get TF-IDF and Cosine Similarity
        cosθ =A / B/|A||B|
        :param tfidf1: list[list[str, float]]
        :param tfidf2: list[list[str, float]]
        :rtype : float
        """
        tfidf2_dict = {key: value for key, value in tfidf2}

        ab = 0  #A / B
        for key, value in tfidf1:
            value2 = tfidf2_dict.get(key)
            if value2:
                ab += float(value * value2)

        # |A| and |B|
        a = sqrt(sum([v ** 2 for k, v in tfidf1]))
        b = sqrt(sum([v ** 2 for k, v in tfidf2]))

        return float(ab / (a * b))

    @classmethod
    def some_similarity(cls, base_url, data):
        """
        :param base_url: str
        :param data: list[lost[str, str]]
        :rtype : list[lost[str, str, float]]
        """
        base_tfidf = cls.gen_web(base_url)
        return [[title, url, cls.similarity(base_tfidf, cls.gen_web(url))] for title, url in data]

    @classmethod
    def analysis(cls, text, enable_one_char):
        """
        Calc TF-IDF
Morphological analysis of text and return of the number of nouns(Morphological Analysis)
        :param text: str
        :rtype : dict{str: int}
        """
        result = defaultdict(int)
        result2 = {}
        count = 0
        t = cls._get_tokenizer()

        #Morphological analysis
        # f_in = lambda x, y: x.encode('utf-8') in y.encode('utf-8')
        # f_in = lambda x, y: x.decode('shift-jis') in y.decode('shift-jis')
        f_in = lambda x, y: x in y

        for token in t.tokenize(text):
            if not f_in('noun', token.part_of_speech):
                continue
            count += 1

            if f_in('Non-independent', token.part_of_speech):
                continue

            if f_in('suffix', token.part_of_speech):
                continue

            if f_in('number', token.part_of_speech):
                continue

            if not enable_one_char:
                if len(token.surface) == 1:
                    continue

            result[token.surface] += 1
            result2[token.surface] = token

        # TF-IDF calculation
        result3 = []
        for key in result:
            result3.append([key, result[key]])

        result3.sort(key=lambda x: x[1], reverse=True)
        result4 = []
        for r in result3[:100]:
            # print r[0], float(float(r[1])/float(count)), result2[r[0]]
            result4.append([r[0], float(float(r[1])/float(count))])
        return result4

    @classmethod
    def filter(cls, text):
        """
Filter text to eliminate noise
        :param text: str
        :rtype : str
        """
        #Eliminate alphabets, half-width alphanumeric characters, line breaks and tabs
        text = re.sub(r'[a-zA-Z0-9¥"¥.¥,¥@]+', '', text)
        text = re.sub(r'[!"“#$%&()\*\+\-\.,\/:;<=>?@\[\\\]^_`{|}~]', '', text)
        text = re.sub(r'[\n|\r|\t|Year|Month|Day]', '', text)

        #Eliminate non-Japanese characters(Korean, Chinese, Hebrew, etc.)
        jp_chartype_tokenizer = nltk.RegexpTokenizer(u'([Ah-Hmm]+|[A-Hmm]+|[\u4e00-\u9FFF]+|[Ah-んA-Hmm\u4e00-\u9FFF]+)')
        text = "".join(jp_chartype_tokenizer.tokenize(text))
        return text

    @classmethod
    def _get_tokenizer(cls):
        if TFIDF._t is not None:
            return TFIDF._t
        TFIDF._t = Tokenizer()
        return TFIDF._t

■ Execution result of data_import.py スクリーンショット 2015-11-22 23.32.15.png

Search function source code

search.py


# -*- coding: utf-8 -*-
from __future__ import absolute_import, unicode_literals
from collections import defaultdict
from tx.storage import Storage
from tx.tfidf import TFIDF


class Search(object):
    @classmethod
    def search(cls, v, count=5):
        """
Search wikipedia pages related to v
        """
        #Extract feature vector by morphological analysis of v
        tfidf = TFIDF.gen(v)
        s = Storage()
        result = defaultdict(float)
        for search_word, search_score in tfidf:
            #Inquire about the reverse index
            title_score_map = s.get_r_index(search_word)

            for _title, _score in title_score_map:
                #Calculate and aggregate similarity
                result[_title] += search_score * _score

        #Descending sort by similarity score
        search_result = [(k, result[k]) for k in result]
        search_result = sorted(search_result, key=lambda x: x[1], reverse=True)

        if len(search_result) >= count:
            return search_result[:count]
        return search_result


def printer(l, keyword):
    print "++++++++++++++++++++++"
    print "search results:{}".format(keyword)
    print "++++++++++++++++++++++"
    count = 1
    for title, score in l:
        print "- {}Rank:Degree of similarity:{}".format(str(count).encode("utf-8"), score)
        print "-", title, Storage.get_wikipedia_url(title)

        count += 1


def main():
    v = "Antioxidant and detoxifying power of the body"
    printer(Search.search(v), v)

    v = "Regional JP domain name"
    printer(Search.search(v), v)

    v = "African Identity Competitive Capitalist System Class Struggle"
    printer(Search.search(v), v)

main()

Execution result


>>> python search.py
++++++++++++++++++++++
search results:Antioxidant and detoxifying power of the body
++++++++++++++++++++++
-First place:Degree of similarity:0.00787401574803
- 2-Adamantanone
https://ja.wikipedia.org/wiki/2-%E3%82%A2%E3%83%80%E3%83%9E%E3%83%B3%E3%82%BF%E3%83%8E%E3%83%B3
-2nd place:Degree of similarity:0.00654878847413
- (R)-1-Isothiocianato-4-(Methyl sulfinyl)butane
https://ja.wikipedia.org/wiki/%28R%29-1-%E3%82%A4%E3%82%BD%E3%83%81%E3%82%AA%E3%82%B7%E3%82%A2%E3%83%8A%E3%83%88-4-%28%E3%83%A1%E3%83%81%E3%83%AB%E3%82%B9%E3%83%AB%E3%83%95%E3%82%A3%E3%83%8B%E3%83%AB%29%E3%83%96%E3%82%BF%E3%83%B3
-3rd place:Degree of similarity:0.00507614213198
- 2-Iodoxybenzoic acid
https://ja.wikipedia.org/wiki/2-%E3%83%A8%E3%83%BC%E3%83%89%E3%82%AD%E3%82%B7%E5%AE%89%E6%81%AF%E9%A6%99%E9%85%B8
-4th:Degree of similarity:0.00465313028765
-Tributyltin oxide
https://ja.wikipedia.org/wiki/%E3%83%88%E3%83%AA%E3%83%96%E3%83%81%E3%83%AB%E3%82%B9%E3%82%BA%E3%82%AA%E3%82%AD%E3%82%B7%E3%83%89
-5th place:Degree of similarity:0.00440917107584
-Anoxomer
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%8E%E3%82%AD%E3%82%BD%E3%83%9E%E3%83%BC
++++++++++++++++++++++
search results:Regional JP domain name
++++++++++++++++++++++
-First place:Degree of similarity:0.0217041800643
-Abitibi-Temiscamang region
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%93%E3%83%86%E3%82%A3%E3%83%93%E3%83%BB%E3%83%86%E3%83%9F%E3%82%B9%E3%82%AB%E3%83%9E%E3%83%B3%E3%82%B0%E5%9C%B0%E5%9F%9F
-2nd place:Degree of similarity:0.0203389830508
- .jp
https://ja.wikipedia.org/wiki/.jp
-3rd place:Degree of similarity:0.0203217612193
- .shimane.jp
https://ja.wikipedia.org/wiki/.shimane.jp
-4th:Degree of similarity:0.0203217612193
- .okinawa.jp
https://ja.wikipedia.org/wiki/.okinawa.jp
-5th place:Degree of similarity:0.0203217612193
- .chiba.jp
https://ja.wikipedia.org/wiki/.chiba.jp
++++++++++++++++++++++
search results:African Identity Competitive Capitalist System Class Struggle
++++++++++++++++++++++
-First place:Degree of similarity:0.017993795243
-African socialism
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E7%A4%BE%E4%BC%9A%E4%B8%BB%E7%BE%A9
-2nd place:Degree of similarity:0.0109634551495
-Africa Liberal Network
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E8%87%AA%E7%94%B1%E3%83%8D%E3%83%83%E3%83%88%E3%83%AF%E3%83%BC%E3%82%AF
-3rd place:Degree of similarity:0.00844594594595
-New Partnership for Africa Development
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E9%96%8B%E7%99%BA%E3%81%AE%E3%81%9F%E3%82%81%E3%81%AE%E6%96%B0%E3%83%91%E3%83%BC%E3%83%88%E3%83%8A%E3%83%BC%E3%82%B7%E3%83%83%E3%83%97
-4th:Degree of similarity:0.00833333333333
-Flag of the African Union
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E9%80%A3%E5%90%88%E3%81%AE%E6%97%97
-5th place:Degree of similarity:0.008203125
-African Central Bank
https://ja.wikipedia.org/wiki/%E3%82%A2%E3%83%95%E3%83%AA%E3%82%AB%E4%B8%AD%E5%A4%AE%E9%8A%80%E8%A1%8C

It would be nice to be able to write a beautiful program to operate Redis. There seems to be How to use DBPedia to get information from Wikipedia.

Recommended Posts

Full-text search of Wikipedia by implementing a morphological analysis type full-text search engine
A simple data analysis of Bitcoin provided by CoinMetrics in Python