What I did to speed up the string search task

I tried it because there was an algorithm problem to speed up the character string search at the company development camp. By the way, I am a so-called copy / paste programmer who is good at making similar products and refactoring them with reference to existing ones.

1. Challenges

Samples, test data, and processing result files implemented in advance with some of the following tasks hidden together will be provided.

■ Program contents

・ Search for multiple character strings for multiple text files and output the following -How many relevant character strings are included (total number in multiple texts) -Number of text files containing the corresponding character string

-Run on command prompt or Windows PowerShell ・ Specify the following arguments for "program" Program folder input file output file Folder: Stores search target files (plural) Input file: List the search string Output file: Search results

・ When searching, distinguish between uppercase and lowercase letters and full-width and half-width characters. ・ The text to be searched and the search character string are utf-8.

・ Search string -150 items in one file, separated by line breaks Search the number of occurrences in the file for all of them -Any length of 1 or more characters -Does not include spaces, tabs, line breaks, or special symbols -Any character string, which may be part of a word

・ Search results -Output the number of occurrences in the order of appearance of the search character string, separated by tabs The number of appearances is the sum of the results of multiple files

2. Sample program

sample,

-Number of text files containing the corresponding character string

The part of was not implemented.

"""
Count the number of occurrences of the specified character string from all the files in the specified folder
The result is output in TSV
python sample1.py text_folder strings_file result_file
"""
import sys
from pathlib import Path

class TextManager:
    """
A class that manages text file information
    """
    def __init__(self, foldername):
        '''
Read and store the files in the folder

        Parameters
        ----------
        foldername : str
Folder name
        '''
        self.__alllines = []
        p = Path(foldername)
        for filename in p.iterdir():
            with open(filename, 'r', encoding='utf-8') as f:
                lines = [s.strip() for s in f.readlines() if s.strip() != '']
                self.__alllines.extend(lines)

    def search(self, string_list):
        '''
Search
Search for registered text in the search string list

        Parameters
        ----------
        string_list : List[str]
List of search strings
        
        Returns
        -------
        Dict[str, int]
search results
        '''
        result = {}
        for s in string_list:
            count = self.__search_one(s)
            result[s] = count
        return result

    def __search_one(self, string):
        '''
Search for strings

        Parameters
        ----------
        string : str
Search string
        
        Returns
        -------
        int
Number of search strings that existed
        '''
        l = len(string)
        count = 0
        for line in self.__alllines:
            for i in range(0, len(line)-l+1):
                if string == line[i:i+l]:
                    count += 1
        return count

if __name__ == "__main__":
    args = len(sys.argv)
    if args != 4:
        print("USAGE: python sample1.py text_folder strings_file result_file")
        sys.exit()
    text_folder = sys.argv[1]
    strings_file = sys.argv[2]
    result_file = sys.argv[3]
    #Read the target file group (files in the folder)
    text_manager = TextManager(text_folder)
    #Read the search string
    search_strings = []
    with open(strings_file, 'r', encoding='utf-8') as fi:
        search_strings = [s.strip() for s in fi.readlines() if s.strip() != '']
    #Search execution: List[str] -> Dict[str, int]
    result = text_manager.search(search_strings)
    with open(result_file, 'w', encoding='utf-8') as fo:
        #Output in the order described in the search string file
        for s in search_strings:
            c = result.get(s, 0)
            fo.write("{}\t{}\n".format(s, c))

3. Find the number of text files that contain the corresponding character string

It takes a lot of effort to review the whole logic, so as a copy / paste programmer, use the current logic as it is. First, divide the part that was read from multiple files together into each file and save it. Some consideration is given to high-speed chips.

class TextManager:
    """
A class that manages text file information
    """
    def __init__(self, foldername):
        '''
Read and store the files in the folder

        Parameters
        ----------
        foldername : str
Folder name
        '''
        self.__alllines = []
        __alllines_extend = self.__alllines.extend
        p = Path(foldername)
        for filename in p.iterdir():
            __filelines = []
            __filelines_extends = __filelines.extend
            with open(filename, 'r', encoding='utf-8') as f:
                f_readlines = f.readlines
                lines = [s.strip() for s in f_readlines() if s.strip() != '']
                __filelines_extends(lines)
            self.__alllines.append(__filelines)

A loop for each file is added to the for loop of the part that searches one search string from the whole, and if the search string is found in it, the file count is incremented by 1, and finally the search string exists. Returns the number and the number of files in which the search string existed. The commented out print () is a reminder that I modified it while checking if it was working properly.

    def __search_one(self, string):
        '''
Search for strings

        Parameters
        ----------
        string : str
Search string
        
        Returns
        -------
        int
Number of search strings that existed

        int
Number of files where the search string existed
        '''
        l = len(string)
        count = 0
        fcount = 0
        # print(string, count, fcount)
        for filelines in self.__alllines:
            # print(filelines)
            find = False
            for line in filelines:
                # print(line)
                for i in range(0, len(line)-l+1):
                    if string == line[i:i+l]:
                        count += 1
                        find = True
            if find:
                fcount +=1
            # print(string, count, fcount)
        # print(string, count, fcount)
        return count, fcount

The search summary corresponds to getting two return values from individual search.

    def search(self, string_list):
        '''
Search
Search for registered text in the search string list

        Parameters
        ----------
        string_list : List[str]
List of search strings
        
        Returns
        -------
        Dict[str, int]
search results

        Dict[str, int]
Number of search result files
        '''
        result = {}
        fresult = {}
        for s in string_list:
            rtn = self.__search_one(s)
            result[s] = rtn[0]
            fresult[s] = rtn[1]
        return result, fresult

The output section has the same support, so it is omitted. Now you can easily get the correct answer, which is the minimum requirement for the task.

4. Eliminate unnecessary punctuation matching

Currently, the same judgment is made by shifting each character from the beginning to the end of the line read by __search_one (). The point is how to speed up this, but if you cut out blocks that do not contain non-target characters from the search target character string, matching efficiency should increase. That's why it is separated by the reading part. This is the part of .split ('[,." "<< >> | [#]]'). Other parentheses should be targeted, but since the target of the assignment was the file brought from Aozora Bunko, this is enough for the time being.

    def __init__(self, foldername):
        '''
Read and store the files in the folder

        Parameters
        ----------
        foldername : str
Folder name
        '''
        self.__alllines = []
        __alllines_append = self.__alllines.append
        p = Path(foldername)
        for filename in p.iterdir():
            __filelines = []
            __filelines_extends = __filelines.extend
            with open(filename, 'r', encoding='utf-8') as f:
                f_readlines = f.readlines
                lines = [s.strip().split('[、。「」《》|[#]]')
                         for s in f_readlines() if s.strip() != '']
                __filelines_extends(lines)
            __alllines_append(__filelines)

__Search_one () also increases the loop corresponding to the carving. There should be a way to write it that doesn't have to be increased, but for the time being, a loop was needed if it was a defeat method. Print () is useful for checking around. Actually, the test data that was distributed with the sample in advance was in English, so I coded it with split () without arguments, but that doesn't work in Japanese. Since the result is strange, when I print, it was chopped character by character in Japanese. So, once I removed it, I corresponded to the number of files, and then I split again.

    def __search_one(self, string):
        '''
Search for strings

        Parameters
        ----------
        string : str
Search string
        
        Returns
        -------
        int
Number of search strings that existed

        int
Number of files where the search string existed
        '''
        l = len(string)
        count = 0
        fcount = 0
        # print(string, count, fcount)
        for filelines in self.__alllines:
            # print(filelines)
            find = False
            for line in filelines:
                # print(line)
                for sentence in line:
                    for i in range(0, len(sentence)-l+1):
                        if string == sentence[i:i+l]:
                            count += 1
                            find = True
                            # print(sentence, string, sep='\t')
            if find:
                fcount +=1
            # print(string, count, fcount)
        # print(string, count, fcount)
        return count, fcount

5. Review of matching

It is very inefficient to compare the appearance of character strings by shifting them character by character. No matter how much you devise a scripting language, the speed will not increase if you compare it yourself. If you can manage with the functions of the standard library, that is absolutely faster. I knew the method of String in String, but I rejected it because it wouldn't work if it appeared multiple times in one target. I'm not confident in my memory, so I'll search for it. [Search on google "python string match"](https://www.google.com/search?q=python+%E6%96%87%E5%AD%97%E5%88%97+%E3%83 % 9E% E3% 83% 83% E3% 83% 81 & oq = python +% E3% 83% 9E% E3% 83% 83% E3% 83% 81 & aqs = chrome.2.69i57j0i7i30l2j0j0i7i30j0j0i7i30l2.11559j0j7 & sourceid = chrome & ie Oh, there was a function I wanted. Search string with Python (determine whether it contains ~, get position, count) So, rewrite the process in the middle of __search_one.

            for line in filelines:
                # print(line)
                for sentence in line:
                    # for i in range(0, len(sentence)-l+1):
                    #     if string == sentence[i:i+l]:
                    c = sentence.count(string)
                    if c > 0:
                        count += c
                        find = True
 ~~~

This makes the processing time an order of magnitude faster.

# 6.Review of processing
When you can find the number of appearances from the target in one shot, 4.Added in`split`Is rather a cause of slowdown as it only increases loops.
I don't need to carve it anymore, so I'll put it back.
here`__search_one`I will write only inside.

```python
            for line in filelines:
                # print(line)
                c = line.count(string)
                if c > 0:
                    count += c
                    find = True

7.Last push

countSince you can find out the number of appearances in one shot using, there is no point in reading line by line. In this issue, there is no limit on the amount of memory used, only the installed memory of the verification environment is written. If you read the entire file, it makes no difference whether you read it line by line or by file. That's why I changed to file batch reading. I should have changed the variable name to something more appropriate, but I didn't really think about it.

    def __init__(self, foldername):
        '''
 Read and store the files in the folder

        Parameters
        ----------
        foldername : str
 Folder name
        '''
        self.__alllines = []
        __alllines_extend = self.__alllines.extend
        p = Path(foldername)
        for filename in p.iterdir():
            with open(filename, 'r', encoding='utf-8') as f:
                lines = f.read()
            self.__alllines.append(lines)

The actual search part is simple

    def __search_one(self, string):
        '''
 Search for strings

        Parameters
        ----------
        string : str
 Search string
        
        Returns
        -------
        int
 Number of search strings that existed

        int
 Number of files where the search string existed
        '''
        count = 0
        fcount = 0
        # print(string, count, fcount)
        for lines in self.__alllines:
            c = lines.count(string)
            if c > 0:
                count += c
                fcount += 1
            # print(string, count, fcount)
        # print(string, count, fcount)
        return count, fcount

This is 116 times faster than the first correct answer.

8.Summary

Since it is an algorithm problem, we should have devised a matching method. In fact,countSome people wrote code that would take half the time of the first code without using. However, even if you devise various algorithms in the scripting language, you cannot beat the functions of the library whose entity is written in C. If you know if the right library exists, you'll quickly reach the correct answer. Actually, there was a person who got an order of magnitude score in a blink of an eye by participating in the middle, and from there I thought about various things and searched and reached the final point. I would like to know what kind of algorithm is used to process in half the time without shifting with the library function.

Recommended Posts

What I did to speed up the string search task
What I did to ssh to the VPS Ubuntu environment
What I did to welcome the Python2 EOL with confidence
What I did to save Python memory
What I did to keep track of the humidity and temperature of the archive
[At Coder] What I did to reach the green rank in Python
[Python] What I did to do Unit Test
What I did when updating from Python 2.6 to 2.7
What I did to get started with Linux commands
I tried to summarize the string operations of Python
What I did when I was angry to put it in with the enable-shared option
I tried to speed up video creation by parallel processing
Ventilation is important. What I did to keep track of the C02 concentration in the room
Speed up the netstat command
Mongodb Shortest Introduction (3) I tried to speed up even millions
What I did when I wanted to make Python faster -Numba edition-
What to do when you get "I can't see the site !!!!"
Numba to speed up as Python
H29.2.27 ~ 3.5 Summary of what I did
I calculated the stochastic integral (I to integral)
Project Euler 4 Attempt to speed up
How to speed up Python calculations
[DRF] Snippet to speed up PrimaryKeyRelatedField
I tried to move the ball
I tried to estimate the interval.
What skills do I need to program with the FBX SDK Python?
I tried to summarize the general flow up to service creation by self-education.
I want to batch convert the result of "string" .split () in Python
I want to pin Spyder to the taskbar
I want to output to the console coolly
How to speed up instantiation of BeautifulSoup
I tried to summarize the umask command
I want to handle the rhyme part1
I tried to recognize the wake word
What I referred to when studying tkinter
I want to handle the rhyme part3
"Lie ... What have you been up to?"
I tried to summarize the graphical modeling.
[Pandas] Expand the character string to DataFrame
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
What I did with a Python array
I want to display the progress bar
What I always add to my ~ / .bashrc
What I was addicted to Python autorun
I want to handle the rhyme part2
I want to handle the rhyme part5
I want to handle the rhyme part4
I want to visualize the transfer status of the 2020 J League, what should I do?
[Django version up] The way to convert to string when saving TextFiled has changed
What I did when I got stuck in the time limit with lambda python
AtCoder AGC 041 C --I was addicted to the full search of Domino Quality
I want to set a life cycle in the task definition of ECS
What you want to memorize with the basic "string manipulation" grammar of python
Organize Python tools to speed up the initial movement of data analysis competitions
I made a tool to automatically back up the metadata of the Salesforce organization
[Python] I tried to get the type name as a string from the type function
Python program is slow! I want to speed up! In such a case ...
I made a program to look up words on the window (previous development)