I tried it because there was an algorithm problem to speed up the character string search at the company development camp. By the way, I am a so-called copy / paste programmer who is good at making similar products and refactoring them with reference to existing ones.
Samples, test data, and processing result files implemented in advance with some of the following tasks hidden together will be provided.
■ Program contents
・ Search for multiple character strings for multiple text files and output the following -How many relevant character strings are included (total number in multiple texts) -Number of text files containing the corresponding character string
-Run on command prompt or Windows PowerShell ・ Specify the following arguments for "program" Program folder input file output file Folder: Stores search target files (plural) Input file: List the search string Output file: Search results
・ When searching, distinguish between uppercase and lowercase letters and full-width and half-width characters. ・ The text to be searched and the search character string are utf-8.
・ Search string -150 items in one file, separated by line breaks Search the number of occurrences in the file for all of them -Any length of 1 or more characters -Does not include spaces, tabs, line breaks, or special symbols -Any character string, which may be part of a word
・ Search results -Output the number of occurrences in the order of appearance of the search character string, separated by tabs The number of appearances is the sum of the results of multiple files
sample,
-Number of text files containing the corresponding character string
The part of was not implemented.
"""
Count the number of occurrences of the specified character string from all the files in the specified folder
The result is output in TSV
python sample1.py text_folder strings_file result_file
"""
import sys
from pathlib import Path
class TextManager:
"""
A class that manages text file information
"""
def __init__(self, foldername):
'''
Read and store the files in the folder
Parameters
----------
foldername : str
Folder name
'''
self.__alllines = []
p = Path(foldername)
for filename in p.iterdir():
with open(filename, 'r', encoding='utf-8') as f:
lines = [s.strip() for s in f.readlines() if s.strip() != '']
self.__alllines.extend(lines)
def search(self, string_list):
'''
Search
Search for registered text in the search string list
Parameters
----------
string_list : List[str]
List of search strings
Returns
-------
Dict[str, int]
search results
'''
result = {}
for s in string_list:
count = self.__search_one(s)
result[s] = count
return result
def __search_one(self, string):
'''
Search for strings
Parameters
----------
string : str
Search string
Returns
-------
int
Number of search strings that existed
'''
l = len(string)
count = 0
for line in self.__alllines:
for i in range(0, len(line)-l+1):
if string == line[i:i+l]:
count += 1
return count
if __name__ == "__main__":
args = len(sys.argv)
if args != 4:
print("USAGE: python sample1.py text_folder strings_file result_file")
sys.exit()
text_folder = sys.argv[1]
strings_file = sys.argv[2]
result_file = sys.argv[3]
#Read the target file group (files in the folder)
text_manager = TextManager(text_folder)
#Read the search string
search_strings = []
with open(strings_file, 'r', encoding='utf-8') as fi:
search_strings = [s.strip() for s in fi.readlines() if s.strip() != '']
#Search execution: List[str] -> Dict[str, int]
result = text_manager.search(search_strings)
with open(result_file, 'w', encoding='utf-8') as fo:
#Output in the order described in the search string file
for s in search_strings:
c = result.get(s, 0)
fo.write("{}\t{}\n".format(s, c))
It takes a lot of effort to review the whole logic, so as a copy / paste programmer, use the current logic as it is. First, divide the part that was read from multiple files together into each file and save it. Some consideration is given to high-speed chips.
class TextManager:
"""
A class that manages text file information
"""
def __init__(self, foldername):
'''
Read and store the files in the folder
Parameters
----------
foldername : str
Folder name
'''
self.__alllines = []
__alllines_extend = self.__alllines.extend
p = Path(foldername)
for filename in p.iterdir():
__filelines = []
__filelines_extends = __filelines.extend
with open(filename, 'r', encoding='utf-8') as f:
f_readlines = f.readlines
lines = [s.strip() for s in f_readlines() if s.strip() != '']
__filelines_extends(lines)
self.__alllines.append(__filelines)
A loop for each file is added to the for
loop of the part that searches one search string from the whole, and if the search string is found in it, the file count is incremented by 1, and finally the search string exists. Returns the number and the number of files in which the search string existed.
The commented out print ()
is a reminder that I modified it while checking if it was working properly.
def __search_one(self, string):
'''
Search for strings
Parameters
----------
string : str
Search string
Returns
-------
int
Number of search strings that existed
int
Number of files where the search string existed
'''
l = len(string)
count = 0
fcount = 0
# print(string, count, fcount)
for filelines in self.__alllines:
# print(filelines)
find = False
for line in filelines:
# print(line)
for i in range(0, len(line)-l+1):
if string == line[i:i+l]:
count += 1
find = True
if find:
fcount +=1
# print(string, count, fcount)
# print(string, count, fcount)
return count, fcount
The search summary corresponds to getting two return values from individual search.
def search(self, string_list):
'''
Search
Search for registered text in the search string list
Parameters
----------
string_list : List[str]
List of search strings
Returns
-------
Dict[str, int]
search results
Dict[str, int]
Number of search result files
'''
result = {}
fresult = {}
for s in string_list:
rtn = self.__search_one(s)
result[s] = rtn[0]
fresult[s] = rtn[1]
return result, fresult
The output section has the same support, so it is omitted. Now you can easily get the correct answer, which is the minimum requirement for the task.
Currently, the same judgment is made by shifting each character from the beginning to the end of the line read by __search_one ()
.
The point is how to speed up this, but if you cut out blocks that do not contain non-target characters from the search target character string, matching efficiency should increase.
That's why it is separated by the reading part. This is the part of .split ('[,." "<< >> | [#]]')
.
Other parentheses should be targeted, but since the target of the assignment was the file brought from Aozora Bunko, this is enough for the time being.
def __init__(self, foldername):
'''
Read and store the files in the folder
Parameters
----------
foldername : str
Folder name
'''
self.__alllines = []
__alllines_append = self.__alllines.append
p = Path(foldername)
for filename in p.iterdir():
__filelines = []
__filelines_extends = __filelines.extend
with open(filename, 'r', encoding='utf-8') as f:
f_readlines = f.readlines
lines = [s.strip().split('[、。「」《》|[#]]')
for s in f_readlines() if s.strip() != '']
__filelines_extends(lines)
__alllines_append(__filelines)
__Search_one ()
also increases the loop corresponding to the carving.
There should be a way to write it that doesn't have to be increased, but for the time being, a loop was needed if it was a defeat method.
Print ()
is useful for checking around.
Actually, the test data that was distributed with the sample in advance was in English, so I coded it with split ()
without arguments, but that doesn't work in Japanese. Since the result is strange, when I print
, it was chopped character by character in Japanese.
So, once I removed it, I corresponded to the number of files, and then I split
again.
def __search_one(self, string):
'''
Search for strings
Parameters
----------
string : str
Search string
Returns
-------
int
Number of search strings that existed
int
Number of files where the search string existed
'''
l = len(string)
count = 0
fcount = 0
# print(string, count, fcount)
for filelines in self.__alllines:
# print(filelines)
find = False
for line in filelines:
# print(line)
for sentence in line:
for i in range(0, len(sentence)-l+1):
if string == sentence[i:i+l]:
count += 1
find = True
# print(sentence, string, sep='\t')
if find:
fcount +=1
# print(string, count, fcount)
# print(string, count, fcount)
return count, fcount
It is very inefficient to compare the appearance of character strings by shifting them character by character.
No matter how much you devise a scripting language, the speed will not increase if you compare it yourself.
If you can manage with the functions of the standard library, that is absolutely faster.
I knew the method of String in String
, but I rejected it because it wouldn't work if it appeared multiple times in one target.
I'm not confident in my memory, so I'll search for it.
[Search on google "python string match"](https://www.google.com/search?q=python+%E6%96%87%E5%AD%97%E5%88%97+%E3%83 % 9E% E3% 83% 83% E3% 83% 81 & oq = python +% E3% 83% 9E% E3% 83% 83% E3% 83% 81 & aqs = chrome.2.69i57j0i7i30l2j0j0i7i30j0j0i7i30l2.11559j0j7 & sourceid = chrome & ie
Oh, there was a function I wanted.
Search string with Python (determine whether it contains ~, get position, count)
So, rewrite the process in the middle of __search_one
.
for line in filelines:
# print(line)
for sentence in line:
# for i in range(0, len(sentence)-l+1):
# if string == sentence[i:i+l]:
c = sentence.count(string)
if c > 0:
count += c
find = True
~~~
This makes the processing time an order of magnitude faster.
# 6.Review of processing
When you can find the number of appearances from the target in one shot, 4.Added in`split`Is rather a cause of slowdown as it only increases loops.
I don't need to carve it anymore, so I'll put it back.
here`__search_one`I will write only inside.
```python
for line in filelines:
# print(line)
c = line.count(string)
if c > 0:
count += c
find = True
count
Since you can find out the number of appearances in one shot using, there is no point in reading line by line.
In this issue, there is no limit on the amount of memory used, only the installed memory of the verification environment is written.
If you read the entire file, it makes no difference whether you read it line by line or by file.
That's why I changed to file batch reading.
I should have changed the variable name to something more appropriate, but I didn't really think about it.
def __init__(self, foldername):
'''
Read and store the files in the folder
Parameters
----------
foldername : str
Folder name
'''
self.__alllines = []
__alllines_extend = self.__alllines.extend
p = Path(foldername)
for filename in p.iterdir():
with open(filename, 'r', encoding='utf-8') as f:
lines = f.read()
self.__alllines.append(lines)
The actual search part is simple
def __search_one(self, string):
'''
Search for strings
Parameters
----------
string : str
Search string
Returns
-------
int
Number of search strings that existed
int
Number of files where the search string existed
'''
count = 0
fcount = 0
# print(string, count, fcount)
for lines in self.__alllines:
c = lines.count(string)
if c > 0:
count += c
fcount += 1
# print(string, count, fcount)
# print(string, count, fcount)
return count, fcount
This is 116 times faster than the first correct answer.
Since it is an algorithm problem, we should have devised a matching method.
In fact,count
Some people wrote code that would take half the time of the first code without using.
However, even if you devise various algorithms in the scripting language, you cannot beat the functions of the library whose entity is written in C.
If you know if the right library exists, you'll quickly reach the correct answer.
Actually, there was a person who got an order of magnitude score in a blink of an eye by participating in the middle, and from there I thought about various things and searched and reached the final point.
I would like to know what kind of algorithm is used to process in half the time without shifting with the library function.
Recommended Posts