I made a process in python to extract text using any of prefix match, suffix match, partial match, and exact match using multiple conditions. Originally, I made a process in python to extract and remove something that contains a specific wording from a certain text, but I thought that the process to extract alone would be effective and recreated it so that I could partially change that part. I tried it.
This time there is also an exe, so if you just want to run it, you don't need python.
resources / search data.xlsx
, extraction is performed under the conditions of prefix match, suffix match, partial match, and exact match according to the definition set in resources / appConfig.ini
. ..data
directory.The following process creates conditions for searching.
. *
Is added according to the conditions such as prefix match and suffix match, and the conditions are connected by |
. def createReg(self):
searchItems=pd.read_excel('resources/Search data.xlsx')
sortTypeCode=iniFile.get('info','sortType')
searchItemArray=np.asarray(searchItems['Search word'])
sortType=SORT_ENUM(sortTypeCode)
if sortType==SORT_ENUM.SORT_LENGTH_ASC or sortType==SORT_ENUM.SORT_LENGTH_DESC:
searchItemIndex=[]
for item in searchItemArray:
searchItemIndex.append(len(item))
searchSeries=pd.Series(searchItemIndex)
serchItemDataFrame=pd.concat([searchItems['Search word'],searchSeries],axis=1)
if sortType==SORT_ENUM.SORT_LENGTH_ASC:
sortItems=serchItemDataFrame.sort_values(0,ascending=True)
else:
sortItems=serchItemDataFrame.sort_values(0,ascending=False)
searchItemArray=np.asarray(sortItems['Search word'])
regTypeCode=iniFile.get('info','regType')
regType=REG_ENUM(regTypeCode)
regStr=''
for item in searchItemArray:
if regStr!='':
regStr=regStr+'|'
sItem=item
if REG_ENUM.REG_TYPE_CONTAIN==regType:
sItem='.*'+item+'.*'
elif REG_ENUM.REG_TYPE_FRONT==regType:
sItem=item+'.*'
elif REG_ENUM.REG_TYPE_BACKWARD==regType:
sItem='*.'+item
elif REG_ENUM.REG_TYPE_EXACT_MATCH==regType:
sItem=item
regStr=regStr+sItem
return re.compile(regStr)
The following process extracts based on the conditions created in the above process.
with open
to read the file and see if it matches line by line. def extract(self):
reg=self.createReg()
paths=glob.glob('data/*.csv')
fileDict={}
for pathName in paths:
extractList=[]
with open(pathName,encoding=iniFile.get('info','encoding')) as f:
# targetStrs=f.read()
for targetStr in f:
extractStr=reg.search(targetStr)
if extractStr:
extractList.append(targetStr)
fileDict[os.path.basename(pathName)]=extractList
outputPath=iniFile.get('info','outputPath')
for key,data in fileDict.items():
outputFile=outputPath+'extract_'+key+'.txt'
with open(outputFile,encoding='utf-8',mode='w') as f:
for d in data:
f.write(d)
data
resources / search data.xlsx
regExtract.exe
.Recommended Posts