Extract lines that match the conditions from a text file with python

Overview

I made a process in python to extract text using any of prefix match, suffix match, partial match, and exact match using multiple conditions. Originally, I made a process in python to extract and remove something that contains a specific wording from a certain text, but I thought that the process to extract alone would be effective and recreated it so that I could partially change that part. I tried it.

Things necessary

python 3.7.2
pandas
numpy

This time there is also an exe, so if you just want to run it, you don't need python.

Publication place

Published on github.

Processing content

Based on the wording defined in resources / search data.xlsx, extraction is performed under the conditions of prefix match, suffix match, partial match, and exact match according to the definition set in resources / appConfig.ini. ..
Processes all files stored in the data directory.
The result is output under the ʻoutput` directory.

Source description

The following process creates conditions for searching.

** Get the search string from the search data .xlsx **.
Sort by the length of the character string according to the sort condition.
. * Is added according to the conditions such as prefix match and suffix match, and the conditions are connected by |.

    def createReg(self):
        searchItems=pd.read_excel('resources/Search data.xlsx')
        sortTypeCode=iniFile.get('info','sortType')

        searchItemArray=np.asarray(searchItems['Search word'])
        sortType=SORT_ENUM(sortTypeCode)
        if sortType==SORT_ENUM.SORT_LENGTH_ASC or sortType==SORT_ENUM.SORT_LENGTH_DESC:
            searchItemIndex=[]
            for item in searchItemArray:
                searchItemIndex.append(len(item))
            searchSeries=pd.Series(searchItemIndex)
            serchItemDataFrame=pd.concat([searchItems['Search word'],searchSeries],axis=1)
            if sortType==SORT_ENUM.SORT_LENGTH_ASC:
                sortItems=serchItemDataFrame.sort_values(0,ascending=True)
            else:
                sortItems=serchItemDataFrame.sort_values(0,ascending=False)
            searchItemArray=np.asarray(sortItems['Search word'])
        regTypeCode=iniFile.get('info','regType')
        regType=REG_ENUM(regTypeCode)
        regStr=''
        for item in searchItemArray:
            if regStr!='':
                regStr=regStr+'|'
            sItem=item
            if REG_ENUM.REG_TYPE_CONTAIN==regType:
                sItem='.*'+item+'.*'
            elif REG_ENUM.REG_TYPE_FRONT==regType:
                sItem=item+'.*'
            elif REG_ENUM.REG_TYPE_BACKWARD==regType:
                sItem='*.'+item
            elif REG_ENUM.REG_TYPE_EXACT_MATCH==regType:
                sItem=item
            regStr=regStr+sItem
        return re.compile(regStr)

The following process extracts based on the conditions created in the above process.

Use with open to read the file and see if it matches line by line.
Store the matches in an array.
Finally, it is output as a text file.

    def extract(self):
        reg=self.createReg()
        paths=glob.glob('data/*.csv')
        
        fileDict={}

        for pathName in paths:
            extractList=[]
            with open(pathName,encoding=iniFile.get('info','encoding')) as f:
                # targetStrs=f.read()
                for targetStr in f:
                    extractStr=reg.search(targetStr)
                    if extractStr:
                        extractList.append(targetStr)
            fileDict[os.path.basename(pathName)]=extractList
        outputPath=iniFile.get('info','outputPath')
        for key,data in fileDict.items():
            outputFile=outputPath+'extract_'+key+'.txt'
            with open(outputFile,encoding='utf-8',mode='w') as f:
                for d in data:
                    f.write(d)

How to use

See the readme on github
If you just want to move
Store the file you want to process in data
Set the wording you want to extract in resources / search data.xlsx
Run regExtract.exe.

How to use

When there are multiple files that you want to check if they are included in the linked file, etc.
Modify the process so that a specific wording can be converted, etc.