Extract strings from files in Python

Introduction

** What to explain in this article ** Sample code for the following features.

--Create a list of files under the specified directory --Check if the text in the file contains a particular string --Extract the text in the range enclosed by a specific string from the text in the file

Development environment

--python 2.7 and above

Create a list of files under the specified directory

code

def generate_file_list(dirpath_to_search):
    file_list = []
    for dirpath, dirnames, filenames in os.walk(dirpath_to_search):
        for filename in filenames:
             file_list.append(os.path.join(dirpath,filename))

    return file_list

how to use

A sample when you want to recursively acquire the file names under sample1 with the following directory structure.

`Sample directory structure`


sample1/
├── dir01
│   ├── dir11
│   │   └── file21.txt
│   └── file11.txt
├── file01.txt
└── file02.txt

`how to use`


file_list = generate_file_list('sample1')
for file in file_list:
    print(file)

#output
# sample1/file01.txt
# sample1/file02.txt
# sample1/dir01/file11.txt
# sample1/dir01/dir11/file21.txt

API used

os.walk(top, topdown=True, onerror=None, followlinks=False)

Create the file names under the directory tree by scanning the tree top-down or bottom-up. Yield tuples (dirpath, dirnames, filenames) for each directory (including top itself) in the directory tree rooted at directory top.

Find out if the text in the file contains a particular string

code

def contain_text_in_file(filepath, text):
    with open(filepath) as f:
        return any(text in line for line in f)

how to use

A sample when there are files contain.txt and not_contain.txt as shown below and you want to know the file that includes "2020/02/02" in the file.

`contain.txt`


Update date: 2020/02/02
This article is about python file operations.

`not_contain.txt`


Update date: 2019/10/15
This article is about python file operations.

`how to use`


filepath1 = './contain.txt'
text = '2020/02/02'
result1 = contain_text_in_file(filepath1, text)
print(result1) # True

filepath2 = './not_contain.txt'
text = '2020/02/02'
result2 = contain_text_in_file(filepath2, text)
print(result2) # False

API used

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Opens file and returns the corresponding file object.

any(iterable)

Returns True if any element of iterable is true. Returns False if iterable is empty. Equivalent to the following code:

Extract the text in the range enclosed by a specific string from the text in the file

code

import re

def extract_text_in_file(filepath, pattern_prev, pattern_next):
    extracted_text_array = []
    pattern = pattern_prev + '(.*)' + pattern_next
    with open(filepath) as f:
        lines = f.readlines()
        for line in lines:
            tmp_extracted_text_array = re.findall(pattern, line)
            extracted_text_array.extend(tmp_extracted_text_array)

    return extracted_text_array

how to use

A sample when there is a file called file.txt like the one below and you want to extract the date part surrounded by" update date "and" by ".

`file.txt`


Update date:2020/02/01 by taro
This article is about python file operations.

Update date:2020/02/02 by jiro
This article is about python file operations.

`how to use`


filepath = './file.txt'
pattern_prev = 'Update date:'
pattern_next = ' by'
extracted_text_array = extract_text_in_file(filepath, pattern_prev, pattern_next)

for extracted_text in extracted_text_array:
    print(extracted_text)

#output
# 2020/02/01
# 2020/02/02

API used

re.findall(pattern, string, flags=0)

Returns all unique matches by pattern in string as a list of strings. The string is scanned from left to right and matches are returned in the order they are found. Returns a list of groups if there is more than one group in the pattern. If the pattern has multiple groups, it will be a list of tuples. Empty matches are included in the result.

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

Opens file and returns the corresponding file object.