Use decorators to prevent re-execution of data processing

Overview

It is a common and common process to process data, save it to a disk once, and reuse it (skip data processing from the second time), but considering the dependency of parameters at the time of reuse, etc. It tends to be unexpectedly complicated. Therefore, consider an implementation that does not repeat the same process by making a skip judgment using a Python decorator.

A library of this process can be found at github.com/sotetsuk/memozo:

motivation

For example, suppose you now have a huge amount of statement data (one sentence per line):

1. I have a pen.
2. I have an apple.
3. ah! Apple pen!

...

9999...

# PPAP (copyright belongs to Pikotaro)

Now suppose you want to filter only the sentences that contain a specific keyword from this data (for example, the sentence that contains the keyword `` `pen```).

One naive implementation of the filter would be to create a generator that yields every time it finds a statement that meets the criteria:

def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line

gen = filter_data('pen')
for line in gen:
    print(line, end='')

And if you want to reuse this processed data (filtered data) many times, it is not always a good idea to scan all the data each time. You may want to cache the data filtered to disk once and then use the cached data. Also, this data processing process depends on the parameter (`` keyword `), so if this process is executed with a different` `keyword, all the data will be checked again and put on the disk. There is also the aspect of wanting to cache. And I have a desire to achieve this process simply by wrapping the function with a decorator.

In summary, the goal is to use the decorator awesome_decorator to cache the output from the generator, and if this function is executed with the same parameters, use the cache to return the output: is:

@awesome_decorator
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line


#The first time it scans all the data and returns the result.
#At this time, the filtered sentence'./data/pen.txt'It will be cached in.
gen_pen_sentences1 = filter_data('pen')
for line in gen_pen_sentences1:
    print(line, end='')

#Since it is executed with the same parameters, the cache'./data/pen.txt'Returns the data from.
gen_pen_sentences2 = filter_data('pen')
for line in gen_pen_sentences2:
    print(line, end='')

#Since it is a new parameter, we will filter it again from the raw data.
gen_apple_sentences = filter_data('apple')
for line in gen_apple_sentences:
    print(line, end='')

Also, this example is a function that returns a generator, but I think there are other situations where you want to cache the execution result of a function that returns an object that can be serialized by pickle to disk (for example, preprocessed). `` `ndarray``` and parameter-dependent trained machine learning models).

Implementation

awesome_decoratorIs easy to implement, determine if there are already cached files,

  1. If there is a cache, create a new generator that returns the value from the cache and return it in place of the original generator
  2. If there is no cache, wrap the original generator and return a generator to write to the cache each time it returns a value

Just (even if you use `` `pickle``` etc.):

def awesome_decorator(func):

    @functools.wraps(func)
    def _wrapper(keyword):
        #This time, for the sake of simplicity, we assume that the argument of the function is only one keyword.
        #general(*args, **kwargs)When using, use inspect etc. to extract the arguments and their values.
        file_path = './data/{}.txt'.format(keyword)

        #If there is cached data, it returns a generator that reads statements from it.
        if os.path.exists(file_path):
            def gen_cached_data():
                with codecs.open(file_path, 'r', 'utf-8') as f:
                    for line in f:
                        yield line
            return gen_cached_data()

        #If there is no cached data, it will generate a decorator that returns statements from the raw data as usual.
        gen = func(keyword)

        #It also caches the values returned by the above generators.
        def generator_with_cache(gen, file_path):
            with codecs.open(file_path, 'w', 'utf-8') as f:
                for e in gen:
                    f.write(e)
                    yield e

        return generator_with_cache(gen, file_path)

    return _wrapper

The article 12 Steps for Understanding Python Decorators is an easy-to-understand explanation of the decorator itself.

All in all, it looks like this (this works just fine with `` `./data/sentence.txt```):

awesome_generator.py


# -*- coding: utf-8 -*-

import os
import functools
import codecs


def awesome_decorator(func):

    @functools.wraps(func)
    def _wrapper(keyword):
        #This time, for the sake of simplicity, we assume that the argument of the function is only one keyword.
        #general(*args, **kwargs)When using, use inspect etc. to extract the arguments and their values.
        file_path = './data/{}.txt'.format(keyword)

        #If there is cached data, it returns a generator that reads statements from it.
        if os.path.exists(file_path):
            def gen_cached_data():
                with codecs.open(file_path, 'r', 'utf-8') as f:
                    for line in f:
                        yield line
            return gen_cached_data()

        #If there is no cached data, it will generate a decorator that returns statements from the raw data as usual.
        gen = func(keyword)

        #It also caches the values returned by the above generators.
        def generator_with_cache(gen, file_path):
            with codecs.open(file_path, 'w', 'utf-8') as f:
                for e in gen:
                    f.write(e)
                    yield e

        return generator_with_cache(gen, file_path)

    return _wrapper


@awesome_decorator
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line


if __name__ == '__main__':
    #The first time it scans all the data and returns the result.
    #At this time, the filtered sentence'./data/pen.txt'It will be cached in.
    gen_pen_sentences1 = filter_data('pen')
    for line in gen_pen_sentences1:
        print(line, end='')

    #Since it is executed with the same parameters, the cache'./data/pen.txt'Returns the data from.
    gen_pen_sentences2 = filter_data('pen')
    for line in gen_pen_sentences2:
        print(line, end='')

    #Since it is a new parameter, we will filter it again from the raw data.
    gen_apple_sentences = filter_data('apple')
    for line in gen_apple_sentences:
        print(line, end='')

memozo 今回の実装は,パラメータの形やファイル名等を固定された形で扱っていましたが,任意の形に少し拡張したものをパッケージとしてgithub.com/sotetsuk/memozoにまとめました. With this, this process can be written like this:

from memozo import Memozo

m = Memozo('./data')

@m.generator(file_name='filtered_sentences', ext='txt')
def filter_data(keyword):
    path_to_raw_data = './data/sentences.txt'
    with codecs.open(path_to_raw_data, 'r', 'utf-8') as f:
        for line in f:
            if keyword in line:
                yield line

The cache file is saved in './ data / filtered_sentences_1fec01f.txt'` ``, and the history of parameters used in` `./data/.memozo is written. The hash is calculated from (file name, function name, parameter), and if both the history and cache file using the same hash already exist, the function execution will be skipped. In other words, if you execute with the same (file name, function name, parameter), the value will be returned from the cache, and if you change any one, the result will be different.

In addition to the generator, there are versions of the functions that correspond to `pickle```, `codecs, and ordinary open```.

I think the implementation is still incomplete, so I would be grateful if you could mention Issue / PR etc.

Relation

タスク間に複雑な依存関係がある場合はDAGベースのワークフローツールを使った方がいいでしょう.一例として,github.com/spotify/luigiなどが挙げられます.

References

-github.com/sotetsuk/memozo: Summary of this implementation -github.com/spotify/luigi: If you have complex dependencies between tasks, you should use a DAG-based workflow tool. luigi is one example. -github.com/petered/plato/pulls/56: Implementation of the same motivation -lru_cache: Cache to memory with decorator -12 Steps to Understand Python Decorators: Explanation of the decorator itself

Recommended Posts

Use decorators to prevent re-execution of data processing
Summary of how to use pandas.DataFrame.loc
Summary of how to use pyenv-virtualenv
Summary of how to use csvkit
processing to use notMNIST data in Python (and tried to classify it)
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Scipy
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Pandas
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use graph drawing library ♬ Basics of Matplotlib
[Python] Summary of how to use pandas
100 Language Processing Knock-91: Preparation of Analogy Data
Use pandas to convert grid data to row-holding (?) Data
[Python2.7] Summary of how to use unittest
Jupyter Notebook Basics of how to use
How to use "deque" for Python data
Basics of PyTorch (1) -How to use Tensor-
Summary of how to use Python list
[Python2.7] Summary of how to use subprocess
Example of efficient data processing with PANDAS
[Introduction to Data Scientists] Basics of Python ♬
[Question] How to use plot_surface of python
[Introduction to Data Scientists] Basics of scientific calculation, data processing, and how to use the graph drawing library ♬ Environment construction
How to use folium (visualization of location information)
A simple example of how to use ArgumentParser
[Python] How to use two types of type ()
Use the retry processing mode added to Boto3
Convert data with shape (number of data, 1) to (number of data,) with numpy.
Not much mention of how to use Pickle
Summary of how to use MNIST in Python
How to use data analysis tools for beginners
[Pandas] Basics of processing date data using dt
Preparing to try "Data Science 100 Knock (Structured Data Processing)"
Story of trying to use tensorboard with pytorch
I want to get League of Legends data ③
I want to get League of Legends data ②
Use of past meteorological data 1 (Display of AMeDAS points)
[Introduction to cx_Oracle] (5th) Handling of Japanese data
How to use PyTorch-based image processing library "Kornia"
Summary of studying Python to use AWS Lambda
Data cleansing 3 Use of OpenCV and preprocessing of image data
Use a cool graph to analyze PES data!
I want to get League of Legends data ①
What to use for Python stacks and queues (speed comparison of each data structure)
Use Pandas to write only the specified lines of the data frame to an excel file
Use data class for data storage of Python 3.7 or higher
Create a dataset of images to use for learning
[Introduction to Python] How to use while statements (repetitive processing)
100 language processing knock-92 (using Gensim): application to analogy data
Summary of tools needed to analyze data in Python
Full-width and half-width processing of CSV data in Python
A memo of how to use AIST supercomputer ABCI
I tried to summarize how to use matplotlib of python
About data preprocessing of systems that use machine learning
Love is needed to prevent accidental erasure of cron
[Chapter 5] Introduction to Python with 100 knocks of language processing
[Chapter 6] Introduction to scikit-learn with 100 knocks of language processing
How to use Python Kivy ① ~ Basics of Kv Language ~
Send data from Python to Processing via socket communication
Performance verification of data preprocessing in natural language processing
[Chapter 3] Introduction to Python with 100 knocks of language processing
DataNitro, implementation of function to read data from sheet
Let's use the open data of "Mamebus" in Python