Motivation

When you are programming, you may want to reuse the program, or you may want other members to use the program.

In such a case, if you modularize and package the code for each function and maintain the documentation properly, it will be easier for others to use.

VS Code is also a powerful tool for creating Python packages. So I will explain how to create a Python package using VS Code.

It also contains useful information when packaging your data analysis program.

environment

		Remarks
OS	Windows10
conda	4.8.3	With Anaconda Prompt`conda -V`
Anaconda	2020.02	With Anaconda Prompt`conda list anaconda`
Python	3.8.2
VSCode	1.43.2

Setting up the environment

Please refer to this article to prepare the Python execution environment in VS Code. How to build Python and Jupyter execution environment with VS Code

Directory structure

Once you have an execution environment with VS Code, create folders and files for developing Python packages there.

.
├── Favorite package name
│   ├── __init__.py
│ └── Favorite file name.py
├── setup.py
└── script.py

The package itself is under the favorite package name folder.

As an example, I made it on VS Code as shown above. As an example, let's name the package mypackage.

Write setup.py

setup.py is a file that sets the dependency information, version information, and package name of the package to be created in the package.

setup.py

from setuptools import setup, find_packages

setup(
    name='mypackage',
    install_requires=['pandas','scikit-learn'],
    packages=find_packages()
)

Write the settings in the argument of the function setup. For example, in ʻinstall_requires`, write the modules required for the package.

There are various other items, so please check Official documentation (write setup script) as appropriate.

Write a program

Let's write a program right away. As an example, this time I will make a package to analyze the data of Kaggle's Titanic.

Suppose you write the following program in the file preprocessing.py in the package. This is a program that preprocesses data.

preprocessing.py

class Preprocesser:
"""
Class to preprocess
"""
    def process(self,data):
        """
Method to preprocess
        """
        processed_data=data.copy()

        #Median Missing Age
        age_m=processed_data['Age'].median()
        processed_data['Age']=processed_data['Age'].fillna(age_m)

        #-----abridgement-----
        #Write preprocessing

        return processed_data

Run the program

Let's run the program. Create a .py script for program execution called script.py directly under the package mypackage folder. As an example, write a program that preprocesses and displays training data.

script.py

def main():
    from mypackage import preprocessing
    import pandas as pd

    train=pd.read_csv('train.csv')

    #Initialize the preprocessing instance and perform preprocessing
    preprocesser=preprocessing.Preprocesser()
    train_processed=preprocesser.process(train)

    print(train_processed.head())

if __name__=='__main__':
    main()

Regarding how to import your own package, if it is directly under the folder of your own package,

from mypackage import preprocessing

from Self-made package name import File name of individual python code

You can import the package with.

With this script.py open, press the F5 key on VS Code to execute the program as shown above, and the execution result will be displayed on the Terminal.

Debug the package

[How to build Python and Jupyter execution environment with VS Code #Utilization of debugging](https://qiita.com/SolKul/items/f078877acd23bb1ea5b5#%E3%83%87%E3%83%90%E3%83%83] % E3% 82% B0% E3% 81% AE% E6% B4% BB% E7% 94% A8) You can use VS Code's debugging features in package programming as well.

For example, press the F9 key on line 7 of the code preprocessing.py in the package as shown above. You will see a red dot at the left end of this line. This is called a ** breakpoint **. And in this state, go back to script.py and press F5 to execute it.

Execution is paused at line 7 in the package as shown above, and the variables declared at that time (here, the variables in preprocessing.py) are displayed in the left sidebar. By using breakpoints in this way, I think that bug fixing (= ** debugging **) of the program will be improved.

Install the package

Try installing this self-made package in another environment. And I will try to see if it works in that other environment.

Open Anaconda Prompt and create a new environment.

conda create -n Favorite environment name python=Python version

This time, I created an environment called setup_test as an example.

Then start this environment.

conda activate setup_test

Then move to the folder that contains the setup.py edited above.

cd setup.Directory with py

Then install this homebrew package.

python setup.py install

After installation, try running the above script.py in this state. Copy script.py and train.csv to another folder of your choice and try running them there.

python script.py

It can be executed as shown above, and the preprocessed training data is displayed. This folder only contains scripts and data, not a self-made package folder. In other words, if you could run it with script.py in this folder, it means that you could install this self-made package in this environment.

Prepare demo data in your own package

When creating your own package, you may want to include data files other than the source code.

For example, suppose you create and distribute a package for data analysis. And when other members want to use the package, I think there is a need to know the behavior of the analysis, although the data cannot be prepared immediately. In such a case, if you prepare the demo data in the package, you can explain it smoothly to that person.

As an example, we will explain the case where the training data for the Titanic is prepared in the package. Add some folders and files to the directory.

.
├── mymodule
│   ├── __init__.py
│   ├── preprocessing.py
│   ├── load_date.py *
│   └── resources *
│        └── train.csv *
├── setup.py
└── script.py

*:Newly added files and folders

First, create a folder for data in your own package. Here, it is resources. And put the training data (train.csv) in it.

Read the data in the package

Write the following code to load the demo data and add it to the package.

load_date.py

import pkgutil,io
import pandas as pd

class DataLoader:
    def load_demo(self):
        train_b=pkgutil.get_data('mypackage','resources/train.csv')
        train_f=io.BytesIO(train_b)
        train=pd.read_csv(train_f)
        return train

Here we use a module called pkgutil which is included as standard in Python. The function pkgutil.get_data () can get its contents in binary by specifying the package name and file name.

Also, ʻio is used to handle the read binary data like a file (file-like object`).

Test if the demo data can be read. Rewrite main () of script.py as follows and execute it with F5 on VS Code.

script.py

def main():
    from mypackage import load_data

    data_loader=load_data.DataLoader()
    train=data_loader.load_demo()
    print(train.head())

The demo data can be read as shown above.

Make sure that data is installed at the same time as package installation

However, with this alone, even if you install this package, the data will not be installed at the same time. Add a line to setup.py so that when you install the package, the data will be installed at the same time.

setup.py

from setuptools import setup, find_packages

setup(
    name='mypackage',
    install_requires=['pandas','scikit-learn'],
    packages=find_packages(),
    package_data={'mypackage': ['resources/*']}
)

By specifying the package name and folder name in package_data, you can specify the data to be installed at the same time as the package installation.

For details, refer to Official document (2.6. Install package data).

Then, as explained above, if you create a new environment and install your own package using setup.py, you can see that the demo data can be used in the installed environment. ..

in conclusion

This is not enough to make your program easy to explain to other members and easy to use. Originally, you could write tests with ʻunit test or pytest, or There are other things to do, such as explaining program I / O with docstring`.

However, I think packaging is the first step in doing so.

If you have come this far, please make your program easy to understand by writing tests, writing docstrings, and converting docstrings to the program specifications as shown below.

Additional Information

This article was very helpful for packaging Python code. It also describes how to write a test using ʻunit test`, so please refer to it. How to make a Python package (written for internships)

However, when it comes to testing, pytest is easier to use. If you are used to ʻunit test, please try using pytest`. pytest (official documentation)

It's also about documentation that describes your program. You can document the docstring as a specification. How to use Sphinx. Read docstring and generate specifications

You may also want to use diagrams and formulas to explain how to use the program and the theory.

In such a case, it is recommended to use a module called mkdocs that can create documents in markdown format. Document creation with MkDocs

If you create a document with this sphinx and mkdocs and host it on AWS S3 etc., how do you use this program from members? If you are asked, it is very convenient because you can send the URL when you are busy.

I referred to here for the data analysis of the Titanic. [Introduction to Kaggle Beginners] Who will survive the Titanic?

How to make a Python package using VS Code