When you are programming, you may want to reuse the program, or you may want other members to use the program.
In such a case, if you modularize and package the code for each function and maintain the documentation properly, it will be easier for others to use.
VS Code is also a powerful tool for creating Python packages. So I will explain how to create a Python package using VS Code.
It also contains useful information when packaging your data analysis program.
Remarks | ||
---|---|---|
OS | Windows10 | |
conda | 4.8.3 | With Anaconda Promptconda -V |
Anaconda | 2020.02 | With Anaconda Promptconda list anaconda |
Python | 3.8.2 | |
VSCode | 1.43.2 |
Please refer to this article to prepare the Python execution environment in VS Code. How to build Python and Jupyter execution environment with VS Code
Once you have an execution environment with VS Code, create folders and files for developing Python packages there.
.
├── Favorite package name
│ ├── __init__.py
│ └── Favorite file name.py
├── setup.py
└── script.py
The package itself is under the favorite package name
folder.
As an example, I made it on VS Code as shown above. As an example, let's name the package mypackage
.
setup.py
is a file that sets the dependency information, version information, and package name of the package to be created in the package.
setup.py
from setuptools import setup, find_packages
setup(
name='mypackage',
install_requires=['pandas','scikit-learn'],
packages=find_packages()
)
Write the settings in the argument of the function setup
.
For example, in ʻinstall_requires`, write the modules required for the package.
There are various other items, so please check Official documentation (write setup script) as appropriate.
Let's write a program right away. As an example, this time I will make a package to analyze the data of Kaggle's Titanic.
Suppose you write the following program in the file preprocessing.py
in the package.
This is a program that preprocesses data.
preprocessing.py
class Preprocesser:
"""
Class to preprocess
"""
def process(self,data):
"""
Method to preprocess
"""
processed_data=data.copy()
#Median Missing Age
age_m=processed_data['Age'].median()
processed_data['Age']=processed_data['Age'].fillna(age_m)
#-----abridgement-----
#Write preprocessing
return processed_data
Let's run the program. Create a .py
script for program execution called script.py
directly under the package mypackage
folder. As an example, write a program that preprocesses and displays training data.
script.py
def main():
from mypackage import preprocessing
import pandas as pd
train=pd.read_csv('train.csv')
#Initialize the preprocessing instance and perform preprocessing
preprocesser=preprocessing.Preprocesser()
train_processed=preprocesser.process(train)
print(train_processed.head())
if __name__=='__main__':
main()
Regarding how to import your own package, if it is directly under the folder of your own package,
from mypackage import preprocessing
like
from Self-made package name import File name of individual python code
You can import the package with.
With this script.py
open, press the F5
key on VS Code to execute the program as shown above, and the execution result will be displayed on the Terminal
.
[How to build Python and Jupyter execution environment with VS Code #Utilization of debugging](https://qiita.com/SolKul/items/f078877acd23bb1ea5b5#%E3%83%87%E3%83%90%E3%83%83] % E3% 82% B0% E3% 81% AE% E6% B4% BB% E7% 94% A8) You can use VS Code's debugging features in package programming as well.
For example, press the F9
key on line 7 of the code preprocessing.py
in the package as shown above. You will see a red dot at the left end of this line. This is called a ** breakpoint **. And in this state, go back to script.py
and press F5 to execute it.
Execution is paused at line 7 in the package as shown above, and the variables declared at that time (here, the variables in preprocessing.py
) are displayed in the left sidebar. By using breakpoints in this way, I think that bug fixing (= ** debugging **) of the program will be improved.
Try installing this self-made package in another environment. And I will try to see if it works in that other environment.
Open Anaconda Prompt and create a new environment.
conda create -n Favorite environment name python=Python version
This time, I created an environment called setup_test
as an example.
Then start this environment.
conda activate setup_test
Then move to the folder that contains the setup.py
edited above.
cd setup.Directory with py
Then install this homebrew package.
python setup.py install
After installation, try running the above script.py
in this state. Copy script.py
and train.csv
to another folder of your choice and try running them there.
python script.py
It can be executed as shown above, and the preprocessed training data is displayed. This folder only contains scripts and data, not a self-made package folder. In other words, if you could run it with script.py
in this folder, it means that you could install this self-made package in this environment.
When creating your own package, you may want to include data files other than the source code.
For example, suppose you create and distribute a package for data analysis. And when other members want to use the package, I think there is a need to know the behavior of the analysis, although the data cannot be prepared immediately. In such a case, if you prepare the demo data in the package, you can explain it smoothly to that person.
As an example, we will explain the case where the training data for the Titanic is prepared in the package. Add some folders and files to the directory.
.
├── mymodule
│ ├── __init__.py
│ ├── preprocessing.py
│ ├── load_date.py *
│ └── resources *
│ └── train.csv *
├── setup.py
└── script.py
*:Newly added files and folders
First, create a folder for data in your own package. Here, it is resources
. And put the training data (train.csv
) in it.
Write the following code to load the demo data and add it to the package.
load_date.py
import pkgutil,io
import pandas as pd
class DataLoader:
def load_demo(self):
train_b=pkgutil.get_data('mypackage','resources/train.csv')
train_f=io.BytesIO(train_b)
train=pd.read_csv(train_f)
return train
Here we use a module called pkgutil
which is included as standard in Python. The function pkgutil.get_data ()
can get its contents in binary by specifying the package name and file name.
Also, ʻio is used to handle the read binary data like a file (
file-like object`).
Test if the demo data can be read. Rewrite main ()
of script.py
as follows and execute it with F5
on VS Code.
script.py
def main():
from mypackage import load_data
data_loader=load_data.DataLoader()
train=data_loader.load_demo()
print(train.head())
The demo data can be read as shown above.
However, with this alone, even if you install this package, the data will not be installed at the same time. Add a line to setup.py
so that when you install the package, the data will be installed at the same time.
setup.py
from setuptools import setup, find_packages
setup(
name='mypackage',
install_requires=['pandas','scikit-learn'],
packages=find_packages(),
package_data={'mypackage': ['resources/*']}
)
By specifying the package name and folder name in package_data
, you can specify the data to be installed at the same time as the package installation.
For details, refer to Official document (2.6. Install package data).
Then, as explained above, if you create a new environment and install your own package using setup.py
, you can see that the demo data can be used in the installed environment. ..
This is not enough to make your program easy to explain to other members and easy to use. Originally, you could write tests with ʻunit test or
pytest, or
There are other things to do, such as explaining program I / O with docstring`.
However, I think packaging is the first step in doing so.
If you have come this far, please make your program easy to understand by writing tests, writing docstring
s, and converting docstring
s to the program specifications as shown below.
This article was very helpful for packaging Python code. It also describes how to write a test using ʻunit test`, so please refer to it. How to make a Python package (written for internships)
However, when it comes to testing, pytest
is easier to use. If you are used to ʻunit test, please try using
pytest`.
pytest (official documentation)
It's also about documentation that describes your program. You can document the docstring
as a specification.
How to use Sphinx. Read docstring and generate specifications
You may also want to use diagrams and formulas to explain how to use the program and the theory.
In such a case, it is recommended to use a module called mkdocs
that can create documents in markdown format.
Document creation with MkDocs
If you create a document with this sphinx
and mkdocs
and host it on AWS S3 etc., how do you use this program from members? If you are asked, it is very convenient because you can send the URL when you are busy.
I referred to here for the data analysis of the Titanic. [Introduction to Kaggle Beginners] Who will survive the Titanic?