[DOCKER] I measured the time when I pip installed the C language dependent module with alpine

Let's measure

Pip install of C language dependent module with alpine makes it heavy

I tried to see how long it would take to pip install obediently. When I executed it with requirements.txt for machine learning that I had, the following log came out.

Building wheels for collected packages: backcall, h5py, kiwisolver, matplotlib, numpy, pandas, pathspec, Pillow, PyYAML, pyzmq, regex, retrying, scikit-learn, scipy, tornado, typed-ast

We decided to pick up and measure only the modules that are likely to be used frequently during development.

Experiment environment

There were two types of environments that I could run at hand, so I'll try both.

Measurement method

The execution time is measured by adding the time command to the head. Both cases are executed in the same dockerfile, which takes the form of embedding time pip install hoge in the RUN statement.


FROM python:3.8-alpine3.11
RUN apk update \
  && apk add --virtual .build --no-cache openblas-dev lapack-dev freetype-dev \
  gfortran libxml2 g++ gcc zip unzip cmake make \
  libpng-dev openssl-dev musl libffi-dev python3-dev libxslt-dev \
  libxml2-dev jpeg-dev \
  && apk add --no-cache -X http://dl-cdn.alpinelinux.org/alpine/edge/testing hdf5-dev

RUN pip install --upgrade --no-cache-dir pip setuptools wheel && \
  time pip install --no-cache-dir Cython && \
...

Like this. The following table summarizes the logs output during docker build.

result

[^ 1]: As described in here, none-any is a non-OS architecture dependent module, mostly written in pure python. It has been. See also Official PEP425.

module / time case1 case2
real user sys real user sys
Cython 3.07s 2.33s 0.42s 1.52s 0.99s 0.07s
numpy 5m 28.27s 7m 6.90s 22.11s 2m 11.15s 3m 23.69s 6.05s
pandas 31m 46.58s 30m 36.25s 0m 53.83s 14m 8.24s 13m 53.10s 15.88s
Pillow 50.81s 44.09s 5.99s 24.79s 19.88s 1.69s
scipy 30m 45.99s 36m 29.33s 1m 45.89s 12m 52.81s 17m 54.87s 42.58s
scikit-learn 14m 38.63s 14m 3.37s 28.98s 6m 33.10s 6m 24.84s 10.11s
h5py 3m 45.58s 3m 34.79s 9.30s 1m 45.87s 1m 42.42s 4.18s
matplotlib 2m 51.50s 2m 35.53s 13.59s 1m 21.77s 1m 13.70s 6.75s
regex 30.52s 28.84s 1.24s 13.75s 13.06s 0.34s

Oops It's sloppy, it's sloppy

In case1, numpy alone takes 5 minutes, and pandas / scipy takes more than 30 minutes. Since scikit-learn depends on numpy and scipy, you actually have to be prepared for an hour. It's over an hour and a half when all of the above are combined. What will this be? Assuming you work 8 hours a day

** You can only deploy about 5 times a day **

It becomes difficult to easily allocate the latest branch to the development environment and even check for minor bug fixes. If the working time is included, it will be further reduced.

For the time being, if the memory is doubled, the processing speed will also be doubled, so it is not impossible to shorten the time. If you are using aws, you can save time by selecting an instance type with a large amount of memory. However, in case 2, which is more than twice the performance of case 1, it takes less than half the time, but it takes about 40 minutes in total. You can only deploy up to 12 times a day.


By the way, since pip install was done in order from the top of the table, the execution time of some modules includes the installation execution time of dependent modules. kiwisolver also seems to be falling with tar.gz, so matplotlib alone should be a little faster.

  Successfully built pandas
  Installing collected packages: six, python-dateutil, pytz, pandas
  Successfully installed pandas-0.25.3 python-dateutil-2.8.1 pytz-2019.3 six-1.14.0
------
  Successfully built scikit-learn
  Installing collected packages: joblib, scikit-learn
  Successfully installed joblib-0.14.1 scikit-learn-0.22.2.post1
-----
  Successfully built matplotlib kiwisolver
  Installing collected packages: cycler, kiwisolver, pyparsing, matplotlib
  Successfully installed cycler-0.10.0 kiwisolver-1.2.0 matplotlib-3.2.1 

Summary

Even if it is light, it takes 1 to 2 minutes for each module, so it is necessary to consider the operation that does not require building C language dependent modules as much as possible. Use ~~ conda, without now ~~

Digression pip

At the time of pip install, immediately after the wheel conversion of the module is completed, whl is redeployed again and the necessary products are moved directly under site-package. It's all, including the shared library group * .so built from code written in python. If the python3.8 binary is directly under / usr / local / bin like alpine, it will be moved to / usr / local / lib / python3.8 / site-packages /.

Redeployment may seem like a hassle, but it's a PEP-compliant safe module deployment. PEP491 also says so [^ 2]

[^ 2]: As an aside, anaconda uses its own module installation method, so it is not in line with PEP. One end that is causing the danger of mixing.


As a test, ʻunzip the pre-built whl file and search for the .so` file.

./numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so
./numpy/linalg/_umath_linalg.cpython-38-x86_64-linux-gnu.so
./numpy/core/_operand_flag_tests.cpython-38-x86_64-linux-gnu.so
...

After installing each whl file with pip install, if you search for the .so file in the library path, you can see that the pre-built file exists for each module.

/usr/local/lib/python3.8/site-packages/numpy/linalg/lapack_lite.cpython-38-x86_64-linux-gnu.so
/usr/local/lib/python3.8/site-packages/numpy/linalg/_umath_linalg.cpython-38-x86_64-linux-gnu.so
/usr/local/lib/python3.8/site-packages/numpy/core/_operand_flag_tests.cpython-38-x86_64-linux-gnu.so
...
/usr/local/lib/python3.8/site-packages/pandas/io/sas/_sas.cpython-38-x86_64-linux-gnu.so
/usr/local/lib/python3.8/site-packages/pandas/_libs/hashtable.cpython-38-x86_64-linux-gnu.so
...

Now you can finally import it with python, On the flip side, you can use it as long as the product is in the path that python searches for the module. So, in Workaround of the original article, put it later except for python related Assuming that, I'm copying directly under / usr / local / using docker's multi-stage build.

References

Recommended Posts

I measured the time when I pip installed the C language dependent module with alpine
I tried to illustrate the time and time in C language
The module that should have been installed with pip does not run
I can't install the package with pip.
Install python's C language dependent module in wheel format with multi stage build
I tried Hello World with 64bit OS + C language without using the library
What I did when I got stuck in the time limit with lambda python
I played with Floydhub for the time being
Find out the location of packages installed with pip
I measured the performance of 1 million documents with mongoDB
[Note] I can't call the installed module in jupyter
I stumbled upon PyUnicodeUCS4_FromStringAndSize when inserting TensorFlow with pip
Why can I use the module by importing with python?
Can't find the package you installed with pip install --user?
[Python] I installed the game from pip and played it
A story that didn't work when I tried to log in with the Python requests module