Python VM profile

Overview

I saw it in "Differences between top-level loops and loops in functions"

`loop_toplevel.py`


for i in range(10**7):
    pass

As a sample, let's profile the processing of the Python VM and see that the processing of STORE_NAME is effective for the execution time.

Check in Ubuntu 18.04 environment.

Get Python source code

Obtained from https://github.com/python/cpython

~$ git clone https://github.com/python/cpython.git

Python build

install gcc

~$ sudo apt install -y gcc

Build with profiler enabled, optimization disabled

~$ cd cpython
~/cpython$ ./configure --enable-profiling --with-pydebug
~/cpython$ make

Profile

Prepare the target script

~/cpython$ vi loop_toplevel.py
~/cpython$ cat loop_toplevel.py
for i in range(10**7):
    pass

Simply run the built python and it will profile it for you.

~/cpython$ ./python loop_toplevel.py

The profile result is saved in gmon.out. You can see the summary by gprof ./python.

~/cpython$ ls -l gmon.out
-rw-rw-r-- 1 vagrant vagrant 1284656 Nov  4 03:16 gmon.out
~/cpython$ gprof ./python|head -n 20
Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls   s/call   s/call  name
 17.46      0.22     0.22 20043118     0.00     0.00  read_size_t
  9.52      0.34     0.12     2229     0.00     0.00  _PyEval_EvalFrameDefault
  7.94      0.44     0.10 10007889     0.00     0.00  insertdict
  6.35      0.52     0.08 10026827     0.00     0.00  _PyMem_DebugRawAlloc
  6.35      0.60     0.08 10021559     0.00     0.00  _PyMem_DebugCheckAddress
  4.76      0.66     0.06 20045319     0.00     0.00  PyGILState_Check
  3.97      0.71     0.05 10020541     0.00     0.00  address_in_range
  3.97      0.76     0.05 10001148     0.00     0.00  PyLong_FromLong
  3.97      0.81     0.05 10046396     0.00     0.00  lookdict_unicode_nodummy
  3.97      0.86     0.05 10021024     0.00     0.00  _PyMem_DebugRawFree
  3.57      0.91     0.05 10019912     0.00     0.00  _PyObject_Free
  3.17      0.95     0.04 10027556     0.00     0.00  write_size_t
  3.17      0.99     0.04 10019415     0.00     0.00  _PyMem_DebugFree
  2.38      1.02     0.03 10017088     0.00     0.00  _PyDict_CheckConsistency
  2.38      1.05     0.03 10008245     0.00     0.00  PyDict_SetItem

Visualization with gprof2dot

Let's visualize the call graph using gprof2dot.

~/cpython$ sudo apt install -y graphviz
~/cpython$ sudo apt install -y python3-pip
~/cpython$ pip3 install --user gprof2dot
~/cpython$ gprof ./python | ~/.local/bin/gprof2dot | dot -Tpng -o output.png

We saw that PyDict_SetItem was called on the order of the number of loops, and less than that was consuming time (total time 51.14%, self time 2.31%, 10008245 calls).

Check with source

Looking at the Python VM's STORE_NAME implementation, it does call PyDict_SetItem.

`cpython/Python/ceval.c`


case TARGET(STORE_NAME): {
    PyObject *name = GETITEM(names, oparg);
    PyObject *v = POP();
    PyObject *ns = f->f_locals;
    int err;
    if (ns == NULL) {
        _PyErr_Format(tstate, PyExc_SystemError,
                      "no locals found when storing %R", name);
        Py_DECREF(v);
        goto error;
    }
    if (PyDict_CheckExact(ns))
        err = PyDict_SetItem(ns, name, v);
    else
        err = PyObject_SetItem(ns, name, v);
    Py_DECREF(v);
    if (err != 0)
        goto error;
    DISPATCH();
}