I want to solve the problem of memory leak when outputting a large number of images with Matplotlib

Overview

It is not uncommon to use Matplotlib to visualize data and output images. At that time, if you output an image using savefig, a slight memory leak seems to occur. There is no particular problem if the number of images is several hundred to several thousand, but it becomes a problem when you want to output tens of thousands to hundreds of thousands of images overnight for a fairly large amount of data.

Operating environment

Python：3.7.7 Matplotlib：3.2.2

Investigation

I checked the amount of memory increase using the following code.

import os
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import psutil

#Data for measurement
memory_start = psutil.virtual_memory().used
time_start = dt.datetime.now()
fw = open('./Measurement log.csv','w')
fw.write('i,time_delta[s],memory[KB]\n')
 
#Output 10,000 sheets
for i in range(10000):
    #Generate two types of data
    size = 10000
    x1 = np.random.randn(size)
    y1 = 0.5*x1 + 0.5**0.5*np.random.randn(size)
    x2 = np.random.randn(size)
    y2 = np.random.randn(size)

    #Initialize the graph and create a scatter plot
    fig, ax = plt.subplots(figsize=(8,8))
    ax.scatter(x1, y1, alpha=0.1, color='r', label='data1')
    ax.scatter(x2, y2, alpha=0.1, color='g', label='data2')
    
    #Add labels and legends
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.legend(loc='upper right')
    
    #Image output
    plt.savefig('./output/{:05}.png'.format(i))
    
    #Release graph data *
    #plt.clf()
    #plt.cla()
    plt.close()
    
    #Measure the amount of memory increase from the start and the time of one loop
    memory_delta = psutil.virtual_memory().used - memory_start
    time_end = dt.datetime.now()
    time_delta = time_end - time_start
    time_start = time_end
    fw.write('{},{},{}\n'.format(i+1, time_delta.microseconds/1e6, memory_delta/1e3))

fw.close()

When releasing the memory of the graph, the measurement was performed when only close was performed and when cla and clf were performed and then closed.

The amount of memory increase is suppressed to about half when cla and clf are performed before the case of close only. However, it can be confirmed that it increases in both cases.

what will you do?

What should I do to prevent the memory from increasing no matter how many images are output? As a result of various trials, I suppressed the memory leak by rewriting the code as follows, I was able to output the image properly.

#Initialize only once at the beginning
fig, ax = plt.subplots(figsize=(8,8))

for i in range(10000):
    #Generate two types of data
    x1 = np.random.randn(size)
    y1 = 0.5*x1 + 0.5**0.5*np.random.randn(size)
    x2 = np.random.randn(size)
    y2 = np.random.randn(size)
    
    ax.scatter(x1, y1, alpha=0.1, color='r', label='data1')
    ax.scatter(x2, y2, alpha=0.1, color='g', label='data2')
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')    
    ax.legend(loc='upper right')
    
    plt.savefig('./output/{:05}.png'.format(i))

    #Do only cla
    #clf,If you close it, you will not be able to write to the graph after that.
    #On the contrary, if you do not perform cla, it will overlap with the previous image more and more.
    plt.cla()

If you don't want to change the image size in the middle, I think that any graph can be handled with this, but I'm not so sure, so be careful when using it.