I want to solve the problem of memory leak when outputting a large number of images with Matplotlib

Overview

It is not uncommon to use Matplotlib to visualize data and output images. At that time, if you output an image using savefig, a slight memory leak seems to occur. There is no particular problem if the number of images is several hundred to several thousand, but it becomes a problem when you want to output tens of thousands to hundreds of thousands of images overnight for a fairly large amount of data.

Operating environment

Python:3.7.7 Matplotlib:3.2.2

Investigation

I checked the amount of memory increase using the following code.

import os
import numpy as np
import datetime as dt
import matplotlib.pyplot as plt
import psutil

#Data for measurement
memory_start = psutil.virtual_memory().used
time_start = dt.datetime.now()
fw = open('./Measurement log.csv','w')
fw.write('i,time_delta[s],memory[KB]\n')
 
#Output 10,000 sheets
for i in range(10000):
    #Generate two types of data
    size = 10000
    x1 = np.random.randn(size)
    y1 = 0.5*x1 + 0.5**0.5*np.random.randn(size)
    x2 = np.random.randn(size)
    y2 = np.random.randn(size)

    #Initialize the graph and create a scatter plot
    fig, ax = plt.subplots(figsize=(8,8))
    ax.scatter(x1, y1, alpha=0.1, color='r', label='data1')
    ax.scatter(x2, y2, alpha=0.1, color='g', label='data2')
    
    #Add labels and legends
    ax.set_xlabel('X')
    ax.set_ylabel('Y')
    ax.legend(loc='upper right')
    
    #Image output
    plt.savefig('./output/{:05}.png'.format(i))
    
    #Release graph data *
    #plt.clf()
    #plt.cla()
    plt.close()
    
    #Measure the amount of memory increase from the start and the time of one loop
    memory_delta = psutil.virtual_memory().used - memory_start
    time_end = dt.datetime.now()
    time_delta = time_end - time_start
    time_start = time_end
    fw.write('{},{},{}\n'.format(i+1, time_delta.microseconds/1e6, memory_delta/1e3))

fw.close()

When releasing the memory of the graph, the measurement was performed when only close was performed and when cla and clf were performed and then closed.

graph2.PNG

The amount of memory increase is suppressed to about half when cla and clf are performed before the case of close only. However, it can be confirmed that it increases in both cases.

what will you do?

What should I do to prevent the memory from increasing no matter how many images are output? As a result of various trials, I suppressed the memory leak by rewriting the code as follows, I was able to output the image properly.

#Initialize only once at the beginning
fig, ax = plt.subplots(figsize=(8,8))

for i in range(10000):
    #Generate two types of data
    x1 = np.random.randn(size)
    y1 = 0.5*x1 + 0.5**0.5*np.random.randn(size)
    x2 = np.random.randn(size)
    y2 = np.random.randn(size)
    
    ax.scatter(x1, y1, alpha=0.1, color='r', label='data1')
    ax.scatter(x2, y2, alpha=0.1, color='g', label='data2')
    
    ax.set_xlabel('X')
    ax.set_ylabel('Y')    
    ax.legend(loc='upper right')
    
    plt.savefig('./output/{:05}.png'.format(i))

    #Do only cla
    #clf,If you close it, you will not be able to write to the graph after that.
    #On the contrary, if you do not perform cla, it will overlap with the previous image more and more.
    plt.cla()

graph.PNG

If you don't want to change the image size in the middle, I think that any graph can be handled with this, but I'm not so sure, so be careful when using it.

Recommended Posts

I want to solve the problem of memory leak when outputting a large number of images with Matplotlib
When generating a large number of graphs with matplotlib, I do not want to display the graph on the screen (jupyter environment)
[Python] I want to make a 3D scatter plot of the epicenter with Cartopy + Matplotlib!
I wanted to solve the ABC164 A ~ D problem with Python
TensorFlow To learn from a large number of images ... (Unsolved problem) → 12/18 Solved
I want to manually create a legend with matplotlib
I tried to solve the problem with Python Vol.1
The 15th offline real-time I tried to solve the problem of how to write with python
I tried to solve a combination optimization problem with Qiskit
How to write offline real time I tried to solve the problem of F02 with Python
Two solutions to the problem that it is hard to see the vector field when writing a vector field with quiver () of matplotlib pyplot
[Introduction to StyleGAN] I played with "The Life of a Man" ♬
Try to solve the N Queens problem with SA of PyQUBO
I want to output the beginning of the next month with Python
I want to check the position of my face with OpenCV!
[Django] What to do if the model you want to create has a large number of fields
Manage the overlap when drawing scatter plots with a large amount of data (Matplotlib, Pandas, Datashader)
I want to create a graph with wavy lines omitted in the middle with matplotlib (I want to manipulate the impression)
I tried to predict the number of domestically infected people of the new corona with a mathematical model
The story of the algorithm drawing a ridiculous conclusion when trying to solve the traveling salesman problem properly
How to write when you want to put a number after the group number to be replaced with a regular expression in re.sub of Python
I want to output while converting the value of the type (e.g. datetime) that is not supported when outputting json with python
Want to solve a simple classification problem?
(Matplotlib) I want to draw a graph with a size specified in pixels
I want to sort a list in the order of other lists
I want to express my feelings with the lyrics of Mr. Children
Memorandum of means when you want to make machine learning with 50 images
I want to stop the automatic deletion of the tmp area with RHEL7
Try to solve a set problem of high school math with Python
Python: I want to measure the processing time of a function neatly
I want to display the number of num_boost_rounds when early_stopping is applied using XGBoost callback (not achieved)
I want to make matplotlib a dark theme
Try to solve the fizzbuzz problem with Keras
Try to solve the traveling salesman problem with a genetic algorithm (Python code)
[AtCoder] Solve A problem of ABC101 ~ 169 with Python
I want to display only different lines of a text file with diff
I want to make a game with Python
I failed to install django with pip, so a reminder of the solution
I want to set a life cycle in the task definition of ECS
I want to add silence to the beginning of a wav file for 1 second
I want to see a list of WebDAV files in the Requests module
I want to solve APG4b with Python (Chapter 2)
I want to customize the appearance of zabbix
I want to write to a file with Python
I wanted to know the number of lines in multiple files, so I tried to get it with a command
I tried to create a model with the sample of Amazon SageMaker Autopilot
I tried to make something like a chatbot with the Seq2Seq model of TensorFlow
[Python] I want to use only index when looping a list with a for statement
How to identify the element with the smallest number of characters in a Python list?
The story of Linux that I want to teach myself half a year ago
I tried to solve the virtual machine placement optimization problem (simple version) with blueqat
I want to plot the location information of GTFS Realtime on Jupyter! (With balloon)
I want to collect a lot of images, so I tried using "google image download"
[Pyhton] I want to solve the problem that tkinter does not work on MacOS11
I want to save a file with "Do not compress images in file" set in OpenPyXL
When the variable you want to superscript with matplotlib is two or more characters
A memo on how to overcome the difficult problem of capturing FX with AI
A memo of misunderstanding when trying to load the entire self-made module with Python3
I want to be notified of the connection environment when the Raspberry Pi connects to the network
Accelerate a large number of simple queries with MySQL
I want to detect images of cats from Instagram