The copy method of pandas.DataFrame is deep copy by default

The conclusion is as the title says. Also written in the official documentation.

A colleague said, "The assign method of pandas makes a copy of the data frame internally, so it's slow and troublesome because it consumes memory."

I'm addicted to reading "Recursion Substitution Eradication Committee for Python / pandas Data Processing" and writing statistic batch processing neatly using method chains. I did.

However, if you look at the actual pandas code,

#Comments and other methods are omitted
class DataFrame(NDFrame):

    def insert(self, loc, column, value, allow_duplicates=False):
        data = self.copy()

        # do all calculations first...
        results = {}
        for k, v in kwargs.items():
            results[k] = com._apply_if_callable(v, data)

        # ... and then assign
        for k, v in sorted(results.items()):
            data[k] = v

        return data

I thought, "What? Python's copy method is shallow copy in dictionaries and arrays?" But

Therefore, when using the copy method in a dictionary or array, the objects inside are the same, and copying the objects inside does not eat up memory.

a = {'a': [1, 2, 3]}
b = a.copy()

#The contents of a and b are the same
assert a['a'] is b['a']

#Destructive changes are spilling over!
a['a'].append(4)
print(b)
# => {'a': [1, 2, 3, 4]}

The copy method of pandas (and the assign method that uses it) seems to be better to worry about memory when dealing with huge data frames.

import pandas as pd

df_a = pd.DataFrame({'a': [1, 2, 3]})
df_b = df_a.copy()

#The contents of a and b are not the same!
assert df_a['a'] is not df_b['a']

Half a joke, I told my colleague that if this were Haskell, it wouldn't be a problem if the shallow copy didn't make any destructive changes.

This is a comment from a colleague. I would like to know if there is a writing style that is both easy to understand and saves memory.

[Caution for eradicating recursive substitution of pandas] I think that there are many cases where assign or pipe is used to avoid recursive assignment, but be aware that assign is a copy of df itself, so it will be much slower. On the other hand, the pipe is not copied, so it's okay

assign https://github.com/pandas-dev/pandas/blob/v0.20.3/pandas/core/frame.py#L2492 pipe https://github.com/pandas-dev/pandas/blob/v0.19.2/pandas/core/generic.py#L2698-L2708

However, I think it is true that assign is easy to read as ui, so

  • Reduce the number of assigns as much as possible (it is okay to add multiple columns with one assign because copy does not occur)
  • Shake off and recursively assign

I wonder if I can do it at best ...

By the way, I narrowed down the columns before passing it to assign, and tried concat to convert the returned dataframe to the original dataframe, but on the contrary it was considerably slower, so this is also not very good

Recommended Posts

The copy method of pandas.DataFrame is deep copy by default
What is the true identity of Python's sort method "sort"? ??
Judge whether it is my child from the picture of Shiba Inu by deep learning (1)
cv2.Canny (): Makes the adjustment of edge detection by the Canny method nice
Unfortunately there is no sense of unity in the where method
Is the probability of precipitation correct?
Feature extraction by TF method using the result of morphological analysis
Science "Is Saito the representative of Saito?"
[Anomaly detection] Try using the latest method of deep distance learning
Find the ratio of the area of Lake Biwa by the Monte Carlo method
Judgment whether it is my child from the photograph of Shiba Inu by deep learning (3) Visualization by Grad-CAM
What is Newton's method? ?? Approximate solution of equation to be solved by Newton's method
Voice processing by deep learning: Let's identify who the voice actor is from the voice
Find out the name of the method that called it from the method that is python
[Introduction to Python] What is the method of repeating with the continue statement?
Count / verify the number of method calls.
Othello-From the tic-tac-toe of "Implementation Deep Learning" (3)
What is the cause of the following error?
Visualize the effects of deep learning / regularization
Pandas of the beginner, by the beginner, for the beginner [Python]
Summary of SQLAlchemy connection method by DB
The update of conda is not finished.
The backslash of the Japanese keyboard is "ro"
Othello-From the tic-tac-toe of "Implementation Deep Learning" (2)
Make the default value of the argument immutable
[pandas] When specifying the default Index label in the at method, "" is not required
In Python, change the behavior of the method depending on how it is called
Approximation by the least squares method of a circle with two fixed points
The format of the message obtained by Slack API is subtly difficult to use
Judgment whether it is my child from the photograph of Shiba Inu by deep learning (4) Visualization by Grad-CAM and Guided Grad-CAM
Saddle point search using the gradient method
The first Markov chain Monte Carlo method by PyStan
I tried increasing or decreasing the number by programming
The copy method of pandas.DataFrame is deep copy by default
[Pyro] Statistical modeling by the stochastic programming language Pyro ① ~ What is Pyro ~
The story of doing deep learning with TPU
About the accuracy of Archimedean circle calculation method
About the behavior of copy, deepcopy and numpy.copy
Sort the elements of the array by specifying the conditions
The origin of Manjaro Linux is "Mount Kilimanjaro"
Make a copy of the list in Python
FAQ: Why is the comparison of numbers inconsistent?
The value of pyTorch torch.var () is not distributed
This is the only basic review of Python ~ 1 ~
This is the only basic review of Python ~ 2 ~
Minimize the number of polishings by combinatorial optimization
Judging the finish of mahjong by combinatorial optimization
Search by the value of the instance in the list
This is the only basic review of Python ~ 3 ~
Deep learning learned by implementation (segmentation) ~ Implementation of SegNet ~
Return value of quit ()-Is there anything returned by the "function that ends everything"?
Usage to call a method of an instance before it is returned by __new__
The timing when the value of the default argument is evaluated is different between Ruby and Python.