Pandas data frames are convenient, but memory management I'm not sure, I was curious about where and how they are actually placed, so I looked it up.
import pandas as pd
df = pd.DataFrame({'A': [1, 2], 'B': [3.0, 4.0], 'C': [5, 6]})
for block in df._data.blocks:
memory_address = block.values.__array_interface__['data'][0]
memory_hex = block.values.data.hex()
print(f"({id(block)}) {block}")
print(f"<{memory_address}> {memory_hex}")
print()
(4886642416) FloatBlock: slice(1, 2, 1), 1 x 2, dtype: float64
<140474854679968> 00000000000008400000000000001040
(4886642608) IntBlock: slice(0, 4, 2), 2 x 2, dtype: int64
<140474585659872> 0100000000000000020000000000000005000000000000000600000000000000
The number in the angle bracket is the memory address, and the number after that is the hexadecimal representation of the memory value. Since both columns A and C are Int values, you can see that they are collectively allocated in memory. I see?
The data frame manages the data in blocks through a class called BlockManger. The idea around this is the article "[A Roadmap for Rich Scientific Data Structures in Python](https://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/] by the author of Pandas. ) βIs easy to understand.
If you follow the type of the variable that appears in the above code, it will be as follows.
You can see that the block holds a NumPy ndarray.
So, from here on, it's the world of NumPy, "2.2. Advanced NumPy β Scipy lecture notes ], You can get the memory address with ndarray.__array_interface__ ['data'] [0]
. And since you can get the memoryview with ndarray.data
, you can also look at the memory value.
Note that when you print the memoryview, it is displayed as <memory at 0x11b6a3ad0>
, but this is the address of the instance of memoryview, which is different from the address of the value. For more information, see "[Numpy, Python3.6 --not able to understand why address is different? --Stack Overflow](https://stackoverflow.com/questions/52032545/numpy-python3-6-not-able-to-understand-" why-address-is-different) β.
Let's experiment with how the memory allocation changes by doing some simple data frame operations.
df1 = df[0:1]
(4886726416) FloatBlock: slice(1, 2, 1), 1 x 1, dtype: float64
<140474854679968> 0000000000000840
(4886727088) IntBlock: slice(0, 4, 2), 2 x 1, dtype: int64
<140474585659872> 01000000000000000500000000000000
First is the slice of the first line. You can see that the memory address has not changed and the reference range has become shorter. The instance of the block has changed.
df2 = df[1:2]
(4886798416) FloatBlock: slice(1, 2, 1), 1 x 1, dtype: float64
<140474854679976> 0000000000001040
(4886798896) IntBlock: slice(0, 4, 2), 2 x 1, dtype: int64
<140474585659880> 02000000000000000600000000000000
This is the slice on the second line. Since all the memory addresses are +8, you can see that they are referring to the same memory block just by shifting the pointer.
df['D'] = [True, False]
(4886642416) FloatBlock: slice(1, 2, 1), 1 x 2, dtype: float64
<140474854679968> 00000000000008400000000000001040
(4886642608) IntBlock: slice(0, 4, 2), 2 x 2, dtype: int64
<140474585659872> 0100000000000000020000000000000005000000000000000600000000000000
(4886800144) BoolBlock: slice(3, 4, 1), 1 x 2, dtype: bool
<140474855093504> 0100
Add a column. For existing columns, not only the memory address but also the block does not change.
df3 = df.append(df)
(4886726224) IntBlock: slice(0, 1, 1), 1 x 4, dtype: int64
<140474855531008> 0100000000000000020000000000000001000000000000000200000000000000
(4509301648) FloatBlock: slice(1, 2, 1), 1 x 4, dtype: float64
<140474585317312> 0000000000000840000000000000104000000000000008400000000000001040
(4509301840) IntBlock: slice(2, 3, 1), 1 x 4, dtype: int64
<140474585630688> 0500000000000000060000000000000005000000000000000600000000000000
(4509301552) BoolBlock: slice(3, 4, 1), 1 x 4, dtype: bool
<140474855008224> 01000100
I tried to combine the lines. The memory layout has changed drastically. There are also two IntBlocks. This causes fragmentation, so I'd like you to put it together at the right time.
df4 = df3._consolidate()
(4509301552) BoolBlock: slice(3, 4, 1), 1 x 4, dtype: bool
<140474855008224> 01000100
(4509301648) FloatBlock: slice(1, 2, 1), 1 x 4, dtype: float64
<140474585317312> 0000000000000840000000000000104000000000000008400000000000001040
(4886728240) IntBlock: slice(0, 4, 2), 2 x 4, dtype: int64
<140475125920528> 01000000000000000200000000000000010000000000000002000000000000000500000000000000060000000000000005000000000000000600000000000000
When I called the private method _consolidate ()
, the Int values were grouped together and placed at the new memory address.
Recommended Posts