When you want to use the data created by Spark's DataFrame in each Python module, you can use the toPandas ()
method to convert it to Pandas DataFrame, but a memory error often occurs at that time.
Through trial and error so that it can be stored in memory, I have summarized the ones that seem to be effective.
There seems to be a better way, so if you know it, please let me know!
Conversion by spark is affected by spark.driver.memory
and spark.driver.maxResultSize
, but in dask it is not, so it is easy to avoid the error.
Conversion using dask
import dask.dataframe as dd
df.write.parquet(parquet_path)
dask_df = dd.read_parquet(parquet_path)
pandas_df = dask_df.compute()
Change the data type of a variable to reduce the number of bytes.
Change data type
#For example, int32 type(4 bytes)Int8 type(1 byte)Convert to
dask_df = dask_dt.astype({k: 'int8' for k in dask_df.dtypes[dask_df.dtypes == 'int32'].index})
Recommended Posts