Working with 3D data structures in pandas

Handles 3D data

The main data structure in pandas is the one-dimensional or line Series There are pandas.Series.html) and a two-dimensional or tabular DataFrame (http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.html). This is the main object in pandas and is also detailed in Python for Data Analysis.

But there is actually another major object. That's the third 3D Panel that appears in Intro to Data Structures. .org / pandas-docs / stable / generated / pandas.Panel.html).

This three-dimensional data structure is useful, for example, when you want to extract arbitrary numbers from daily table data and perform statistical analysis on time series logs.

Create a Panel object

Panels can be created by taking a dictionary-formatted DataFrame or a 3D ndarray as arguments. Let's do it concretely.

import pandas as pd
rng = pd.date_range('1/1/2014',periods=100,freq='D')

#Create a data frame with random numbers, index ABCD
df1 = pd.DataFrame(np.random.randn(100, 4), index = rng, columns = ['A','B','C','D'])
df2 = pd.DataFrame(np.random.randn(100, 4), index = rng, columns = ['A','B','C','D'])
df3 = pd.DataFrame(np.random.randn(100, 4), index = rng, columns = ['A','B','C','D'])

#Create a Panel object by combining these data frames
pf = pd.Panel({'df1':df1,'df2':df2,'df3':df3})

pf
#=>
# <class 'pandas.core.panel.Panel'>
# Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis)
# Items axis: df1 to df3
# Major_axis axis: 2014-01-01 00:00:00 to 2014-04-10 00:00:00
# Minor_axis axis: A to D

The Panel object was created like this. Each dimension is called Items axis, Major_axis, Minor_axis.

See the documentation (http://pandas.pydata.org/pandas-docs/stable/generated/pandas.Panel.html) to see what methods this object has.

Key operations on the Panel object

First of all, it will be popular to access by index.

pf.ix[0] #Access to df1
pf.ix[1] #Access to df2
pf['df1'] #This is also access to df1

In this way, you can access each table that Panel has.

#Add new column to table
pf['df1']['E'] = pd.DataFrame(np.random.randn(100, 1), index = rng)
pf['df2']['E'] = pd.DataFrame(np.random.randn(100, 1), index = rng)

#Check the data structure
pf.shape
#=> (3, 100, 4)

#df1 Access 10 columns in column E of the table
pf.ix['df1',-10:,'E']
#=>
# 2014-04-01   -1.623615
# 2014-04-02    1.878481
# 2014-04-03   -0.890555
# 2014-04-04    0.736037
# 2014-04-05   -1.451665
# 2014-04-06    0.126473
# 2014-04-07    0.997485
# 2014-04-08   -1.252981
# 2014-04-09   -1.136791
# 2014-04-10   -1.873199

Panel can also be converted to stacked data frames with to_flame (). Statistical functions can be used for this stacked data frame. In addition, this object can be reconverted to the original Panel with to_panel ().

pf.to_frame().to_panel()
#=>
# <class 'pandas.core.panel.Panel'>
# Dimensions: 3 (items) x 100 (major_axis) x 4 (minor_axis)
# Items axis: df1 to df3
# Major_axis axis: 2014-01-01 00:00:00 to 2014-04-10 00:00:00
# Minor_axis axis: A to D

Use Panel to analyze log data

Suppose your application log files are generated daily in a directory, for example Fluentd. When analyzing this log file across dates, it is very convenient because you can analyze the time series by tabulating the data for one day and using a three-dimensional data structure.

The other day's article I will rewrite and apply the program to get the file list of the directory that came out as a sample.

import sys
import os
import pandas as pd

def list_files(path):
    dic = {}
    for root, dirs, files in os.walk(path):
        for filename in files:
            fullname = os.path.join(root, filename)
            if filename.startswith("fluent") \
               and filename.endswith(".log"):
                try:
                    print("Reading: %(filename)s" % locals())
                    df = pd.read_table(
                        os.path.join(path, filename), header=None)
                    dic[filename] = df
                except pd.parser.CParserError:
                    print("Skip: %(filename)s" % locals())
    return pd.Panel(dic)

Since the Panel object returned by this method is a three-dimensional data structure that collects multiple log files, you can use statistical functions to analyze time-series data.

Summary

You can use Panels in pandas to work with 3D data structures. By adding another dimension in addition to the row and column data structure, it is useful for time series data analysis.

Recommended Posts

Working with 3D data structures in pandas
Try working with binary data in Python
Data visualization with pandas
Data manipulation with Pandas!
Working with LibreOffice in Python
Working with sounds in Python
Data processing tips with Pandas
Interpolate 2D data with scipy.interpolate.griddata
Ingenuity to handle data with Pandas in a memory-saving manner
Working with LibreOffice in Python: import
Versatile data plotting with pandas + matplotlib
Python data structures learned with chemoinformatics
Working with DICOM images in Python
Get additional data in LDAP with python
Load csv with duplicate columns in pandas
Ant book in python: Sec. 2-4, data structures
Post Test 3 (Working with PosgreSQL in Python)
Try converting to tidy data with pandas
RDS data via stepping stones in Pandas
Example of efficient data processing with PANDAS
Best practices for messing with data with pandas
Convenient time series aggregation with TimeGrouper in pandas
Visualize corona infection data in Tokyo with matplotlib
Try to aggregate doujin music data with pandas
Read Python csv data with Pandas ⇒ Graph with Matplotlib
Delete data in a pattern with Redis Cluster
Read table data in PDF file with Python
Read pandas data
Precautions when dealing with control structures in Python 2.6
Interactively visualize data with TreasureData, Pandas and Jupyter.
Try working with Mongo in Python on Mac
Remove rows with duplicate indexes in pandas DataFrame
Handle integer types with missing values in Pandas
[Introduction for beginners] Working with MySQL in Python
Save pandas data in Excel format to data assets with Cloud Pak for Data (Watson Studio)
Train MNIST data with a neural network in PyTorch
Get Amazon RDS (PostgreSQL) data using SQL with pandas
Specific sample code for working with SQLite3 in Python
How to convert horizontally held data to vertically held data with pandas
Be careful when reading data with pandas (specify dtype)
Data analysis environment construction with Python (IPython notebook + Pandas)
Fill outliers with NaN based on quartiles in Pandas
How to extract non-missing value nan data with pandas
Process csv data with python (count processing using pandas)
Sort post data in reverse order with Django's ListView
[Memo] Text matching in pandas data frame using flashtext
Convert numeric variables to categorical with thresholds in pandas
How to extract non-missing value nan data with pandas
Data analysis with python 2
3D plot Pandas DataFrame
Learn Pandas in 10 minutes
Processing datasets with pandas (1)
Convert 202003 to 2020-03 with pandas
Processing datasets with pandas (2)
Merge datasets with pandas
Sampling in imbalanced data
Visualize data with Streamlit
Learn Pandas with Cheminformatics
Reading data with TensorFlow
UnicodeDecodeError in pandas read_csv
2D plot in matplotlib