Add totals to rows and columns in pandas

I will write it as a memorandum when using * pandas *. That's what the answer here says.

python - Appending column totals to a Pandas DataFrame - Stack Overflow

df = pd.concat([df,pd.DataFrame(df.sum(axis=0),columns=['Grand Total']).T])
df = pd.concat([df,pd.DataFrame(df.sum(axis=1),columns=['Total'])],axis=1)

The basics of * pandas * are summarized in the official manual so that it can be read in about 10 minutes.

10 Minutes to pandas — pandas 0.14.1 documentation

As test data, we will use past weather information that can be downloaded from the Japan Meteorological Agency. The weather conditions for Tokyo for one year (January-December 2013) are totaled monthly and the total of rows and columns is calculated.

Japan Meteorological Agency | Past Meteorological Data Download

気象庁過去の気象データ・ダウンロード.png

Environmental setting

Make the independent Python library available in pyvenv.

$ PYVENV_DIR=$HOME/.pyvenv/data-analytics
$ pyvenv-3.4 $PYVENV_DIR
$ source $PYVENV_DIR/bin/activate
$ pip install pandas

Let's call the file downloaded from the Japan Meteorological Agency weather-tokyo-2013.csv. As it is difficult to handle with * pandas * as it is downloaded, pre-processing is performed.

Converts the character code to UTF-8.
Keep the header with multiple lines on one line.
Rewrite the header line to the English name.
Set the date column to "day".
Set the weather condition (daytime) column to "weather_day".

Check data reading with pandas

>>> import pandas as pd

# Read CSV file, whose "day" column is parsed with `dateutil.parser.parse`.
>>> fname = 'weather-tokyo-2013.csv'
>>> df = pd.read_csv(fname, parse_dates=['day'])

# Check DataFrame overview.
>>> df.dtypes
>>> df.head(3)
>>> df.describe()

I was able to confirm that the CSV file was loaded into * DataFrame *.

Then extract the "month" part from the "day" column and add it to the "month" column. The "day" column is recognized as a datetime when reading the CSV, so you can use the datetime API.

>>> df['month'] = df['day'].apply(lambda x: x.month)

Aggregate by groupby and add row and column total value

DataFrame * has a groupby method that provides SQL" GROUP BY "like behavior. This time, I will add up the number of days by "weather overview (daytime)" and "month". For example, you can see the number of sunny days per month.

>>> grouped = df.groupby(['weather_day', 'month'])['day'].count()
>>> type(grouped)
>>> grouped.index
>>> grouped.head(3)

The return value of groupby is * Series *, so convert it to * DataFrame * with the ʻunstack` method. This will allow you to calculate the row and column totals in the opening process, in this case the monthly totals for the weather overview.

>>> monthly = grouped.unstack()

# Add "Total" column on each row.
>>> monthly = pd.concat([monthly,
                         pd.DataFrame(monthly.sum(axis=1), columns=['Total'])],
                        axis=1)
# Add "Grand Total" row on each column.
>>> monthly = pd.concat([monthly,
                         pd.DataFrame(monthly.sum(axis=0), columns=['Grand Total']).T])

In addition to simple sums, you can also filter by row or column name to add up. If you think of a day that includes "rain" as a rainy day and a day that includes "fine" as a sunny day, you can use the index name to calculate as follows. (Since the actual data includes weather such as "sunny temporary rain", there are days when it is duplicated.)

>>> monthly = pd.concat([monthly,
                         pd.DataFrame(monthly[monthly.index.map(lambda x: 'rain' in x)].sum(axis=0), columns=['Rainy']).T])
>>> monthly = pd.concat([monthly,
                         pd.DataFrame(monthly[monthly.index.map(lambda x: 'Fine' in x)].sum(axis=0), columns=['Fine']).T])

Finally, write the aggregation result to a text file. Headers, index names, etc. can be controlled with optional arguments. For example, specify the delimiter with * sep *.

>>> fname = 'weather-tokyo-2013-monthly-weather_name.csv'
>>> monthly.to_csv(fname, sep=',')

After that, check the result with spreadsheet software such as Excel.

Draw a graph using ipython

You can open the data file in Excel and draw the graph, but you can also draw in Python using * matplotlib *. If you want to use it for a while, use * ipython *. It will perform the chores of graph drawing processing behind the scenes.

$ pip install matplotlib ipython
$ ipython

You are now in the IPython interactive shell. Read the data you just exported and draw a graph of the monthly transition between sunny and rainy days.


In [1]: import pandas as pd

In [2]: pd.options.display.mpl_style = 'default'

In [3]: fname = 'weather-tokyo-2013-monthly-weather_name.csv'

In [4]: df = pd.read_csv(fname, index_col=0)

In [5]: df.T[['Fine', 'Rainy']][:12].plot(color=['r', 'b']).set_xlabel('Month')
Out[5]: <matplotlib.text.Text at 0x106cb89e8>

The options on the second line are passed by * pandas * to * matplotlib *.
- The "T" in DataFrame * stands for transpose (), which is the transpose of rows and columns.
- pandas.DataFrame.transpose — pandas 0.14.1 documentation
Since the total value is given to the row and column in the previous process, the data up to December is sliced.
You are passing the color argument toplot ()to specify the color for each series. Since the order of column specification is matched, Fine is "r" = Red and Rainy is "b" = Blue.

A graph like this should be drawn in a separate window.

月別の天候変化_東京.png

Since June is the rainy season, it rains a lot (less than half the number of days), and you can see that there are almost no rainy days in January. The weather changes easily in October, but it seems that there are still more sunny days than rainy days. In addition, since the rough aggregation conditions are easy to implement and the fluctuations are easy to understand, when performing detailed analysis, let's aggregate the values of the weather overview after categorizing them.