I will write it as a memorandum when using * pandas *. That's what the answer here says.
df = pd.concat([df,pd.DataFrame(df.sum(axis=0),columns=['Grand Total']).T])
df = pd.concat([df,pd.DataFrame(df.sum(axis=1),columns=['Total'])],axis=1)
The basics of * pandas * are summarized in the official manual so that it can be read in about 10 minutes.
As test data, we will use past weather information that can be downloaded from the Japan Meteorological Agency. The weather conditions for Tokyo for one year (January-December 2013) are totaled monthly and the total of rows and columns is calculated.
Japan Meteorological Agency | Past Meteorological Data Download
Make the independent Python library available in pyvenv
.
$ PYVENV_DIR=$HOME/.pyvenv/data-analytics
$ pyvenv-3.4 $PYVENV_DIR
$ source $PYVENV_DIR/bin/activate
$ pip install pandas
Let's call the file downloaded from the Japan Meteorological Agency weather-tokyo-2013.csv
. As it is difficult to handle with * pandas * as it is downloaded, pre-processing is performed.
>>> import pandas as pd
# Read CSV file, whose "day" column is parsed with `dateutil.parser.parse`.
>>> fname = 'weather-tokyo-2013.csv'
>>> df = pd.read_csv(fname, parse_dates=['day'])
# Check DataFrame overview.
>>> df.dtypes
>>> df.head(3)
>>> df.describe()
I was able to confirm that the CSV file was loaded into * DataFrame *.
Then extract the "month" part from the "day" column and add it to the "month" column. The "day" column is recognized as a datetime when reading the CSV, so you can use the datetime API.
>>> df['month'] = df['day'].apply(lambda x: x.month)
groupby
method that provides SQL" GROUP BY "like behavior. This time, I will add up the number of days by "weather overview (daytime)" and "month". For example, you can see the number of sunny days per month.>>> grouped = df.groupby(['weather_day', 'month'])['day'].count()
>>> type(grouped)
>>> grouped.index
>>> grouped.head(3)
The return value of groupby
is * Series *, so convert it to * DataFrame * with the ʻunstack` method. This will allow you to calculate the row and column totals in the opening process, in this case the monthly totals for the weather overview.
>>> monthly = grouped.unstack()
# Add "Total" column on each row.
>>> monthly = pd.concat([monthly,
pd.DataFrame(monthly.sum(axis=1), columns=['Total'])],
axis=1)
# Add "Grand Total" row on each column.
>>> monthly = pd.concat([monthly,
pd.DataFrame(monthly.sum(axis=0), columns=['Grand Total']).T])
In addition to simple sums, you can also filter by row or column name to add up. If you think of a day that includes "rain" as a rainy day and a day that includes "fine" as a sunny day, you can use the index name to calculate as follows. (Since the actual data includes weather such as "sunny temporary rain", there are days when it is duplicated.)
>>> monthly = pd.concat([monthly,
pd.DataFrame(monthly[monthly.index.map(lambda x: 'rain' in x)].sum(axis=0), columns=['Rainy']).T])
>>> monthly = pd.concat([monthly,
pd.DataFrame(monthly[monthly.index.map(lambda x: 'Fine' in x)].sum(axis=0), columns=['Fine']).T])
Finally, write the aggregation result to a text file. Headers, index names, etc. can be controlled with optional arguments. For example, specify the delimiter with * sep *.
>>> fname = 'weather-tokyo-2013-monthly-weather_name.csv'
>>> monthly.to_csv(fname, sep=',')
After that, check the result with spreadsheet software such as Excel.
You can open the data file in Excel and draw the graph, but you can also draw in Python using * matplotlib *. If you want to use it for a while, use * ipython *. It will perform the chores of graph drawing processing behind the scenes.
$ pip install matplotlib ipython
$ ipython
You are now in the IPython interactive shell. Read the data you just exported and draw a graph of the monthly transition between sunny and rainy days.
In [1]: import pandas as pd
In [2]: pd.options.display.mpl_style = 'default'
In [3]: fname = 'weather-tokyo-2013-monthly-weather_name.csv'
In [4]: df = pd.read_csv(fname, index_col=0)
In [5]: df.T[['Fine', 'Rainy']][:12].plot(color=['r', 'b']).set_xlabel('Month')
Out[5]: <matplotlib.text.Text at 0x106cb89e8>
transpose ()
, which is the transpose of rows and columns.color
argument toplot ()
to specify the color for each series. Since the order of column specification is matched, Fine is "r" = Red and Rainy is "b" = Blue.A graph like this should be drawn in a separate window.
Since June is the rainy season, it rains a lot (less than half the number of days), and you can see that there are almost no rainy days in January. The weather changes easily in October, but it seems that there are still more sunny days than rainy days. In addition, since the rough aggregation conditions are easy to implement and the fluctuations are easy to understand, when performing detailed analysis, let's aggregate the values of the weather overview after categorizing them.
Recommended Posts