Data analysis is popular these days, so I'll analyze it by showing a sample of the code.
the code
The execution environment will be Python3.
In this article we will do the following:
--Read CSV --Simple column conversion --Aggregate and draw from various perspectives
Use Seaborn
for drawing.
Seaborn: statistical data visualization
The data to be analyzed is as follows.
target.csv
datetime, id, value
20170606121314, 1,2
20170606121315, 1,3
20170606121316, 1,4
20170608121616, 1,4
20170608121617, 1,1
20170608121618, 1,2
20170606121540, 2,10
20170606121541, 2,8
20170606121542, 2,11
20170608121543, 2,4
20170606134002, 3,21
20170606134003, 3,10
20170606134004, 3,4
20170608134005, 3,50
datetime
is a string of year, month, day, hour, minute, and second.
Also assume that a certain value
occurs every second for a certain period of time for a few seconds for each id.
python
import pandas as pd
#CSV read
df = pd.read_csv("target.csv",sep=",")
df.columns = ["datetime","id","value"]
As a method to check if it was read
df.head()
It will be. Then, the output will be as follows.
datetime | id | value | |
---|---|---|---|
0 | 20170606121314 | 1 | 2 |
1 | 20170606121315 | 1 | 3 |
2 | 20170606121316 | 1 | 4 |
3 | 20170608121616 | 1 | 4 |
4 | 20170608121617 | 1 | 1 |
The head ()
method is a method that displays the first 5 lines of data and is often used to check the contents of data.
There is also a method called tail ()
, which displays 5 lines of data from the end of the data.
The display result is as follows.
datetime | id | value | |
---|---|---|---|
9 | 2017-06-08 12:15:43 | 2 | 4 |
10 | 2017-06-06 13:40:02 | 3 | 21 |
11 | 2017-06-06 13:40:03 | 3 | 10 |
12 | 2017-06-06 13:40:04 | 3 | 4 |
13 | 2017-06-08 13:40:05 | 3 | 50 |
Also, in the following line, the column is set in the dataframe.
python
df.columns = ["datetime","id","value"]
python
from datetime import datetime as dt
df.datetime = df.datetime.apply(lambda d: dt.strptime(str(d), "%Y%m%d%H%M%S"))
The purpose of doing this is to make the date column easier to work with. What we're doing is accessing the value in each row of the datetime column with df.datetime
and parse the string with the strptime
method. This allows values that were originally Strings to be converted to date and time types.
python
df_by_id= df.groupby("id")["value"].count().reset_index()
df_by_id
groupby ("id ")
aggregates records by value in the id column. The number of records by id is counted by count ()
.
The contents of df_byid are as follows.
id | value | |
---|---|---|
0 | 1 | 6 |
1 | 2 | 4 |
2 | 3 | 4 |
python
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
id_df = pd.DataFrame(df_by_id)
sns.distplot(id_df.value, kde=False, rug=False, axlabel="record_count",bins=10)
We use a library called seaborn
that draws beautiful diagrams.
python
df_value_sum= df.groupby("id")["value"].sum().reset_index()
The part that is count ()
above is just sum ()
.
The contents of df_value_sum are as follows.
id | value | |
---|---|---|
0 | 1 | 16 |
1 | 2 | 33 |
2 | 3 | 85 |
python
start_datetime_by_id = df.groupby(["id"])["datetime"].first().reset_index()
df_date = pd.DataFrame(start_datetime_by_id)
The contents of df_date are as follows.
id | datetime | |
---|---|---|
0 | 1 | 2017-06-06 12:13:14 |
1 | 2 | 2017-06-06 12:15:40 |
2 | 3 | 2017-06-06 13:40:02 |
python
sns.distplot(date_df.datetime.dt.month, kde=False, rug=False, axlabel="record_generate_date",hist_kws={"range": [1,30]}, bins=30)
With the option hist_kws = {"range ": [1,30]}
, the horizontal axis draws in the range 0-30. This is where the data occurred out of the data on June 30, 2017.
This is for the sake of clarity.
Recommended Posts