(Same as previous) Here is the Excel file. It is output from a certain DB, and sentences are stored in one record per line and one field. Each line also has a date information field. The theme of this time is to extract the specified keyword from the text in this field and plot how the number of appearances changes from month to month. The entrance and exit are Windows Excel files, and the middle is done on a Mac.
Character code conversion and Excel conversion are the same as last time, so they are omitted.
Let df read csv by pd.read (). MeCab required
def group_by_month(df):
e = df['comment'] #Specify a field with text
e.index = pd.to_datetime(df['datetime']) #Specify date information in index
m = MeCab.Tagger('-Ochasen') #Put the output in Chasen mode
result_df = None
for k, v in e.iteritems():
if type(v) != unicode:
continue
target_dic = { #Specify the target keyword
'XXX' : 0,
'YYY' : 0,
'ZZZ' : 0,
}
s8 = v.encode('utf-8')
node = m.parseToNode(s8)
while node:
word=node.feature.split(',')[0]
key = node.surface
if key in target_dic:
target_dic[key] += 1 #Increase the count if found
node = node.next
if result_df is None:
result_df = pd.DataFrame(target_dic, index=[k])
else:
result_df = result_df.append(pd.DataFrame(target_dic, index=[k]))
#Monthly grouping
result_df['index1'] = result_df.index
result_df = result_df.groupby(pd.Grouper(key='index1', freq='M')).sum()
#It doesn't seem to work with index, so put it in column
return result_df
Every time I empty the dictionary, count the number of occurrences, convert it to a DataFrame and add it. I think it could be made simpler, but I don't know how to do it.
At this point, the following data will be stored in result_df.
XXX YYY ZZZ
index1
2014-06-30 0 1 0
2014-07-31 0 6 0
2014-08-31 3 19 6
2014-09-30 1 8 0
2014-10-31 5 29 7
2014-11-30 10 8 0
2014-12-31 10 31 8
2015-01-31 12 41 15
2015-02-28 45 82 22
2015-03-31 21 58 9
2015-04-30 23 60 19
2015-05-31 4 36 3
2015-06-30 11 40 8
2015-07-31 13 49 11
2015-08-31 8 14 2
2015-09-30 13 13 9
2015-10-31 5 31 9
2015-11-30 11 21 3
2015-12-31 12 21 3
2016-01-31 2 19 0
2016-02-29 12 15 5
2016-03-31 9 32 7
2016-04-30 2 22 4
2016-05-31 6 24 2
2016-06-30 7 21 4
2016-07-31 9 22 4
2016-08-31 5 21 1
2016-09-30 7 31 6
2016-10-31 0 12 1
'''
Prepare the graph area
'''
def plot_init(title):
fig = plt.figure()
ax = fig.add_subplot(1,1,1)
ax.set_title(title)
return fig, ax
'''
Plot
'''
def plot_count_of_day(df):
title = 'test_data'
fig, ax = plot_init(title)
for c in df.columns:
df[c].plot(label=c, ax=ax)
ax.legend()
ax.set(xlabel='month', ylabel='count')
Like this.
end.
Recommended Posts