Make a note of the script that you will always run when analyzing data in Python in the future. Runs on Python2 with Spark 2.0 in IBM's Data Science Experience environment. (This time it doesn't have to be Spark at all) Since the number of fields is quite large in actual analysis work, I tried to think of a method that does not require coding of field names (column names) in the script so that analysis can be performed efficiently. Try column expansion and flagging of category data, which is a function equivalent to "field reorganization" of SPSS Modeler, which is required for machine learning data preparation! I didn't try missing value related this time, so I'll take the next opportunity. (Data has already been entered in df_wiskey used in this article)
#First, check the contents of the DataFrame
df_wiskey.head(10)
#Next, check the attributes of the column (field) (this time, proceed with a fairly appropriate w)
df_wiskey.dtypes
#Basic statistics of numerical data
df_wiskey.describe()
#Graph the distribution of numerical data
#Put matplotlib in inline mode
%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
for x in df_wiskey.columns[df_wiskey.dtypes == 'float64']:
xdesc = df_wiskey[x].describe()
plt.hist(df_wiskey[x] , range=(xdesc['min'], xdesc['max']) )
plt.title( x )
plt.show()
#Numerical data,Correlation between two variables
df_wiskey.corr()
#Data other than numerical data
df_wiskey[df_wiskey.columns[df_wiskey.dtypes == 'object']].head(5)
#Aggregate data appearance frequency for non-numerical data (assumed to be category value)
for x in df_wiskey.columns[df_wiskey.dtypes == 'object']:
valcal = df_wiskey[x].value_counts();
print '-- '+x+' -----------------------------------'
print valcal.head(10)
print '--------------------------------------------'
#Cross tabulation between category data--Simple but the display feels strange
crosstab( df_wiskey.Country , df_wiskey.Category)
#Heatmap in Country vs Category(Bourbon concentrates on USA, Single Malt covers most countries)
df_wiskey_pd = pivot_table( data=df_wiskey , columns='Country' , index='Category' , values='Name' , aggfunc='count')
plt.imshow(df_wiskey_pd , aspect= 'auto' ,interpolation='nearest')
plt.colorbar()
plt.xticks(range(df_wiskey_pd.shape[1]), df_wiskey_pd.columns , rotation='vertical')
plt.yticks(range(df_wiskey_pd.shape[0]), df_wiskey_pd.index)
plt.show()
#Fieldize the data in the Country column to enter into the modeling technique, T/Set F
# (The column name is Country_XXXXXXXX)
for x in df_wiskey.groupby('Country').count().index :
x1 = 'Country_' + x
df_wiskey[x1] = 'F'
#If the country set in the Country column is xxxxx, then Country_Change to T for XXXXXXXXX
df_wiskey.loc[df_wiskey[x1][df_wiskey.Country == x].index , x1] = 'T'
#Display only the first 3 lines
df_wiskey.head(3)
Data Scientist Experience notebooks may be pretty easy to use: grinning:
Recommended Posts