Previously, I did Story of extracting a sample by scanning 100% from the population with Hadoop. If there is little prior information about the data and you want to analyze it by fumbling, you will first analyze the extracted sample ad hoc from various angles to grasp the characteristics and trends of the data.
Sampling and pandas by Hadoop ) Is excellently compatible. The combination of pandas + matplotlib is analyzed using two data structures, Series and DataFrame, as previously introduced. You can visualize the results.
Since Hadoop output has a standard tab-delimited data structure, it can be read as it is by using the pd.read_table () function.
import pandas as pd
df = pd.read_table('hadoop-out.txt')
df.describe() #Find multiple summary statistics
#=> count              38156219 #Total population
#   unique              6536847 #Unique population
#   top      0024D69XXXXX,Area9 #1st index
You can also force the dictionary object to be converted to a data frame in the following ways:
df = pd.DataFrame(list(self.dic.values()), index=list(self.dic.keys()))
In the first place, the data is usually structured by the time it is processed by Hadoop using Fluentd etc., so it is compatible with pandas that handles structured data. The good thing is that it makes sense.
The value_counts () function is useful for further aggregating results such as word counts. Find the observation frequency of the value from a one-dimensional data structure such as a series, an array, or a sequence.
Pandas also provides a function fillna () that fills in missing values, which allows you to fill holes in the extraction process with some value.
| argument | Description | 
|---|---|
| value | Scalar value to fill in the blanks.(Dictionaries are also acceptable) | 
| axis | 0 for rows, 1 for columns | 
| limit | Maximum number of consecutive fills | 
| method | Specify when filling in the holes with the average value or median value | 
The duplicated () function in the data frame returns a series. This can be used to check for duplicates as it returns True if the value has already appeared in that dataframe.
The replace () function replaces the value. For example, to consider 99999 to be a missing value and replace it with NaN:
series.replace('99999', np.nan)
It is also easy to remove or round outliers other than the reference value.
#Absolute value exceeds 3(-Other than between 3 and 3)Value to NaN
data[np.abs(data) > 3] = np.nan
Using pandas functions can help you narrow down the targets to be analyzed from the extracted specimens. Hadoop-friendly pandas are essential for fast PDCA cycles of analysis.