Start studying: Saturday, December 7th
Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): Completed on Thursday, December 19th ・ Progate Python course (5 courses in total): Ends on Saturday, December 21st ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): Completed on Saturday, December 23 ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ ** Wes Mckinney "(Japanese title) Introduction to Data Analysis with Python" (O'Reilly Japan, 2018) **: January 4th (Sat) ~
p.181 Chapter 5 Finished reading until the introduction to pandas.
-Pandas is designed to handle tabular and non-uniform data. Like NumPy, I prefer data processing that doesn't use for loops. A lot of series and data frames are used.
-Series: Includes a label array called an index associated with consecutive values. Objects such as one-dimensional arrays, numerical references by label and condition specification are also possible You can also pass a Python dictionary format to make a series. If there is no corresponding one, it is treated as NaN. NaN can be identified by the isnull and not null functions of pandas.
-Data Frame: Has a tabular data structure and ordered columns. An image that shares the index of the series as a whole. Many of the processes used in Kaggle's pre-processing. Extraction of head, loc, colons designation, etc. (Because the extracted series has the same index as the data frame had.) Passing a nested dictionary interprets the outer key as the column index and the inner key as the row index.
-Index objects have the role of holding labels and metadata. Therefore, it is treated as immutable. This makes it possible to handle data safely. If you want to change the index, use pandas' reindex function. You can also index columns by specifying columns as an argument. The drop function that deletes an element can be changed while overwriting the original data by setting replace = True as an argument.
・ Data selection Use iloc to refer to the label of the data frame with loc and to refer to by index position. Slicing with labels is different from Python's and includes endpoints. (Including 2 in [: 2])
-When using arithmetic methods (add, sub, div ...), it is possible to calculate while considering NaN by using fill_value as the second argument. (Usually, where the axis labels do not overlap, one is ignored and the calculation is performed collectively as NaN.)
・ Summary statistics (number of each element, etc.) can also be output. sum etc. Also for columns by specifying axis = 1 (or axis ='columns') as an argument. For idxmax, the maximum value for each index. You can also get all the multiple summary statistics by passing describe. If it is numerical data, it is the deviation or total, if it is not numerical, it is the number of elements themselves excluding count or duplication. This was also often used in Kaggle. You can get the number of each element with value_count. You can also sort by combining with sort. value_count.sort () sort is true or false. The isin function can be used to determine if the specified element exists. True if there is. You can also use this to create a subset of just what you want.
Recommended Posts