Learning record (25th day)

Start studying: Saturday, December 7th

Teaching materials, etc .: ・ Miyuki Oshige "Details! Python3 Introductory Note ”(Sotec, 2017): 12/7 (Sat) -12/19 (Thu) read ・ Progate Python course (5 courses in total): 12/19 (Thursday) -12/21 (Saturday) end ・ Andreas C. Müller, Sarah Guido "(Japanese title) Machine learning starting with Python" (O'Reilly Japan, 2017): 12/21 (Sat) -December 23 (Sat) ・ Kaggle: Real or Not? NLP with Disaster Tweets: Posted on Saturday, December 28th to Friday, January 3rd Adjustment ・ ** Wes Mckinney "(Japanese title) Introduction to data analysis by Python" (O'Reilly Japan, 2018) **: 1/4 (Wednesday) to 1/13 (Monday) read

"Introduction to Data Analysis with Python"

Read on January 13th

Chapter 11 Time Series Data

-Any data observed at a certain point in time constitutes a time series. Examples of characterization: time stamps, fixed periods, sense of time, etc. The method changes depending on what it is applied to. pandas offers many tools for time series. It is effective for finance and log data analysis.

-Datetime, time, calendar module You can specify the format with str or strftime. % Y is a 4-digit year,% y is a 2-digit year, etc. Use it like datetime.strftime ('% Y-% m-% d').

-Index reference If you use date ['2000'], you can refer to the data of the corresponding date. Generation by specifying a range date_range Data movement You can also move by specifying shift and offset.

・ Most of the time series are handled by Coordinated Universal Time UTC. Get and generate a timezone object with pytz.timezone Localize with tz_localize and convert to another timezone with tz_convert. You can also specify the time zone when generating the timestamp.

-Time series frequency can be converted. Use the resample method. Downsampling to aggregate to less frequent data, vice versa resample ('5min', closed = XXX), closed determines which of the left and right is the closed interval (not included in the value). OHLC (Open-High-Low-Close) function, open price, close price, highest price, lowest price can be aggregated. 　 -Window function: Weights that decrease exponentially are applied to the data. A function that is 0 except for a certain finite interval. Helps reduce noise and gap data. You can apply your own functions by rolling, expanding, span, apply.

Chapter 12 pandas: Advanced Edition

・ Categorical of pandas There is a possibility that processing speed and memory usage can be improved by utilizing it. 　 -When performing a large amount of analysis using a specific data set, performance improvement can be obtained with categorical variables. Replacing columns in a data frame with categorical representations also saves a lot of memory. 　astype('category')

-Category method addition, size relationship setting, deletion, etc. 　add_categories, as_ordered,remove_categories

-When using a machine learning tool, etc., it may be necessary to convert to a dummy variable format. (One-hot encoding.) Expressed as 0 or 1. It can be converted with get_dummies.

-Groupby can perform common processing for specified elements. You can do the same with transform using a lambda expression, like lambda x: x.mean (). 　df.transform(lambda x:x.mean()) Group calculation is also possible by utilizing transform normalized = (df ['A'] --b.transform ('mean')) / b.transform ('std') etc. Aggregation for each group may occur multiple times, or the benefits of vector operations outweigh the overall benefits.

Chapter 13 Introduction to Modeling Library in Python

-The point of contact between pandas and the analysis library is usually a NumPy array. Use the .value attribute to convert a data frame to NumPy. (Becomes an ndarray.) 　data.values When returning, pass a two-dimensional ndarray and specify the column name. 　pd.DataFrame(data.values, columns=['one', 'two', 'three']

-When using only a part of the column It is better to use values while referring to the index with loc. 　model_cols = ['x0', 'x1'] 　data.loc[:, model_cols].values Now you can extract only ** x0, x1 ** of ** all rows ** with array.

`Replace some with dummy variables`


dummies = pd.get_dummies(data.category, prefix='category')
data_with_dummies = data.drop('category', axis=1).join(dummies)

#Create a dummy, delete the original column with drop, and add it with join.

Learning record No. 21 (25th day)