About this article

I passed the Python Data Analysis Test, so I summarized the points.

1. Role of data analysis engineer

Supervised learning and unsupervised learning

Supervised learning is a learning method that has a label that gives the correct answer. The target data that is the correct label is called the objective variable. Data other than the objective variable is called the explanatory variable. Supervised learning is a learning method that predicts the objective variable using ** explanatory variables **.

On the other hand, unsupervised learning is a learning method that does not use correct labels. Since there is no correct label, it means ** a learning method without an objective variable **.

Classification and clustering

The classification of supervised learning is ** clearly defined in advance how many groups to divide. For example, if you want to classify dogs and cats, you would divide them into two groups.

Clustering, on the other hand, is categorized as unsupervised learning, and ** it is not clear how many groups there will be **. Maybe it's 3 groups, maybe 5 groups.

Machine learning processing procedure

Machine learning is processed in this way.

Get data->Data processing->Data visualization->Algorithm selection->Learning process->Accuracy evaluation->Trial operation->Result use (service operation)

Machine learning just needs ** data **.

Data analysis package

The main packages for data analysis are:

Jupyter Notebook
NumPy
pandas
Matplotlib
scikit-learn
SciPy

Even if I make a mistake, I don't use django. Although SciPy has little presence in reference books, it is a package used for data analysis.

If you are interested in django, please try google. I'm a relative of Flask.

2. Python and environment

pip command

The pip command will update the installed library to the latest version by adding the -U option. To install the latest version explicitly, it looks like this.

$ pip install -U numpy pandas

Remove whitespace string

Use the strip method to remove the ** left and right whitespace characters **.

`in`


bird = '   Condor Penguin Duck    '
print("befor strip: {}".format(bird))
print("after strip: {}".format(bird.strip()))

`out`


befor strip:    Condor Penguin Duck    
after strip: Condor Penguin Duck

pickle module

The ** pickle module ** serializes Python objects so that they can be read and written in files.

pathlib module

If you want to use paths in Python, use the ** pathlib module **.

Magic command

Jupyter Notebook has a command called ** Magic Command **. For example, %% timeit and% timeit. Both are commands that execute a program multiple times and measure the execution time.

% timeit measures the time for a single line of program. On the other hand, %% timeit measures the processing time of the entire cell.

`in`


%%timeit
x = np.arange(10000)
fig, ax = plt.subplots()
ax.pie(x, shadow=True)
ax.axis('equal')
plt.show()

`out`


#Output of figures is omitted
12 s ± 418 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

3. Basic knowledge for reading mathematical formulas

Mathematics takes time to write in Qiita, so I will briefly introduce it. I think it's a good idea to take a closer look at the graphs to see what the trends are.

Logarithmic function

The function expressed by the following formula is called ** logarithmic function **.

f\left( x\right) =\log _{2}x

Euclidean distance

There is ** Euclidean distance ** as a method to find the scalar of the magnitude of the vector, that is, to find the norm.

\left\| x\right\| _{1}=\left| x_{1}\right| +\left| x_{2}\right| +\ldots +\left| x_{n}\right|

Simply put, the absolute values of each element of the vector are added together.

Matrix multiplication

Multiplying the m × s matrix by the s × n matrix gives the m × n matrix.

Like the m × s matrix and the x × n matrix, it cannot be multiplied unless the number of matrices matches. Also, unlike mathematical multiplication, matrix multiplication results in different results when the order changes.

Differentiation of natural logarithm

$ f \ left (x \ right) = e ^ {x} $ does not change even if it is differentiated **.

f'\left( x\right) =e^{x}

4.1 NumPy

dtype attribute

You can check the ** element data type ** of the NumPy array ndarray with the dtype attribute. By the way, the Python type method can check the type (ndarray) of the array itself.

`in`


a = np.array([1, 2, 3])
print("ndarray dtype: {}".format(a.dtype))
print("ndarray type: {}".format(type(a)))

`out`


ndarray dtype: int32
ndarray type: <class 'numpy.ndarray'>

Copy and reference

In ndarray, the operation b = a is a reference. (If you change the value of b, the value of ** also changes **) If you operate b = a.copy (), it will be treated as a copy. (Change the value of b does not change the value of ** a)

If you slice a Python standard list, you will be passed a ** copy **, but if you slice the result in Numpy, you will be passed a ** reference **.

If you try various combinations, you will get a better understanding.

nan Use np.nan to declare non-numeric in NumPy.

`in`


a = np.array([1, np.nan, 3])
print(a)

`out`


[ 1. nan  3.]

Matrix division

The vpslit function decomposes the matrix in the ** row direction **, and the hsplit function decomposes the matrix in the ** column direction **.

`in`


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
first1, second1 = np.vsplit(a, [2])
first2, second2 = np.hsplit(second1, [2])
print(second2)

`out`


[[9]]

Average value

Use the mean method to find the mean of the matrix.

`in`


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a.mean()

`out`

5.0

Logical value

ndarray is displayed as True / False when compared by operator.

`in`


a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a > 4

`out`


array([[False, False, False],
       [False,  True,  True],
       [ True,  True,  True]])

4.2 Pandas

Index / column name specification

Use ** loc method / iloc method ** to extract data by specifying index or column from DataFrame.

The loc method specifies the index and column name ** index name and column name **. The iloc method specifies indexes and columns by ** position or range **.

`in`


df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]

display(df.loc[["01", "03"], ["A", "C"]])
display(df.iloc[[0, 2], [0, 2]])

Write / read data

Data is written with to_xxx and read with to_xxx. excel, csv, pickle, etc. are supported.

`in`


df.to_excel("FileName.xlsx")
df = pd.read_excel("FineName.xlsx")

Sorting data

The data is sorted by the sort_values method. ** By default, the sort is done in ascending order. ** ** Set ʻascending = False` as an argument to sort in descending order.

`in`


df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]

df.sort_values(by="C", ascending=False)

One-hot encoding

You can convert to One-hot encoding using the get_dummies method. One-hot encoding adds ** columns ** only for categorical variable types.

Date array

Use the data_range method to get a date array. You can set dates ** to the arguments ** start and end **.

`in`


dates = pd.date_range(start="2020-01-01", end="2020-12-31")
print(dates)

`out`


DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
               '2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
               '2020-01-09', '2020-01-10',
               ...
               '2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
               '2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
               '2020-12-30', '2020-12-31'],
              dtype='datetime64[ns]', length=366, freq='D')

4.3 Matplotlib

Subplot

Specify the number of subplots to place in the argument of the subplots method. ** A two-row subplot is placed for one number, and a two-column subplot is placed for ncols **.

`in`


fig, axes = plt.subplots(2)
display(plt.show())

`in`


fig, axes = plt.subplots(ncols=2)
display(plt.show())

Scatter plot

Scatter plots can be drawn with the scatter method.

histogram

The histogram can be drawn with the hist method. You can specify the number of bins ** with the ** bins argument.

pie chart

Pie charts can be drawn with the pi method. By default, it is drawn ** counterclockwise ** from the right.

style

For the color, you can specify ** the color name defined in HTML, X11, or CSS4 **. Font styles can also be ** defined in a dictionary and applied collectively, or applied individually **.

4.4 scikit-learn

Classification model

The classification model dataset is divided into ** training data ** and ** test data **. This is because the model's ** generalization ability ** needs to be evaluated.

Decision tree

The decision tree has features that the model can be visualized and the contents are easy to understand. The parameters must be set by the user. The purpose of the decision tree is to ** maximize information gain ** or minimize ** impure **. (Both have the same meaning)

Dimensionality reduction

Dimensionality reduction is the task of reducing dimensions without damaging the data as much as possible. For example, you can delete the unimportant Y data from the X and Y 2D data to make it X-only 1D data.

ROC curve and AUC

The ROC curve is to predict that all data above the probability of each data is a positive example when the data are arranged in descending order of probability. As the AUC value approaches 1, the sample with a relatively high probability tends to be a positive example, and the sample with a relatively low probability tends to be a negative example. In other words, AUC can compare the goodness between models.

Reference / Citation

A new textbook for data analysis using Python

I passed the Python data analysis test, so I summarized the points

About this article

1. Role of data analysis engineer

Supervised learning and unsupervised learning

Classification and clustering

Machine learning processing procedure

Data analysis package

2. Python and environment

pip command

Remove whitespace string

in

out

pickle module

pathlib module

Magic command

in

out

3. Basic knowledge for reading mathematical formulas

Logarithmic function

Euclidean distance

Matrix multiplication

Differentiation of natural logarithm

dtype attribute

in

out

Copy and reference

in

out

Matrix division

in

out

Average value

in

out

Logical value

in

out

Index / column name specification

in

Write / read data

in

Sorting data

in

One-hot encoding

Date array

in

out

Subplot

in

in

Scatter plot

histogram

pie chart

style

Classification model

Decision tree

Dimensionality reduction

ROC curve and AUC

Reference / Citation

`in`

`out`

`in`

`out`

`in`

`out`

`in`

`out`

`in`

`out`

`in`

`out`

`in`

`out`

`in`

`in`

`in`

`in`

`out`

`in`

`in`