I passed the Python Data Analysis Test, so I summarized the points.
Supervised learning is a learning method that has a label that gives the correct answer.
The target data that is the correct label is called the objective variable
.
Data other than the objective variable is called the explanatory variable
.
Supervised learning is a learning method that predicts the objective variable using ** explanatory variables **.
On the other hand, unsupervised learning is a learning method that does not use correct labels. Since there is no correct label, it means ** a learning method without an objective variable **.
The classification of supervised learning is ** clearly defined in advance how many groups to divide. For example, if you want to classify dogs and cats, you would divide them into two groups.
Clustering, on the other hand, is categorized as unsupervised learning, and ** it is not clear how many groups there will be **. Maybe it's 3 groups, maybe 5 groups.
Machine learning is processed in this way.
Get data->Data processing->Data visualization->Algorithm selection->Learning process->Accuracy evaluation->Trial operation->Result use (service operation)
Machine learning just needs ** data **.
The main packages for data analysis are:
Even if I make a mistake, I don't use django
.
Although SciPy has little presence in reference books, it is a package used for data analysis.
The pip command will update the installed library to the latest version by adding the -U
option.
To install the latest version explicitly, it looks like this.
$ pip install -U numpy pandas
Use the strip method
to remove the ** left and right whitespace characters **.
in
bird = ' Condor Penguin Duck '
print("befor strip: {}".format(bird))
print("after strip: {}".format(bird.strip()))
out
befor strip: Condor Penguin Duck
after strip: Condor Penguin Duck
The ** pickle module ** serializes Python objects so that they can be read and written in files.
If you want to use paths in Python, use the ** pathlib module **.
Jupyter Notebook has a command called ** Magic Command **.
For example, %% timeit
and% timeit
.
Both are commands that execute a program multiple times and measure the execution time.
% timeit
measures the time for a single line of program.
On the other hand, %% timeit
measures the processing time of the entire cell.
in
%%timeit
x = np.arange(10000)
fig, ax = plt.subplots()
ax.pie(x, shadow=True)
ax.axis('equal')
plt.show()
out
#Output of figures is omitted
12 s ± 418 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Mathematics takes time to write in Qiita, so I will briefly introduce it. I think it's a good idea to take a closer look at the graphs to see what the trends are.
The function expressed by the following formula is called ** logarithmic function **.
f\left( x\right) =\log _{2}x
There is ** Euclidean distance ** as a method to find the scalar of the magnitude of the vector, that is, to find the norm.
\left\| x\right\| _{1}=\left| x_{1}\right| +\left| x_{2}\right| +\ldots +\left| x_{n}\right|
Simply put, the absolute values of each element of the vector are added together.
Multiplying the m × s
matrix by the s × n
matrix gives the m × n
matrix.
Like the m × s
matrix and the x × n
matrix, it cannot be multiplied unless the number of matrices matches.
Also, unlike mathematical multiplication, matrix multiplication results in different results when the order changes.
$ f \ left (x \ right) = e ^ {x} $ does not change even if it is differentiated **.
f'\left( x\right) =e^{x}
4.1 NumPy
You can check the ** element data type ** of the NumPy array ndarray with the dtype attribute
.
By the way, the Python type method can check the type (ndarray) of the array itself.
in
a = np.array([1, 2, 3])
print("ndarray dtype: {}".format(a.dtype))
print("ndarray type: {}".format(type(a)))
out
ndarray dtype: int32
ndarray type: <class 'numpy.ndarray'>
In ndarray, the operation b = a
is a reference. (If you change the value of b, the value of ** also changes **)
If you operate b = a.copy ()
, it will be treated as a copy. (Change the value of b does not change the value of ** a)
If you slice a Python standard list, you will be passed a ** copy **, but if you slice the result in Numpy, you will be passed a ** reference **.
If you try various combinations, you will get a better understanding.
nan
Use np.nan
to declare non-numeric in NumPy.
in
a = np.array([1, np.nan, 3])
print(a)
out
[ 1. nan 3.]
The vpslit function decomposes the matrix in the ** row direction **, and the hsplit function decomposes the matrix in the ** column direction **.
in
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
first1, second1 = np.vsplit(a, [2])
first2, second2 = np.hsplit(second1, [2])
print(second2)
out
[[9]]
Use the mean method to find the mean of the matrix.
in
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a.mean()
out
5.0
ndarray is displayed as True / False when compared by operator.
in
a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
a > 4
out
array([[False, False, False],
[False, True, True],
[ True, True, True]])
4.2 Pandas
Use ** loc method / iloc method ** to extract data by specifying index or column from DataFrame.
The loc method specifies the index and column name ** index name and column name **. The iloc method specifies indexes and columns by ** position or range **.
in
df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]
display(df.loc[["01", "03"], ["A", "C"]])
display(df.iloc[[0, 2], [0, 2]])
Data is written with to_xxx
and read with to_xxx
.
excel, csv, pickle, etc. are supported.
in
df.to_excel("FileName.xlsx")
df = pd.read_excel("FineName.xlsx")
The data is sorted by the sort_values method. ** By default, the sort is done in ascending order. ** ** Set ʻascending = False` as an argument to sort in descending order.
in
df = pd.DataFrame([[1, 2, 3], [5, 7, 11], [13, 17, 19]])
df.index = ["01", "02", "03"]
df.columns = ["A", "B", "C"]
df.sort_values(by="C", ascending=False)
You can convert to One-hot encoding using the get_dummies method
.
One-hot encoding adds ** columns ** only for categorical variable types.
Use the data_range method
to get a date array.
You can set dates ** to the arguments ** start and end **.
in
dates = pd.date_range(start="2020-01-01", end="2020-12-31")
print(dates)
out
DatetimeIndex(['2020-01-01', '2020-01-02', '2020-01-03', '2020-01-04',
'2020-01-05', '2020-01-06', '2020-01-07', '2020-01-08',
'2020-01-09', '2020-01-10',
...
'2020-12-22', '2020-12-23', '2020-12-24', '2020-12-25',
'2020-12-26', '2020-12-27', '2020-12-28', '2020-12-29',
'2020-12-30', '2020-12-31'],
dtype='datetime64[ns]', length=366, freq='D')
4.3 Matplotlib
Specify the number of subplots to place in the argument of the subplots method. ** A two-row subplot is placed for one number, and a two-column subplot is placed for ncols **.
in
fig, axes = plt.subplots(2)
display(plt.show())
in
fig, axes = plt.subplots(ncols=2)
display(plt.show())
Scatter plots can be drawn with the scatter method
.
The histogram can be drawn with the hist method
.
You can specify the number of bins ** with the ** bins argument.
Pie charts can be drawn with the pi method
.
By default, it is drawn ** counterclockwise ** from the right.
For the color, you can specify ** the color name defined in HTML, X11, or CSS4 **. Font styles can also be ** defined in a dictionary and applied collectively, or applied individually **.
4.4 scikit-learn
The classification model dataset is divided into ** training data ** and ** test data **. This is because the model's ** generalization ability ** needs to be evaluated.
The decision tree has features that the model can be visualized and the contents are easy to understand. The parameters must be set by the user. The purpose of the decision tree is to ** maximize information gain ** or minimize ** impure **. (Both have the same meaning)
Dimensionality reduction is the task of reducing dimensions without damaging the data as much as possible. For example, you can delete the unimportant Y data from the X and Y 2D data to make it X-only 1D data.
The ROC curve is to predict that all data above the probability of each data is a positive example when the data are arranged in descending order of probability. As the AUC value approaches 1, the sample with a relatively high probability tends to be a positive example, and the sample with a relatively low probability tends to be a negative example. In other words, AUC can compare the goodness between models.
A new textbook for data analysis using Python
Recommended Posts