One of Python's strengths is the large number of modules and libraries. Even if you don't define general-purpose functions and classes one by one, you can save a lot of work by using open source modules that you have prototyped in the past or created by our predecessors.
Now let's use a typical library!
Numpy Numpy is a library for fast and efficient numerical calculations in Python. Use the ʻimport` statement when using the library.
import numpy as np
By adding ʻas 〇〇 to the end of the ʻimport
statement
You can change the name of the module to XX in your code.
This time, numpy
is set to np
.
numpy uses its own array (an array of the ndarray class).
Objects of the ndarray
class include:
import numpy as np
#Declaration of numpy array
a = np.arange(15).reshape(3, 5)
b = np.array([1.2,3.5,5.1])
c = ([(1.5,2,3),(4,5,6)])
print(a)
print(b)
#Array shape
print(a.shape)
#Array dimensions
print(a.ndim)
#Array element types
print(a.dtype.name)
#Number of elements in the array
print(a.size)
#Display of array
#Array a
[[ 0 1 2 3 4 ]
[ 5 6 7 8 9 ]
[10 11 12 13 14 ]]
#Array b
[1.2 3.5 5.1 ]
#Array c
[(1.5, 2, 3), (4, 5, 6)]
#Array shape
(3, 5)
#Number of dimensions of array
2
#Array element types
int64
#Number of elements in the array
15
You can generate a Numpy array by using the ** array attribute **. Multidimensional arrays can also be easily created. Use the ** shape property ** to check the number of elements such as rows and columns. You can check the total number of elements with the ** size property **. Also, if you want to check what type the matrix element is (int, float, etc.) You can do this by using the ** type property **.
Zero matrix, matrix with all elements set to 1, empty matrix You can also generate it.
import numpy as np
print(np.zeros((3,4)))
print(np.ones((2,3,4),dtype = np.int16))
print(np.empty((2,3)))
#Zero matrix
[[0. 0. 0. 0. ]
[0. 0. 0. 0. ]
[0. 0. 0. 0. ] ]
#Matrix with all elements 1
[[[1 1 1 1 ]
[1 1 1 1 ]
[1 1 1 1] ]
[[1 1 1 1 ]
[1 1 1 1 ]
[1 1 1 1 ] ] ]
#Empty Numpy matrix
[[1.39069238e-309 1.39069238e-309 1.39069238e-309 ]
[1.39069238e-309 1.39069238e-309 1.39069238e-309 ] ]
Challenges
Let's make a 5 * 5 identity matrix from a zero matrix. I'll leave it to you how to put the diagonal components.
As mentioned above, you can also use the ** arrange attribute ** to generate a Numpy array in which the elements enter in the specified order.
In np.arange. (Stop)
, n
up to 0 <= n <stop
is entered in sequence.
In np.arange. (start, stop)
, n
up to start <= n <stop
is entered in sequence.
With np.arange. (start, stop, step)
, n
up to start <= n <stop
is entered by skipping step
. The number to be skipped corresponds to a decimal number.
print(np.arange(10))
print(np.arange(0,10))
print(np.arange(10,30,5))
print(np.arange(0,2,0.3))
[0 1 2 3 4 5 6 7 8 9 ]
[0 1 2 3 4 5 6 7 8 9 ]
[10 15 20 25 ]
[0. 0.3 0.6 0.9 1.2 1.5 1.8 ]
There is also a ** linspace attribute ** that divides the [start: stop)
range into the specified number of elements at regular intervals.
print(np.linspace(0,2,9))
[0. 0.25 0.5 0.75 1. 1.25 1.5 1.75 2. ]
In the four arithmetic operations of Numpy array, the calculation is basically performed between the same row and column elements.
A = np.array([[1,1],[0,1]])
B = np.array([[2,0],[3,4]])
print(A+B)
print(A-B)
print(A*B)
#A+B
[[3 1 ]
[3 5 ] ]
#A-B
[[ - 1 1 ]
[ - 3 - 3 ] ]
#A*B
[[2 0 ]
[0 4 ] ]
If you want to find the inner product instead of multiplying the square elements, use @
or dot function
.
print(A@B)
print(A.dot(B))
#A@B
[[5 4 ]
[3 4 ]]
#A.dot(B)
[[5 4 ]
[3 4 ]]
Exponentiation of Numpy arrays is done between arrays (A ** B) The B [i] power of A [i] is output. The square root is also processed for each element.
a = np.arange(1, 11, 1)
b = np.array([1,2,1,2,1,2,1,2,1,2])
print(np.power(a, b))
print(np.sqrt(a))
[ 1 4 3 16 5 36 7 64 9 100 ]
[1. 1.41421356 1.73205081
2. 2.23606798 2.44948974 2.64575131 2.82842712
3. 3.16227766 ]
Numpy also supports trigonometric and hyperbolic functions. Be careful because the argument is ** radian **.
print(np.sin(0))
print(np.cos(0))
print(np.tan(np.pi/4))
print(np.tanh(2))
0.0
1.0
0.9999999999999999
0.9640275800758169
There is also an inverse trigonometric function. The output is also ** radian **.
print(np.arcsin(1.0))
print(np.arcsin(1.0)*2/np.pi)
print(np.arccos(-1.0))
print(np.arctan(-0.5))
#arcsin(1.0)
1.5707963267948966
#arcsin(1.0)*2/np.pi
1.0
#arccos(-1.0)
3.141592653589793
#arctan(-0.5)
- 0.4636476090008061
There are several functions that enumerate all the elements of a Numpy array.
onehot = np.array([0, 1, 1, 0, 1, 0, 1, 1])
#Example of counting the number of 1
print(np.count_nonzero(onehot))
print(len(np.where(onehot != 0)[0]))
#Example of counting the number of 0s
print(np.count_nonzero(1 - onehot))
#Example of counting 0 and 1 at the same time
#Use of unique function
print(np.unique(onehot, return_counts=True))
#Display in dictionary type
unique, count = np.unique(onehot, return_counts=True)
print(dict(zip(unique, count)))
#Use of bincount function
c = np.bincount(onehot)
print(c)
#Number of 1
5 #count_nonzero
5 #len(np.where(onehot != 0)[0])
#Number of 0
3 #count_nonzero(1 - onehot)
#Count at the same time
(array([0, 1]), array([3, 5])) #unique
{0: 3, 1: 5} #Dictionary type
[3 5 ] #
In particular, the ʻunique functionand the
bincount function count all types of elements. The difference between the two is that the ʻunique function
does not display the number of values that do not exist in the Numpy array.
A = np.array([2,3,4,3,4,4,4,6,7,1])
#Use of unique function
u,c = np.unique(A,return_counts=True)
print(dict(zip(u,c)))
#Use of bincount function
print(np.bincount(A))
#Use of unique function
{1: 1, 2: 1, 3: 2, 4: 4, 6: 1, 7: 1}
#Use of bincount function
[0 1 1 2 4 0 1 1 ]
Challenges
Create an array of values entered by π / 4 in the sin function (0 to 4π) Count each element.
Use the ʻargmax function and ʻargmin function
to retrieve the subscripts (indexes) of the largest and smallest elements of an array.
The subscript (index) is the position of the element.
A = np.array([2,4,3,6,7,8,8,5,4,7,8])
#Maximum value of the element
print(np.max(A))
#Maximum index
print(np.argmax(A))
print(A.argmax())
#Minimum value of element
print(np.min(A))
#Minimum index
print(np.argmin(A))
print(A.argmin())
#Maximum value of the element
8
#Index of maximum value of element
5
5
#Minimum value of element
2
#Index of the minimum value of the element
0
0
ʻArgmax, argmin functionis There are two types,
np.argmax (argmin)and
np.ndarray.argmax (argmin) The
np.argmax functionspecifies an array as the first argument of the function and The
np.ndarray.argmax functionis used by calling it like a method The index
5 of the largest element of array A starting from index" 0 "is output. Each argument has a ʻaxis
parameter, which searches for the maximum and minimum values on the specified axis even in a multidimensional array.
B = np.array([[2,4,5],[9,2,8]])
#axis(axis=0)Maximum value of each element based on
print(B.max(axis=0))
#axis(axis=0)Index of the maximum value of each element based on
print(B.argmax(axis=0))
#Axis(axis=1)Maximum value of each element as a reference
print(B.max(axis=1))
#axis(axis=1)Index of the maximum value of each element based on
print(B.argmax(axis=1))
#axis(axis=0)Maximum value of each element based on
[9 4 8 ]
#axis(axis=0)Index of the maximum value of each element based on
[1 0 1 ]
#Axis(axis=1)Maximum value of each element as a reference
[5 9 ]
#axis(axis=1)Index of the maximum value of each element based on
[2 0 ]
When ʻaxis = 0, the index of the maximum value of each column of the array is output. When ʻaxis = 1
, the index of the maximum value of each row of the array is output.
Challenges
Array of the previous task Maximum and minimum values Find the first index.
Pandas Pandas is a fast, powerful, flexible and easy-to-use open source data analysis and manipulation tool. First, let's import the library.
import pandas as pd
Next, let's load the Iris dataset from the UCI Machine Learning Repository.
In pandas, the data table is called ** DataFrame **.
df
is an abbreviation, so the name has no particular meaning,
Often declared as df
.
import pandas as pd
#Displaying data frames
df = pd.read_csv('https://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',header=None)
print(df)
#Displaying data frames
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
[150 rows x 5 columns]
I think I could read it like this.
If you want to check only the beginning and end, use head, tail function
.
#Display of the first 10 lines
print(df.head(n=10))
#Display of the last 5 lines
print(df.tail(n=5))
#Display of the first 10 lines
0 1 2 3 4
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
5 5.4 3.9 1.7 0.4 Iris-setosa
6 4.6 3.4 1.4 0.3 Iris-setosa
7 5.0 3.4 1.5 0.2 Iris-setosa
8 4.4 2.9 1.4 0.2 Iris-setosa
9 4.9 3.1 1.5 0.1 Iris-setosa
#Display of the last 5 lines
0 1 2 3 4
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
Let's check the shape of the data set.
#Check the number of rows and columns
print(df.shape)
#Check index
print(df.index)
#Check the column
print(df.columns)
#Check the data type of each column of dataframe
print(df.dtypes)
#Check the number of rows and columns
(150, 5)
#Check index
RangeIndex(start=0, stop=150, step=1)
#Check the column
Int64Index([0, 1, 2, 3, 4], dtype='int64')
#Check the data type of each column of dataframe
0 float64
1 float64
2 float64
3 float64
4 object
dtype: object
Data frame statistics can be found with the describe () function
.
print(df.describe())
0 1 2 3
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
You can also retrieve only part of the data frame.
#100~Get 109th data
print(df[100:110],"\n\n")
#Get only the 100th data
print(df.loc[100],"\n\n")
#0~Get only the 4th column of 100 rows
print(df.iloc[0:100,4])
0 1 2 3 4
100 6.3 3.3 6.0 2.5 Iris-virginica
101 5.8 2.7 5.1 1.9 Iris-virginica
102 7.1 3.0 5.9 2.1 Iris-virginica
103 6.3 2.9 5.6 1.8 Iris-virginica
104 6.5 3.0 5.8 2.2 Iris-virginica
105 7.6 3.0 6.6 2.1 Iris-virginica
106 4.9 2.5 4.5 1.7 Iris-virginica
107 7.3 2.9 6.3 1.8 Iris-virginica
108 6.7 2.5 5.8 1.8 Iris-virginica
109 7.2 3.6 6.1 2.5 Iris-virginica
0 6.3
1 3.3
2 6
3 2.5
4 Iris-virginica
Name: 100, dtype: object
0 Iris-setosa
1 Iris-setosa
2 Iris-setosa
3 Iris-setosa
4 Iris-setosa
...
95 Iris-versicolor
96 Iris-versicolor
97 Iris-versicolor
98 Iris-versicolor
99 Iris-versicolor
Name: 4, Length: 100, dtype: object
For example, when features such as clothing size are ordered, by mapping the features to integers, etc. Learning algorithms will be able to correctly interpret ordinal features. It also enables inverse conversion.
df = pd.DataFrame([
['green','M',10.1,'class1'],
['red','L',13.5,'class2'],
['blue','XL',15.3,'class1']
])
#Set column name
df.columns = ['color','size','price','classlabel']
print(df,"\n")
size_mapping = {'XL':3,'L':2,'M':1}
df['size']= df['size'].map(size_mapping)
print(df,"\n")
#Define a function for inverse transformation
inv_size_mapping = {v:k for k,v in size_mapping.items()}
df['size']=df['size'].map(inv_size_mapping)
print(df)
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
#After conversion
color size price classlabel
0 green 1 10.1 class1
1 red 2 13.5 class2
2 blue 3 15.3 class1
#After inverse conversion
color size price classlabel
0 green M 10.1 class1
1 red L 13.5 class2
2 blue XL 15.3 class1
Machine learning libraries often require class labels as integer values, so encode them. Inverse transformation is prepared as well as ordered features.
class_mapping = {label:idx for idx, label in enumerate(np.unique(df['classlabel']))}
df['classlabel']=df['classlabel'].map(class_mapping)
print(df,"\n")
inv_class_mapping = {v : k for k, v in class_mapping.items()}
df['classlabel'] = df['classlabel'].map(inv_class_mapping)
print(df)
color size price classlabel
0 green M 10.1 0
1 red L 13.5 1
2 blue XL 15.3 0
color size price classlabel
0 green M 10.1 0
1 red L 13.5 1
2 blue XL 15.3 0
It is rare that the classification items of the machine learning classification classification are numerical values such as class 1, class 2 ... from the beginning.
For example, when classifying men and women, it may be easier to convert them to dummy variables such as 0 and 1.
You may also want to convert multi-class features into a one-hot representation.
In such a case, the get_dummies () function
is convenient.
This time, we will perform ** one-hot encoding ** for colors that are a type of feature quantity called nominal feature quantity in no order.
pd.get_dummies(df[['price','color','size']])
price size color_blue color_green color_red
0 10.1 1 0 1 0
1 13.5 2 0 0 1
2 15.3 3 1 0 0
Not all data is completely filled. In such cases, you need to fill in the missing values (NaN).
from io import StringIO
#Creating sample data
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
df = pd.read_csv(StringIO(csv_data))
print(df)
print(df.describe())
Where StringIO
is the string assigned to the variable csv_data
I am using it so that it can be read into a DataFrame object.
A B C D
0 1.0 2.0 3.0 4.0
1 5.0 6.0 NaN 8.0
2 10.0 11.0 12.0 NaN
A B C D
count 3.000000 3.000000 2.000000 2.000000
mean 5.333333 6.333333 7.500000 6.000000
std 4.509250 4.509250 6.363961 2.828427
min 1.000000 2.000000 3.000000 4.000000
25% 3.000000 4.000000 5.250000 5.000000
50% 5.000000 6.000000 7.500000 6.000000
75% 7.500000 8.500000 9.750000 7.000000
max 10.000000 11.000000 12.000000 8.000000
Two missing values were found in this DataFrame. When the DataFrame is huge, there is a limit to what you can visually check, so try to find the missing data with the ʻisnull ()` function.
print(df.isnull(),"\n")
print(df.isnull().sum())
A B C D
0 False False False False
1 False False True False
2 False False False True
A 0
B 0
C 1
D 1
dtype: int64
Now we know that there are missing values in columns C and D.
This time, we will complement these with the median of other values in the same column.
Use the fillna () function
to complete.
df["C"]=df["C"].fillna(df["C"].median())
df["D"]=df["D"].fillna(df["D"].median())
print(df.isnull().sum())
A 0
B 0
C 0
D 0
dtype: int64
This fills in the missing values.
On the other hand, it is also possible to delete rows and columns with missing values with the dropna () method
.
csv_data = '''A,B,C,D
1.0,2.0,3.0,4.0
5.0,6.0,,8.0
10.0,11.0,12.0,'''
df = pd.read_csv(StringIO(csv_data))
print(df.dropna(),"\n")
print(df.dropna(axis=1))
A B C D
0 1.0 2.0 3.0 4.0
A B
0 1.0 2.0
1 5.0 6.0
2 10.0 11.0
In the dropna () method
, if nothing is put in the argument, the row (axis = 0) and the axis are put in the argument to delete the column containing the missing value in the axial direction.
Challenges from here
https://www.kaggle.com/ Get the dataset from kaggle.
Enter My account.
Create New API token |
---|
Click
I think the Kaggle.json file is saved. Move on to Google Colab.
Enter the following:
from google.colab import files
files.upload()
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!chmod 600 /root/.kaggle/kaggle.json
!pip install kaggle
!kaggle competitions download -c titanic
Then files.upload ()
will start
A file selection field like this will appear, so select the Kaggle.json file you saved earlier.
Then, the command under it will save the titanic data in the kaggle folder, so use that train data.
df = pd.read_csv('/content/train.csv',header=None)
print(df.tail(n=10))
0 1 2 ... 9 10 11
882 882 0 3 ... 7.8958 NaN S
883 883 0 3 ... 10.5167 NaN S
884 884 0 2 ... 10.5 NaN S
885 885 0 3 ... 7.05 NaN S
886 886 0 3 ... 29.125 NaN Q
887 887 0 2 ... 13 NaN S
888 888 1 1 ... 30 B42 S
889 889 0 3 ... 23.45 NaN S
890 890 1 1 ... 30 C148 C
891 891 0 3 ... 7.75 NaN Q
[10 rows x 12 columns ]
Matplotlib ** Matplotlib ** is a comprehensive library for creating static, animated, and interactive visualizations. Visualization is a relatively important task. In many cases, the correlation can be seen by visualizing data that is difficult to see with just a character string. Let's import it now.
import matplotlib.pyplot as plt
Let's draw a simple number line.
plt.plot([1, 2, 3, 4])
plt.ylabel('some numbers')
plt.show()
Plot the data with the plot function
.
This time, the one corresponding to the x-axis is not explicitly described, but when a single array is specified, it is automatically regarded as a sequence of y-values and an x-value is generated.
You can specify the y-axis label name with the ylabel function
.
This is how the x-to-y plot works.
plt.plot([1, 2, 3, 4], [1, 4, 9, 16])
plt.xlabel('x-axis')
plt.ylabel('y-axis')
Of course, you can also make a scatter plot.
import numpy as np
t = np.arange(0., 5., 0.2)
# evenly sampled time at 200ms intervals
t = np.arange(0., 5., 0.2)
# red dashes, blue squares and green triangles
plt.plot(t, t, 'r--', t, t**2, 'bX', t, t**3, 'g^',t,t**4,'y*')
plt.show()
Here, r--
, bs
, g ^
, y-
, etc. are plot settings, and the first alphabetic character is the color abbreviation.
There are eight types of abbreviations: {b (lue), g (ray), r (ed), c (ian), m (asenta), y (ellow), k (ey-plate black), w (hite)} there is.
If you want to try other colors
https://qiita.com/KntKnk0328/items/5ef40d9e77308dd0d0a4
See here.
In addition, the line types of the plot are as follows.
symbol | Line type |
---|---|
: | dotted line |
- | solid line |
-. | Dashed line |
-- | Dashed line |
There are various markers.
symbol | marker | symbol | marker |
---|---|---|---|
. | point | * | Star |
, | pixel | 1 | Y |
o | Round | 2 | Y(↓) |
v | Lower triangle | 3 | Y(←) |
^ | Upper triangle | 4 | (→) |
< | Left triangle | + | + |
> | Right triangle | x | x |
s | Rectangle | X | x(filled) |
p | pentagon | D | rhombus |
h | Hexagon | d | Thin rhombus |
8 | Octagon | No marker |
It is also possible to display plots in 3D space as contour lines or as 3D graphs.
from matplotlib import cm
from mpl_toolkits.mplot3d import Axes3D
#Creating grid points
X = np.arange(-5, 5, 0.25)
Y = np.arange(-5, 5, 0.25)
X, Y = np.meshgrid(X, Y)
#Calculation of the value of the function at each point
R = np.sqrt(X**2 + Y**2)
Z = np.sin(R)
fig_1 = plt.figure()
ax = Axes3D(fig_1)
ax.plot_surface(X, Y, Z, rstride=1, cstride=1, cmap=cm.viridis)
fig_2, ax_2 = plt.subplots(figsize=(8, 6))
#Contour line display
contour = ax_2.contourf(X, Y, Z)
#Color bar display
fig_2.colorbar(contour)
plt.show()
First, create a grid point with the numpy.meshgrid function.
Then calculate the value of the function on the grid points.
This time I'm drawing Z.
Drawing a 3D graph is done by passing plot_surface (X, Y, Z)
as an argument.
The meaning of the other arguments
Parameter name | meaning |
---|---|
rstride | Line jump width |
cstride | Column jump width |
rcount | Maximum number to use for row elements |
ccount | Maximum number to use for column elements |
color | Surface color |
cmap | Surface color map |
facecolor | Individual patch colors |
norm | Standardized instance when converting map values to colors |
vmax | Map maximum |
vmin | Map minimum |
shade | Presence or absence of shadow |
Contour lines are drawn using the countourf function
.
In addition, the correspondence table (color bar) of the values for the contour lines is displayed by passing the contour lines to the colorbar function
.
Challenges
Let's draw your favorite 3D data!
The fill_between function
supports filling the area that can be created when there are two or more functions.
x = np.arange(0,5,0.01)
y1 = np.sin(x)
y2 = np.cos(x)
plt.plot(x,y1,color='k',label="sin")
plt.plot(x,y2,color='r',label="cos")
plt.legend()
plt.fill_between(x,y1,y2,color='b',alpha = 0.1)
Use the ʻimshow function` to display an existing image or a two-dimensional array containing the values of each pixel. The image is displayed by passing an array object or PIL image as the first argument.
Save your favorite image in jpeg format and start from the file on the left
I will upload it.
import matplotlib.image as mpimg
img = mpimg.imread("〇〇.jpg ")
plt.imshow(img)
If you enter the file name in place of 〇〇.jpg
, the image should be displayed.
To display the array, do as follows. Let's plot a two-dimensional random number array.
x = np.random.random((100, 100))
plt.imshow(x, cmap="plasma")
plt.colorbar()
Recommended Posts