2020.8.11 I will share the spreadsheet link because the table is hard to see.
This article summarizes your knowledge of the Python 3 Engineer Certification Data Analysis Exam, which began on June 8, 2020. We are organizing information from Prime Strategy's practice exams and various web pages. The term "textbook" in the article refers to the following books, which are the main teaching materials.
Main teaching material: Released on September 19, 2018 (2,678 yen including tax) "A new textbook for data analysis using Python" (Shoeisha) Authors: Manabu Terata, Shingo Tsuji, Takanori Suzuki, Shintaro Fukushima (honorific title omitted)
Question range | Number of questions | Question distribution | ||
---|---|---|---|---|
1 | Role of data engineer | 2 | 5.00% | |
2 | Python and environment | |||
1 | Execution environment construction | 1 | 2.50% | |
2 | Python basics | 3 | 7.50% | |
3 | Jupyter Notebook | 1 | 2.50% | |
3 | Foundations of mathematics | |||
1 | Basic knowledge for reading mathematical formulas | 1 | 2.50% | |
2 | linear algebra | 2 | 5.00% | |
3 | Basic analysis | 1 | 2.50% | |
4 | Probability and statistics | 2 | 5.00% | |
4 | Analysis practice by library | |||
1 | NumPy | 6 | 15.00% | |
2 | pandas | 7 | 17.50% | |
3 | Matplotlib | 6 | 15.00% | |
4 | scikit-learn | 8 | 20% |
Major items | Sub-item | Overview | Details | reference | |
---|---|---|---|---|---|
2 | Python and environment | pip | The pip command is a utility that installs Python packages published in The Python Package Index. Use the pip install command to install the package. | ||
About pip's U option Ex.) pip install -U numpy pandas |
The pip command is-By adding the U option, the installed library will be updated to the latest version. To install the latest version explicitly, it looks like this. |
||||
PEP8 | PEP8 is a standard coding standard. Multiple imports are allowed for the same module, but line breaks are allowed for different modules. | [Python coding conventions]Read PEP8- Qiita https://qiita.com/simonritchie/items/bb06a7521ae6560738a7 | |||
Log level | There are five levels of logging in python. 1. CRITICAL 2. ERROR 3. WARNING 4. INFO 5. DEBUG |
||||
Convenient module | The pickle module can serialize Python objects so that they can be read and written in files. Boolean values, numbers, character strings, etc. can be pickled. The pathlib module is useful for working with file paths. Wildcard filename in glob method(*)It can also be specified with. |
||||
ravel and flatten are functions that make an array one-dimensional. ravel()Returns views as much as possible, but flatten()Always returns a copy. reshape()Also reval()Returns views as much as possible. | If you assign an array to another variable, the assigned variable refers to the original array. If you want to create it as a separate object, copy()Or deep copy()use. * Ravel and flatten of numpy are functions that make an array one-dimensional. ravel()Returns views as much as possible, but flatten()Always returns a copy. reshape()Also reval()Returns views as much as possible. |
||||
Reading and writing data | Reading data from a binary file returns a file descriptor with the b option of the open method and reads()Read and write with()Write with | ||||
strip method Ex.) bird = ' Condor Penguin Duck ' print("befor strip: {}".format(bird)) print("after strip: {}".format(bird.strip())) |
Whitespace characters at both ends are removed. | ||||
Regular expressions .Any one letter a.c abc, acc, aac ^The beginning of the line^abc abcdef Repeat 0 or more times ab a, ab, abb, abbb +Repeat one or more times ab+ ab, abb, abbb ?0 times or 1 time ab? a, ab {m}Repeat m times a{3} aaa {m,n}Repeat m ~ n times a{2, 4} aa, aaa, aaaa [★]★ Any one character[a-c] a, b, c ★ |
★★ Any a | b a, b | |||
Regular expression special sequence \d arbitrary number[0-9] \D Other than any number[^0-9] \s Any whitespace character[\t\n\r\f\v] \S Any non-whitespace character[^\t\n\r\f\v] \w Any alphanumeric characters[a-xA-Z0-9_] \W Any non-alphanumeric character[\a-xA-Z0-9_] \A beginning of string^ \End of Z string$ |
|||||
Regular expressions find() / findall()→ Returns a list of one or all matching substrings each match()→ Check if the beginning of the character string matches fullmatch()→ Check if the entire string matches search()→ Check if it matches, not just at the beginning. Used when you want to extract a part of a character string replace()→ Replace character string sub()→ Replace character string。置換された文字列が返される。 subn()→ Replaced character string (sub)()Returns a tuple of the number of replaced parts (the number that matches the pattern) (same as the return value of). match/search returns a match object. The following methods can be used for match objects. Get the matched position: start(), end(), span() Get the matched string: group() Get the string for each group: groups() * Parentheses the part of the regular expression pattern in the character string()If you enclose it in, that part is treated as a group. At this time, groups()You can get the character string of the part that matches each group as a tuple. sub is parentheses()When grouping with, the matching character string can be used in the replaced character string. By default\1, \2, \3...But each is the first(), The second(), Third()...Corresponds to the part that matches. If it is a normal string that is not a raw string'\1'like\Note that you need to escape. Regular expression pattern()At the beginning of?P |
re.search("category/(.+?)/", "https://foo.com/category/books/murakami").group(1) #Obtained character string:'books' >>> text = "123456abcedf789ghi" >>> matchobj = re.search(r'[a-z]+', text) >>> if matchobj: ... print(matchobj.group()) ... print(matchobj.start()) ... print(matchobj.end()) ... print(matchobj.span()) ※re.Note that search can only retrieve information for the first matched string. replace is the target string.replace(String to be replaced,String to replace[,Number of replacements])Grammar. >>> raw_abc = r"aaaaabbbbbccccc" >>> rep_raw_abc = raw_abc.replace("c", "C") >>> print("Change before:",raw_abc, "After change:",rep_raw_abc) Change before: aaaaabbbbbccccc After change: aaaaabbbbbCCCCC re.sub(Regular expressions,String to replace, String to be replaced [,Number of replacements])Note the difference between and replace. |
【Python】とっても便利なRegular expressions! - Qiita https://qiita.com/hiroyuki_mrp/items/29e87bf5fe46de62983c | |||
Regular expression flag Limited to ASCII characters: re.ASCII Case insensitive: re.IGNORECASE Match the beginning and end of each line: re.MULTILINE Specify multiple flags |
|||||
Compiling the pattern p = re.compile(r'([a-z]+)@([a-z]+).com') m = p.match(s) result = p.sub('new-address', s) |
|||||
Virtual environment | venv can isolate the modules to be installed for each virtual environment. Use pyenv or Anaconda to switch the Python interpreter. | https://tinyurl.com/y4ypsz9r | |||
%, %%Is a magic command. !Execute the OS shell command with. Shit +Display docstring with Tab. |
How to use the magic command (magic function) of Jupyter Notebook https://miyukimedaka.com/2019/07/28/blog-0083-jupyter-notebook-magic-command-explanation/ | ||||
Frequently used magic commands %time: Measures the execution time of the code that follows and displays the result. %timeit: Measures the execution time of the following code several times and displays the fastest result and average. %env: You can get and set environment variables. %who: Shows the currently declared variables. %whos: Shows the currently declared variables, their types, and their contents. %pwd: Shows the current directory. %history: Displays a list of code cell execution histories. %ls: Shows a list of files in the current directory. %matplotlib inline: If you draw a graph with pyplot etc., the result will open in another window and will be displayed there, but if you use this magic command, the graph will be displayed in the notebook. %%timeit:%Apply the timeit function to all the code in the cell. %%html, %%HTML: Allows you to write and execute html code. |
|||||
Jupyter notebook storage format | notebook format(.ipynb)Is a JSON file | ||||
3 | Foundations of mathematics | queue | "Commutative law: x", "Associative law: ○", "Distributive law: ○" The commutative law does not always hold (note that some do). 1 row / 1 column is a vector. If the number of columns in the matrix and the size of the vector are the same, then these multiplications can be defined and the result is a vector of the same size as the number of rows in the original matrix. |
||
Common logarithm and natural logarithm | The common logarithm is the base 10 logarithm. The natural logarithm is based on e. | ||||
Euclidean distance | direct distance | ||||
Manhattan distance | Zigzag distance (derived from Manhattan's grid) | ||||
Function F(x)Differentiate f(x)When, F is called the primitive function of f and f is called the derivative of F. | |||||
Integral | An integral whose range of integration is not defined is called an indefinite integral. Since an arbitrary constant is differentiated to 0, the indefinite integral usually includes the constant of integration "C". | ||||
Differentiation and integration | The derivative can be regarded as the slope, and the integral as the area. In data analysis and machine learning, the point that the slope of the function is 0 is used as useful information. | ||||
Partial differential | The derivative of a multivariable function with two or more variables is called the partial derivative. In partial differentiation, it is necessary to show which variable was differentiated. | ||||
Established | Expected value of dodecahedron dice is 6.Five. For random variables, discrete → probability mass function, continuous → probability density function | ||||
Factorial 0! | 0!=Note that it is 1. Also remember that the logarithm of 1 is 0. | ||||
sin and cos | sin/cos are called sine and cosine, respectively. tan is tangent. | ||||
4 | Analysis practice by library | Numpy | dtype attribute | You can check the data type of the element of ndarray. | |
Convenient way to generate ndarray # -0 from 5 to 5.Define an array of 1-step numbers x = np.arange(-5, 5, 0.1) #Generate arithmetic progressions from 1 to 10 for the number of elements specified by num np.linspace(1, 10) |
np.linspace(start, stop, num=50, endpoint=True)Generated with the grammar of. num specifies the number of elements. num is 50 by default. | ||||
np.random module | Note that np does not include the value specified for stop compared to the standard module. random.random() / np.random.rand(size)Generates a random number from 0 to 1. import numpy as np import random print(random.random()) # 0.9733563537374995 print(np.random.rand(3)) # [ 0.69962497 0.61950126 0.7944733 ] print(np.random.rand(2, 3)) # [[ 0.29315148 0.06560949 0.56110618] # [ 0.62784039 0.19218867 0.07087768]] np.random.randn(size)Is a random number generator that follows a standard normal distribution. print(np.random.randn(3, 3)) #3x3 array with standard normal distribution # [[-0.52434526 0.16597271 -2.22295048] # [ 0.46995083 -0.64576356 -2.73155503] # [ 1.04575168 0.05712791 -0.46522963]] If you want to generate random numbers that follow a normal distribution, do as follows. np.random.normal(mu, sd, 10000) When generating an integer random number random.randint(low, high, size) np.random.randint(1, 10, 2) #Generates two ndarrays with integers between 1 and less than 10. np.random.randint(1, 10, (2, 3) #Generate a 2-by-3 ndarray. np.random.randint(2, size=8) #If high is omitted, the value of low is treated as high. # array([1, 0, 0, 0, 1, 1, 1, 0]) np.random.randint(1, size=8) #Only integers less than 1, that is, 0. # array([0, 0, 0, 0, 0, 0, 0, 0]) choich has the following differences from the standard module. random.choice(seq)Select one from seq np.random.choice(a)Select multiple from a seq1=[0、1、2、3] random.choice(seq1) #1 time choice random.choice("hello") #1 letter choice from 5 letters np.random.choice(seq1, 4) #Arrangement chosen 4 times with duplication np.random.choice([2, 4, 6],2) #Arrangement chosen twice with duplication np.random.choice([0, 1], (3, 3)) #0 in a size3x3 array,Fill in 1 np.random.choice(5, 2) #np.randint(0, 5, 2)Synonymous with |
How to use NumPy (12) Random numbers, random-Remrin's python capture diary http://python-remrin.hatenadiary.jp/entry/2017/04/26/233717 | |||
Conversion to a one-dimensional array | You can use the raise or flatten methods to convert a two-dimensional NumPy array to one-dimensional. The ravel method returns a reference and the flatten method returns a copy. | ||||
Copy and reference a = np.array([1, 2, 3]) b = a ① b = a.copy() ② |
① is a reference and ② is a copy. Note that slicing a Python standard list will pass a copy, but Numpy slices will pass a reference. | ||||
Matrix division a = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]) first1, second1 = np.vsplit(a, [2]) first2, second2 = np.hsplit(second1, [2]) print(second2) |
The vpslit function decomposes the matrix in the row direction, and the hsplit function decomposes the matrix in the column direction. | ||||
About display of print statement import numpy as np a = np.array([[1, 2, 3], [4, 5, 6]]) b = np.array([7,8,9]) print(a[-1:, [1,2]], b.shape) |
[5 6] a is a[-1:, [1,2]]And the last line ([4,5,6)[1,2]So 5,Extract 6 Note that b is one-dimensional because it has one parenthesis. |
||||
np.Number of elements generated by arange | x = np.arange(0.0, 1.5, 0.1)Then array([0. , 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8, 0.9, 1. , 1.1, 1.2, 1.3, 1.4])15 pieces. If the center is 15, it is 10 times that, so 150. np.sin(x)Is processed in radians of the arc degree method. | ||||
pandas | Date array dates = pd.date_range(start="2020-01-01", end="2020-12-31") print(dates) |
date_range()Generate a date array with. You can specify the start and end date and time with start and end. | |||
DataFrame Join Linking:Connect the contents of the data in a certain direction as it is. pd.concat, DataFrame.append Join:Connect the contents of the data by associating them with the value of some key. pd.merge, DataFrame.join |
pd.concat([df1, df2])は縦方向のJoin、横方向にJoinしたい場合はaxis=1をつける。何も指定しないと完全外部Joinになるため、内部Joinにしたいならjoin=Attach the inner. join_axes=[df1.index]のようにJoin行/列を指定することも可能。 Simple df1 in the vertical direction.append(df2)としてLinkingすることもできる。df2の箇所をSeriesにすると行追加。ignore_index=Note that if True is not specified, index will be linked as it is. Joinはmergeによっておこなう。文法はpd.merge(left, right, on='key', how='inner').. how is inner/left/right/outerを指定可能。複数のkeyでJoinする際はonにリストを渡す。indexをキーとしてJoinしたい場合はDataFrame.joinが便利。規定は左外部Joinとなるがhowで変更可能(left.join(right, how='inner'))。 |
Python pandas 図でみる データLinking / Join処理 - StatsFragments http://sinhrks.hatenablog.com/entry/2015/01/28/073327 | |||
read_html() | If there are multiple tables, get them as a list of DataFrames | ||||
Missing value processing | fillna()Arguments method= 'ffill', method = 'bfill'You can store different values in the same column for the missing element. method= 'ffill'If, the value stored in the element with the smaller subscript, method= 'bfill'If, the missing value is filled with the value stored in the element with a large subscript. data['Age'].fillna(20) #Fill in the missing values in column Age with 20 data['Age'].fillna(data['Age'].mean()) #Fill in the missing values in column Age with the average value of Age data['Age'].fillna(data['Age'].median()) #Fill in the missing values in column Age with the median of Age data['Age'].fillna(data['Age'].mode()) #Fill in the missing values in column Age with the mode of Age |
Missing value handling with Pandas- Qiita https://qiita.com/0NE_shoT_/items/8db6d909e8b48adcb203 | |||
Mutual conversion between Numpy and Pandas | The pandas → numpy conversion is the values attribute of DataFrame, and the reverse is pd to ndarray..DataFrame()It can be converted by using it as an argument of. Index name and column name are not retained when converting to numpy. |
||||
pd.describe() | describe is the mean, standard deviation, maximum for each column/You can get the minimum and mode values. std is the standard deviation. top is the mode. | https://tinyurl.com/y3gn3dz4 | |||
How to use groupby and Grouper import numpy as np import pandas as pd np.random.seed(123) dates = pd.date_range(start="2017-04-01", periods=365) df = pd.DataFrame(np.random.randint(1, 31, 365), index=dates, columns=["rand"]) df_year = pd.DataFrame(df.groupby(pd.Grouper(freq='W-SAT')).sum(), columns=["rand"]) |
Grouper can be grouped flexibly by specifying the frequency with freq. * The 5th line creates a DataFrame that uses the date as an index. Each value in the rand column is a random integer from 1 to 30. |
||||
Matplotlib | MATLAB Style and OOP (Object Oriented) Style | The former has a shorter code, but you cannot specify it in detail. Basically, the latter should be used. Users do not need to prepare Figures or Axes to create a single graph. These objects are automatically generated. |
|||
Generation of drawing objects and subplot objects fig, axes = plt.subplots(2) |
As shown on the left, figs and axes can be generated at once. fig.add_subplot()It is also possible to generate subplots individually for figs with. ■fig,When making ax individually #Create an area to place Axes fig = plt.figure(facecolor = "lightgray") #Add Axes to Figure ax = fig.add_subplot(111) subplots(2)Then the subplot is 2 lines, ncol=If you do like 2, you will have two rows. |
||||
How to arrange multiple subplots ax_1 = fig.add_subplot(221) ax_2 = fig.add_subplot(222) ax_3 = fig.add_subplot(223) #Plot the data in Axes in the 3rd row and 2nd column ax[2, 1].plot(x, y) |
pyplot.subplots()You can use to create multiple Axes objects at once. For the first argument nrows and the second argument ncols, pass the number of Axes in the row direction and the number in the column direction, respectively. | [Matplotlib]OOP and MATLAB style https://python.atelierkobato.com/matplotlib/ | |||
Axis settings #Axes settings ax.grid() #Show grid ax.set_title("Axes sample", fontsize=14) #Show title ax.set_xlim([-5, 5]) #x-axis range ax.set_ylim([-5, 5]) #y-axis range |
|||||
Formatting a Figure object #Creating and formatting Figure objects fig = plt.figure( #size figsize = (5, 5), #Fill color facecolor = "lightgray", #Border display frameon = True, #Border color edgecolor = "black", #Border thickness linewidth = 4) |
|||||
#Axes on the figure(Subplot)Add ax = fig.add_subplot( #Number of rows and columns, Axes number 111, #Fill color facecolor = "lightgreen", #x-axis and y-axis range xlim = [-4,4], ylim = [0,40]) |
|||||
Graph display plt.show() |
Display the graph with the show method. | ||||
import matplotlib.pyplot as plt fig, ax = plt.subplots() x = [1, 2, 3] y1 = [10, 2, 3] y2 = [5, 3, 6] labels = ['Setosa', 'Versicolor', 'Virginica'] ax.bar(x, y_total, tick_label=labels, label='y1') ax.bar(x, y2, label='y2') ax.legend() plt.show() |
Note that y1 is not used as a variable | ||||
import numpy as np import matplotlib.pyplot as plt np.random.seed(123) mu = 100 sigma = 15 x = np.random.normal(mu, sigma, 1000) fig, ax = plt.subplots() n, bins, patches = ax.hist(x, bins=25, orientation='horizontal') for i, num in enumerate(n): print('{:.2f} - {:.2f} {}'.format(bins[i], bins[i + 1], num)) plt.show() |
The default value for bins is 10. See textbook P192. Bins as a return value is a boundary value, and the number of bins + 1. The variable mu means the mean value and the variable sigma means the standard deviation. The histogram is drawn horizontally. "N" where the return value of hist method is stored, bins,Of the "patches", "bins" contains the values of the bin boundaries, and the number is 26. When this script is executed, the frequency distribution table is output in addition to the histogram. The part of the print statement on the left is the display of the frequency distribution table. 51.53 - 55.62 2.0 55.62 - 59.70 3.0 59.70 - 63.78 6.0 63.78 - 67.86 7.0 67.86 - 71.94 16.0 71.94 - 76.02 29.0 76.02 - 80.11 37.0 |
||||
Pie chart display | See textbook P198. To maintain the ass ratio, ax.axis('equal')And. autopct can display each value in%. Highlight is explode. Example: plt.pie(x, labels=label, counterclock=False, startangle=90)Draw clockwise from directly above |
https://tinyurl.com/yyl8yml6 | |||
Scikit-learn | DBSCAN | The DBSCAN method, which is one of unsupervised learning, is a density-based clustering algorithm that focuses on the distance between feature vectors. | |||
Evaluation scale of classification Precision(Compliance rate) Recall(Recall) F1 Score Accuracy(Correct answer rate) |
Precision and Recall are in a trade-off relationship. Therefore, you should also look at the F1 Score index. An example of a common cancer diagnosis is Precision → Emphasis when you want to reduce misdiagnosis Recall → Emphasis when you want to avoid overlooking the correct example Accuracy → General index for checking the accuracy of classification |
Machine learning practice (supervised learning: classification)- KIKAGAKU https://www.kikagaku.ai/tutorial/basic_of_machine_learning/learn/machine_learning_classification | |||
Evaluation scale of regression model | MSE (Mean Squared Error), RMSE (Root Mean Sqaured Error), MAE (Mean Absolute Error) are famous. | https://tinyurl.com/y2xc9c58 https://tinyurl.com/y5k8gc9a Meaning of various errors (RMSE, MAE, etc.)-Mathematics learned with concrete examples https://mathwords.net/rmsemae#:~:text=MAE%EF%BC%88Mean%20Absolute%20Error%EF%BC%89,-%E3%83%BB%E5%AE%9A%E7%BE%A9%E5%BC%8F%E3%81%AF&text=%E3%83%BB%E5%B9%B3%E5%9D%87%E7%B5%B6%E5%AF%BE%E8%AA%A4%E5%B7%AE%E3%81%A8%E3%82%82%E8%A8%80%E3%81%84,%E3%81%A8%E3%81%97%E3%81%A6%E6%89%B1%E3%81%86%E5%82%BE%E5%90%91%E3%81%8C%E3%81%82%E3%82%8A%E3%81%BE%E3%81%99%E3%80%82 |
|||
Scikit-Dataset that comes with learn load_iris load_boston |
The iris records the length and width of 150 iris "gaku" and "petals", as well as the type of flower. Explanatory variable 4, objective variable 1. boston is a dataset that records 14 features and housing prices, including the number of crimes per capita and the average number of rooms in a residence, by region on the outskirts of Boston, USA. | ||||
Decision tree Algorithm for regression and classification. It has the advantage of being easy to interpret and requiring less pretreatment. |
Textbook P235. Information gain=Impureness of parent node-It is represented by the sum of the impurities of the child nodes. If it is positive, it should be divided into child nodes, and if it is negative, it should not be divided. | Tree structure(data structure) - Wikipedia https://ja.wikipedia.org/wiki/%E6%9C%A8%E6%A7%8B%E9%80%A0_(%E3%83%87%E3%83%BC%E3%82%BF%E6%A7%8B%E9%80%A0)#%E7%94%A8%E8%AA%9E | |||
SVM Draw a decision boundary so that the margin is maximized. The method of making linearly separable data linearly separable is called a kernel trick. |
from sklearn.svm import SVC svc = SVC() C is a cost parameter and means a penalty for false predictions. If it is too large, it causes overfitting. gamma determines the complexity of the model. The larger the value, the more complicated it becomes and overfitting occurs. |
||||
Sigmoid function y = 1 / 1 + exp(x)Takes the form of.(0, 0.5), 0< y <It becomes 1. |
Sigmoid is a model that performs binary classification. In the case of three-class classification, it can be dealt with by performing binary classification for the number of classes. | ||||
Normalization | Normalization is standardization with an average of 0 variances of 1.[StandardScaler]Normalization to maximum 1 and minimum 0[MinMaxScaler]Is famous. | ||||
Separation of training data and test data | from sklearn.model_selection import train_test_split | ||||
Linear model | The linear model (LinearRegression) is divided into simple regression with one explanatory variable and multiple multiple regression with multiple explanatory variables. | ||||
Principal component analysis | This is a method of compressing data to the same or lower dimension as the original dimension by looking for the direction in which the variance increases. Principal component analysis is scikit-It can be executed using the PCA class of learn's depositon module. |
||||
Grid search from sklearn.datasets import load_iris from sklearn.model_selection import GridSearchCV from sklearn.tree import DecisionTreeClassifier from sklearn.model_selection import train_test_split iris = load_iris() X, y = iris.data, iris.target X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123) clf = DecisionTreeClassifier() param_grid = {'max_depth': [3, 4, 5]} cv = GridSearchCV(clf, param_grid=param_grid, cv=10) cv.fit(X_train, y_train) y_pred = cv.predict(X_test) |
In the code on the left, the optimum value of the depth of the decision tree may change each time it is executed. If you want to have reproducibility, do as follows. clf = DecisionTreeClassifier(random_state=0) |
Parameter explanation of decision tree analysis – S-Analysis http://data-analysis-stats.jp/2019/01/14/%E6%B1%BA%E5%AE%9A%E6%9C%A8%E5%88%86%E6%9E%90%E3%81%AE%E3%83%91%E3%83%A9%E3%83%A1%E3%83%BC%E3%82%BF%E8%A7%A3%E8%AA%AC/ | |||
Clustering | k-means is a method of first randomly allocating cluster centers, modifying the cluster centers while calculating the distance to each data, and recalculating and clustering until the final cluster centers converge. Clustering can be broadly divided into split-optimal clustering and hierarchical clustering. Divided optimal clustering is a method of preparing a function that measures the goodness of a cluster in advance and seeking clustering that minimizes the value of that function. Hierarchical clustering, on the other hand, is a method of building clusters hierarchically by dividing or merging clusters. Hierarchical clustering is further divided into aggregate type and split type. The agglomeration type is a method in which each data point is considered as a cluster, and similar clusters are sequentially agglomerated. The split type is a method that starts from the state where the entire data point is considered as one cluster, and sequentially divides a group of dissimilar data points. The split type tends to require more calculations than the aggregate type. |
https://tinyurl.com/y6cgp24f https://tinyurl.com/y2df2w4c |
Recommended Posts