Overview

This is a collection of self-made questions that I made as one of the study methods in the Python3 engineer certification data analysis test that I took in November 2020. I hope it will help those who are going to take the exam.

The experience report is summarized in this article ↓ https://qiita.com/pon_maeda/items/a6c008fb3d993278fccb

important point

――This collection of questions is created in the form of questions such as answering each question and filling in the blanks so that you can easily solve it in the gap time. -** Please note that the actual test is a four-choice format (as of November 15, 2020) . - It is a little more difficult than the actual exam. ** ** ――Since it was roughly created for personal use, it may not be a problem statement. Please forgive me.

Exercise books

1. Role of data analysis engineer

Machine learning is roughly divided into three. There are three types: () learning, () learning, and () learning.

Answer

--Supervised learning --Unsupervised learning --Reinforcement learning

The () variable, also known as the correct label, is used only for () learning.

Answer

--Objective variable --Supervised learning

The method used when this correct label is a continuous value is (), and the method used when it is another value is ().

Answer

Continuous value: Regression Other values: Classification

What are the two main methods of unsupervised learning?

Answer

--Clustering --Dimensionality reduction

2. Python and environment

venv is a tool that allows you to use different versions of Python. (Yes / No)

Answer

No Since venv is built under Python, you can't version control Python itself.

A function that allows you to specify a file name with a wildcard in Python.

Answer

glob function

3. Foundations of mathematics

Japanese reading of sin, cos, and tan.

Answer

sin: sine con: cosine tan: tangent

How many Napiers are there?

Answer

2.7182…

What is the logarithm of 1?

Answer

The factorial of 1 is.

Answer

Suppose you are told that if you roll a hexahedral dice once, you will get an odd number, although the number of rolls is unknown. The probability in this case is called the () probability, which is the basis of the () theorem.

Answer

--Conditional probability --Bayes' theorem

4. Practice of analysis by library

4.1. NumPy

4.1.1. Overview of NumPy

NumPy has a type for arrays () and a type for matrices ().

Answer

For arrays: ndarray For matrix: matrix * In the data analysis test, ndarray plays a leading role

One of the features of ↑ is that you can use multiple types or make one type.

Answer

Must be one type. This is the difference from DataFrame.

4.1.2. Handle data with NumPy

Function to check the size in an array

Answer

shape function

The ravel function returns (), while the flatten function returns ().

Answer

ravel function: returns a reference (or a shallow copy) flatten function: returns a (deep) copy

Function to check the type of array

Answer

dtype function

Function to convert array type

Answer

astype function

A function that generates a uniform random number of integers

Answer

np.random.randint function * Generated in the range of {{first argument}} or more and less than {{second argument}} * If you pass a tuple as the third argument, it will be generated with that matrix size.

A function that generates a uniform random number of decimals

Answer

np.random.uniform function * Arguments are the same as the np.random.randint function

A function that creates a random number from a standard normal distribution of integers

Answer

np.random.randn function

Is the standard normal distribution the mean () or variance () distribution?

Answer

Distribution of mean 0, variance 1

What is the function to generate a normal distribution random number by specifying the mean and standard deviation?

Answer

np.random.normal function

A function that creates an identity matrix with the specified diagonal elements

Answer

np.eye function With np.eye (3), you can do something like this array([[1., 0., 0.], [0., 1., 0.], [0., 0., 1.]])

A function that creates an array of specified values for all elements

Answer

np.full function Example: np.full ((2, 4), np.pi)

A function that creates an evenly divided array in a specified range

Answer

np.linspace function Example: np.linespace (0, 1, 5) // → array ([0., 0.25, 0.5, 0.75, 1.0])

A function that allows you to see the differences between the elements of an array

Answer

np.diff function

a = [1, 2, 3]
b = [4, 5, 6]
np.concatnate([a, b])

Then, which of the following is possible?

[1, 2, 3, 4, 5, 6]
[[1, 2, 3],[4, 5, 6]]
[1, 2, 3, [4, 5, 6]]

Answer

1. `[1, 2, 3, 4, 5, 6]`

The np.concatnate function is (row or column) directional concatenation in the case of concatenation between one-dimensional arrays.

Answer

Connected in the column direction. (Same behavior as hstack function)

The np.concatnate function is concatenated in the (row or column) direction by default when concatenating two-dimensional arrays.

Answer

Concatenated in the row direction. (Same behavior as vstack function)

If the argument axis = 1 is specified for this function, it becomes () direction concatenation.

Answer

Connected in the column direction. (Same behavior as hstack function)

A function that divides a two-dimensional array in the column direction.

Answer

np.hsplit function Example) first, second = np.hsplit (hoge_array, [2]) # → Split in the 3rd column

A function that splits a two-dimensional array in the row direction

Answer

np.vsplit function Example) first, second = np.vsplit (hoge_array, [2]) # → Split at 3rd line

What does transpose of a two-dimensional array mean?

Answer

Swap rows and columns

If you have a two-dimensional array called a, how do you transpose it?

Answer

a.T

What is a function that increases the dimension of a one-dimensional array without specifying the number of elements?

Answer

np.newaxis function * If you can specify the number of elements, you can also use the reshape function.

a = np.array([1, 5, 4])
# array([[1, 5, 4]])

How can I use the above function to increase the dimensions as described above?

Answer

a[np.newaxis, :]

a = np.array([1, 5, 4])
# array([[1],
         [5],
         [4]])

How can I use the above function to increase the dimensions as described above?

Answer

a[:, np.newaxis]

What is the function that generates the grid data?

Answer

np.meshgrid function

np.arange(1, 10, 3)

What will happen to this result?

Answer

array([1, 4, 7]) 1 or more and less than 10 (that is, up to 9) are divided into 3 equal parts.

4.1.3. NumPy features What is NumPy's convenience function group that converts array elements such as sin () and log () at once?

Answer

Universal function

A function that returns the absolute value of an array element

Answer

np.abs function

a = np.array([0, 1, 2])
b = np.array([[-3, -2, -1],
              [0, 1, 2]])
a + b

As mentioned above, what is the sum of the two-dimensional array and the one-dimensional array?

Answer

array([[-3, -1, 1], [0, 2, 4]]) It is added to b as if a became two lines.

What does it mean to be able to compute a scalar on an array?

Answer

broadcast

What does the @ operator mean?

Answer

Neutral operator for matrix multiplication

A_matrix @ B_matrix

In a different way.

Answer

np.dot(A_matrix, B_matrix) Or A_matrix.dot (B_matrix)

A function that calculates the number of True in an array of truth.

Answer

np.count_nonzero function Or the np.sum function

--np.count_nonzero method --A function that outputs the number of non-zero elements. --Python treats False as 0, so it counts the number of True. --np.sum function --Function to add in elements --Python treats True as 1, so the number of True is calculated as a result.

A function that finds whether True is included in an array of truth.

Answer

np.any function

A function that finds whether all elements are True in an array of truth.

Answer

np.all function

4.2. pandas

4.2.1. Overview of pandas

With df.head () and df.tail (), output only the () line at the beginning and end of the DataFrame.

Answer

5 lines

Function to know the size of df

Answer

df.shape

How to get two pieces of information from df, A column and B column

Answer

`df[“A“, “B“]` Or `df.loc [:, ["A "," B "]]` etc.

4.2.2. Reading / writing data

4.2.3. Data shaping

How to extract only records with 10,000 steps or more, assuming that there is a df that is a data frame of the number of steps and calories ingested

Answer

`df [df [“steps ”]> = 10000]`

Or df [df.loc [:,“ steps ”]> = 10000] df.query ('steps> = 10000') etc.

How to sort in descending order of steps, assuming there is df which is a DataFrame of steps and calories ingested

Answer

df.sort_values (by = ”steps”, ascending = False)

One-hot encode the motion index column containing the three values High, Mid, and Low, adding "exercise" to the prefix.

Answer

df.get_dummies (df.loc [:, “exercise index“], prefix = ”exercise”)

4.2.4. Time series data

How to create an array of dates from 2020-01-01 to 2020-10-01.

Answer

pd.date_range(start=”2020-01-01”, end=”2020-10-01”)

Create an array of dates for 100 days from 2020-01-01.

Answer

pd.date_range(start=”2020-01-01”, period=100)

Create an array only for Saturday among the dates from 2020-01-01 to 2020-10-01.

Answer

pd.date_range(start=”2020-01-01”, end=”2020-10-01”, freq=”W-SAT”)

Group the time series data df into monthly data and use the average value.

Answer

`df.groupby(pd.Grouper(freq='M')).mean()`

Or df.resample ('M'), mean () etc.

4.2.5. Missing value processing

Argument used when you want to fill Nan with the previous value in the fillna function.

Answer

`df.fillna(method='ffill')`

If it is a DataFrame, fill it with the value one line above. If it is bfill, it will be filled with the value one line below.

What if you want to give the median value to the argument of the fillna function?

Answer

`df.fillna(df.median())` * Note that it is not `method ='median'`

4.2.6. Data consolidation

Create df_merge by concatenating df_1 and df_2 in the column direction.

Answer

df_merge = pd.concat([df_1, df_2], axis=1)

4.2.7. Handling of statistical data

Function to check the mode

Answer

mode function

Function that gives the median

Answer

median function

A function that yields the standard deviation (sample standard deviation)

Answer

std function

Functions and arguments that give the standard deviation (population)

Answer

Pass the ddof = 0 argument to the std function

4.3. Matplotlib

Where is the pie chart placed?

Answer

Placed from above

The pie chart is arranged around (clockwise or counterclockwise).

Answer

clockwise

For pie charts, pass the () argument to the () method to implement it clockwise.

Answer

In the `pie method`, pass` counterclock = False`. Somehow, I write it on the world's website in reverse. why. Lol The default is counterclock = True

To specify where to start drawing the graph in a pie chart, pass the () argument to the () method.

Answer

`startangle = {{angle where you want to start output}}` The default value is None, which is drawn from the 3 o'clock position. It will be from 12 o'clock by specifying 90 degrees.

4.4. scikit-learn

4.4.1. Preprocessing

Missing value

What class is used to complement the data if there are missing values?

Answer

Imputer class

About the value passed to the strategy argument in the above class.

mean = ①、median = ②、most_frequent = ③

Answer

1. Average 2. Median 3. Mode

Category variable encoding

What is the class that encodes categorical variables?

Answer

LabelEncoder class

What is the attribute that confirms the original value after encoding?

Answer

.classes_ attribute

Along with the encoding of categorical variables, what is the major processing method?

Answer

`One-hot encoding` If you have 4 blood types, add 4 columns to make it a flag.

Another way to call this encoding.

Answer

Dummy variable

What do you call a matrix with many components 0 and a matrix with many non-zero components?

Answer

Sparse and dense matrices

Feature normalization

Distributed normalization is the process of converting features so that the mean of the features is () and the standard deviation is ().

Answer

Feature `mean is 0`,` standard deviation is 1`

What is the class that performs distributed normalization?

Answer

StanderdScaler class

Minimum / maximum normalization is the process of converting features so that the minimum value of the feature is () and the maximum value is ().

Answer

The `minimum value of the feature is 0` and the` maximum value is 1`.

What is the class that performs minimum / maximum normalization?

Answer

MinMaxScaler class

4.4.2. Classification

Classification is a typical task of supervised learning.

Answer

Supervised learning Classification uses known data as a teacher and learns a model that distributes each data to classes.

The above uses the correct label, which is called the () variable.

Answer

Objective variable

Three typical classification algorithms

Answer

--Support vector machine --Decision tree --Random forest

Flow of classification model construction

To build a classification model, the data at hand is ().

Answer

Divide into a training dataset and a test dataset.

"Learning" in classification refers to building a classification model using () datasets.

Answer

Training dataset

What is the ability to respond to unknown data calculated from predictions for the test data set of the constructed model?

Answer

Generalization ability

What is the function that separates each dataset?

Answer

model_selection.train_test_split function

scikit-learn uses the () function for learning and the () function for prediction.

Answer

Learning: fit function Prediction: predict function

Support vector machine

Support vector machines are algorithms that can be used not only for classification and regression, but also for ().

Answer

Outlier detection

When considering 2D data belonging to two classes, what is the data closest to the boundary among the data of each class?

Answer

Support vector

When considering 2D data belonging to two classes, draw a straight line in () so that the distance between the support vectors is the largest ().

Answer

--Large (far) --Decision boundary

The distance between this straight line and the support vector is called ().

Answer

margin

Random forest

What is the data of randomly selected samples and features (explanatory variables) used in Random Forest?

Answer

Bootstrap data

Random forest is a set of decision trees, and what is learning using multiple learning machines in this way?

Answer

Ensemble learning

4.4.3. Regression

Regression is the task of explaining () variables with () variables represented by features.

Answer

--Objective variable --Explanatory variable

In linear regression, when the explanatory variable is one variable, it is called (), and when there are two or more variables, it is called ().

Answer

--Simple regression --Multiple regression

4.4.4. Dimensionality reduction

A task that () data data without damaging the information it has.

Answer

compression

Principal component analysis

In scikit-learn, which class of which module is used for principal component analysis.

Answer

decomposition.PCS class

4.4.5. Model evaluation

Category classification accuracy

Four indicators that quantify the extent to which data categories have been assigned.

() Rate, () Rate, () Rate, () Value

Answer

--Compliance rate - Recall --F value --Correct answer rate

In addition, these indicators are calculated from the () matrix.

Answer

Confusion matrix

There is a trade-off between the () rate and the () rate.

Answer

--Compliance rate - Recall

Prediction probability accuracy

The () curve and () calculated from it are used as indicators to quantify the accuracy of the prediction probability for the data.

Answer

--ROC curve - AUC

4.4.6. Hyperparameter optimization

Hyperparameters have values (determined or undetermined) during training.

Answer

Not decided. Apart from learning, the user needs to specify the value.

Two typical methods for optimizing hyperparameters.

Answer

--Grid search --Random search

finally

It's a poor problem, but I hope it helps someone. If you make any mistakes, I would be grateful if you could comment on them. Thank you until the end.

Python3 Engineer Certification Data Analysis Exam Self-made Questions

Overview

important point

Exercise books

1. Role of data analysis engineer

2. Python and environment

3. Foundations of mathematics

4. Practice of analysis by library

4.1.1. Overview of NumPy

4.1.2. Handle data with NumPy

4.2.1. Overview of pandas

4.2.2. Reading / writing data

4.2.3. Data shaping

4.2.4. Time series data

4.2.5. Missing value processing

4.2.6. Data consolidation

4.2.7. Handling of statistical data

4.4.1. Preprocessing

Missing value

Category variable encoding

Feature normalization

4.4.2. Classification

Flow of classification model construction

Support vector machine

Random forest

4.4.3. Regression

4.4.4. Dimensionality reduction

Principal component analysis

4.4.5. Model evaluation

Category classification accuracy

Prediction probability accuracy

4.4.6. Hyperparameter optimization

finally