Feature engineering for machine learning starting with the 1st Google Colaboratory --Binarization and discretization of count data

Introduction

This article describes binarization and discretization used as preprocessing for count data. This article is mainly based on "Features Engineering for Machine Learning". Please check it out if you become.

Also, the content of this article is explained in more detail on YouTube, so if you are interested, please check it out.

All the programs explained in this article can be found at here.

What is binarization?

As the name implies, it is the process of making the target value binary. For example, consider the following example.

Example
I want to create a system that recommends songs recommended to users.
I want to use the number of times the user listened to a song as a feature, but how should I format the data?

When I took out the data of a certain user there, it is assumed that the data was as follows. The first column is the song ID, and the second column is the number of times the song has been played.

The histogram of this data is as follows.

Now, in order to recommend a song recommended to the user, it is important to know whether the user was interested in the song. However, if the above is left as it is, a song that has been listened to 20 times will give the model information that it likes 20 times as much as a song that has been listened to only once. Therefore, assuming that you are interested if you have played the song even once, the song that has been played more than once is binarized to 1, and the song that has never been played is binarized to 0. By doing this, I was able to eliminate the differences between songs and divide them into songs that I was interested in and songs that I was not interested in.

This is represented in a graph as follows.

The implemented code is shown below.

`binary.py`


import math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##Random number fixed
np.random.seed(100)

##Generate pseudo data
data_array = []
for i in range(1, 1000):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(9000))
data = pd.DataFrame({'Listen Count': data_array})

data_binary = pd.DataFrame()
##True by multiplying by 1,False to 1,Convert to 0
data_binary['Listen Count'] = (data['Listen Count'] > 0) * 1

What is discretization?

Discretization by fixed width

By discretizing, continuous data can be treated as the same group, so

--The effect of scale can be removed --You can eliminate outliers

There is a merit

For example, if the age of a person is given as numerical data, all ages are divided into groups by grouping 0 to 10 as group 1, 10 to 20 as group 2 ....., and 80 or more as group 9. Will be possible. You may feel that you can leave the numerical data as it is, but for example, if there are several people who lived up to 110 years old, it will be pulled by the large data and the influence of other factors will be reduced. May be done. However, by grouping 80 and 110 years old as elderly people in the same group, such problems can be solved.

This time, the ages are divided by 10 years old, but depending on the lifestyle, 0 to 12 (from childhood to elementary school) may be divided into groups 1 and 12 to 17 (junior high and high school students) may be divided into groups 2.

Also, if the number spans multiple digits, it may be grouped by a power of 10, such as 0-9, 10-99, 100-999, and so on.

** When dividing by 10 **

`discretization.py`


import numpy as np

small_counts = np.random.randint(0, 100, 20)
print(small_counts)

print(np.floor_divide(small_counts, 10))

Execution result

** When grouping by a power of 10 **

`discretization.py`


import numpy as np

large_counts = []
for i in range(1, 100, 10):
  tmp = np.random.randint(0, i * 1000, 5)
  large_counts.extend(tmp)

print(np.array(large_counts))
print(np.floor(np.log10(large_counts)))

Discretization by quantile

Discretization with a fixed width is very convenient, but if there is a large gap in the count data, for example, there will be multiple groups that do not contain data. In such cases, use the quantile. The quantile divides the data into two by the median and further divides the divided data into two by the median. So the quartile divides the data into four, and the decile divides the data into ten.

For example, the deciile order of the following distribution data is shown in the table below.

If this value is shown in the graph, it will be as shown in the figure below, and you can see that the width is calculated so that the amount of data is even.

The implemented program is shown below.

** When grouping by quantiles (graph) **

`quantile.py`


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

##Random number fixed
np.random.seed(100)

data_array = []
for i in range(1, 1000):
  s = np.random.randint(0, i * 10, 10)
  data_array.extend(s)
data_array.extend(np.zeros(2000))
data = pd.DataFrame({'Listen Count': data_array})

deciles = data['Listen Count'].quantile([.1, .2, .3, .4, .5, .6, .7, .8, .9])

print(deciles)

plt.vlines(deciles, 0, 5500, "blue", linestyles='dashed')

** When grouping by quantiles **

`quantile.py`


import numpy as np

large_counts = []
for i in range(1, 100, 10):
  tmp = np.random.randint(0, i * 1000, 5)
  large_counts.extend(tmp)
np.array(large_counts)

#Convert to quartile
print(pd.qcut(large_counts, 4, labels=False))

Finally

I'm planning to post reviews and explanation videos of technical books on YouTube, focusing on machine learning. We also introduce companies that you should know if you go to IT. Please like, subscribe to the channel, and give us a high rating, as it will motivate youtube and Qiita updates.

YouTube: https://www.youtube.com/channel/UCywlrxt0nEdJGYtDBPW-peg Twitter: https://twitter.com/tatelabo