If you have knowledge about the target domain when performing machine learning, you can improve the accuracy by considering an appropriate feature amount and giving it as a feature amount, but even if you do not have domain knowledge, you can add or aggregate. You can take the strategy of expecting to find the feature amount by chance. Since it is a brute force approach that tries all possible combinations from one end, it seems to be called brute force feature engineering.
featuretools is a python library that semi-automates the troublesome feature creation if done manually. very convenient.
featuretools official tutorial https://docs.featuretools.com/en/stable/
In this article, I will follow the code of this blog https://blog.amedama.jp/entry/featuretools-brute-force-feature-engineering
Enter with pip
terminal
pip install featuretools
You can also link with another library by installing addon additionally. https://docs.featuretools.com/en/stable/install.html
When multiple DataFrames are given, the features are created by performing four arithmetic operations such as aggregating, calculating statistics, and performing four arithmetic operations between the features. Deep Feature Synthesis does these tasks for good Shioume. Yes, the function that does this is featuretools.dfs (). https://docs.featuretools.com/en/stable/getting_started/afe.html
In order to realize this good automation of Shioume, it is necessary to specify more detailed data types than pandas.DataFrame. For example, there are Datetime, DateOfBirth, DatetimeTimeIndex, NumericTimeIndex, etc. just for the data type that expresses time, and it makes it difficult for inappropriate combinations to occur. https://docs.featuretools.com/en/stable/getting_started/variables.html
featuretools calls the input data entity. I think that you often bring data with pandas.DataFrame, but in that case, one pandas.DataFrame is one entity.
trans_primitives performs calculations between features
python
import pandas as pd
data = {'name': ['a', 'b', 'c'],
'x': [1, 2, 3],
'y': [2, 4, 6],
'z': [3, 6, 9],}
df = pd.DataFrame(data)
df
First, create an empty featuretools.EntitySet. EntitySet is an object for defining the relationship between entities and the content to be processed, but only id is written below. The id can be omitted, but in the following, id ='example'.
python
import featuretools as ft
es = ft.EntitySet(id='example')
es
Below, the df created earlier is registered so that it can be called by the name'locations'. index = is an argument to specify index as it is, and if omitted, the first column of DataFrame is treated as index.
python
es = es.entity_from_dataframe(entity_id='locations',
dataframe=df,
index='name')
es
output
Entityset: example
Entities:
locations [Rows: 3, Columns: 4]
Relationships:
No relationships
The entity registered in EntitySet can be called as follows
python
es['locations']
output
Entity: locations
Variables:
name (dtype: index)
x (dtype: numeric)
y (dtype: numeric)
z (dtype: numeric)
Shape:
(Rows: 3, Columns: 4)
python
es['locations'].df
Now that we have an EntitySet, all we have to do is pass it to ft.dfs () to create the feature. target_entity is the main entity, the calculation method that trans_primitives applies to the combination between features, and the calculation method that agg_primitives uses for aggregate.
The available primitives are summarized below https://primitives.featurelabs.com/
In the following, add_numeric is instructed to add the sum between features, and subtract_numeric is instructed to add the difference between features.
python
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity='locations',
trans_primitives=['add_numeric', 'subtract_numeric'],
agg_primitives=[],
max_depth=1)
feature_matrix
Originally there were x, y, z, and the sum of them, x + y, x + z, y + z, and the difference x-y, x-z, y-z have been added.
For agg_primitives, specify the calculation method to create aggregate.
python
data = {'item_id': [1, 2, 3, 4, 5],
'name': ['apple', 'broccoli', 'cabbage', 'dorian', 'eggplant'],
'category': ['fruit', 'vegetable', 'vegetable', 'fruit', 'vegetable'],
'price': [100, 200, 300, 4000, 500]}
item_df = pd.DataFrame(data)
item_df
You now have a DataFrame with two categorical variables to use for aggregate
Same as before until entity is added
python
es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
dataframe=item_df,
index='item_id')
es
Add a relationship here to use for aggregate.
In the following, it is instructed to create a new entity called category based on the entity called items and set the index at that time as category.
python
es = es.normalize_entity(base_entity_id='items',
new_entity_id='category',
index='category')
es
output
Entityset: example
Entities:
items [Rows: 5, Columns: 4]
category [Rows: 2, Columns: 1]
Relationships:
items.category -> category.category
As for what happens with this, first of all, items are left as they are.
output
es['items'].df
On the other hand, the entity called category is indexed in the newly specified category column, so it is as follows.
python
es['category'].df
Try to specify count, sum, mean for agg_primitives
python
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity='items',
trans_primitives=[],
agg_primitives=['count', 'sum', 'mean'],
max_depth=2)
feature_matrix
Since the only column that can be aggregated is items.price, COUNT (), category.MEAN (), and category.SUM () are calculated respectively, and a DataFrame with 3 additional columns is created.
python
data = {'item_id': [1, 2, 3],
'name': ['apple', 'banana', 'cherry'],
'price': [100, 200, 300]}
item_df = pd.DataFrame(data)
item_df
python
from datetime import datetime
data = {'transaction_id': [10, 20, 30, 40],
'time': [
datetime(2016, 1, 2, 3, 4, 5),
datetime(2017, 2, 3, 4, 5, 6),
datetime(2018, 3, 4, 5, 6, 7),
datetime(2019, 4, 5, 6, 7, 8),
],
'item_id': [1, 2, 3, 1],
'amount': [1, 2, 3, 4]}
tx_df = pd.DataFrame(data)
tx_df
I will add the entity as before
python
es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
dataframe=item_df,
index='item_id')
es = es.entity_from_dataframe(entity_id='transactions',
dataframe=tx_df,
index='transaction_id',
time_index='time')
es
Create a relationship that connects two entities. You merge in the item_id column of items and the item_id column of transactions.
python
relationship = ft.Relationship(es['items']['item_id'], es['transactions']['item_id'])
es = es.add_relationship(relationship)
es
output
Entityset: example
Entities:
items [Rows: 3, Columns: 3]
transactions [Rows: 4, Columns: 4]
Relationships:
transactions.item_id -> items.item_id
python
feature_matrix, feature_defs = ft.dfs(entityset=es,
target_entity='items',
trans_primitives=['add_numeric', 'subtract_numeric'],
agg_primitives=['count', 'sum', 'mean'],
max_depth=2)
feature_matrix
If you write out only the column headings, it looks like the following. It does both the aggregation and the calculation between the features.
output
['name',
'price',
'COUNT(transactions)',
'MEAN(transactions.amount)',
'SUM(transactions.amount)',
'COUNT(transactions) + MEAN(transactions.amount)',
'COUNT(transactions) + SUM(transactions.amount)',
'COUNT(transactions) + price',
'MEAN(transactions.amount) + SUM(transactions.amount)',
'MEAN(transactions.amount) + price',
'price + SUM(transactions.amount)',
'COUNT(transactions) - MEAN(transactions.amount)',
'COUNT(transactions) - SUM(transactions.amount)',
'COUNT(transactions) - price',
'MEAN(transactions.amount) - SUM(transactions.amount)',
'MEAN(transactions.amount) - price',
'price - SUM(transactions.amount)']
Let's see what happens if we gradually increase max_depth with the above code
max_depth=1
['name',
'price',
'COUNT(transactions)',
'MEAN(transactions.amount)',
'SUM(transactions.amount)']
max_depth=1 → 2 increase
['COUNT(transactions) + MEAN(transactions.amount)',
'COUNT(transactions) + SUM(transactions.amount)',
'COUNT(transactions) + price',
'MEAN(transactions.amount) + SUM(transactions.amount)',
'MEAN(transactions.amount) + price',
'price + SUM(transactions.amount)',
'COUNT(transactions) - MEAN(transactions.amount)',
'COUNT(transactions) - SUM(transactions.amount)',
'COUNT(transactions) - price',
'MEAN(transactions.amount) - SUM(transactions.amount)',
'MEAN(transactions.amount) - price',
'price - SUM(transactions.amount)']
max_depth=2 → 3 increase
['MEAN(transactions.amount + items.price)',
'MEAN(transactions.amount - items.price)',
'SUM(transactions.amount + items.price)',
'SUM(transactions.amount - items.price)']
max_depth=Increase from 3 to 4
['COUNT(transactions) + MEAN(transactions.amount + items.price)',
'COUNT(transactions) + MEAN(transactions.amount - items.price)',
'COUNT(transactions) + SUM(transactions.amount + items.price)',
'COUNT(transactions) + SUM(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) + MEAN(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) + MEAN(transactions.amount)',
'MEAN(transactions.amount + items.price) + SUM(transactions.amount + items.price)',
'MEAN(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) + SUM(transactions.amount)',
'MEAN(transactions.amount + items.price) + price',
'MEAN(transactions.amount - items.price) + MEAN(transactions.amount)',
'MEAN(transactions.amount - items.price) + SUM(transactions.amount + items.price)',
'MEAN(transactions.amount - items.price) + SUM(transactions.amount - items.price)',
'MEAN(transactions.amount - items.price) + SUM(transactions.amount)',
'MEAN(transactions.amount - items.price) + price',
'MEAN(transactions.amount) + SUM(transactions.amount + items.price)',
'MEAN(transactions.amount) + SUM(transactions.amount - items.price)',
'SUM(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
'SUM(transactions.amount + items.price) + SUM(transactions.amount)',
'SUM(transactions.amount - items.price) + SUM(transactions.amount)',
'price + SUM(transactions.amount + items.price)',
'price + SUM(transactions.amount - items.price)',
'COUNT(transactions) - MEAN(transactions.amount + items.price)',
'COUNT(transactions) - MEAN(transactions.amount - items.price)',
'COUNT(transactions) - SUM(transactions.amount + items.price)',
'COUNT(transactions) - SUM(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) - MEAN(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) - MEAN(transactions.amount)',
'MEAN(transactions.amount + items.price) - SUM(transactions.amount + items.price)',
'MEAN(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
'MEAN(transactions.amount + items.price) - SUM(transactions.amount)',
'MEAN(transactions.amount + items.price) - price',
'MEAN(transactions.amount - items.price) - MEAN(transactions.amount)',
'MEAN(transactions.amount - items.price) - SUM(transactions.amount + items.price)',
'MEAN(transactions.amount - items.price) - SUM(transactions.amount - items.price)',
'MEAN(transactions.amount - items.price) - SUM(transactions.amount)',
'MEAN(transactions.amount - items.price) - price',
'MEAN(transactions.amount) - SUM(transactions.amount + items.price)',
'MEAN(transactions.amount) - SUM(transactions.amount - items.price)',
'SUM(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
'SUM(transactions.amount + items.price) - SUM(transactions.amount)',
'SUM(transactions.amount - items.price) - SUM(transactions.amount)',
'price - SUM(transactions.amount + items.price)',
'price - SUM(transactions.amount - items.price)']
max_depth=4 → 5 increase
[]
max_depth=Increase from 5 to 6
[]
It seems that it is a specification that applies agg_primitives, trans_primitives, agg_primitives, trans_primitives and ends.
CUSTOM primitives
It seems that you can also add your own primitive and calculate https://docs.featuretools.com/en/stable/getting_started/primitives.html#simple-custom-primitives
Convenient! !! !!
Recommended Posts