What are feature tools

If you have knowledge about the target domain when performing machine learning, you can improve the accuracy by considering an appropriate feature amount and giving it as a feature amount, but even if you do not have domain knowledge, you can add or aggregate. You can take the strategy of expecting to find the feature amount by chance. Since it is a brute force approach that tries all possible combinations from one end, it seems to be called brute force feature engineering.

featuretools is a python library that semi-automates the troublesome feature creation if done manually. very convenient.

featuretools official tutorial https://docs.featuretools.com/en/stable/

In this article, I will follow the code of this blog https://blog.amedama.jp/entry/featuretools-brute-force-feature-engineering

install

Enter with pip

`terminal`


pip install featuretools

You can also link with another library by installing addon additionally. https://docs.featuretools.com/en/stable/install.html

Deep Feature Synthesis

When multiple DataFrames are given, the features are created by performing four arithmetic operations such as aggregating, calculating statistics, and performing four arithmetic operations between the features. Deep Feature Synthesis does these tasks for good Shioume. Yes, the function that does this is featuretools.dfs (). https://docs.featuretools.com/en/stable/getting_started/afe.html

In order to realize this good automation of Shioume, it is necessary to specify more detailed data types than pandas.DataFrame. For example, there are Datetime, DateOfBirth, DatetimeTimeIndex, NumericTimeIndex, etc. just for the data type that expresses time, and it makes it difficult for inappropriate combinations to occur. https://docs.featuretools.com/en/stable/getting_started/variables.html

3. In case of one entity

featuretools calls the input data entity. I think that you often bring data with pandas.DataFrame, but in that case, one pandas.DataFrame is one entity.

3-1. trans_primitives only

trans_primitives performs calculations between features

Create a DataFrame to use

`python`


import pandas as pd
data = {'name': ['a', 'b', 'c'],
        'x': [1, 2, 3],
        'y': [2, 4, 6],
        'z': [3, 6, 9],}
df = pd.DataFrame(data)
df

Create an EntitySet

First, create an empty featuretools.EntitySet. EntitySet is an object for defining the relationship between entities and the content to be processed, but only id is written below. The id can be omitted, but in the following, id ='example'.

`python`


import featuretools as ft
es = ft.EntitySet(id='example')
es

Add entity to EntitySet

Below, the df created earlier is registered so that it can be called by the name'locations'. index = is an argument to specify index as it is, and if omitted, the first column of DataFrame is treated as index.

`python`


es = es.entity_from_dataframe(entity_id='locations',
                              dataframe=df,
                              index='name')
es

`output`


Entityset: example
  Entities:
    locations [Rows: 3, Columns: 4]
  Relationships:
    No relationships

The entity registered in EntitySet can be called as follows

`python`


es['locations']

`output`


Entity: locations
  Variables:
    name (dtype: index)
    x (dtype: numeric)
    y (dtype: numeric)
    z (dtype: numeric)
  Shape:
    (Rows: 3, Columns: 4)

`python`


es['locations'].df

Run dfs

Now that we have an EntitySet, all we have to do is pass it to ft.dfs () to create the feature. target_entity is the main entity, the calculation method that trans_primitives applies to the combination between features, and the calculation method that agg_primitives uses for aggregate.

The available primitives are summarized below https://primitives.featurelabs.com/

In the following, add_numeric is instructed to add the sum between features, and subtract_numeric is instructed to add the difference between features.

`python`


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='locations',
                                      trans_primitives=['add_numeric', 'subtract_numeric'],
                                      agg_primitives=[],
                                      max_depth=1)
feature_matrix

Originally there were x, y, z, and the sum of them, x + y, x + z, y + z, and the difference x-y, x-z, y-z have been added.

3-2. aggregate only

For agg_primitives, specify the calculation method to create aggregate.

Create a DataFrame to use

`python`


data = {'item_id': [1, 2, 3, 4, 5],
        'name': ['apple', 'broccoli', 'cabbage', 'dorian', 'eggplant'],
        'category': ['fruit', 'vegetable', 'vegetable', 'fruit', 'vegetable'],
        'price': [100, 200, 300, 4000, 500]}
item_df = pd.DataFrame(data)
item_df

You now have a DataFrame with two categorical variables to use for aggregate

Create an EntitySet

Same as before until entity is added

`python`


es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
                              dataframe=item_df,
                              index='item_id')
es

Add a relationship here to use for aggregate.

In the following, it is instructed to create a new entity called category based on the entity called items and set the index at that time as category.

`python`


es = es.normalize_entity(base_entity_id='items',
                         new_entity_id='category',
                         index='category')
es

`output`


Entityset: example
  Entities:
    items [Rows: 5, Columns: 4]
    category [Rows: 2, Columns: 1]
  Relationships:
    items.category -> category.category

As for what happens with this, first of all, items are left as they are.

`output`


es['items'].df

On the other hand, the entity called category is indexed in the newly specified category column, so it is as follows.

`python`


es['category'].df

Run dfs

Try to specify count, sum, mean for agg_primitives

`python`


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='items',
                                      trans_primitives=[],
                                      agg_primitives=['count', 'sum', 'mean'],
                                      max_depth=2)
feature_matrix

Since the only column that can be aggregated is items.price, COUNT (), category.MEAN (), and category.SUM () are calculated respectively, and a DataFrame with 3 additional columns is created.

4. In case of two entities

Create a DataFrame

`python`


data = {'item_id': [1, 2, 3],
        'name': ['apple', 'banana', 'cherry'],
        'price': [100, 200, 300]}
item_df = pd.DataFrame(data)
item_df

`python`


from datetime import datetime
data = {'transaction_id': [10, 20, 30, 40],
        'time': [
            datetime(2016, 1, 2, 3, 4, 5),
            datetime(2017, 2, 3, 4, 5, 6),
            datetime(2018, 3, 4, 5, 6, 7),
            datetime(2019, 4, 5, 6, 7, 8),
        ],
        'item_id': [1, 2, 3, 1],
        'amount': [1, 2, 3, 4]}
tx_df = pd.DataFrame(data)
tx_df

Create an EntitySet

I will add the entity as before

`python`


es = ft.EntitySet(id='example')
es = es.entity_from_dataframe(entity_id='items',
                              dataframe=item_df,
                              index='item_id')
es = es.entity_from_dataframe(entity_id='transactions',
                              dataframe=tx_df,
                              index='transaction_id',
                              time_index='time')
es

Create a relationship that connects two entities. You merge in the item_id column of items and the item_id column of transactions.

`python`


relationship = ft.Relationship(es['items']['item_id'], es['transactions']['item_id'])
es = es.add_relationship(relationship)
es

`output`


Entityset: example
  Entities:
    items [Rows: 3, Columns: 3]
    transactions [Rows: 4, Columns: 4]
  Relationships:
    transactions.item_id -> items.item_id

Run dfs

`python`


feature_matrix, feature_defs = ft.dfs(entityset=es,
                                      target_entity='items',
                                      trans_primitives=['add_numeric', 'subtract_numeric'],
                                      agg_primitives=['count', 'sum', 'mean'],
                                      max_depth=2)
feature_matrix

If you write out only the column headings, it looks like the following. It does both the aggregation and the calculation between the features.

`output`


['name',
 'price',
 'COUNT(transactions)',
 'MEAN(transactions.amount)',
 'SUM(transactions.amount)',
 'COUNT(transactions) + MEAN(transactions.amount)',
 'COUNT(transactions) + SUM(transactions.amount)',
 'COUNT(transactions) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount)',
 'MEAN(transactions.amount) + price',
 'price + SUM(transactions.amount)',
 'COUNT(transactions) - MEAN(transactions.amount)',
 'COUNT(transactions) - SUM(transactions.amount)',
 'COUNT(transactions) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount)',
 'MEAN(transactions.amount) - price',
 'price - SUM(transactions.amount)']

How max_depth works

Let's see what happens if we gradually increase max_depth with the above code

`max_depth=1`


['name',
 'price',
 'COUNT(transactions)',
 'MEAN(transactions.amount)',
 'SUM(transactions.amount)']

`max_depth=1 → 2 increase`


['COUNT(transactions) + MEAN(transactions.amount)',
 'COUNT(transactions) + SUM(transactions.amount)',
 'COUNT(transactions) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount)',
 'MEAN(transactions.amount) + price',
 'price + SUM(transactions.amount)',
 'COUNT(transactions) - MEAN(transactions.amount)',
 'COUNT(transactions) - SUM(transactions.amount)',
 'COUNT(transactions) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount)',
 'MEAN(transactions.amount) - price',
 'price - SUM(transactions.amount)']

`max_depth=2 → 3 increase`


['MEAN(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price)',
 'SUM(transactions.amount - items.price)']

`max_depth=Increase from 3 to 4`


['COUNT(transactions) + MEAN(transactions.amount + items.price)',
 'COUNT(transactions) + MEAN(transactions.amount - items.price)',
 'COUNT(transactions) + SUM(transactions.amount + items.price)',
 'COUNT(transactions) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + MEAN(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + MEAN(transactions.amount)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) + SUM(transactions.amount)',
 'MEAN(transactions.amount + items.price) + price',
 'MEAN(transactions.amount - items.price) + MEAN(transactions.amount)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount - items.price) + SUM(transactions.amount)',
 'MEAN(transactions.amount - items.price) + price',
 'MEAN(transactions.amount) + SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount) + SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) + SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) + SUM(transactions.amount)',
 'SUM(transactions.amount - items.price) + SUM(transactions.amount)',
 'price + SUM(transactions.amount + items.price)',
 'price + SUM(transactions.amount - items.price)',
 'COUNT(transactions) - MEAN(transactions.amount + items.price)',
 'COUNT(transactions) - MEAN(transactions.amount - items.price)',
 'COUNT(transactions) - SUM(transactions.amount + items.price)',
 'COUNT(transactions) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - MEAN(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - MEAN(transactions.amount)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount + items.price) - SUM(transactions.amount)',
 'MEAN(transactions.amount + items.price) - price',
 'MEAN(transactions.amount - items.price) - MEAN(transactions.amount)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount - items.price)',
 'MEAN(transactions.amount - items.price) - SUM(transactions.amount)',
 'MEAN(transactions.amount - items.price) - price',
 'MEAN(transactions.amount) - SUM(transactions.amount + items.price)',
 'MEAN(transactions.amount) - SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) - SUM(transactions.amount - items.price)',
 'SUM(transactions.amount + items.price) - SUM(transactions.amount)',
 'SUM(transactions.amount - items.price) - SUM(transactions.amount)',
 'price - SUM(transactions.amount + items.price)',
 'price - SUM(transactions.amount - items.price)']

`max_depth=4 → 5 increase`

[]

`max_depth=Increase from 5 to 6`

[]

It seems that it is a specification that applies agg_primitives, trans_primitives, agg_primitives, trans_primitives and ends.

CUSTOM primitives

It seems that you can also add your own primitive and calculate https://docs.featuretools.com/en/stable/getting_started/primitives.html#simple-custom-primitives

Summary

Convenient! !! !!

Notes on how to use featuretools

What are feature tools

terminal

3. In case of one entity

3-1. trans_primitives only

Create a DataFrame to use

python

Create an EntitySet

python

Add entity to EntitySet

python

output

python

output

python

Run dfs

python

3-2. aggregate only

Create a DataFrame to use

python

Create an EntitySet

python

python

output

output

python

Run dfs

python

4. In case of two entities

Create a DataFrame

python

python

Create an EntitySet

python

python

output

Run dfs

python

output

How max_depth works

max_depth=1

max_depth=1 → 2 increase

max_depth=2 → 3 increase

max_depth=Increase from 3 to 4

max_depth=4 → 5 increase

max_depth=Increase from 5 to 6

Summary

`terminal`

`python`

`python`

`python`

`output`

`python`

`output`

`python`

`python`

`python`

`python`

`python`

`output`

`output`

`python`

`python`

`python`

`python`

`python`

`python`

`output`

`python`

`output`

`max_depth=1`

`max_depth=1 → 2 increase`

`max_depth=2 → 3 increase`

`max_depth=Increase from 3 to 4`

`max_depth=4 → 5 increase`

`max_depth=Increase from 5 to 6`