Introduction

This is Miyano (@estie_mynfire) from estie CTO. On the first day, I wrote a little niche content (I tried Pandas' Sql Upsert), but this time I'm doing it at estie ** "Office" About "Forecast of appropriate rent" **.

As for office rent, unlike housing, the offered rent is not so public (only one-third or less of the properties are open to the public in the 5 central wards of the city), and the contracted rent is basically not available ** It is difficult to collect correct answer data **. Under such circumstances, when estimating real estate rent, we are jointly verifying the accuracy of the model while incorporating the professional eyes of office real estate on the business side.

Since there are many interactions with business sites, I will write about what I pay particular attention to and what I have devised.

Task

As mentioned above, since we work together with the members on the business side, the following issues are more likely to occur than when developing only with ML engineers.

Challenge 1. Managing past models

With feedback from business engineers and expansion of source data We frequently make logic changes such as outlier property removal, feature addition, and learning data change, and are improving every day to create a more accurate model. However, as mentioned above, the accuracy of these models cannot always be evaluated using only numerical indicators (professional eyes are also required). I intended to make a better model than in the past

** "Oh, if it was a model two months ago, it would have been a good value here, but it's a strange value!" **

What often happens. ~~ Most of the time, I notice that when I'm in a hurry, so I'm ** tingling **. ~~

In such a case, if you can immediately switch back to the state of the past model, you can immediately investigate the cause and reflect it in the production data at high speed.

Challenge 2. Explanation of model estimates

When jointly verifying the accuracy, it is often requested to explain the cause, such as "The estimated value here, why is it such a value?". In such a case, if you can explain that you are greatly influenced by this feature, you can continue a more meaningful discussion.

Challenge 3. Visualization of output for accuracy verification

As mentioned above, we create multiple new models every day, output the model output, and have the business side check the accuracy at high speed. At that time, intuitive analysis is difficult if the output is in tabular format, and communication to show only the values in this area may occur over several round trips.

What you are doing

The following measures are taken to solve the above.

Correspondence 1 Model version control

Code management

The logic (including high parameters) and learning data change tickets are managed by issue on github, and the branch is cut for each issue. If the version to be released next is v1.1.0 and the corresponding issue numbers are 4 and 6, the branch name will be something like dev / v1.1.0 / issue4_6. At the time of release, it is once merged into v1.1.0branch and tag management is also performed.

Version control of intermediate generated files

All files used for learning and estimation are managed by s3. A bucket for machine learning is prepared in s3, and intermediate files are stored under the same directory structure (dev / v1.1.0 / issue4_6) as the branch name.

Correspondence 2 Cause explanation using SHAP

When asked "Why is this estimated value here?", It is possible to find out the cause by using SHAP, so add the shap_value column when estimating. It is. There are various articles about SHAP, so please refer to them.

Explanation of interpretation of machine learning model using Shap

Simply put, it tells us "how much each feature contributed to the estimated value".

import shap
def calc_shap(df_, feature_list, model, rank_th=5) -> pd.core.frame.DataFrame:
    '''shap_Add a value column
    Args:
        df_ (pd.core.frame.DataFrame):Data for which you want to calculate the estimated rent after adding features
        feature_list ([str]):Feature name list
        model :Learning model
        rank_th (int): shap_How many features with high value should be displayed. default 5.
    Returns:
        pd.core.frame.DataFrame.
    '''
    df = df_.copy()
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(df[feature_list])
    shap_df = pd.DataFrame(shap_values, columns=feature_list) # df[feature_values]Data frame in which all the values of
    shap_rank = shap_df.applymap(lambda x: abs(x)).rank(axis=1, ascending=False, method='min') #For each record,Those with a large absolute value(⇔ High contribution)Ranking from
    main_contri_col = {i: [col for col in r.keys() if r[col] <= rank_th] for i, r in shap_rank.iterrows()} #Contribution rank for each record_Get column list up to th
    main_contri_val = [shap_df.loc[i, main_contri_col[i]].to_dict() for i in main_contri_col.keys()] #Contribution rank for each record_Get columns up to th and their contribution
    df['shap_value'] = main_contri_val
    return df

The value of this shap_value column is a json string {'Area': 22627,'Age': 717,'hoge1': -5409,'hoge2': 2968,'hoge3': 3791} It looks like this. This is useful because you can find out that the area contributes insanely, and when you look it up, you can discover that the area order of the estimated record was incorrect.

Response 3 Visualization on a map for accuracy verification

For business side members to check accuracy independently at high speed Not only the simple accuracy and the difference between the result and the previous logic, but also the learning data and the estimated value for various properties are visualized. The following images are past samples, but the properties with high estimates are plotted in orange, and the properties with low estimates are plotted in light blue. (Usually, the learning data is also plotted with black circles)

The visualization code is below.

'''Visualization of estimated rent by folium
Required:
    pandas
    folium
    matplotlib
'''

import subprocess
import pandas as pd
import folium
import matplotlib.colors as cl

def calc_RGB_value(norm_rent: float) -> str:
    '''0-Returns a 1-scale compressed number as RGB hexadecimal notation
The cheapest property is light blue,Make expensive properties orange
    Args:
        norm_rent (float): 0-Numerical value compressed to 1 scale
    Returns:
        str
        ex: #54b0c5
    '''
    R_val = 41 + (255 - 41) * norm_rent
    G_val = 182 + (150 - 182) * norm_rent
    B_val = 246 + (0 - 246) * norm_rent
    return cl.to_hex((R_val / 255, G_val / 255, B_val / 255, 1))

def add_color_col(df_: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
    '''Add color column
    Args:
        df_ (pd.core.frame.DataFrame): estimated_Data frame containing rent columns
    Returns:
        pd.core.frame.DataFrame
        'color'Add column and return
    '''
    df = df_.copy()
    norm = cl.Normalize(vmin=df['estimated_rent'].min(), vmax=df['estimated_rent'].max())
    norm_rent_ = [norm(v) for v in df['estimated_rent']]  #Estimated rent 0,Make it 1 scale
    color_ = [calc_RGB_value(norm_rent) for norm_rent in norm_rent_]
    df["color"] = color_
    return df

class Drawer:
    def __init__(self, ld_path, ed_path):
        self.read_ld(ld_path)
        self.read_ed(ed_path)
        self.add_color_col()
        self.init_map()
    def read_ld(self, ld_path):
        '''Reading training data
        '''
        self.ld = pd.read_csv(ld_path)
        assert 'answer_rent' in self.ld.columns
    def read_ed(self, ed_path):
        '''Data reading after estimated rent calculation
        '''
        self.ed = pd.read_csv(ed_path)
        assert 'estimated_rent' in self.ld.columns
    def add_color_col(self):
        self.ed = add_color_col(self.ed)
        self.ld['color'] = '#262626' #black
    def init_map(self):
        '''Initialize map
        '''
        self.map = folium.Map(
            location=[self.ed.latitude.mean(),self.ed.longitude.mean()],
            zoom_start=6, tiles='cartodbpositron')
    def add_ld_plot(self, size=15):
        '''Training data plot
        size (int):The size of the plot circle. default 15.
        '''
        for i, row in self.ld.iterrows():
            folium.Circle(
                radius=size, location=[row['latitude'], row['longitude']],
                popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
                '<br/>Correct rent: {:,.0f}Circle/Tsubo'.format(row['ans_rent']),
                color=row['color'], fill_color=row['color']).add_to(self.map)
    def add_ed_plot(self, size=5):
        '''Estimated rent plot
        size (int):The size of the plot circle. default 5.
        '''
        for i, row in self.ld.iterrows():
            folium.Circle(
                radius=size, location=[row['latitude'], row['longitude']],
                popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
                '<br/>Estimated rent: {:,.0f}Circle/Tsubo'.format(row['estimated_rent']),
                color=row['color'], fill_color=row['color']).add_to(self.map)
if __name__ == '__main__':
    drawer = Drawer(
        ld_path='s3://hogehoge/dev/v1.1.0/issue4_6/ld.csv',
        ed_path='s3://hogehoge/dev/v1.1.0/issue4_6/ed.csv')
    drawer.add_ld_plot()
    drawer.add_ed_plot()
    drawer.map.save('map.html')
    subprocess.call(
        ['aws', 's3', 'mv', 'map.html', 's3://hogehoge/dev/v1.1.0/issue4_6/map.html'])

in conclusion

from now on

Although the method is still primitive, the data code version synchronization management is performed by the above method. In the future, I am thinking of introducing MLflow to make it easier to manage, but I will write a sequel as soon as it is introduced.

About estie

At estie, we are always looking for engineers who are enthusiastic about new technologies and full-stack engineers! https://www.wantedly.com/companies/company_6314859/projects

estie -> https://www.estie.jp estiepro -> https://pro.estie.jp Company site-> https://www.estie.co.jp

Machine learning model management to avoid quarreling with the business side