This is Miyano (@estie_mynfire) from estie CTO. On the first day, I wrote a little niche content (I tried Pandas' Sql Upsert), but this time I'm doing it at estie ** "Office" About "Forecast of appropriate rent" **.
As for office rent, unlike housing, the offered rent is not so public (only one-third or less of the properties are open to the public in the 5 central wards of the city), and the contracted rent is basically not available ** It is difficult to collect correct answer data **. Under such circumstances, when estimating real estate rent, we are jointly verifying the accuracy of the model while incorporating the professional eyes of office real estate on the business side.
Since there are many interactions with business sites, I will write about what I pay particular attention to and what I have devised.
As mentioned above, since we work together with the members on the business side, the following issues are more likely to occur than when developing only with ML engineers.
With feedback from business engineers and expansion of source data We frequently make logic changes such as outlier property removal, feature addition, and learning data change, and are improving every day to create a more accurate model. However, as mentioned above, the accuracy of these models cannot always be evaluated using only numerical indicators (professional eyes are also required). I intended to make a better model than in the past
** "Oh, if it was a model two months ago, it would have been a good value here, but it's a strange value!" **
What often happens. ~~ Most of the time, I notice that when I'm in a hurry, so I'm ** tingling **. ~~
In such a case, if you can immediately switch back to the state of the past model, you can immediately investigate the cause and reflect it in the production data at high speed.
When jointly verifying the accuracy, it is often requested to explain the cause, such as "The estimated value here, why is it such a value?". In such a case, if you can explain that you are greatly influenced by this feature, you can continue a more meaningful discussion.
As mentioned above, we create multiple new models every day, output the model output, and have the business side check the accuracy at high speed. At that time, intuitive analysis is difficult if the output is in tabular format, and communication to show only the values in this area may occur over several round trips.
The following measures are taken to solve the above.
The logic (including high parameters) and learning data change tickets are managed by issue on github, and the branch is cut for each issue.
If the version to be released next is v1.1.0 and the corresponding issue numbers are 4 and 6, the branch name will be something like dev / v1.1.0 / issue4_6
. At the time of release, it is once merged into v1.1.0
branch and tag management is also performed.
All files used for learning and estimation are managed by s3.
A bucket for machine learning is prepared in s3, and intermediate files are stored under the same directory structure (dev / v1.1.0 / issue4_6
) as the branch name.
When asked "Why is this estimated value here?", It is possible to find out the cause by using SHAP, so add the shap_value column when estimating. It is. There are various articles about SHAP, so please refer to them.
Explanation of interpretation of machine learning model using Shap
Simply put, it tells us "how much each feature contributed to the estimated value".
import shap
def calc_shap(df_, feature_list, model, rank_th=5) -> pd.core.frame.DataFrame:
'''shap_Add a value column
Args:
df_ (pd.core.frame.DataFrame):Data for which you want to calculate the estimated rent after adding features
feature_list ([str]):Feature name list
model :Learning model
rank_th (int): shap_How many features with high value should be displayed. default 5.
Returns:
pd.core.frame.DataFrame.
'''
df = df_.copy()
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(df[feature_list])
shap_df = pd.DataFrame(shap_values, columns=feature_list) # df[feature_values]Data frame in which all the values of
shap_rank = shap_df.applymap(lambda x: abs(x)).rank(axis=1, ascending=False, method='min') #For each record,Those with a large absolute value(⇔ High contribution)Ranking from
main_contri_col = {i: [col for col in r.keys() if r[col] <= rank_th] for i, r in shap_rank.iterrows()} #Contribution rank for each record_Get column list up to th
main_contri_val = [shap_df.loc[i, main_contri_col[i]].to_dict() for i in main_contri_col.keys()] #Contribution rank for each record_Get columns up to th and their contribution
df['shap_value'] = main_contri_val
return df
The value of this shap_value column is a json string
{'Area': 22627,'Age': 717,'hoge1': -5409,'hoge2': 2968,'hoge3': 3791}
It looks like this. This is useful because you can find out that the area contributes insanely, and when you look it up, you can discover that the area order of the estimated record was incorrect.
For business side members to check accuracy independently at high speed Not only the simple accuracy and the difference between the result and the previous logic, but also the learning data and the estimated value for various properties are visualized. The following images are past samples, but the properties with high estimates are plotted in orange, and the properties with low estimates are plotted in light blue. (Usually, the learning data is also plotted with black circles)
The visualization code is below.
'''Visualization of estimated rent by folium
Required:
pandas
folium
matplotlib
'''
import subprocess
import pandas as pd
import folium
import matplotlib.colors as cl
def calc_RGB_value(norm_rent: float) -> str:
'''0-Returns a 1-scale compressed number as RGB hexadecimal notation
The cheapest property is light blue,Make expensive properties orange
Args:
norm_rent (float): 0-Numerical value compressed to 1 scale
Returns:
str
ex: #54b0c5
'''
R_val = 41 + (255 - 41) * norm_rent
G_val = 182 + (150 - 182) * norm_rent
B_val = 246 + (0 - 246) * norm_rent
return cl.to_hex((R_val / 255, G_val / 255, B_val / 255, 1))
def add_color_col(df_: pd.core.frame.DataFrame) -> pd.core.frame.DataFrame:
'''Add color column
Args:
df_ (pd.core.frame.DataFrame): estimated_Data frame containing rent columns
Returns:
pd.core.frame.DataFrame
'color'Add column and return
'''
df = df_.copy()
norm = cl.Normalize(vmin=df['estimated_rent'].min(), vmax=df['estimated_rent'].max())
norm_rent_ = [norm(v) for v in df['estimated_rent']] #Estimated rent 0,Make it 1 scale
color_ = [calc_RGB_value(norm_rent) for norm_rent in norm_rent_]
df["color"] = color_
return df
class Drawer:
def __init__(self, ld_path, ed_path):
self.read_ld(ld_path)
self.read_ed(ed_path)
self.add_color_col()
self.init_map()
def read_ld(self, ld_path):
'''Reading training data
'''
self.ld = pd.read_csv(ld_path)
assert 'answer_rent' in self.ld.columns
def read_ed(self, ed_path):
'''Data reading after estimated rent calculation
'''
self.ed = pd.read_csv(ed_path)
assert 'estimated_rent' in self.ld.columns
def add_color_col(self):
self.ed = add_color_col(self.ed)
self.ld['color'] = '#262626' #black
def init_map(self):
'''Initialize map
'''
self.map = folium.Map(
location=[self.ed.latitude.mean(),self.ed.longitude.mean()],
zoom_start=6, tiles='cartodbpositron')
def add_ld_plot(self, size=15):
'''Training data plot
size (int):The size of the plot circle. default 15.
'''
for i, row in self.ld.iterrows():
folium.Circle(
radius=size, location=[row['latitude'], row['longitude']],
popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
'<br/>Correct rent: {:,.0f}Circle/Tsubo'.format(row['ans_rent']),
color=row['color'], fill_color=row['color']).add_to(self.map)
def add_ed_plot(self, size=5):
'''Estimated rent plot
size (int):The size of the plot circle. default 5.
'''
for i, row in self.ld.iterrows():
folium.Circle(
radius=size, location=[row['latitude'], row['longitude']],
popup='Property name: %s' % (row['Property name'] if 'Property name' in row.keys() else '' +
'<br/>Estimated rent: {:,.0f}Circle/Tsubo'.format(row['estimated_rent']),
color=row['color'], fill_color=row['color']).add_to(self.map)
if __name__ == '__main__':
drawer = Drawer(
ld_path='s3://hogehoge/dev/v1.1.0/issue4_6/ld.csv',
ed_path='s3://hogehoge/dev/v1.1.0/issue4_6/ed.csv')
drawer.add_ld_plot()
drawer.add_ed_plot()
drawer.map.save('map.html')
subprocess.call(
['aws', 's3', 'mv', 'map.html', 's3://hogehoge/dev/v1.1.0/issue4_6/map.html'])
Although the method is still primitive, the data code version synchronization management is performed by the above method. In the future, I am thinking of introducing MLflow to make it easier to manage, but I will write a sequel as soon as it is introduced.
At estie, we are always looking for engineers who are enthusiastic about new technologies and full-stack engineers! https://www.wantedly.com/companies/company_6314859/projects
estie -> https://www.estie.jp estiepro -> https://pro.estie.jp Company site-> https://www.estie.co.jp
Recommended Posts