Amsterdam, the capital of the Netherlands, is a very famous tourist destination with a very beautiful cityscape. It is a city with many canals that are particularly characteristic of Europe, and it is famous as a tourist destination so that too many tourists becomes a problem.
Airbnb Airbnb is a famous private lodging service. Airbnb comes from Air bed and Bed & Breakfast. It's a service that is said to have started when Brian Chesky rented out his loft from time to time. It is the mainstream accommodation overseas.
The purpose is to get an understanding of the situation on Airbnb when Amsterdam tourists try to stay at Airbnb. This time, I analyzed the Airbnb accommodation data in Amsterdam to find out what the characteristics are and which variables affect the price of Airbnb accommodation.
[Inside Airbnb -Adding data to the debate] http://insideairbnb.com/get-the-data.html Inside Airbnb is a site that provides actual data on Airbnb. The data is very well organized and provided in csv format, so even beginners like me can easily analyze it.
https://towardsdatascience.com/exploring-machine-learning-for-airbnb-listings-in-toronto-efdbdeba2644 https://note.com/ryohei55/n/n56f723bc3f90
calendar = pd.read_csv('calendar.csv')
print(calendar.date.nunique(), 'days', calendar.listing_id.nunique(), 'unique listings')
366 days 20025 unique listings The data is from 2020-12-08 to 2020-12-06, but for some reason there is a slight error of 366 days, but as far as the data is seen, there seems to be no problem, so I will proceed. There are 20025 listings, and I am grateful for the large amount of data.
calendar.head(5)
I tried to graph how much Airbnb is already reserved and how much space is available in chronological order.
calendar_new = calendar[['date', 'available']]
calendar_new['busy'] = calendar_new.available.map( lambda x:0 if x == 't' else 1)
calendar_new = calendar_new.groupby('date')['busy'].mean().reset_index()
calendar_new['date'] = pd.to_datetime(calendar_new['date'])
plt.figure(figsize=(10, 5))
plt.plot(calendar_new['date'], calendar_new['busy'])
plt.title('Airbnb Amsterdam Calendar')
plt.ylabel('Busy %')
plt.show()
** Consideration ** With a occupancy rate of over 80% overall, it can be recognized that airbnb in Amsterdam is crowded all year round. It gets very crowded over the year. This may be due to the influence of tourists who come to see fireworks over the year.
The congestion rate will increase after March. A similar sudden rise is seen in June. However, these rises may be due to the fact that the airbnb host has not vacated the room because reservations that are a little far from the current time cannot be decided because the host's schedule has not been decided.
calendar['date'] = pd.to_datetime(calendar['date'])
calendar['price'] = calendar['price'].str.replace('$', '')
calendar['price'] = calendar['price'].str.replace(',', '')
calendar['price'] = calendar['price'].astype(float)
calendar['date'] = pd.to_datetime(calendar['date'])
mean_of_month = calendar.groupby(calendar['date'].dt.strftime('%B'), sort=False)['price'].mean()
mean_of_month.plot(kind = 'barh', figsize=(12, 7))
plt.xlabel('Average Monthly Price')
** Consideration ** The average price of airbnb in Amsterdam throughout the year is around 160 euros (¥ 18000) per night. I have the impression that January and February will be a little cheaper if you say it is strong.
calendar['dayofweek'] = calendar.date.dt.weekday_name
cats = [ 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday']
price_week = calendar[['dayofweek', 'price']]
price_week = calendar.groupby(['dayofweek']).mean().reindex(cats)
price_week.drop(['listing_id','maximum_nights', 'minimum_nights'], axis=1, inplace=True)
price_week.plot(grid=True)
ticks = list(range(0,7,1))
labels = "Mon Tues Weds Thurs Fri Sat Sun".split()
plt.xticks(ticks, labels)![download (14).png](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/505543/155809c6-0f06-2623-2d6e-78405d07ab30.png)
** Consideration ** It's settled on average below € 170 from Monday to Thursday, but prices are extremely high when staying from Friday to Saturday. It is thought that the demand for airbnb is biased on weekends because schools and companies visit on Fridays and Saturdays when they are closed.
Airbnb Contains data about each accommodation.
listings = pd.read_csv('listings.csv')
print('We have', listings.id.nunique(), 'listings in the listing data.')
listings.head(5)
It looks like this.
listings.groupby(by = 'neighbourhood_cleansed').count()[['id']].sort_values(by='id', ascending=False).head(10)
listings.loc[(listings.price <= 1000) & (listings.price > 0)].price.hist(bins=200)
plt.ylabel('Count')
plt.xlabel('Listing price in EUR')
plt.title('Histogram of listing prices')
The price distribution is like this.
select_neighbourhood_over_100 = listings.loc[(listings.price <= 1000) & (listings.price > 0)].groupby('neighbourhood_cleansed')\
.filter(lambda x: len(x)>=100)["neighbourhood_cleansed"].values
listings_neighbourhood_over_100 = listings.loc[listings['neighbourhood_cleansed'].map(lambda x: x in select_neighbourhood_over_100)]
sort_price = listings_neighbourhood_over_100.loc[(listings_neighbourhood_over_100.price <= 1000) & (listings_neighbourhood_over_100.price > 0)]\
.groupby('neighbourhood_cleansed')['price'].median().sort_values(ascending=False).index
sns.boxplot(y='price', x='neighbourhood_cleansed', data=listings_neighbourhood_over_100.loc[(listings_neighbourhood_over_100.price <= 1000) & (listings_neighbourhood_over_100.price > 0)],
order=sort_price)
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()
** Consideration ** As you can see from Centrum-West Centrum-Oost, the prices near the central station are quite high. The cheapest price range is when you go to an area like Bijnmer, which takes about 30 minutes by tram. Basically, it seems that the price of airbnb around it is decided by the distance from the central station. ![xxamsterdam-train-stations-map.jpg.pagespeed.ic.POsCpucKFr.jpg](https://qiita-image-store.s3.ap-northeast-1.amazonaws.com/0/505543/2a9486e8-df14 -64f2-c508-7e172c11656b.jpeg)
select_property_over_100 = listings.loc[(listings.price <= 1000) & (listings.price > 0)].groupby('property_type')\
.filter(lambda x:len(x) >=20)["property_type"].values
listings_property_over_100 = listings.loc[listings["property_type"].map(lambda x: x in select_property_over_100)]
sort_price = listings_property_over_100.loc[(listings_property_over_100.price <= 1000) & (listings_property_over_100.price >0)]\
.groupby('property_type')['price'].median().sort_values(ascending=False).index
sns.boxplot(y='price', x ='property_type', data=listings_property_over_100.loc[(listings_property_over_100.price <= 1000) & (listings_property_over_100.price >0)],
order = sort_price)
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right')
plt.show()
** Consideration ** First of all, the boxplot shows the variation of data, the center line points to the median, the dark line below it is the first quartile, and the dark line above it is the third quartile. It is a number. A hostel is a cheap accommodation that is common in Europe. However, although it is cheap, although it is classified as a hostel, there are many in Amsterdam that cost 1000 EUR. However, it must be taken into consideration that 1000 EUR or more is excluded as an outlier this time.
The data of the Hotel also varies. Probably because some hotels have a high-class taste. However, since the median itself is about 180 EUR, airbnb, which is classified as Hotel, seems to be basically a cheap classification.
listings.loc[(listings.price <= 1000) & (listings.price > 0)].pivot(columns='room_type', values='price').plot.hist(stacked=True, bins=100)
plt.xlabel('Listing Price in EUR')
** Consideration ** First you will notice that there are few Shared rooms and Hotel rooms. You can rent out the entire house / apartment, or rent out only the room. And most of them seem to be rented out for each house / apartment. If you want to make it cheaper, it seems more efficient to narrow down your search to private rooms. In the case of renting out the entire house / apartment, it is natural that only the room will be more expensive than renting out.
pd.Series(np.concatenate(listings['amenities'].map(lambda amns: amns.split(",")))).value_counts().head(20).plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.show()
** Consideration ** There is a lot of Wifi. Winters in the Netherlands are cold and most of them are equipped with Heating. Many places do not have amenities such as irons, shampoos and hair dryers, so you will need to check a little.
Family and kid friendly is a little ... but I don't want to be interfered with by airbnb, so this is considered an advantage and tomorrow. You can also see that free parking is not in the top, so please be careful when you come by car.
amenities = np.unique(np.concatenate(listings['amenities'].map(lambda amns: amns.split(","))))
amenity_prices = [(amn, listings[listings['amenities'].map(lambda amns: amn in amns)]['price'].mean()) for amn in amenities if amn != ""]
amenity_srs = pd.Series(data=[a[1] for a in amenity_prices], index=[a[0] for a in amenity_prices])
amenity_srs.sort_values(ascending=False)[:20].plot(kind='bar')
ax = plt.gca()
ax.set_xticklabels(ax.get_xticklabels(), rotation=45, ha='right', fontsize=12)
plt.show()
** Consideration ** I don't know if Washer / Dryer is the most price related ... But what is Amsterdam-like is Suitable for events. Many events are held in Amsterdam. It seems that the rooms that are easy to attend the event and that are in the right place tend to be expensive. Other than those two, the relationship is almost uniform.
listings.loc[(listings.price <= 1000)&(listings.price > 0)].pivot(columns = 'beds', values='price').plot.hist(stacked=True, bins=100)
plt.xlabel('Listing price in EUR')
** Consideration ** Mostly one or two beds. Is this the result you imagined? By the way, 32 beds! ?? I thought, so I tried it. https://www.airbnb.jp/rooms/779175?source_impression_id=p3_1577402659_vntGlW7Yj5I5pX4U It was a story that this ferry has 32 beds. I'm surprised.
col = ['host_listings_count', 'accommodates', 'bedrooms', 'price', 'number_of_reviews', 'review_scores_rating']
corr = listings.loc[(listings.price<=1000)&(listings.price > 0)][col].dropna().corr()
plt.figure(figsize=(6,6))
sns.set(font_scale=1)
sns.heatmap(corr, cbar=True, annot=True, square=True, fmt='.2f', xticklabels=col, yticklabels=col)
plt.show()
** Consideration ** This is a heat map that makes it easy to see each correlation in the listings data by color. However, only this time, there is no correlation in most parts. However, there is a strong correlation between bedrooms and accommodates. Since this is the number of people who can stay and the number of beds, it is understandable that there is a correlation. However, such a thing that determines the number of accomodates by the number of beds is considered to be a spurious correlation because the number of guests is artificially determined as the number of beds rather than having a correlation.
The following is data preparation. The data is made into a dummy variable.
from sklearn.feature_extraction.text import CountVectorizer
count_vectorizer = CountVectorizer(tokenizer=lambda x:x.split(','))
amenities = count_vectorizer.fit_transform(listings['amenities'])
df_amenities = pd.DataFrame(amenities.toarray(), columns=count_vectorizer.get_feature_names())
df_amenities = df_amenities.drop('', 1)
columns = ['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic', 'is_location_exact', 'requires_license', 'instant_bookable', 'require_guest_profile_picture', 'require_guest_phone_verification']
for c in columns:
listings[c] = listings[c].replace('f',0,regex=True)
listings[c] = listings[c].replace('t',1,regex=True)
listings['security_deposit'] = listings['security_deposit'].fillna(value=0)
listings['security_deposit'] = listings['security_deposit'].replace('[\$,]', '', regex=True).astype(float)
listings['cleaning_fee'] = listings['cleaning_fee'].fillna(value=0)
listings['cleaning_fee'] = listings['cleaning_fee'].replace('[\$,]', '', regex=True).astype(float)
listings_new = listings[['host_is_superhost', 'host_identity_verified', 'host_has_profile_pic','is_location_exact',
'requires_license', 'instant_bookable', 'require_guest_profile_picture',
'require_guest_phone_verification', 'security_deposit', 'cleaning_fee',
'host_listings_count', 'host_total_listings_count', 'minimum_nights',
'bathrooms', 'bedrooms', 'guests_included', 'number_of_reviews','review_scores_rating', 'price']]
for col in listings_new.columns[listings_new.isnull().any()]:
listings_new[col] = listings_new[col].fillna(listings_new[col].median())
for cat_feature in ['zipcode', 'property_type', 'room_type', 'cancellation_policy', 'neighbourhood_cleansed', 'bed_type']:
listings_new = pd.concat([listings_new, pd.get_dummies(listings[cat_feature])], axis=1)
listings_new = pd.concat([listings_new, df_amenities], axis=1, join='inner')
We will use RandomForestRegressor.
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
y = listings_new['price']
x = listings_new.drop('price', axis=1)
X_train, X_test, y_train, y_test = train_test_split(x, y, test_size=0.25, random_state=123)
rf = RandomForestRegressor(n_estimators=500, random_state=123, n_jobs=-1)
rf.fit(X_train, y_train)
y_train_pred = rf.predict(X_train)
y_test_pred = rf.predict(X_test)
rmse_rf = (mean_squared_error(y_test, y_test_pred))**(1/2)
print('RMSE test: %.3f' % rmse_rf)
print('R^2 test: %.3f' % (r2_score(y_test, y_test_pred)))
RMSE test: 73.245 R^2 test: 0.479 The result looks like this. It's 0.479 in the R ^ 2 test, so it's pretty accurate. For the time being, let's look at which item the decision tree judged to be important.
coefs_df = pd.DataFrame()
coefs_df['est_int'] = X_train.columns
coefs_df['coefs'] = rf.feature_importances_
coefs_df.sort_values('coefs', ascending=False).head(20)
** Consideration ** You can see that the number of bedrooms has a significant effect on the price. Also, in airbnb, a cleaning fee is charged separately from the room fee, but you can see that the price also affects the price. This seems to have a fairly direct effect.
from sklearn.linear_model import Lasso
lasso = Lasso()
lasso.fit(X_train, y_train)
#Regression coefficient
print(lasso.coef_)
#Intercept(error)
print(lasso.intercept_)
#Coefficient of determination
print(lasso.score(X_test, y_test))
[ 1.85022916e-03 1.31073590e+00 -0.00000000e+00 0.00000000e+00 5.23464952e+00 5.97640655e-01 6.42296851e-01 3.67942959e+01 8.80302532e+00 -3.96520183e-02 8.39294507e-01] -30.055848397234712 0.27054071146797
I also tried multiple regression analysis, but I couldn't improve the accuracy very much. Well, this is unavoidable because it is made into a dummy variable, isn't it? right? ??
I've just learned data analysis, but this Inside Airbnb has very well-organized data, which I'm grateful for as a beginner. I want to analyze it like this! It is a little difficult to find open data, so please refer to it.
This time, there are a lot of parts that I just copied, but I'm glad that I was able to learn what I intended and how to process the data.
I would appreciate any advice!
Recommended Posts