[First data science ⑤] I tried to help my friend find the first property by data analysis.

Nice to meet you. My name is S.I., a third-year university student belonging to the Department of Computer Science. My experience with Python is a bit of a university experiment.

The data science division of Cacco Inc., where I am an intern, has the task of creating crawlers during the trial period to collect, process, and visualize data, and to briefly discuss what I have learned.

Task

theme

My college friend is going to live alone. However, when I look at the real estate website, there are too many properties to choose from. Please solve it by data analysis.

Constraint

Within 60 minutes of commuting time from JR Kanamachi Station

background

When searching for a property from your own property search experience, the "information you want to know" is I thought that it was a "condition for finding a property" to tell the real estate agent to mediate, and decided to solve it by data analysis.

policy

  1. Crawl the property site "Smighty" and save it as an HTML file
  2. Scraping each variable from the obtained HTML file
  3. Data analysis
  4. Present the conditions for finding a property

Crawling

This time, we will use Smighty's "Commuting / School Time Search" to crawl, and the search results will be within 60 minutes to Kanamachi Station.

  1. Save the total number of properties posted on the site in a text file
  2. Specify the URL of the first page and save it as an HTML file
  3. Get the URL of the next page from pagination and transition
  4. Save the destination page as an HTML file
  5. Repeat a few steps until there are no more pages

The crawling code looks like this:

crawling.py


import requests
from bs4 import BeautifulSoup
import time
import os
import datetime

def crawling():
    #Path of directory for saving html files
    dirname = './html_files'
    if not os.path.exists(dirname):
        #Create directory if it does not exist
        os.mkdir(dirname)

    #Convert the first page to html
    url = "https://sumaity.com/chintai/commute_list/list.php?search_type=c&text_from_stname%5B%5D=%E9%87%91%E7%94%BA&cost_time%5B%5D=60&price_low=&price_high="
    response = requests.get(url)
    time.sleep(1)
    #Save to file
    page_count = 1    #Page count
    with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
        file.write(response.text)

    #Total number of properties(Theoretical value)Acquisition (as an acceptance condition)
    soup = BeautifulSoup(response.content, "lxml")
    num_bukken = int(soup.find(class_='searchResultHit').contents[1].text.replace(',', ''))
    print("Total number of properties within 60 minutes of commuting time:", num_bukken)
    #Save the total number of properties in a text file as it will be used to check the acceptance conditions when scraping.
    path = './data.txt'
    with open(path, mode='w') as f:
        f.write("{}\n".format(num_bukken))

    #Crawling on the second and subsequent pages, continue until the next page runs out
    while True:
        page_count += 1

        #Find the next url
        next_url = soup.find("li", class_="next")

        #Break and finish when the next page runs out
        if next_url == None:
            print("Total number of pages:", page_count-1)
            with open(path, mode='a') as f:
                f.write("{}\n".format(page_count-1))
            break

        #Get the next page url and save it as an html file
        url = next_url.a.get('href')
        response = requests.get(url)
        time.sleep(1)
        with open('./html_files/page{}.html'.format(page_count), 'w', encoding='utf-8') as file:
            file.write(response.text)

        #Prepare for analysis to get the url of the next page
        soup = BeautifulSoup(response.content, "lxml")

        #Crawling progress output
        if page_count % 10 == 0:
            print(page_count, 'Get page')

#Main function
if __name__ == "__main__":
    date_now = datetime.datetime.now()
    print("Start crawling:", date_now)
    crawling()
    date_now = datetime.datetime.now()
    print("Finished crawling:", date_now)



Scraping

The following are the variables scraped this time. image.png

  1. Scraping each variable for each property
  2. When all the variables are available, add them as a record to the CSV file.
  3. Is it consistent? Match the total number of properties acquired by crawling with the number of records

The scraping code looks like this:

scraping.py


from bs4 import BeautifulSoup
import datetime
import csv
import re

#Regular expression for dividing an address into a prefecture and a city
pat = '(...??[Prefectures])((?:Asahikawa|Date|Ishikari|Morioka|Oshu|Tamura|Minamisoma|Nasushiobara|Higashimurayama|Musashimurayama|Hamura|Tokamachi|Joetsu|Toyama|Nonoichi|Omachi|Gamagori|Yokkaichi|Himeji|Yamatokoriyama|Hatsukaichi|under>Pine|Iwakuni|Tagawa|Omura|Miyako|Furano|Beppu|Saiki|Kurobe|Komoro|Shiojiri|Tamano|Shunan)city|(?:余city|高city|[^city]{2,3}?)county(?:Tamamura|Omachi|.{1,5}?)[Towns and villages]|(?:.{1,4}city)?[^town]{1,4}?Ward|.{1,7}?[cityTowns and villages])(.+)'

def scraping(total_page, room_num):
    #Initialization of the number of properties
    room_count = 0

    #Preparation of csv file (add header)
    with open('room_data.csv', 'w', newline='', encoding='CP932') as file:
        header = ['No', 'building_name', 'category', 'prefecture', 'city', 'station_num', 'station', 'method', 'time', 'age', 'total_stairs', 'stairs', 'layout', 'room_num', 'space', 'south', 'corner', 'rent', 'unit_price', 'url']
        writer = csv.DictWriter(file, fieldnames=header)
        writer.writeheader()


    for page_num in range(total_page):
        #Scraping progress output
        if page_num % 10 == 0:
            print(page_num , '/', total_page)

        #Open the html file to be scraped with Beautiful Soup
        with open('./html_files/page{}.html'.format(page_num + 1), 'r', encoding='utf-8') as file:
            page = file.read()
        soup = BeautifulSoup(page, "lxml")

        #Get information for each building
        building_list = soup.find_all("div", class_="building")
        for building in building_list:
            #Building category: Condominium or apartment or detached house
            buildingCategory = building.find(class_="buildingCategory").getText()

            #Building name
            buildingName = building.find(class_="buildingName").h3.getText().replace("{}".format(buildingCategory), "").replace("New arrival", "")

            #Extraction of candidates for the nearest station and the distance from the station
            traffic = building.find("ul", class_="traffic").find_all("li")
            #Number of nearest stations
            station_num = len(traffic)
            #Extract those with short walking time
            min_time = 1000000    #Initialize the minimum required time
            for j in range(station_num):
                traffic[j] = traffic[j].text
                figures = re.findall(r'\d+', traffic[j])
                time = 0
                for figure in figures:
                    #Calculation of required time
                    time += int(figure)
                #Store minimum time required and index if minimum
                if time < min_time:
                    min_time = time
                    index = j

            #If you have station or route information
            if len(traffic[index].split(' ')) > 1:
                #Route decision
                line = traffic[index].split(' ')[0]
                #Determining the nearest station
                station = traffic[index].split(' ')[1].split('station')[0]
                #Obtaining transportation (bus, car, walking) to the station
                if len(traffic[index].split(' ')) > 2:
                    if "bus" in traffic[index].split(' ')[1]:
                        method = "bus"
                    elif "car" in traffic[index].split(' ')[2]:
                        method = "car"
                    else:
                        method = "walk"
                #No transportation information to the station
                else:
                    method = None
            #If there is no station or route information
            else:
                station = None
                line = None
                method = None
                time = None

            #Street address
            address = building.find(class_="address").getText().replace('\n','')
            address = re.split(pat, address)
            if len(address) < 3:
                prefecture = "Tokyo"
                city = "Adachi Ward"
            else:
                prefecture = address[1]
                city = address[2]

            #Details of the building (age, structure, total number of floors)
            building_detail = building.find(class_="detailData").find_all("td")
            for j in range(len(building_detail)):
                building_detail[j] = building_detail[j].text

            # ----Get only the number of age----
            #Age unknown
            if 'Unknown construction' == building_detail[0]:
                building_detail[0] = None
            #0 years old
            elif 'Less than' in building_detail[0]:
                building_detail[0] = 0
            #Normal value
            else:
                building_detail[0] = int(re.findall(r'\d+', building_detail[0])[0])

            #Get only the total number of floors
            building_detail[2] = int(re.findall(r'\d+', building_detail[2])[0])


            # ----Get room details----
            rooms = building.find(class_="detail").find_all("tr",
                                                            {'class': ['estate applicable', 'estate applicable gray']})
            for j in range(len(rooms)):
                #Counting the number of properties
                room_count += 1

                # ----Number of floors----
                stairs = rooms[j].find("td", class_="roomNumber").text
                #Get only numbers (delete "floor", process missing values)
                if "-" == stairs:
                    stairs = None
                else:
                    stairs = int(re.findall(r'\d+', stairs)[0])

                #Make the rent an integer type
                price = rooms[j].find(class_="roomPrice").find_all("p")[0].text
                price = round(10000 * float(price.split('Ten thousand')[0]))

                #Management fee
                kanri_price = rooms[j].find(class_="roomPrice").find_all("p")[1].text
                #Unification of notation (deletion of 10,000 yen notation, "-”And“ 0 yen ”missing value processing)
                if "-" in kanri_price or "0 Yen" == kanri_price:
                    kanri_price = 0
                else:
                    kanri_price = int(kanri_price.split('Circle')[0].replace(',',''))

                #Room type (floor plan)
                room_type = rooms[j].find(class_="type").find_all("p")[0].text
                if room_type == "Studio":
                    room_type = "1R"
                #number of rooms
                num_of_rooms = int(re.findall(r'\d+', room_type)[0])


                #Room area, deletion of unit "m2"
                room_area = rooms[j].find(class_="type").find_all("p")[1].text
                room_area = float(room_area.split('m')[0])

                #South facing corner room
                special = rooms[j].find_all("span", class_="specialLabel")
                south = 0
                corner = 0
                for label in range(len(special)):
                    if "South facing" in special[label].text:
                        south = 1
                    if "Corner room" in special[label].text:
                        corner = 1

                #Get detailed url
                room_url = rooms[j].find("td", class_="btn").a.get('href')

                #rent=Rent+Ask for management fee
                rent = price + kanri_price

                # 1m^Find the rent (unit price) for each 2
                unit_price = rent / room_area

                #Output to csv file: encoding default"utf-8", If you handle Japanese on windows"cp932"
                with open('room_data.csv', 'a', newline='', encoding='CP932') as file:
                    writer = csv.DictWriter(file, fieldnames=header)
                    writer.writerow(
                        {'No':room_count, 'building_name':buildingName, 'category':buildingCategory, 'prefecture':prefecture, 'city':city, 'station_num':station_num, 'station':station,
                              'method':method, 'time':min_time, 'age':building_detail[0], 'total_stairs':building_detail[2], 'stairs':stairs,
                              'layout':room_type, 'room_num':num_of_rooms, 'space':room_area, 'south':south, 'corner':corner, 'rent':rent, 'unit_price':unit_price, 'url':room_url})

    print("{}We have acquired the property data.".format(room_count))
    #Confirmation of acceptance conditions
    if room_count == room_num:
        print("Clear acceptance conditions")
    else:
        print("{}There are differences. The acceptance conditions have not been cleared.".format(abs(room_count-room_num)))

if __name__ == "__main__":
    date_now = datetime.datetime.now()
    print("Start scraping:", date_now)
    #Pass the total number of pages and the number of properties to the scraping function (acceptance condition)
    path = './data.txt'
    with open(path) as f:
        data = f.readlines()
    scraping(int(data[1].replace("\n","")), int(data[0].replace("\n","")))
    date_now = datetime.datetime.now()
    print("Finished scraping:", date_now)

Data visualization

First of all, I checked the histogram of how the rent is distributed, and removed the property whose rent was too high because it was considered unsuitable for living alone. image.png

From here, let's see how each variable affects rent.

Floor plan

Let's look at the number of properties and the distribution of rent for each floor plan.

image.png

A bar graph of the number of properties for each floor plan shows that the floor plans from 1R to 3LDK account for 98% of the total. If you look at the distribution of rent for those floor plans on a violin plot, you can see that the distribution of rent differs for each floor plan. Therefore, the floor plan is likely to be a variable that affects rent.

place

Let's see where there are many properties. image.png

By prefecture, most of them were in Tokyo and Chiba, and Saitama was about 3%. Looking at each city in more detail, Adachi-ku, Katsushika-ku, Matsudo-shi, Kashiwa-shi, Arakawa-ku has more than 1000 properties, which seems to be a good place to look for properties. Let's look at the distribution of rent in each of these districts. image.png

Looking at the rent histogram by prefecture, we can see that although there are many properties in Tokyo, there are many properties with high rent, and there are many properties in Chiba that are cheaper. If you take a closer look at the rent box plot for each city, you can see that the green box in the Chiba area is located at the bottom. It seems that you can find cheap properties in Matsudo, Kashiwa, Nagareyama, Ichikawa, Abiko, Yoshikawa, and Soka. Looking at the boxplot, you can see that the distribution of rent differs depending on the district, so it seems that where the property is located also affects the rent.

Time required from the station and its means

image.png

There is a weak negative correlation between travel time and rent, and it seems that the longer the travel time, the cheaper the rent. We also illustrated the difference in rent depending on the means of transportation such as the bus or walking used at that time. It can be seen that the rent for walking in blue is higher than that for buses. Therefore, both transportation and travel time are likely to affect rent.

Age

We grouped the ages every 5 years and put out a box plot of rent. image.png

You can see that the rent is gradually getting cheaper from the property after 15 years old. Therefore, age is also likely to be a variable that affects rent.

Total floors and types of buildings

Let's look at the distribution of rent by the total number of floors of the building and the histogram of the total number of floors. image.png

Looking at the total number of floors and the distribution of rent, it seems that the rent of properties up to 2 stories is cheap. Looking at the histogram of the total number of floors, most of the two-story properties were apartments. Besides, 95% of the properties are within 10 stories, so I think that you may have a longing for a high-rise property for the first time living alone, but it seems difficult when it comes to a property for living alone. From the above results, it was found that building information also affects rent.

South facing

I will see if the southward orientation, which is a characteristic of the property, affects the rent. I've created a histogram of properties facing south and those that don't. image.png

Looking at the histogram, the distributions are similar, so we tested whether the difference in rent was significant. There was a significant difference in the average rent for properties facing south and those not facing south. At this time, the homoscedasticity test was performed by the F-test, and the homoscedasticity was not rejected, so the t-test assuming homoscedasticity was performed. As a result, the south facing property is about 1500 yen cheaper. From the above, it was found that whether or not it faces south affects the rent.

Corner room

Similarly, we will look at the impact on corner rooms. image.png

Since the homoscedasticity test was performed by the F-test and the homoscedasticity was rejected, the t-test was performed assuming that the variances are not equal. As a result, it turned out that the difference was significant, and the price for the corner room was about 2000 yen higher. From the above, it was found that the corner room also affects the rent.

Data visualization 2

Based on the results so far, I would like to analyze again and specifically determine the conditions of the property that I recommend to my university friends who are actually in trouble.

In what district should I actually look for a property?

The relationship between the number of properties and the average rent is plotted for each city. image.png

You can see that Matsudo City has many properties and the average rent is low.

What is the floor plan?

The number of properties and average rent for each floor plan are as follows. image.png

The average rent for 1R, 2K and 3K is cheaper, but the number of 1K properties is overwhelming. If you live alone, you don't need that much space, so I think a 1K floor plan is good. The rent market price for 1K property in Matsudo was 56000 yen using the median.

How old is it?

A box plot of changes in rent distribution depending on the age of 1K properties in Matsudo City. image.png

You can see that there are many properties below the market price if they are 15 years old or older. I think it's a good idea to look for it in about 15 years.

How long does it take?

I made a bar graph by color-coding according to the means of transportation, showing how many properties are required from the station. image.png

It can be seen that 95% of the properties for living alone are within a 20-minute walk. We plotted the number of properties and the average rent for each time required from the station. You can see that the rent will be cheaper if you extend it within 15 minutes.

result

Based on the above, the conditions for finding a property to tell a friend are as follows. image.png

If you search under the above conditions, it will be a ** apartment ** type room. Under these conditions, I think you can find a good room by going to a real estate agent in Matsudo.

in conclusion

Reflections

The point of reflection is that the data was analyzed without a policy being decided. It's okay if you do it as a hobby, but if you do it, you will end up with a diagram that you do not know where to use it, which is a huge waste of time, so you have to analyze the data with a purpose.

What I learned from the feedback

  1. ** Code Review ** Improve readability by inserting comments and writing according to coding standards
  2. ** Announcement Review ** Depending on the approach you use to present the results you have produced, and the approach you take, the same content can be good or bad. Are the results for the purpose properly obtained? I have to make a story and announce it in an easy-to-understand manner.

Other

I feel that this issue was a valuable experience because I was able to give feedback by creating materials and making presentations, not just analysis. I noticed that it took so long to collect and analyze the data, and it was difficult to convey the results to the other party, so I would like to make use of it in the future.

that's all.

Recommended Posts

[First data science ⑤] I tried to help my friend find the first property by data analysis.
I tried to predict the J-League match (data analysis)
[First data science ⑥] I tried to visualize the market price of restaurants in Tokyo
The first time a programming beginner tried simple data analysis by programming
I tried logistic regression analysis for the first time using Titanic data
I tried to find the optimal path of the dreamland by (quantum) annealing
I tried to save the data with discord
Before the coronavirus, I first tried SARS analysis
I tried to rescue the data of the laptop by booting it on Ubuntu
I tried to find out the outline about Big Gorilla
[First COTOHA API] I tried to summarize the old story
I tried to find the average of the sequence with TensorFlow
[Introduction to Pandas] I tried to increase exchange data by data interpolation ♬
I tried to analyze scRNA-seq data using Topological Data Analysis (TDA)
First satellite data analysis by Tellus
I tried to find out how to streamline the work flow with Excel × Python, my article summary ★
[First scraping] I tried to make a VIP character of Smash Bros. [Beautiful Soup] [Data analysis]
I tried to visualize the Beverage Preference Dataset by tensor decomposition.
I was frustrated by Kaggle, so I tried to find a good rental property by scraping & machine learning
I tried to summarize the commands used by beginner engineers today
I tried to predict by letting RNN learn the sine wave
I tried to solve the shift scheduling problem by various methods
I tried to move the ball
I tried to estimate the interval.
I tried to summarize all the Python plots used in the research by active science graduate students [Basic]
I tried to open the latest data of the Excel file managed by date in the folder with Python
I tried to verify the best way to find a good marriage partner
[Data science basics] I tried saving from csv to mysql with python
I tried moving the image to the specified folder by right-clicking and left-clicking
I tried how to improve the accuracy of my own Neural Network
765 I tried to identify the three professional families by CNN (with Chainer 2.0.0)
I tried to verify and analyze the acceleration of Python by Cython
I tried fMRI data analysis with python (Introduction to brain information decoding)
I tried fractal dimension analysis by the box count method in 3D
I tried to summarize the Linux commands used by beginner engineers today-Part 1-
I tried to solve the inverted pendulum problem (Cart Pole) by Q-learning.
I tried to perform a cluster analysis of customers using purchasing data
I tried to verify the result of A / B test by chi-square test
I tried to analyze the New Year's card by myself using python
I tried to summarize the umask command
I tried tensorflow for the first time
I tried to recognize the wake word
I tried factor analysis with Titanic data!
I tried to summarize the graphical modeling.
I tried to estimate the pi stochastically
I tried to touch the COTOHA API
I tried my best to return to Lasso
I tried to summarize all the Python visualization tools used in research by active science graduate students [Application]
I tried to predict the presence or absence of snow by machine learning.
I tried to predict the change in snowfall for 2 years by machine learning
I tried to process and transform the image and expand the data for machine learning
Find the ideal property by scraping! A few minutes walk from the property to the destination
I tried to pass the G test and E qualification by training from 50
The first step to log analysis (how to format and put log data in Pandas)
I tried to program bubble sort by language
I tried web scraping to analyze the lyrics.
I tried using scrapy for the first time
I tried cluster analysis of the weather map
I tried to optimize while drying the laundry
I tried to get an image by scraping
I tried to find 100 million digits of pi