Make a scraping app with Python + Django + AWS and change jobs

I'm mogken in the Corporate Planning Office of an IT venture. Recently, I've been thinking about changing jobs, and I'm studying programming to appeal to those people. Just saying that you are studying by mouth does not make a big appeal, so I made a simple web application with Python and Django, built it on AWS & published the source on Github.

This time, I would like to explain the program (Python) that I wrote myself, which also serves as an output for fixing knowledge within me. It's been three months since I studied Python, and it's my first time to make something like this from scratch. If you have any advice on how to improve the code, I would appreciate it if you could comment.

This finished product

Created a web application that scrapes and displays two company word-of-mouth sites + listed company information with the hope that it will help compare companies when changing jobs. I didn't use it much in the end ... orz img80.jpg

Overview

UI is created quickly using Bootstrap without much hassle Framework uses Django The server is built on Amazon Linux on AWS

UI language Framework server
Bootstrap4 Python Django AWS(Amazon Linux + Nginx)

Program details

Program flow

・ Enter the company name you want to search in the search window ・ Obtain information (HTML) containing search companies from scraping target sites -Extract only the necessary information from the acquired information (HTML) (parse) -Required information Corporate word-of-mouth points Information such as the number of employees (for listed companies) ・ Display as search results

__ Get information (HTML) containing search companies from the target site __

Use Beautiful soup to get information (scraping) of the target site

searchCompany.py



#Add parsed site here
targetSite =['vorkers', 'hyoban', 'jyoujyou']

class GetHtml:
    """
    GetHtml as text
    """
    #Search site URL registration
    def __init__(self, company):
        self.headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0",}
        self.urls = {
            targetSite[0]:'https://www.vorkers.com/company_list?field=&pref=&src_str={}&sort=1&ct=top'.format(company), 
            targetSite[1]:'https://en-hyouban.com/search/?SearchWords={}'.format(company),
            targetSite[2]:'https://Listed company search.com/searches/result?utf8=✓&search_query={}'.format(company)
        }

    #Get the full Html text of the target page
    def getText(self):
        textList = {}
        for url in self.urls:
            res = requests.get(self.urls[url], headers=self.headers)
            text = bs4.BeautifulSoup(res.text, "html.parser")
            textList.setdefault(url, text)
        return textList

init self.headers Set to disguise that you are trying to get information from a web browser If the server of the target site recognizes that it is trying to scrape, it may be blocked, so it is necessary to disguise that it is an access from a Web browser.

self.urls Set the search page URL of the scraped site Get company information from each site by passing the company name entered by the user as an argument to the specified location under the query string (? =) Of each site.

** (query string) ** A specific string at the end of the url to describe the information you want to send to the server. It often starts with? =. When searching for a specific keyword on the site, the keyword that the user wants to search for is described below the query string and sent to the server. The server identifies the word you want to search for from that particular string, sends back the required information, and the browser displays that information.

In other words, if you enter the keyword you want to search under the query string, you can search without typing characters in the search window.

getText

A function that returns a dictionary with targetSite as the key and the acquired HTML information value. The acquired information (HTML) is stored in a dictionary, and only the information that you want to pass to the next processing is extracted.

__ Extraction of necessary information (from word-of-mouth site) __

Target site: openWork (former: Vorkers) / Kaisha's reputation Required information: Evaluation points / company name

searchCompany.py


class ParseHtml:
    """
    ParseHtmlHtml to get required values
    """
    #Acquisition of company name and evaluation points
    def parseNamePoint(self, textList):
        #Tag registration for perspective
        nameTag =  {
            targetSite[0]:["h3", "fs-18 lh-1o3 p-r"],
            targetSite[1]:["h2", "companyName"],
        }
        pointTag = {
            targetSite[0]:["p", "totalEvaluation_item fs-15 fw-b"],
            targetSite[1]:["span", "point"],
        } 

        comNamePoint = {}
        for site in targetSite[:2]:
            try:
                #Acquisition of company name
                parseCname =  textList[site].find(nameTag[site][0], class_=nameTag[site][1])
                cname = parseCname.getText().replace('\n','').replace(' ', '')

                #Obtaining company evaluation points
                parseCpoint = textList[site].find(pointTag[site][0], class_=pointTag[site][1])              
                cpoint = parseCpoint.getText().replace('\n','').replace(' ', '')
        
            #Processing when there is no search result
            except AttributeError:
                comNamePoint.setdefault(site, ['No results','No results'])
               
            #Processing when there is a search result
            else:
                comNamePoint.setdefault(site, [cname, cpoint])

        return comNamePoint

parseNamePoint(self, textList) nameTag Describe the HTML tag for acquiring the company name

pointTag Write HTML tags to get word-of-mouth ratings

** (HTML tag) ** The browser displays the Web page based on the HTML information sent from the server. So the necessary information

** for sentence ** Since the process is to scrape the entered company name to the URL in the query string, the page may not exist depending on the company name. Exception handling for that is done here.

-Overview of exception handling ・ If there is a search result, a dictionary with evaluation points and company name as the value is returned using targetSite as a key. -Error when there is no search result When AttributeError is detected, a dictionary with a value of'no result'is returned.

__ Extraction of necessary information (from listed company search) __

Target site: Listed company search Required information: Company name / Industry / Number of employees / Average age / Average years of service / Average wage

searchCom.py


#If you are a listed company, get company details
    def parseInfo(self, textList):
        #Tag registration for perspective
        cnumberTag = {
            targetSite[2]:['dl', 'well'],
        }
        cinfoTag = {
            targetSite[2]:['dd', 'companies_data']
        }
        
        comInfo = {}

        #Obtaining the company details URL from the company name
        try:
            parseCnumber =  textList[targetSite[2]].find(cnumberTag[targetSite[2]][0], class_=cnumberTag[targetSite[2]][1])
            cnumber = parseCnumber.getText()
            cname = mojimoji.han_to_zen(cnumber[5:].replace('\n', '').replace(' ', ''))
            detail = 'https://xn--vckya7nx51ik9ay55a3l3a.com/companies/{}'.format(cnumber[:5])
        #Processing when there is no search result
        except AttributeError:
            comInfo.setdefault(targetSite[2], ['No data','','','','','',''])
        #Processing when there is a search result
        else:
            #Get Html on company details page
            res = requests.get(detail)
            text = bs4.BeautifulSoup(res.text, "html.parser")

            #Perspective of company detail page
            parseCinfo =  text.find_all(cinfoTag[targetSite[2]][0], class_=cinfoTag[targetSite[2]][1])
            cinfo = parseCinfo

            #Acquisition of parsed content
            cinfoList = []
            for info in cinfo:
                infoText = info.getText().replace('\n', '').replace('\t', '')
                cinfoList.append(infoText)
            #Add company name
            cinfoList.append(cname)

            if len(cinfoList) <= 18:
                cinfoList.append('')
            
            #Molding of necessary information
            useList = itemgetter(0,10,14,15,16,17,18)(cinfoList)
            comInfo.setdefault(targetSite[2], useList)
            
        return comInfo

def parseInfo(self, textList) cnumberTag In the listed company search, simply entering the company name below the query string does not result in a URL where company details can be obtained, so specify an HTML tag for obtaining the URL of the page where detailed information can be obtained in cnumberTag.

cinfoTag Specify a tag to get the required information from the company details page

** try, except (exception handling) *** Since there may be no search results here as well, exception handling is performed in that case.

** for statement ** Using targetSite as the key as the return value, process the data so that the list of necessary information can be returned by the value dictionary.

__ Displayed together as search results __

searchCompany.py


def main(company):
    aboutCompany = {}

    #Get URL and Html
    getHtml = GetHtml(company)
    text = getHtml.getText()
    urls = getHtml.urls

    #html perspective
    parseHtml = ParseHtml()
    comNamePoint = parseHtml.parseNamePoint(text)
    comInfo = parseHtml.parseInfo(text)
    
    #Output data molding
    #Company name and evaluation points
    for site in targetSite[:2]:
         comNamePoint[site].append(urls[site])
    aboutCompany.update(comNamePoint)
    
    #Detailed company information
    for info in comInfo:
        aboutCompany.setdefault(info, comInfo[info])
    
    #Search word
    words = mojimoji.han_to_zen(company)
    aboutCompany['searchWord'] = words


    return aboutCompany

if __name__ =="__main__":
    print(main('Softbank'))

main(company) aboutCompany HTML is acquired and parsed with the input company name as an argument, and it is stored in a dictionary named aboutCompany and returned as a variable. After that, Django is processing to specify it nicely and display it in HTML.

mojimoji Uses an external library called mojimoji that converts half-width characters to full-width characters Since half-width and full-width are not unified depending on the site that pulls information, they are converted in a batch.

At the end

That is all for this commentary. The program was completed over a month ago, so it was hard to forget the details. I can't say anything because I noticed in my code that it was painful to read back the dirty code ...

Next, I'll write Django and AWS.

Recommended Posts

Make a scraping app with Python + Django + AWS and change jobs
Make a desktop app with Python with Electron
Launched a web application on AWS with django and changed jobs
[Practice] Make a Watson app with Python! # 1 [Language discrimination]
Let's make a simple game with Python 3 and iPhone
Let's make a nervous breakdown app with Vue.js and Django-Rest-Framework [Part 1] ~ Django setup ~
Make a fortune with Python
WEB scraping with python and try to make a word cloud from reviews
Let's make a Mac app with Tkinter and py2app
Make ordinary tweets fleet-like with AWS Lambda and Python
[Practice] Make a Watson app with Python! # 3 [Natural language classification]
Scraping with Node, Ruby and Python
Let's make a GUI with python.
Scraping with Python, Selenium and Chromedriver
Scraping with Python and Beautiful Soup
Make a recommender system with python
Make a filter with a django template
Let's make a graph with python! !!
If you know Python, you can make a web application with Django
How to make a surveillance camera (Security Camera) with Opencv and Python
I tried to make a periodical process with Selenium and Python
Let's make a web chat using WebSocket with AWS serverless (Python)!
Fractal to make and play with Python
A memo with Python2.7 and Python3 on CentOS
Daemonize a Python web app with Supervisor
Let's make a voice slowly with Python
CentOS 6.4 with Python 2.7.3 with Apache with mod_wsgi and Django
Make Qt for Python app a desktop app
Let's make a web framework with Python! (1)
Practice web scraping with Python and Selenium
Easy web scraping with Python and Ruby
Let's make a Twitter Bot with Python!
Let's make a web framework with Python! (2)
Touch AWS with Serverless Framework and Python
[AWS] Create a Python Lambda environment with CodeStar and do Hello World
Deploy a Python app on Google App Engine and integrate it with GitHub
Create a Todo app with Django ④ Implement folder and task creation functions
Make a decision tree from 0 with Python and understand it (4. Data structure)
Scraping with Python
Scraping with Python
Scraping tabelog with python and outputting to CSV
Make a Twitter trend bot with heroku + Python
[Python] Make a game with Pyxel-Use an editor-
Building a python environment with virtualenv and direnv
How to develop a cart app with Django
I want to make a game with Python
[Python] Build a Django development environment with Docker
Try to make a "cryptanalysis" cipher with Python
[Python] Make a simple maze game with Pyxel
Launch a web server with Python and Flask
Let's replace UWSC with Python (5) Let's make a Robot
Try to make a dihedral group with Python
A simple to-do list created with Python + Django
Create a Mac app using py2app and Python3! !!
Quickly build a Python Django environment with IntelliJ
[# 1] Make Minecraft with Python. ~ Preliminary research and design ~
I tried to make a periodical process with CentOS7, Selenium, Python and Chrome
Let's make a nervous breakdown app with Vue.js and Django-Rest-Framework [Part 2] ~ Vue setup ~
Get the stock price of a Japanese company with Python and make a graph
Make one repeating string with a Python regular expression.
Django 1.11 started with Python3.6