Make a scraping app with Python + Django + AWS and change jobs

I'm mogken in the Corporate Planning Office of an IT venture. Recently, I've been thinking about changing jobs, and I'm studying programming to appeal to those people. Just saying that you are studying by mouth does not make a big appeal, so I made a simple web application with Python and Django, built it on AWS & published the source on Github.

This time, I would like to explain the program (Python) that I wrote myself, which also serves as an output for fixing knowledge within me. It's been three months since I studied Python, and it's my first time to make something like this from scratch. If you have any advice on how to improve the code, I would appreciate it if you could comment.

This finished product

Created a web application that scrapes and displays two company word-of-mouth sites + listed company information with the hope that it will help compare companies when changing jobs. I didn't use it much in the end ... orz

Overview

UI is created quickly using Bootstrap without much hassle Framework uses Django The server is built on Amazon Linux on AWS

UI	language	Framework	server
Bootstrap4	Python	Django	AWS(Amazon Linux + Nginx)

Program details

Program flow

・ Enter the company name you want to search in the search window ・ Obtain information (HTML) containing search companies from scraping target sites -Extract only the necessary information from the acquired information (HTML) (parse) -Required information Corporate word-of-mouth points Information such as the number of employees (for listed companies) ・ Display as search results

Get information (HTML) containing search companies from the target site

Use Beautiful soup to get information (scraping) of the target site

`searchCompany.py`



#Add parsed site here
targetSite =['vorkers', 'hyoban', 'jyoujyou']

class GetHtml:
    """
    GetHtml as text
    """
    #Search site URL registration
    def __init__(self, company):
        self.headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0",}
        self.urls = {
            targetSite[0]:'https://www.vorkers.com/company_list?field=&pref=&src_str={}&sort=1&ct=top'.format(company), 
            targetSite[1]:'https://en-hyouban.com/search/?SearchWords={}'.format(company),
            targetSite[2]:'https://Listed company search.com/searches/result?utf8=✓&search_query={}'.format(company)
        }

    #Get the full Html text of the target page
    def getText(self):
        textList = {}
        for url in self.urls:
            res = requests.get(self.urls[url], headers=self.headers)
            text = bs4.BeautifulSoup(res.text, "html.parser")
            textList.setdefault(url, text)
        return textList

init self.headers Set to disguise that you are trying to get information from a web browser If the server of the target site recognizes that it is trying to scrape, it may be blocked, so it is necessary to disguise that it is an access from a Web browser.

self.urls Set the search page URL of the scraped site Get company information from each site by passing the company name entered by the user as an argument to the specified location under the query string (? =) Of each site.

** (query string) ** A specific string at the end of the url to describe the information you want to send to the server. It often starts with? =. When searching for a specific keyword on the site, the keyword that the user wants to search for is described below the query string and sent to the server. The server identifies the word you want to search for from that particular string, sends back the required information, and the browser displays that information.

In other words, if you enter the keyword you want to search under the query string, you can search without typing characters in the search window.

getText

A function that returns a dictionary with targetSite as the key and the acquired HTML information value. The acquired information (HTML) is stored in a dictionary, and only the information that you want to pass to the next processing is extracted.

Extraction of necessary information (from word-of-mouth site)

Target site: openWork (former: Vorkers) / Kaisha's reputation Required information: Evaluation points / company name

`searchCompany.py`


class ParseHtml:
    """
    ParseHtmlHtml to get required values
    """
    #Acquisition of company name and evaluation points
    def parseNamePoint(self, textList):
        #Tag registration for perspective
        nameTag =  {
            targetSite[0]:["h3", "fs-18 lh-1o3 p-r"],
            targetSite[1]:["h2", "companyName"],
        }
        pointTag = {
            targetSite[0]:["p", "totalEvaluation_item fs-15 fw-b"],
            targetSite[1]:["span", "point"],
        } 

        comNamePoint = {}
        for site in targetSite[:2]:
            try:
                #Acquisition of company name
                parseCname =  textList[site].find(nameTag[site][0], class_=nameTag[site][1])
                cname = parseCname.getText().replace('\n','').replace(' ', '')

                #Obtaining company evaluation points
                parseCpoint = textList[site].find(pointTag[site][0], class_=pointTag[site][1])              
                cpoint = parseCpoint.getText().replace('\n','').replace(' ', '')
        
            #Processing when there is no search result
            except AttributeError:
                comNamePoint.setdefault(site, ['No results','No results'])
               
            #Processing when there is a search result
            else:
                comNamePoint.setdefault(site, [cname, cpoint])

        return comNamePoint

parseNamePoint(self, textList) nameTag Describe the HTML tag for acquiring the company name

pointTag Write HTML tags to get word-of-mouth ratings

** (HTML tag) ** The browser displays the Web page based on the HTML information sent from the server. So the necessary information

** for sentence ** Since the process is to scrape the entered company name to the URL in the query string, the page may not exist depending on the company name. Exception handling for that is done here.

-Overview of exception handling ・ If there is a search result, a dictionary with evaluation points and company name as the value is returned using targetSite as a key. -Error when there is no search result When AttributeError is detected, a dictionary with a value of'no result'is returned.

Extraction of necessary information (from listed company search)

Target site: Listed company search Required information: Company name / Industry / Number of employees / Average age / Average years of service / Average wage

`searchCom.py`


#If you are a listed company, get company details
    def parseInfo(self, textList):
        #Tag registration for perspective
        cnumberTag = {
            targetSite[2]:['dl', 'well'],
        }
        cinfoTag = {
            targetSite[2]:['dd', 'companies_data']
        }
        
        comInfo = {}

        #Obtaining the company details URL from the company name
        try:
            parseCnumber =  textList[targetSite[2]].find(cnumberTag[targetSite[2]][0], class_=cnumberTag[targetSite[2]][1])
            cnumber = parseCnumber.getText()
            cname = mojimoji.han_to_zen(cnumber[5:].replace('\n', '').replace(' ', ''))
            detail = 'https://xn--vckya7nx51ik9ay55a3l3a.com/companies/{}'.format(cnumber[:5])
        #Processing when there is no search result
        except AttributeError:
            comInfo.setdefault(targetSite[2], ['No data','','','','','',''])
        #Processing when there is a search result
        else:
            #Get Html on company details page
            res = requests.get(detail)
            text = bs4.BeautifulSoup(res.text, "html.parser")

            #Perspective of company detail page
            parseCinfo =  text.find_all(cinfoTag[targetSite[2]][0], class_=cinfoTag[targetSite[2]][1])
            cinfo = parseCinfo

            #Acquisition of parsed content
            cinfoList = []
            for info in cinfo:
                infoText = info.getText().replace('\n', '').replace('\t', '')
                cinfoList.append(infoText)
            #Add company name
            cinfoList.append(cname)

            if len(cinfoList) <= 18:
                cinfoList.append('')
            
            #Molding of necessary information
            useList = itemgetter(0,10,14,15,16,17,18)(cinfoList)
            comInfo.setdefault(targetSite[2], useList)
            
        return comInfo

def parseInfo(self, textList) cnumberTag In the listed company search, simply entering the company name below the query string does not result in a URL where company details can be obtained, so specify an HTML tag for obtaining the URL of the page where detailed information can be obtained in cnumberTag.

cinfoTag Specify a tag to get the required information from the company details page

** try, except (exception handling) *** Since there may be no search results here as well, exception handling is performed in that case.

** for statement ** Using targetSite as the key as the return value, process the data so that the list of necessary information can be returned by the value dictionary.

Displayed together as search results

`searchCompany.py`


def main(company):
    aboutCompany = {}

    #Get URL and Html
    getHtml = GetHtml(company)
    text = getHtml.getText()
    urls = getHtml.urls

    #html perspective
    parseHtml = ParseHtml()
    comNamePoint = parseHtml.parseNamePoint(text)
    comInfo = parseHtml.parseInfo(text)
    
    #Output data molding
    #Company name and evaluation points
    for site in targetSite[:2]:
         comNamePoint[site].append(urls[site])
    aboutCompany.update(comNamePoint)
    
    #Detailed company information
    for info in comInfo:
        aboutCompany.setdefault(info, comInfo[info])
    
    #Search word
    words = mojimoji.han_to_zen(company)
    aboutCompany['searchWord'] = words


    return aboutCompany

if __name__ =="__main__":
    print(main('Softbank'))

main(company) aboutCompany HTML is acquired and parsed with the input company name as an argument, and it is stored in a dictionary named aboutCompany and returned as a variable. After that, Django is processing to specify it nicely and display it in HTML.

mojimoji Uses an external library called mojimoji that converts half-width characters to full-width characters Since half-width and full-width are not unified depending on the site that pulls information, they are converted in a batch.

At the end

That is all for this commentary. The program was completed over a month ago, so it was hard to forget the details. I can't say anything because I noticed in my code that it was painful to read back the dirty code ...

Next, I'll write Django and AWS.

Make a scraping app with Python + Django + AWS and change jobs

This finished product

Overview

Program details

Program flow

__ Get information (HTML) containing search companies from the target site __

searchCompany.py

__ Extraction of necessary information (from word-of-mouth site) __

searchCompany.py

__ Extraction of necessary information (from listed company search) __

searchCom.py

__ Displayed together as search results __

searchCompany.py

At the end

Get information (HTML) containing search companies from the target site

`searchCompany.py`

Extraction of necessary information (from word-of-mouth site)

`searchCompany.py`

Extraction of necessary information (from listed company search)

`searchCom.py`

Displayed together as search results

`searchCompany.py`