I'm mogken in the Corporate Planning Office of an IT venture. Recently, I've been thinking about changing jobs, and I'm studying programming to appeal to those people. Just saying that you are studying by mouth does not make a big appeal, so I made a simple web application with Python and Django, built it on AWS & published the source on Github.
This time, I would like to explain the program (Python) that I wrote myself, which also serves as an output for fixing knowledge within me. It's been three months since I studied Python, and it's my first time to make something like this from scratch. If you have any advice on how to improve the code, I would appreciate it if you could comment.
Created a web application that scrapes and displays two company word-of-mouth sites + listed company information with the hope that it will help compare companies when changing jobs. I didn't use it much in the end ... orz
UI is created quickly using Bootstrap without much hassle Framework uses Django The server is built on Amazon Linux on AWS
UI | language | Framework | server |
---|---|---|---|
Bootstrap4 | Python | Django | AWS(Amazon Linux + Nginx) |
・ Enter the company name you want to search in the search window ・ Obtain information (HTML) containing search companies from scraping target sites -Extract only the necessary information from the acquired information (HTML) (parse) -Required information Corporate word-of-mouth points Information such as the number of employees (for listed companies) ・ Display as search results
Use Beautiful soup to get information (scraping) of the target site
searchCompany.py
#Add parsed site here
targetSite =['vorkers', 'hyoban', 'jyoujyou']
class GetHtml:
"""
GetHtml as text
"""
#Search site URL registration
def __init__(self, company):
self.headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0",}
self.urls = {
targetSite[0]:'https://www.vorkers.com/company_list?field=&pref=&src_str={}&sort=1&ct=top'.format(company),
targetSite[1]:'https://en-hyouban.com/search/?SearchWords={}'.format(company),
targetSite[2]:'https://Listed company search.com/searches/result?utf8=✓&search_query={}'.format(company)
}
#Get the full Html text of the target page
def getText(self):
textList = {}
for url in self.urls:
res = requests.get(self.urls[url], headers=self.headers)
text = bs4.BeautifulSoup(res.text, "html.parser")
textList.setdefault(url, text)
return textList
init self.headers Set to disguise that you are trying to get information from a web browser If the server of the target site recognizes that it is trying to scrape, it may be blocked, so it is necessary to disguise that it is an access from a Web browser.
self.urls Set the search page URL of the scraped site Get company information from each site by passing the company name entered by the user as an argument to the specified location under the query string (? =) Of each site.
** (query string) ** A specific string at the end of the url to describe the information you want to send to the server. It often starts with? =. When searching for a specific keyword on the site, the keyword that the user wants to search for is described below the query string and sent to the server. The server identifies the word you want to search for from that particular string, sends back the required information, and the browser displays that information.
In other words, if you enter the keyword you want to search under the query string, you can search without typing characters in the search window.
getText
A function that returns a dictionary with targetSite as the key and the acquired HTML information value. The acquired information (HTML) is stored in a dictionary, and only the information that you want to pass to the next processing is extracted.
Target site: openWork (former: Vorkers) / Kaisha's reputation Required information: Evaluation points / company name
searchCompany.py
class ParseHtml:
"""
ParseHtmlHtml to get required values
"""
#Acquisition of company name and evaluation points
def parseNamePoint(self, textList):
#Tag registration for perspective
nameTag = {
targetSite[0]:["h3", "fs-18 lh-1o3 p-r"],
targetSite[1]:["h2", "companyName"],
}
pointTag = {
targetSite[0]:["p", "totalEvaluation_item fs-15 fw-b"],
targetSite[1]:["span", "point"],
}
comNamePoint = {}
for site in targetSite[:2]:
try:
#Acquisition of company name
parseCname = textList[site].find(nameTag[site][0], class_=nameTag[site][1])
cname = parseCname.getText().replace('\n','').replace(' ', '')
#Obtaining company evaluation points
parseCpoint = textList[site].find(pointTag[site][0], class_=pointTag[site][1])
cpoint = parseCpoint.getText().replace('\n','').replace(' ', '')
#Processing when there is no search result
except AttributeError:
comNamePoint.setdefault(site, ['No results','No results'])
#Processing when there is a search result
else:
comNamePoint.setdefault(site, [cname, cpoint])
return comNamePoint
parseNamePoint(self, textList) nameTag Describe the HTML tag for acquiring the company name
pointTag Write HTML tags to get word-of-mouth ratings
** (HTML tag) ** The browser displays the Web page based on the HTML information sent from the server. So the necessary information
** for sentence ** Since the process is to scrape the entered company name to the URL in the query string, the page may not exist depending on the company name. Exception handling for that is done here.
-Overview of exception handling ・ If there is a search result, a dictionary with evaluation points and company name as the value is returned using targetSite as a key. -Error when there is no search result When AttributeError is detected, a dictionary with a value of'no result'is returned.
Target site: Listed company search Required information: Company name / Industry / Number of employees / Average age / Average years of service / Average wage
searchCom.py
#If you are a listed company, get company details
def parseInfo(self, textList):
#Tag registration for perspective
cnumberTag = {
targetSite[2]:['dl', 'well'],
}
cinfoTag = {
targetSite[2]:['dd', 'companies_data']
}
comInfo = {}
#Obtaining the company details URL from the company name
try:
parseCnumber = textList[targetSite[2]].find(cnumberTag[targetSite[2]][0], class_=cnumberTag[targetSite[2]][1])
cnumber = parseCnumber.getText()
cname = mojimoji.han_to_zen(cnumber[5:].replace('\n', '').replace(' ', ''))
detail = 'https://xn--vckya7nx51ik9ay55a3l3a.com/companies/{}'.format(cnumber[:5])
#Processing when there is no search result
except AttributeError:
comInfo.setdefault(targetSite[2], ['No data','','','','','',''])
#Processing when there is a search result
else:
#Get Html on company details page
res = requests.get(detail)
text = bs4.BeautifulSoup(res.text, "html.parser")
#Perspective of company detail page
parseCinfo = text.find_all(cinfoTag[targetSite[2]][0], class_=cinfoTag[targetSite[2]][1])
cinfo = parseCinfo
#Acquisition of parsed content
cinfoList = []
for info in cinfo:
infoText = info.getText().replace('\n', '').replace('\t', '')
cinfoList.append(infoText)
#Add company name
cinfoList.append(cname)
if len(cinfoList) <= 18:
cinfoList.append('')
#Molding of necessary information
useList = itemgetter(0,10,14,15,16,17,18)(cinfoList)
comInfo.setdefault(targetSite[2], useList)
return comInfo
def parseInfo(self, textList) cnumberTag In the listed company search, simply entering the company name below the query string does not result in a URL where company details can be obtained, so specify an HTML tag for obtaining the URL of the page where detailed information can be obtained in cnumberTag.
cinfoTag Specify a tag to get the required information from the company details page
** try, except (exception handling) *** Since there may be no search results here as well, exception handling is performed in that case.
** for statement ** Using targetSite as the key as the return value, process the data so that the list of necessary information can be returned by the value dictionary.
searchCompany.py
def main(company):
aboutCompany = {}
#Get URL and Html
getHtml = GetHtml(company)
text = getHtml.getText()
urls = getHtml.urls
#html perspective
parseHtml = ParseHtml()
comNamePoint = parseHtml.parseNamePoint(text)
comInfo = parseHtml.parseInfo(text)
#Output data molding
#Company name and evaluation points
for site in targetSite[:2]:
comNamePoint[site].append(urls[site])
aboutCompany.update(comNamePoint)
#Detailed company information
for info in comInfo:
aboutCompany.setdefault(info, comInfo[info])
#Search word
words = mojimoji.han_to_zen(company)
aboutCompany['searchWord'] = words
return aboutCompany
if __name__ =="__main__":
print(main('Softbank'))
main(company) aboutCompany HTML is acquired and parsed with the input company name as an argument, and it is stored in a dictionary named aboutCompany and returned as a variable. After that, Django is processing to specify it nicely and display it in HTML.
mojimoji Uses an external library called mojimoji that converts half-width characters to full-width characters Since half-width and full-width are not unified depending on the site that pulls information, they are converted in a batch.
That is all for this commentary. The program was completed over a month ago, so it was hard to forget the details. I can't say anything because I noticed in my code that it was painful to read back the dirty code ...
Next, I'll write Django and AWS.
Recommended Posts