This is a continuation of the previous article (Python programming: I tried to get (crawling) news articles using Selenium and BeautifulSoup4).
There was an additional need to obtain an overview (business description, officers, shareholders, etc.) of the companies appearing in the news articles.
So, let's try to realize the process of acquiring "English" company information with a Python program. This time, the information source is ** Yahoo! Finance **.
--Obtaining Profile from Yahoo! Finance
In addition, the author has confirmed the operation with the following version.
--How to install and use the Python library
Since the amount of Code is not large, I will introduce the entire Code. There are two points.
It is a must to implement standby processing (Sleep) even in ** because it does not impose a load on the access destination **. In this article, unlike the previous article, Selenium is not used, but it is better to implement standby processing when using For loop processing so that the program does not issue explosive HTTP requests per unit time.
It is necessary to look at the Source of each page, specify the element in consideration of the tag structure, and acquire the information with BeautifulSoup4. In many cases, you will specify the class attribute attached to the tag and implement the process to get the target tag (and the Text inside it).
When you run the code, you will see the output of print () on the console.
crawler_yahoo.py
import requests
from bs4 import BeautifulSoup
def getSoup(url):
html = requests.get(url)
#soup = BeautifulSoup(html.content, "html.parser")
soup = BeautifulSoup(html.content, "lxml")
return soup
def getAssetProfile(soup):
wrapper = soup.find("div", class_="asset-profile-container")
paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
return paragraph
def getKeyExecutives(soup):
wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
paragraph = []
for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
name = element.find("td", class_="Ta(start)").find("span").text
title = element.find("td", class_="Ta(start) W(45%)").find("span").text
pay = element.find("td", class_="Ta(end)").find("span").text
paragraph.append([name, title, pay])
return paragraph
def getDescription(soup):
wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
return paragraph
def getMajorHolders(soup):
wrapper = soup.find("div", class_="W(100%) Mb(20px)")
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
paragraph.append([share, heldby])
return paragraph
def getTopHolders(soup, category):
idx = {'Institutional': 0, 'MutualFund': 1}
wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
paragraph.append(tmp)
return paragraph
The execution method is shown using Apple (ticker symbol: APPL), which is a hot topic on iphone12, as an example. First, basic information.
python
soup = getSoup('https://finance.yahoo.com/quote/AAPL/profile?p=AAPL')
profile = getAssetProfile(soup)
print('\r\n'.join(profile))
#profile[0]: Sector(s)
#profile[1]: Industry
#profile[2]: Full Time Employees
Below is the execution result.
python
Technology
Consumer Electronics
147,000
Next is a list of officers.
python
exs = getKeyExecutives(soup)
#print('\r\n'.join(exs))
for ex in exs:
print(ex)
#ex[0]: Name
#ex[1]: Title
#ex[2]: Pay
Below is the execution result.
['Mr. Timothy D. Cook', 'CEO & Director', '11.56M']
['Mr. Luca Maestri', 'CFO & Sr. VP', '3.58M']
['Mr. Jeffrey E. Williams', 'Chief Operating Officer', '3.57M']
['Ms. Katherine L. Adams', 'Sr. VP, Gen. Counsel & Sec.', '3.6M']
["Ms. Deirdre O'Brien", 'Sr. VP of People & Retail', '2.69M']
['Mr. Chris Kondo', 'Sr. Director of Corp. Accounting', 'N/A']
['Mr. James Wilson', 'Chief Technology Officer', 'N/A']
['Ms. Mary Demby', 'Chief Information Officer', 'N/A']
['Ms. Nancy Paxton', 'Sr. Director of Investor Relations & Treasury', 'N/A']
['Mr. Greg Joswiak', 'Sr. VP of Worldwide Marketing', 'N/A']
Next is the business content.
python
desc = getDescription(soup)
print('\r\n'.join(desc))
Below is the execution result.
Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch, and other Apple-branded and third-party accessories. It also provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store, that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It sells and delivers third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1977 and is headquartered in Cupertino, California.
The URL has changed from here, and it is shareholder information. First is the summary.
python
soup = getSoup('https://finance.yahoo.com/quote/AAPL/holders?p=AAPL')
holders = getMajorHolders(soup)
for holder in holders:
print(holder)
#holder[0]: share
#holder[1]: heldby
Below is the execution result.
['0.07%', '% of Shares Held by All Insider']
['62.12%', '% of Shares Held by Institutions']
['62.16%', '% of Float Held by Institutions']
['4,296', 'Number of Institutions Holding Shares']
Next is shareholder information (corporate shareholders).
python
topholders = getTopHolders(soup, 'Institutional')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
Below is the execution result.
['Vanguard Group, Inc. (The)', '1,315,961,000', 'Jun 29, 2020', '7.69%', '120,015,643,200']
['Blackrock Inc.', '1,101,824,048', 'Jun 29, 2020', '6.44%', '100,486,353,177']
['Berkshire Hathaway, Inc', '980,622,264', 'Jun 29, 2020', '5.73%', '89,432,750,476']
['State Street Corporation', '709,057,472', 'Jun 29, 2020', '4.15%', '64,666,041,446']
['FMR, LLC', '383,300,188', 'Jun 29, 2020', '2.24%', '34,956,977,145']
['Geode Capital Management, LLC', '251,695,416', 'Jun 29, 2020', '1.47%', '22,954,621,939']
['Price (T.Rowe) Associates Inc', '233,087,540', 'Jun 29, 2020', '1.36%', '21,257,583,648']
['Northern Trust Corporation', '214,144,092', 'Jun 29, 2020', '1.25%', '19,529,941,190']
['Norges Bank Investment Management', '187,425,092', 'Dec 30, 2019', '1.10%', '13,759,344,566']
['Bank Of New York Mellon Corporation', '171,219,584', 'Jun 29, 2020', '1.00%', '15,615,226,060']
Next is shareholder information (individual investment trusts).
python
topholders = getTopHolders(soup, 'MutualFund')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
Below is the execution result.
['Vanguard Total Stock Market Index Fund', '444,698,584', 'Jun 29, 2020', '2.60%', '40,556,510,860']
['Vanguard 500 Index Fund', '338,116,248', 'Jun 29, 2020', '1.98%', '30,836,201,817']
['SPDR S&P 500 ETF Trust', '169,565,200', 'Sep 29, 2020', '0.99%', '19,637,345,812']
['Invesco ETF Tr-Invesco QQQ Tr, Series 1 ETF', '155,032,988', 'Aug 30, 2020', '0.91%', '20,005,456,771']
['Fidelity 500 Index Fund', '145,557,920', 'Aug 30, 2020', '0.85%', '18,782,793,996']
['Vanguard Institutional Index Fund-Institutional Index Fund', '143,016,840', 'Jun 29, 2020', '0.84%', '13,043,135,808']
['iShares Core S&P 500 ETF', '123,444,255', 'Sep 29, 2020', '0.72%', '14,296,079,171']
['Vanguard Growth Index Fund', '123,245,072', 'Jun 29, 2020', '0.72%', '11,239,950,566']
['Vanguard Information Technology Index Fund', '79,770,560', 'Aug 30, 2020', '0.47%', '10,293,593,062']
['Select Sector SPDR Fund-Technology', '69,764,960', 'Sep 29, 2020', '0.41%', '8,079,480,017']
You can get the information displayed on the web browser properly. If you collect information on various companies, you can see a list of companies in which a famous individual investor is listed as a shareholder, or something like a tendency. .. ..
Introducing how to acquire (crawling) company information (from ** Yahoo! Finance **) using BeautifulSoup4.