Introduction

HTML tables can be scraped in a few lines using pandas' pd.read_html (), but this time I would like to show you how to scrape without using read_html ().

Preparation

Install Beautiful Soup. (This time, we will also use pandas to create a data frame, so install it as appropriate.)

$ pip install beautifulsoup4　# or conda install

policy

This time, as an example, let's get the following list of CPUs from this wikipedia page. Screen Shot 2020-02-27 at 20.34.11.png

reference

Here, for reference, I would like to show the method when using the super-easy method pd.read_html ().

import pandas as pd

url = 'https://en.wikipedia.org/wiki/Transistor_count'　#Target web page url
dfs = pd.read_html(url)　#If the web page has multiple tables, they will be stored in dfs in list format

This time, it seems that the target table is stored in the first index of dfs, so let's output dfs [1](dfs [0] stores a table of another class).

dfs[1]

The output result looks like the image below, and you can certainly scrape it. Screen Shot 2020-02-27 at 20.41.53.png

Overview

Before scraping a table with BeautifulSoup, let's take a look at the web page to which it is scraped. Let's jump to the wikipedia page from the link shown earlier and open the developer tools (in the case of chrome, you can display it by right-clicking on the table ⇒ inspect. You can also select option + command + I). Looking at the html source of the page with the developer tools, the target table is under the \

tag, \ (table body) ⇒ \ (table column component) ⇒ \ at the same level as the \ tag. , Corresponds to the column name (Processor ~ MOS process) part of the table).

code

Let's write the code with the above overview in mind.

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = "https://en.wikipedia.org/wiki/Transistor_count"
#Get web page data
page = requests.get(url)
#Parse html
soup = BeautifulSoup(page.text, 'html.parser')

Let's take a look at the parsed data.

print(soup.prettify())

As shown below, you can see the hierarchical structure as seen in the developer tools endlessly.

`output`


<!DOCTYPE html>
<html class="client-nojs" dir="ltr" lang="en">
 <head>
  <meta charset="utf-8"/>
  <title>
   Transistor count - Wikipedia
  </title>
  <script>

・ ・ ・

Let's extract the part that corresponds to the table. Use the find () method to extract the relevant part of the

tag by specifying the

(table cell data) You can see that it has a hierarchical structure (not visible in the image below, but there is a \	tag in the hierarchy below \

tag and wikitable class.

table = soup.find('table', {'class':'wikitable'}).tbody

You might think that you don't necessarily need to specify the table class, but you should specify it if there are tables of other classes. This time, as shown in the image below, it exists with another class name called box-More, so the wikitable class is explicitly specified.

Then, in the extracted table body, get the

tag part (row component of the table). The following find_all ('tr') stores each row component in list format.


rows = table.find_all('tr')

Let's look at the 0th element of the retrieved row component.


print(rows[0])

As shown below, there is an additional hierarchy of \

tag, and you can see that these correspond to the header part of the table.

`output`



<tr>
<th><a href="/wiki/Microprocessor" title="Microprocessor">Processor</a>
</th>
<th data-sort-type="number"><a class="mw-redirect" href="/wiki/MOS_transistor" title="MOS transistor">MOS transistor</a> count
</th>
<th>Date of<br/>introduction
</th>
<th>Designer
</th>
<th data-sort-type="number"><a href="/wiki/MOSFET" title="MOSFET">MOS</a><br/><a href="/wiki/Semiconductor_device_fabrication" title="Semiconductor device fabrication">process</a>
</th>
<th data-sort-type="number">Area
</th></tr>

On the other hand, let's look at the 0th next element of the acquired row components.


print(rows[1])

As you can see, there is an additional \

tag, which corresponds to the data component of each cell in the first row of the table.

`output`



<tr>
<td><a class="mw-redirect" href="/wiki/MP944" title="MP944">MP944</a> (20-bit, <i>6-chip</i>)
</td>
<td><i><b>?</b></i>
</td>
<td>1970<sup class="reference" id="cite_ref-F-14_20-1"><a href="#cite_note-F-14-20">[20]</a></sup><sup class="reference" id="cite_ref-22"><a href="#cite_note-22">[a]</a></sup>
</td>
<td><a href="/wiki/Garrett_AiResearch" title="Garrett AiResearch">Garrett AiResearch</a>
</td>
<td><i><b>?</b></i>
</td>
<td><i><b>?</b></i>
</td></tr>

Creating a data frame

Next, create a data frame from the extracted data. Let's start with the column name of the data frame. Gets all the \

tags inside the \

tag hierarchy inside the \

tags inside the \
tag hierarchy inside the \
tags that are the header components from the 0th row of the table, and extracts only the text component (v.text). `columns = [v.text for v in rows[0].find_all('th')] print(columns)` The result is as follows, but \ n indicating a line break is an obstacle. `output` `['Processor\n', 'MOS transistor count\n', 'Date ofintroduction\n', 'Designer\n', 'MOSprocess\n', 'Area\n']` So let's modify the above code as follows. `columns = [v.text.replace('\n', '') for v in rows[0].find_all('th')] print(columns)` The result is as follows. Only the column name could be extracted neatly. `output` `['Processor', 'MOS transistor count', 'Date ofintroduction', 'Designer', 'MOSprocess', 'Area']` Now, let's prepare an empty data frame by specifying the above column name. `df = pd.DataFrame(columns=columns) df` The result is as follows. Only the column name is displayed in the header part, and you can see the empty data frame. Now that we have extracted the columns, let's extract each data component of the table. #About a certain row component of all rows for i in range(len(rows)): #All of<td>Get tags (cell data), store them in tds, and list them tds = rows[i].find_all('td') #Exclude cases where the number of tds data does not match the number of columns (blank), etc. if len(tds) == len(columns): #Store and list all cell data (of a certain row component) as text components in values values = [ td.text.replace('\n', '').replace('\xa0', ' ') for td in tds ] #values pd.Convert to series data, combine to data frame df = df.append(pd.Series(values, index=columns), ignore_index= True) Let's output the created data frame. `df` The result should look like the image below. I was able to scrape the table cleanly with Beautiful Soup. By the way, if the above td.text.replace ('\ n',''). replace ('\ xa0','') is simply executed as td.text, the values will be as follows. (A component of values is shown as an example). `output` `['Intel 4004 (4-bit, 16-pin)\n', '2,250\n', '1971\n', 'Intel\n', '10,000\xa0nm\n', '12\xa0mm²\n']` As with the header, the line feed code \ n and the space code \ xa0 are included. Therefore, it is necessary to replace each with the replace () method. Save the created data frame in csv format as appropriate. `#No header, tab specified for delimiter df.to_csv('processor.csv', index=False, sep='\t' )` Code summary import requests from bs4 import BeautifulSoup import csv import pandas as pd url = 'https://en.wikipedia.org/wiki/Transistor_count' page = requests.get(url) soup = BeautifulSoup(page.text, 'html.parser') table = soup.find('table', {'class':'wikitable'}).tbody rows = table.find_all('tr') columns = [v.text.replace('\n', '') for v in rows[0].find_all('th')] df = pd.DataFrame(columns=columns) for i in range(len(rows)): tds = rows[i].find_all('td') if len(tds) == len(columns): values = [ td.text.replace('\n', '').replace('\xa0', ' ') for td in tds ] df = df.append(pd.Series(values, index=columns), ignore_index= True) df.to_csv('processor.csv', index=False, sep='\t' ) Recommended Posts Table scraping with Beautiful Soup Scraping with Beautiful Soup Try scraping with Python + Beautiful Soup Scraping multiple pages with Beautiful Soup Scraping with Python and Beautiful Soup Scraping pages with pagination with Beautiful Soup Scraping with Beautiful Soup in 10 minutes Website scraping with Python's Beautiful Soup [Python] Scraping a table using Beautiful Soup Crawl practice with Beautiful Soup Beautiful Soup Scraping with selenium ~ 2 ~ Scraping with Python Scraping with Python Beautiful Soup memo Beautiful soup spills Scraping with Selenium Remove unwanted HTML tags with Beautiful Soup Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium Write a basic headless web scraping "bot" in Python with Beautiful Soup 4 Successful scraping with Selenium Scraping with Python (preparation) Try scraping with Python. Scraping with Python + PhantomJS My Beautiful Soup (Python) Scraping with scrapy shell Scraping Powerpoint (pptx) table Scraping with Selenium [Python] Scraping with Python + PyQuery Scraping RSS with Python Note that I dealt with HTML in Beautiful Soup [Python] Delete by specifying a tag with Beautiful Soup I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis. Get table dynamically with sqlalchemy Web scraping with python + JupyterLab Scraping with selenium in Python Scraping with Selenium + Python Part 1 Scraping with chromedriver in python Create / search / create table with PynamoDB Festive scraping with Python, scrapy Save images with web scraping [Raspberry Pi] Scraping of web pages that cannot be obtained with python requests + Beautiful Soup Scraping with Selenium in Python Easy web scraping with Scrapy Scraping with Tor in Python Scraping Google News search results in Python (2) Use Beautiful Soup Scraping weather forecast with python scraping the Nikkei 225 with playwright-python I tried scraping with python Web scraping beginner with python I-town page scraping with selenium [Python] Practical Beautiful Soup ~ Scraping the triple single odds table on the official website of Kyotei ~ Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath A memorandum when using beautiful soup Scraping with Node, Ruby and Python Web scraping with Python ① (Scraping prior knowledge) Scraping with Selenium in Python (Basic) Web scraping with BeautifulSoup4 (layered page) Scraping with Python, Selenium and Chromedriver [Python] A memorandum of beautiful soup4 Scraping Alexa's web rank with pyQuery

tags that are the header components from the 0th row of the table, and extracts only the text component (v.text).


columns = [v.text for v in rows[0].find_all('th')]
print(columns)

The result is as follows, but \ n indicating a line break is an obstacle.

`output`


['Processor\n', 'MOS transistor count\n', 'Date ofintroduction\n', 'Designer\n', 'MOSprocess\n', 'Area\n']

So let's modify the above code as follows.


columns = [v.text.replace('\n', '') for v in rows[0].find_all('th')]
print(columns)

The result is as follows. Only the column name could be extracted neatly.

`output`


['Processor', 'MOS transistor count', 'Date ofintroduction', 'Designer', 'MOSprocess', 'Area']

Now, let's prepare an empty data frame by specifying the above column name.


df = pd.DataFrame(columns=columns)
df

The result is as follows. Only the column name is displayed in the header part, and you can see the empty data frame.

Now that we have extracted the columns, let's extract each data component of the table.

#About a certain row component of all rows
for i in range(len(rows)):
    #All of<td>Get tags (cell data), store them in tds, and list them
    tds = rows[i].find_all('td')
    #Exclude cases where the number of tds data does not match the number of columns (blank), etc.
    if len(tds) == len(columns):
        #Store and list all cell data (of a certain row component) as text components in values
        values = [ td.text.replace('\n', '').replace('\xa0', ' ') for td in tds ]
        #values pd.Convert to series data, combine to data frame
        df = df.append(pd.Series(values, index=columns), ignore_index= True)

Let's output the created data frame.

df

The result should look like the image below. I was able to scrape the table cleanly with Beautiful Soup.

By the way, if the above td.text.replace ('\ n',''). replace ('\ xa0','') is simply executed as td.text, the values will be as follows. (A component of values is shown as an example).

`output`


['Intel 4004 (4-bit, 16-pin)\n', '2,250\n', '1971\n', 'Intel\n', '10,000\xa0nm\n', '12\xa0mm²\n']

As with the header, the line feed code \ n and the space code \ xa0 are included. Therefore, it is necessary to replace each with the replace () method.

Save the created data frame in csv format as appropriate.

#No header, tab specified for delimiter
df.to_csv('processor.csv', index=False, sep='\t' )

Code summary

import requests
from bs4 import BeautifulSoup
import csv
import pandas as pd

url = 'https://en.wikipedia.org/wiki/Transistor_count'
page = requests.get(url)

soup = BeautifulSoup(page.text, 'html.parser')
table = soup.find('table', {'class':'wikitable'}).tbody

rows = table.find_all('tr')
columns = [v.text.replace('\n', '') for v in rows[0].find_all('th')]

df = pd.DataFrame(columns=columns)

for i in range(len(rows)):
    tds = rows[i].find_all('td')

    if len(tds) == len(columns):
        values = [ td.text.replace('\n', '').replace('\xa0', ' ') for td in tds ]
        df = df.append(pd.Series(values, index=columns), ignore_index= True)

df.to_csv('processor.csv', index=False, sep='\t' )

Table scraping with Beautiful Soup

Introduction

Preparation

policy

reference

Overview

code

output

output

output

Creating a data frame

output

output

output

Code summary

`output`

`output`

`output`

`output`

`output`

`output`