Learning record (3rd day) #CSS selector description method #BeautifulSoup scraping

content of study

How to write CSS selectors

Below are 10 ways to write CSS selectors. Every CSS sector pulls an element of ʻEurasia from continents.html`.

<ul id="continents">
    <li id="au">Australia</li>
    <li id="na">NorthAmerica</li>
    <li id="sa">SouthAmerica</li>
    <li id="ea">Eurasia</li>
    <li id="af">Africa</li>
</ul>
from bs4 import BeautifulSoup
fp = open("continents.html", encoding="utf-8")

soup = BeautifulSoup(fp, 'html.parser')

sel = lambda q: print(soup.select_one(q).string)
sel("#ea")   # (1)
sel("li#ea")   # (2)
sel("ul > li#ea")   # (3)
sel("#continents #ea")   # (4)
sel("#continents > #ea")   # (5)
sel("ul#continents >li#ea")   # (6)
sel("li[id='ea']")   # (7)
sel("li:nth-of-type(4)")   # (8)

print(soup.select("li")[3].string)   # (9)
print(soup.find_all("li")[3].string)   # (10)

(1) Extract the element whose id attribute is ʻea (2) Extract the element with the

  • tag and the id attribute of ʻea. (3) Extract (2) by specifying it from the upper <ul> tag. (4) Extract the child element with id attribute ʻeain the hierarchy below the element with id attributecontinents (5) Extract the child element with id attribute ʻea in the hierarchy directly below the element with id attribute continents (6) Extract the elements of the <ul> tag whose id attribute is continents and the<li>tag whose id attribute is ʻeaimmediately below it. (7) Extract the element of the
  • tag whose id attribute is ʻea (8) Extract the element of the 4th <li> tag (9) Use select () to extract the<li>tag and get the element of that[3](3 counting from 0, that is, the 4th) (10) Use find_all () to extract the<li>tag and get the element of that[3](3 counting from 0, that is, the 4th)

    Execution result

    Eurasia Eurasia Eurasia Eurasia Eurasia Eurasia Eurasia Eurasia Eurasia Eurasia

    Scraping with Beautiful Soup

    Here is a summary of the functions used for scraping.

    • find () method, find_all () method

    Extract the element by specifying an arbitrary attribute. The find () method can get one element, and the find_all () method can get multiple elements at once.

    Example of use

     title = soup.find (id = "title") # Get the element whose id attribute is title
     linls = soup.find_all ("a") # Get all elements tagged
    
    • select () method, select_all () method Specify the selector by the argument and get the element. The select () method can get one element, and the select_all () method can get multiple elements. The usage example is as in sel-continents.py above.

    Summary

    I understand how to scrape, but I often stop understanding python grammar, so I want to keep in mind the underlying purpose of understanding python grammar.

    Recommended Posts

    Learning record (3rd day) #CSS selector description method #BeautifulSoup scraping
    Learning record (2nd day) Scraping by #BeautifulSoup
    Learning record No. 19 (23rd day)
    Learning record No. 29 (33rd day)
    Learning record 4 (8th day)
    Learning record 9 (13th day)
    Learning record 3 (7th day)
    Learning record 5 (9th day)
    Learning record 6 (10th day)
    Programming learning record day 2
    Learning record 8 (12th day)
    Learning record 1 (4th day)
    Learning record 7 (11th day)
    Learning record 2 (6th day)
    Learning record 16 (20th day)
    Learning record 22 (26th day)
    Learning record No. 21 (25th day)
    Learning record No. 10 (14th day)
    Learning record No. 17 (21st day)
    Learning record 12 (16th day) Kaggle2
    Learning record No. 18 (22nd day)
    Learning record No. 24 (28th day)
    Learning record No. 28 (32nd day)
    Learning record No. 23 (27th day)
    Learning record No. 25 (29th day)
    Learning record No. 26 (30th day)
    Learning record No. 20 (24th day)
    Learning record No. 27 (31st day)
    Learning record No. 14 (18th day) Kaggle4
    Learning record No. 15 (19th day) Kaggle5
    Learning record 11 (15th day) Kaggle participation