Web scraping with Python First step

This article is for beginners of web scraping using Python3 and BeautifulSoup4.

I referred to past articles, Since a warning was displayed or it did not work due to the difference in version, I tried to summarize it again.

Overview

The basic process of web scraping is as follows.

① Get the web page. (2) Divide the elements of the acquired page and extract any part. ③ Save in the database.

Use request to get the web page of ① and BeautifulSoup4 to process ②. Since ③ differs depending on the environment, the explanation is omitted in this article.

Preparation

After installing Python3 Use the pip command to install the three packages BeautifulSoup4, requests and lxml.

$ pip install requests 
$ pip install lxml
$ pip install beautifulsoup4

Program execution

Create the following script file.

`sample.py`


import requests
from bs4 import BeautifulSoup

target_url = 'http://example.co.jp'  #example.co.jp is a fictitious domain. Change to any url
r = requests.get(target_url)         #Get from the web using requests
soup = BeautifulSoup(r.text, 'lxml') #Extract elements

for a in soup.find_all('a'):
	print(a.get('href'))         #Show link

Start a command prompt and execute the following command.

$ python sample.py

After running, if you see the page link on the console, you're good to go!

Beautiful Soup method

Here are some useful methods for BeautifulSoup.

soup.a.string　　　　　　　　　　#Change the character string of the a tag
soup.a.attrs    　　　　　　　　#Change all attributes
soup.a.parent　　　　　　　　　　#Parent element returns

soup.find('a') 　　　　　　　　　#The first element is returned
soup.find_all(id='log')　　　　#All elements are returned

soup.select('head > title')   #Specified by css selector

BeautifulSoup has many other methods you can use. For details, please refer to the official document. https://www.crummy.com/software/BeautifulSoup/bs4/doc/

Narrow down the elements

It is convenient to use the regular expression of re to narrow down the target element.

import re
soup.find_all('a', href=re.compile("^http"))     #Links that start with http

import re
soup.find_all('a', href=re.compile("^(?!http)")) #Does not start with http(denial)

import re
soup.find_all('a', text=re.compile("N"), title=re.compile("W")) #Elements where text contains N and title contains W

Manipulating strings

A supplementary explanation of string operations that are useful to remember when scraping.

-Removed spaces before and after characters

"  abc  ".strip()
→abc

・ Split characters

"a, b, c,".split(',') 
→[a, b, c]

・ Search for character strings

"abcde".find('c') #Returns the position if there is a specified character.
→2

・ Character replacement

"abcdc".replace('c', 'x')
→abxdx

Referenced articles

http://qiita.com/itkr/items/513318a9b5b92bd56185