environment

Linux Ubuntu Xfce

reference

Web scraping with Python Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis Practice Selenium WebDriver

tool

Chrome
Chrome-Driver
BS4 --Used when acquiring characters and images
Selenium --Used when operating on the browser
pandas --Used for data combination and file output --liblzma-dev: Include because it is a necessary package for pandas

`Chrome`


sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64]  http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable

`Other`


sudo apt install chromium-chromedriver liblzma-dev \
&& pip install bs4 selenium pandas

Basic

What you can do with bs4

Various methods are prepared for bs4, and if you make full use of those methods and regular expressions (re), ** there is nothing that cannot be obtained **

Parse uses lxml

Fastest and can use the most CSS selectors

html_doc = '<html>...</html>'
soup = BeautifulSoup(html_doc, 'lxml')

Be sure to close, quit after execution

If you don't do it, the debris of the process will accumulate.

from selenium import webdriver
driver = webdriver.Chrome()
#Quit the driver
driver.close()
driver.quit()

Operate with selenium and pass html source to BS4

After the delivery is over, search for treasure with BS4

options = ChromeOptions()
options.add_argument('--headless') #Windowless mode
driver = Chrome(options=options)
url = 'https://www.example.com/'
driver.get(url)

#Start operation of Selenium
...
   ...
      ...
#End of Selenium operation

html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")

#BS4 processing started
...
   ...
      ...
#BS4 processing finished

Do not use find method when the number of HTML tags is small

Search by tag name directly from the BeautifulSoup object

`When there are few tags like this`


from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
    </body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])
print(soup.p.text)

`Execution result`


<title>hello soup</title>
hello soup
<p class="my-story">my story</p>
['my-story']
my story

Know the 4 objects in bs4

BeautfulSoup has 4 types of objects: Tag, NavigableString, BeautifulSoup, Comment.

Of these, the ones I often use are Beautiful Soup and Tag.

BeautifulSoup and Tag objects

BeautifulSoup: Convert HTML source to Python-friendly format (tree structure) Tag: A Tag object is created when a specific method is used on a BeautifulSoup object.

Understand the difference between find and find_all

You can search for anything using the find and find_all methods on a BeautifulSoup object, but you need to know what the method produces for a good search.

** Objects generated by the method ** find → bs4.element.Tag find_all → bs4.element.ResultSet

** Return value when nothing is found ** find → None find_all → [] empty list

bs4.element.Tag

You can think that it is generated by using bs4 methods other than find_all method, BeautifulSoup method, select method.

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
        <a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
        <a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag1')
tag1 = soup.find('a')
print(tag1)
print(type(tag1))


print('tag2')
tag2 = soup.a
print(tag2)
print(type(tag2))

bs4.element.ResultSet Generated by using the find_all method, the BeautifulSoup method, and the select method.

An image with a lot of bs4.element.Tag in the list (** This image is pretty important **)

`python:bs4.element.Image of ResultSet`


bs4.element.ResultSet = [bs4.element.Tag, bs4.element.Tag, bs4.element.Tag,...]

Therefore, it cannot be searched as it is, and it is used after removing it from the list. If you take it out, you can use the same method as bs4.element.tag above.

The method cannot be used! That's almost when you're trying to use the bs4.element.Tag method for bs4.element.ResultSet.

from bs4 import BeautifulSoup

html_doc = '''
<html>
    <head>
        <title>hello soup</title>
    </head>
    <body>
        <p class="my-story">my story</p>
        <a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
        <a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
        <a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
    </body>
</html>
'''

soup = BeautifulSoup(html_doc, 'lxml')

print('tag3')
tag3 = soup.select('a:nth-of-type(2)') #Find by the presence or absence of the a tag in the body tag
print(tag3)
print(type(tag3))

print('tag4')
tag4 = soup.select('.link1') #CSS selector class
print(tag4)
print(type(tag4))

print('tag5')
tag5 = soup.select('a[href]') #Find tags with or without attributes
print(tag5)
print(type(tag5))

Tips

Loosen print output limits

If you keep the default, when you try to print a file that is big, you will get an error ʻIO Pub data rate exceeded.`, so change it to unlimited

`Create configuration file`


jupyter notebook --generate-config

`python:~/.jupyter/jupyter_notebook_config.py`


#Before change 1000000 → After change 1e10
jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10

Read and write at high speed with pickle

Fast because it reads and writes in binary format ('b' in the code means binary)

There is a library with the same function, joblib, but this is good to use when you want to reduce the file size at the expense of speed.

`writing(dump)`


import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(example, f)

`Read(load)`


with open('example.pickle', 'rb') as f:
    example = pickle.load(f)

Addressed an issue that caused an error when trying to read or write something other than a string

When trying to write a bs4 object (bs4.BeautifulSoup, etc.) Since the error "maximum recursion depth exceeded while pickling an object" appears, convert it to string etc. before saving.

`dump`


import pickle

example = 'example'

with open('example.pickle', 'wb') as f:
    pickle.dump(str(example), f)

`load`


with open('example.pickle', 'rb') as f:
    example = BeatitfulSoup(pickle.load(f), 'lxml')

If you just read it, it cannot be handled by bs4 because it is a str type. Therefore, convert to bs4 type when reading

** If the above method doesn't work **

If you can't dump dict In such a case, it is good to dump with json

`dump`


import json

with open('example.json', 'w') as f:
    json.dump(example, f)

`load`


with open('example.json', 'r') as f:
    json.load(f)

Jupyter Notebook

Maximize cell width

When looking at the DataFrame of pandas, if the cell width is the default, the characters will be cut off, so set the cell width to the maximum

`css:~/.jupyter/custom/custom.css`


.container { width:100% !important; }

Measure processing time

Use % time which can only be used under the Jupyter environment This is a built-in method of Jupyter, so no import is required

`How to use`


%time example_function()

Regular expressions

Get the characters before and after the slash in the URL

When you want to get the scraping of https://www.example.com/topics/scraping Specify / with split to get the element behind

`code`


url = 'https://www.example.com/topics/scraping'

print(url.split('/'))
#['https:', '', 'www.example.com', 'topics', 'scraping']
print(url.split('/')[-1])
#scraping

Pandas

Pandas UserWarning: Could not import the lzma module. Your installed Python is incomplete

Error when missing required packages for pandas

sudo apt install liblzma-dev

Extract Column of DataFrame and make it a list

Put the Column you want to retrieve in Series
Use the Series tolist method

import pandas as pd
df = pd.DataFrame(index=[1,2,3], {'Column1':data1, 'Column2':'data2', 'Column3':'data3'})

#Extract Column3 and make it a list
col3 = pd.Series(df['Column3']).tolist()

Left justify the output result of DataFrame

The default is right-justified, so URLs and English are difficult to read.

df.style.set_properties(**{'text-align': 'left'})  #Left justified

[Scraping] Python scraping

environment

reference

tool

Chrome

Other

Basic

What you can do with bs4

Parse uses lxml

Be sure to close, quit after execution

Operate with selenium and pass html source to BS4

Do not use find method when the number of HTML tags is small

When there are few tags like this

Execution result

Know the 4 objects in bs4

BeautifulSoup and Tag objects

Understand the difference between find and find_all

python:bs4.element.Image of ResultSet

Tips

Loosen print output limits

Create configuration file

python:~/.jupyter/jupyter_notebook_config.py

Read and write at high speed with pickle

writing(dump)

Read(load)

Addressed an issue that caused an error when trying to read or write something other than a string

dump

load

dump

load

Maximize cell width

css:~/.jupyter/custom/custom.css

Measure processing time

How to use

Regular expressions

Get the characters before and after the slash in the URL

code

Extract Column of DataFrame and make it a list

Left justify the output result of DataFrame

`Chrome`

`Other`

`When there are few tags like this`

`Execution result`

`python:bs4.element.Image of ResultSet`

`Create configuration file`

`python:~/.jupyter/jupyter_notebook_config.py`

`writing(dump)`

`Read(load)`

`dump`

`load`

`dump`

`load`

`css:~/.jupyter/custom/custom.css`

`How to use`

`code`