Linux Ubuntu Xfce
Web scraping with Python Python Crawling & Scraping-Practical Development Guide for Data Collection and Analysis Practice Selenium WebDriver
Chrome
sudo curl -sS -o - https://dl-ssl.google.com/linux/linux_signing_key.pub | apt-key add
sudo echo "deb [arch=amd64] http://dl.google.com/linux/chrome/deb/ stable main" >> /etc/apt/sources.list.d/google-chrome.list
sudo apt-get -y update
sudo apt-get -y install google-chrome-stable
Other
sudo apt install chromium-chromedriver liblzma-dev \
&& pip install bs4 selenium pandas
Various methods are prepared for bs4, and if you make full use of those methods and regular expressions (re), ** there is nothing that cannot be obtained **
Fastest and can use the most CSS selectors
html_doc = '<html>...</html>'
soup = BeautifulSoup(html_doc, 'lxml')
If you don't do it, the debris of the process will accumulate.
from selenium import webdriver
driver = webdriver.Chrome()
#Quit the driver
driver.close()
driver.quit()
After the delivery is over, search for treasure with BS4
options = ChromeOptions()
options.add_argument('--headless') #Windowless mode
driver = Chrome(options=options)
url = 'https://www.example.com/'
driver.get(url)
#Start operation of Selenium
...
...
...
#End of Selenium operation
html = driver.page_source.encode('utf-8')
soup = BeautifulSoup(html, "lxml")
#BS4 processing started
...
...
...
#BS4 processing finished
Search by tag name directly from the BeautifulSoup
object
When there are few tags like this
from bs4 import BeautifulSoup
html_doc = '''
<html>
<head>
<title>hello soup</title>
</head>
<body>
<p class="my-story">my story</p>
</body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print(soup.title)
print(soup.title.text)
print(soup.p)
print(soup.p['class'])
print(soup.p.text)
Execution result
<title>hello soup</title>
hello soup
<p class="my-story">my story</p>
['my-story']
my story
BeautfulSoup has 4 types of objects: Tag
, NavigableString
, BeautifulSoup
, Comment
.
Of these, the ones I often use are Beautiful Soup
and Tag
.
BeautifulSoup
: Convert HTML source to Python-friendly format (tree structure)
Tag
: A Tag object is created when a specific method is used on a BeautifulSoup object.
You can search for anything using the find
and find_all
methods on a BeautifulSoup
object, but you need to know what the method produces for a good search.
** Objects generated by the method **
find
→ bs4.element.Tag
find_all
→ bs4.element.ResultSet
** Return value when nothing is found **
find
→ None
find_all
→ [] empty list
bs4.element.Tag
You can think that it is generated by using bs4 methods other than find_all
method, BeautifulSoup
method, select
method.
from bs4 import BeautifulSoup
html_doc = '''
<html>
<head>
<title>hello soup</title>
</head>
<body>
<p class="my-story">my story</p>
<a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
<a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
<a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
</body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print('tag1')
tag1 = soup.find('a')
print(tag1)
print(type(tag1))
print('tag2')
tag2 = soup.a
print(tag2)
print(type(tag2))
bs4.element.ResultSet
Generated by using the find_all
method, the BeautifulSoup
method, and the select
method.
An image with a lot of bs4.element.Tag
in the list (** This image is pretty important **)
python:bs4.element.Image of ResultSet
bs4.element.ResultSet = [bs4.element.Tag, bs4.element.Tag, bs4.element.Tag,...]
Therefore, it cannot be searched as it is, and it is used after removing it from the list.
If you take it out, you can use the same method as bs4.element.tag
above.
- The method cannot be used! That's almost when you're trying to use the
bs4.element.Tag
method forbs4.element.ResultSet
.
from bs4 import BeautifulSoup
html_doc = '''
<html>
<head>
<title>hello soup</title>
</head>
<body>
<p class="my-story">my story</p>
<a class='brother' href='http://example.com/1' id='link1'>Link 1</a>
<a class='brother' href='http://example.com/2' id='link2'>Link 2</a>
<a class='brother' href='http://example.com/3' id='link3'>Link 3</a>
</body>
</html>
'''
soup = BeautifulSoup(html_doc, 'lxml')
print('tag3')
tag3 = soup.select('a:nth-of-type(2)') #Find by the presence or absence of the a tag in the body tag
print(tag3)
print(type(tag3))
print('tag4')
tag4 = soup.select('.link1') #CSS selector class
print(tag4)
print(type(tag4))
print('tag5')
tag5 = soup.select('a[href]') #Find tags with or without attributes
print(tag5)
print(type(tag5))
If you keep the default, when you try to print a file that is big, you will get an error ʻIO Pub data rate exceeded.`, so change it to unlimited
Create configuration file
jupyter notebook --generate-config
python:~/.jupyter/jupyter_notebook_config.py
#Before change 1000000 → After change 1e10
jupyter notebook --NotebookApp.iopub_data_rate_limit=1e10
Fast because it reads and writes in binary format ('b' in the code means binary)
There is a library with the same function, joblib
, but this is good to use when you want to reduce the file size at the expense of speed.
writing(dump)
import pickle
example = 'example'
with open('example.pickle', 'wb') as f:
pickle.dump(example, f)
Read(load)
with open('example.pickle', 'rb') as f:
example = pickle.load(f)
When trying to write a bs4 object
(bs4.BeautifulSoup, etc.) Since the error "maximum recursion depth exceeded while pickling an object
" appears, convert it to string
etc. before saving.
dump
import pickle
example = 'example'
with open('example.pickle', 'wb') as f:
pickle.dump(str(example), f)
load
with open('example.pickle', 'rb') as f:
example = BeatitfulSoup(pickle.load(f), 'lxml')
If you just read it, it cannot be handled by bs4 because it is a str type
.
Therefore, convert to bs4
type when reading
** If the above method doesn't work **
If you can't dump dict
In such a case, it is good to dump with json
dump
import json
with open('example.json', 'w') as f:
json.dump(example, f)
load
with open('example.json', 'r') as f:
json.load(f)
Jupyter Notebook
When looking at the DataFrame of pandas, if the cell width is the default, the characters will be cut off, so set the cell width to the maximum
css:~/.jupyter/custom/custom.css
.container { width:100% !important; }
Use % time
which can only be used under the Jupyter
environment
This is a built-in method of Jupyter
, so no import is required
How to use
%time example_function()
When you want to get the scraping
of https://www.example.com/topics/scraping
Specify /
with split
to get the element behind
code
url = 'https://www.example.com/topics/scraping'
print(url.split('/'))
#['https:', '', 'www.example.com', 'topics', 'scraping']
print(url.split('/')[-1])
#scraping
Pandas
Pandas UserWarning: Could not import the lzma module. Your installed Python is incomplete
Error when missing required packages for pandas
sudo apt install liblzma-dev
import pandas as pd
df = pd.DataFrame(index=[1,2,3], {'Column1':data1, 'Column2':'data2', 'Column3':'data3'})
#Extract Column3 and make it a list
col3 = pd.Series(df['Column3']).tolist()
The default is right-justified, so URLs and English are difficult to read.
df.style.set_properties(**{'text-align': 'left'}) #Left justified
Recommended Posts