I've been training for scraping since the other day, I couldn't do the following things, I made it, so I will write it in the article.
-I want to scrape the text and the link destination URL that exist in the table structure as a set (using DataFrame of pandas) -The link destination URL had multiple a hrefs in the same table, and no identifiable name was given, so it was difficult to take even a regular expression. → I decided to use XPath because it seemed good to specify a text sentence, specify it as a link destination of that text, and scrape it. (DataFrame will return an error if the number of rows is not aligned, so I want to omit unnecessary data and take it surely) ・ Beautiful Soup cannot use XPath, but it can be done by using lxml.
[Site that I referred to] http://gci.t.u-tokyo.ac.jp/tutorial/crawling/ http://www.slideshare.net/tushuhei/python-xpath http://qiita.com/tamonoki/items/a341657a86ff7a945224
scraping.py
#coding: utf-8
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
import time
import lxml.html
aaa = []
bbb = []
for page in range(1,2):
url = "http://www.~~~" + str(page)
html = urllib2.urlopen(url)
html2 = urllib2.urlopen(url)
soup = BeautifulSoup(html, "lxml")
dom = lxml.html.fromstring(html2.read())
for o1 in soup.findAll("td", class_="xx"):
aaa.append(o1.string)
for o2 in dom.xpath(u"//a[text()='xxx']/@href"): #Get href by specifying text for xxx part
bbb.append(o2)
time.sleep(2)
df = pd.DataFrame({"aaa":aaa, "bbb":bbb})
print(df)
df.to_csv("xxxx.csv", index=False, encoding='utf-8')
It's easy, but that's it for today.
Recommended Posts