Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath

I've been training for scraping since the other day, I couldn't do the following things, I made it, so I will write it in the article.

-I want to scrape the text and the link destination URL that exist in the table structure as a set (using DataFrame of pandas) -The link destination URL had multiple a hrefs in the same table, and no identifiable name was given, so it was difficult to take even a regular expression. → I decided to use XPath because it seemed good to specify a text sentence, specify it as a link destination of that text, and scrape it. (DataFrame will return an error if the number of rows is not aligned, so I want to omit unnecessary data and take it surely) ・ Beautiful Soup cannot use XPath, but it can be done by using lxml.

[Site that I referred to] http://gci.t.u-tokyo.ac.jp/tutorial/crawling/ http://www.slideshare.net/tushuhei/python-xpath http://qiita.com/tamonoki/items/a341657a86ff7a945224

`scraping.py`


#coding: utf-8
from bs4 import BeautifulSoup
import urllib2
import pandas as pd
import time
import lxml.html

aaa = []
bbb = []

for page in range(1,2):
	url = "http://www.～～～" + str(page)
	html = urllib2.urlopen(url)
	html2 = urllib2.urlopen(url)
	soup = BeautifulSoup(html, "lxml")
	dom = lxml.html.fromstring(html2.read())

	for o1 in soup.findAll("td", class_="xx"):
		aaa.append(o1.string)

	for o2 in dom.xpath(u"//a[text()='xxx']/@href"): #Get href by specifying text for xxx part
		bbb.append(o2)

	time.sleep(2)

df = pd.DataFrame({"aaa":aaa, "bbb":bbb})
print(df)
df.to_csv("xxxx.csv", index=False, encoding='utf-8')

It's easy, but that's it for today.

Recommended Posts

Get the link destination URL by specifying a text sentence with Python scraping (Beautiful Soup) + XPath

[Python] Delete by specifying a tag with Beautiful Soup

Try scraping with Python + Beautiful Soup

Scraping with Python and Beautiful Soup

Get property information by scraping with python

Write a basic headless web scraping "bot" in Python with Beautiful Soup 4

[Python] Get the files in a folder with Python

Specifying the module loading destination with GAE python

[For beginners] Web scraping with Python "Access the URL in the page to get the contents"

Get the URL of the HTTP redirect destination in Python

Scraping with Beautiful Soup

I made a class to get the analysis result by MeCab in ndarray with python

Table scraping with Beautiful Soup

[Python scraping] Output the URL and title of the site containing a specific keyword to a text file

[Python] Get elements by specifying attributes with prefix search in BeautifulSoup

Get Splunk download link by scraping

Link to get started with python

Scraping multiple pages with Beautiful Soup

[Python] A memorandum of beautiful soup4

Get the weather with Python requests

Scraping pages with pagination with Beautiful Soup

Scraping with Beautiful Soup in 10 minutes

Get Qiita trends with Python scraping

Website scraping with Python's Beautiful Soup

Get weather information with Python & scraping

A memo organized by renaming the file names in the folder with python

Get the number of searches with a regular expression. SeleniumBasic VBA Python

Extract lines that match the conditions from a text file with python

Sort anime faces by scraping anime character pages with Beautiful Soup and Selenium

I get a Python No module named'encodings' error with the aws command

How to sort by specifying a column in the Python Numpy array.

[Python] Get the variable name with str

Search the maze with the python A * algorithm

Install by specifying the version with pip

Try HTML scraping with a Python library

[Python] Replace the text output by MeCab with the important words extracted by MeCab + Term Extract.

Python / subprocess> Symbolic link Implementation to get only the destination file name> os.readlink ()

Get a list of articles posted by users with Python 3 Qiita API v2

[Python] How to save images on the Web at once with Beautiful Soup

Find the ideal property by scraping! A few minutes walk from the property to the destination

Get the stock price of a Japanese company with Python and make a graph

How to get a list of files in the same directory with python

[Introduction to Python] How to get the index of data with a for statement