I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.

I've been programming Python for about 3 months on weekday nights and weekends, but I'm still having fun.

What I did recently

1, Morphological analysis ・ I wanted to grasp the flow of throwing data into Mecab, narrowing down by nouns only, calculating the frequency, then adding a user dictionary and trying again, so I tried it a little. ・ Since it was completed immediately, I will not describe the details. .. ..

[Site that I referred to] http://qiita.com/fantm21/items/d3d44f7d86f09acda86f http://qiita.com/naoyu822/items/473756fb8e8bbdc4d734 http://www.mwsoft.jp/programming/munou/mecab_command.html http://shimz.me/blog/d3-js/2711

2, scraping ・ Scraping, such as texts and images, is very often related to work, so I wanted to study to some extent, so this time I started with books. https://www.amazon.co.jp/dp/4873117615

・ First of all, it was well understood that Python + Beautiful Soup can quickly take a single page with an easy-to-understand structure.

・ Next, it turned out that the site generated by JS is difficult with the above combination, and there are PhantomJS and CasperJS, and by writing in JS and scraping, this can be done quickly again.

・ After that, it turned out that even Python can scrape websites generated by JS with the combination of Selenium + PhantomJS.

-For the time being, when I tried to convert to csv with the Pandas Dataframe of the last code, I got stuck with UnicodeEncodeError, but I want to do it for the time being with the end that I put the encode specification in the place to convert to csv with Dataframe and solve it. Was realized

[Site that I referred to] http://doz13189.hatenablog.com/entry/2016/08/21/154219 http://zipsan.hatenablog.jp/entry/20150413/1428861548 http://qiita.com/okadate/items/7b9620a5e64b4e906c42

I just combined the sources of the site that I referred to with copy and paste, but I did it with the following sources. .. ..

`scraping.py`


import lxml.html
from selenium import webdriver
from bs4 import BeautifulSoup
import pandas as pd
import time

aaa = []
bbb = []
ccc = []

for page in range(1,2): #Set the page limit as appropriate
	driver = webdriver.PhantomJS()
	driver.get("https://www.～～=page=" + str(page))
	data = driver.page_source.encode('utf-8')
	soup = BeautifulSoup(data, "lxml")

	for o in soup.findAll("h3", class_="hoge"):#I often see it, but why do everyone call it hoge?
		aaa.append(o.string)

	for o1 in soup.findAll("h3", class_="hoge"):#Why hoge?
		bbb.append(o1.string)

	for o2 in soup.findAll("div", class_="hoge"):#What...？
		ccc.append(o2.get_text())
	time.sleep(3)

df = pd.DataFrame({"aaa":aaa, "bbb":bbb, "ccc":ccc})

print(df)
df.to_csv("hogehoge.csv", index=False, encoding='utf-8')

driver.quit()

There are many places I'm not sure about, but it worked for the time being.

I will continue to study.

Recommended Posts

I tried various things with Python: scraping (Beautiful Soup + Selenium + PhantomJS) and morphological analysis.

Scraping with Python and Beautiful Soup

I tried scraping with Python

I tried scraping with python

Try scraping with Python + Beautiful Soup

I tried web scraping with python.

I tried to make a periodical process with Selenium and Python

I tried scraping Yahoo News with Python

Practice web scraping with Python and Selenium