There are occasional occasions when you need to use the API or scrape to export data to CSV in order to retrieve data online. At that time, I sometimes write by referring to the articles I posted before, but since it was scattered in multiple articles, I will summarize it in one. Personally, I often use Python or Ruby in these cases, so I will write a personal approach to this language.
Get the upcoming weather from python weather api Topic model by LDA with gensim ~ Thinking about user's taste from Qiita tag ~ How to use Rails scraping method Mechanize Notes for handling Ruby CSV
This article is basically explained in the code base. To explain
--Get data using requests
and BeautifulSoup
in Python and convert it to CSV.
--Get data using Mechanize
in Ruby and convert it to CSV.
is.
Python
urllib2
In the following article I wrote earlier, I used ʻurllib2` to get the data as shown in the code below. Get the upcoming weather from python weather api
I was using Python2 at the time, so it's Python2 code. There seems to be a change in the ʻurllib2` library in Python3
The urllib2 module has been split into urllib.request and urllib.error in Python 3. The 2to3 tool will automatically fix the source code import. (http://docs.python.jp/2/library/urllib2.html)
import urllib2, sys
import json
try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = urllib2.urlopen('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode).read()
#Convert the read JSON data to dictionary type
resp = json.loads(resp)
print '**************************'
print resp['title']
print '**************************'
print resp['description']['text']
for forecast in resp['forecasts']:
print '**************************'
print forecast['dateLabel']+'('+forecast['date']+')'
print forecast['telop']
print '**************************'
requests
Now Python 3 uses requests
. When I rewrite it, it looks like the following.
import requests, sys
try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = requests.get('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode)
resp = resp.json()
print('**************************')
print(resp['title'])
print('**************************')
print(resp['description']['text'])
for forecast in resp['forecasts']:
print('**************************')
print(forecast['dateLabel']+'('+forecast['date']+')')
print(forecast['telop'])
print('**************************')
You can check the details in Document. I'm glad that this Document of Requests is written fairly carefully. Requests: HTTP for Humans
If you want to check how to use it, please see the following article. How to use Requests (Python Library) I think it will be helpful.
Again, I would like to use requests
to capture the data.
The code below is a code for scraping the names of Japanese actors and actresses on wikipedia.
Use BeautifulSoup
as the acquired HTML parser. It is convenient because it can also be used in XML.
In other words, Python scraping is done with requests
and BeautifulSoup
.
I think it's easier to select BeautifulSoup
with a CSS selector using the select
method.
import requests
from bs4 import BeautifulSoup
import csv
import time
base_url = 'https://en.wikipedia.org/wiki/'
url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']
for i in range(len(url_list)):
target_url = base_url + url_list[i]
target_html = requests.get(target_url).text
soup = BeautifulSoup(target_html, 'html.parser')
names = soup.select('#mw-content-text > h2 + ul > li > a')
for k, name in enumerate(names):
print(name.get_text())
time.sleep(1)
print('scraping page: ' + str(i + 1))
For more information Beautiful Soup Documentation For those who want a rough sketch Scraping with Python and Beautiful Soup
Now, let's write the above Japanese actor / actress name to CSV.
It's easy with the csv
library.
import requests
from bs4 import BeautifulSoup
import csv
import time
base_url = 'https://en.wikipedia.org/wiki/'
url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']
all_names = []
for i in range(len(url_list)):
target_url = base_url + url_list[i]
target_html = requests.get(target_url).text
soup = BeautifulSoup(target_html, 'html.parser')
names = soup.select('#mw-content-text > h2 + ul > li > a')
for k, name in enumerate(names):
all_names.append(name.get_text())
time.sleep(1)
print('scraping page: ' + str(i + 1))
f = open('all_names.csv', 'w')
writer = csv.writer(f, lineterminator='\n')
writer.writerow(['name'])
for name in all_names:
writer.writerow([name])
f.close()
all_names.csv
name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...
The following articles are neatly summarized on how to use the csv
library.
Reading and writing CSV with Python
Although it is recommended in this article, it is not bad to use ʻopen for reading CSV, but it is recommended because it is quite common to use
pandas` in consideration of subsequent analysis.
import csv
with open('all_name.csv', 'r') as f:
reader = csv.reader(f)
header = next(reader)
for row in reader:
print row
import pandas as pd
df = pd.read_csv('all_name.csv')
Ruby
Ruby uses Mechanize
.
Parse and use the JSON received by'Mechanize`.
We are doing the same thing as using the Python weather API above.
require 'mechanize'
require 'json'
citycode = '460010'
agent = Mechanize.new
page = agent.get("http://weather.livedoor.com/forecast/webservice/json/v1?city=#{citycode}")
data = JSON.parse(page.body)
puts '**************************'
puts data['title']
puts '**************************'
puts data['description']['text']
data['forecasts'].each do |forecast|
puts '**************************'
puts "#{forecast['dataLabel']}(#{forecast['date']})"
puts forecast['telop']
end
puts '**************************'
As a bonus, I think you can also use httparty etc.
jnunemaker/httparty
However, Mechanize
will suffice.
Basically, I think the following article is sufficient. How to use Rails scraping method Mechanize
As shown below, use get
to get the data, use the search
method to extract the relevant part, and ʻinner_text or
get_attribute` to extract the text and attributes.
require 'mechanize'
agent = Mechanize.new
page = agent.get("http://qiita.com")
elements = page.search('li a')
elements.each do |ele|
puts ele.inner_text
puts ele.get_attribute(:href)
end
This time, I will introduce the data acquisition using the post method, which is not done in the above article, with a concrete usage example.
The site The Oracle of Bacon is a site that returns "the number of bacon" when you enter the name of an actor. Although it is different from the content of this article, the "number of bacon" indicates how many times the actor's co-stars will be traced to reach the actor Kevin Bacon. [Six Degrees of Separation](https://ja.wikipedia.org/wiki/%E5%85%AD%E6%AC%A1%E3%81%AE%E9%9A%94%E3%81%9F% It is interesting to think about E3% 82% 8A). As of 2011, it is said that the average number of Facebook users in the world that separates any two is 4.74, which shows that the world is surprisingly small.
Here, I got the names of Japanese actors and actresses in the above python code and made them into CSV, so I would like to get the number of bacon for each of them and make them into CSV.
The CSV of actors and actresses is as follows.
all_names.csv
name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...
Below is the code. The point is how to use post
of Mechanize
.
Also, I couldn't simply get the "number of bacon" I wanted to get from HTML (it was untagged text), so I used a regular expression.
Reference: How to use Ruby regular expressions
The handling of CSV is described in Notes for handling Ruby CSV. Since CSV.open
can be used in the same way as File.open
, I used this here.
require 'mechanize'
require 'csv'
require 'kconv'
def get_bacon_num_to(person)
agent = Mechanize.new
page = agent.post('http://oracleofbacon.org/movielinks.php', { a: 'Kevin Bacon', b: person })
main_text = page.at('#main').inner_text.toutf8
match_result = main_text.match(/has a Bacon number of ([0-9]+)/)
bacon_number = 0
if match_result.nil?
puts "#{person}: Not found."
else
bacon_number = main_text.match(/has a Bacon number of ([0-9]+)/)[1]
puts "#{person}: #{bacon_number}"
end
return bacon_number
end
people = CSV.read('all_names.csv', headers: true)
CSV.open("result.csv", 'w') do |file|
people.each do |person|
num = get_bacon_num_to(person['name'])
file << [person['name'], num]
sleep(1)
end
end
I think there are various methods, but I think that the tools introduced this time can handle many things. Please try by all means try!
Recommended Posts