There are occasional occasions when you need to use the API or scrape to export data to CSV in order to retrieve data online. At that time, I sometimes write by referring to the articles I posted before, but since it was scattered in multiple articles, I will summarize it in one. Personally, I often use Python or Ruby in these cases, so I will write a personal approach to this language.

Past articles

Get the upcoming weather from python weather api Topic model by LDA with gensim ~ Thinking about user's taste from Qiita tag ~ How to use Rails scraping method Mechanize Notes for handling Ruby CSV

Overview

This article is basically explained in the code base. To explain

--Get data using requests and BeautifulSoup in Python and convert it to CSV. --Get data using Mechanize in Ruby and convert it to CSV.

is.

Python

Use of API

urllib2

In the following article I wrote earlier, I used ʻurllib2` to get the data as shown in the code below. Get the upcoming weather from python weather api

I was using Python2 at the time, so it's Python2 code. There seems to be a change in the ʻurllib2` library in Python3

The urllib2 module has been split into urllib.request and urllib.error in Python 3. The 2to3 tool will automatically fix the source code import. (http://docs.python.jp/2/library/urllib2.html)

import urllib2, sys
import json

try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = urllib2.urlopen('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode).read()

#Convert the read JSON data to dictionary type
resp = json.loads(resp)
print '**************************'
print resp['title']
print '**************************'
print resp['description']['text']

for forecast in resp['forecasts']:
    print '**************************'
    print forecast['dateLabel']+'('+forecast['date']+')'
    print forecast['telop']
print '**************************'

requests

Now Python 3 uses requests. When I rewrite it, it looks like the following.

import requests, sys

try: citycode = sys.argv[1]
except: citycode = '460010' #Default region
resp = requests.get('http://weather.livedoor.com/forecast/webservice/json/v1?city=%s'%citycode)

resp = resp.json()
print('**************************')
print(resp['title'])
print('**************************')
print(resp['description']['text'])

for forecast in resp['forecasts']:
    print('**************************')
    print(forecast['dateLabel']+'('+forecast['date']+')')
    print(forecast['telop'])
print('**************************')

You can check the details in Document. I'm glad that this Document of Requests is written fairly carefully. Requests: HTTP for Humans

If you want to check how to use it, please see the following article. How to use Requests (Python Library) I think it will be helpful.

Scraping

Again, I would like to use requests to capture the data. The code below is a code for scraping the names of Japanese actors and actresses on wikipedia. Use BeautifulSoup as the acquired HTML parser. It is convenient because it can also be used in XML.

In other words, Python scraping is done with requests and BeautifulSoup.

I think it's easier to select BeautifulSoup with a CSS selector using the select method.

import requests
from bs4 import BeautifulSoup
import csv
import time

base_url = 'https://en.wikipedia.org/wiki/'

url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']

for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    target_html = requests.get(target_url).text
    soup = BeautifulSoup(target_html, 'html.parser')
    names = soup.select('#mw-content-text > h2 + ul > li > a')


    for k, name in enumerate(names):
        print(name.get_text())

    time.sleep(1) 
    print('scraping page: ' + str(i + 1))

For more information Beautiful Soup Documentation For those who want a rough sketch Scraping with Python and Beautiful Soup

CSV output

Now, let's write the above Japanese actor / actress name to CSV.

It's easy with the csv library.

import requests
from bs4 import BeautifulSoup
import csv
import time

base_url = 'https://en.wikipedia.org/wiki/'

url_list = ['List_of_Japanese_actors', 'List_of_Japanese_actresses']

all_names = []

for i in range(len(url_list)):
    target_url = base_url + url_list[i]
    target_html = requests.get(target_url).text
    soup = BeautifulSoup(target_html, 'html.parser')
    names = soup.select('#mw-content-text > h2 + ul > li > a')


    for k, name in enumerate(names):
        all_names.append(name.get_text())

    time.sleep(1) 
    print('scraping page: ' + str(i + 1))

f = open('all_names.csv', 'w') 
writer = csv.writer(f, lineterminator='\n')
writer.writerow(['name'])
for name in all_names:
    writer.writerow([name])

f.close()

`all_names.csv`


name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...

The following articles are neatly summarized on how to use the csv library. Reading and writing CSV with Python

Although it is recommended in this article, it is not bad to use ʻopen for reading CSV, but it is recommended because it is quite common to use pandas` in consideration of subsequent analysis.

import csv

with open('all_name.csv', 'r') as f:
  reader = csv.reader(f)
  header = next(reader)

  for row in reader:
    print row

import pandas as pd
df = pd.read_csv('all_name.csv')

Ruby

Use of API

Ruby uses Mechanize. Parse and use the JSON received by'Mechanize`. We are doing the same thing as using the Python weather API above.

require 'mechanize'
require 'json'

citycode = '460010'
agent = Mechanize.new
page = agent.get("http://weather.livedoor.com/forecast/webservice/json/v1?city=#{citycode}")
data = JSON.parse(page.body)

puts '**************************'
puts data['title']
puts '**************************'
puts data['description']['text']

data['forecasts'].each do |forecast|
  puts '**************************'
  puts "#{forecast['dataLabel']}(#{forecast['date']})"
  puts forecast['telop']
end
puts '**************************'

As a bonus, I think you can also use httparty etc. jnunemaker/httparty However, Mechanize will suffice.

Scraping and CSV

Basically, I think the following article is sufficient. How to use Rails scraping method Mechanize

As shown below, use get to get the data, use the search method to extract the relevant part, and ʻinner_text or get_attribute` to extract the text and attributes.

require 'mechanize'

agent = Mechanize.new
page = agent.get("http://qiita.com")
elements = page.search('li a')

elements.each do |ele|
  puts ele.inner_text
  puts ele.get_attribute(:href)
end

This time, I will introduce the data acquisition using the post method, which is not done in the above article, with a concrete usage example.

The site The Oracle of Bacon is a site that returns "the number of bacon" when you enter the name of an actor. Although it is different from the content of this article, the "number of bacon" indicates how many times the actor's co-stars will be traced to reach the actor Kevin Bacon. [Six Degrees of Separation](https://ja.wikipedia.org/wiki/%E5%85%AD%E6%AC%A1%E3%81%AE%E9%9A%94%E3%81%9F% It is interesting to think about E3% 82% 8A). As of 2011, it is said that the average number of Facebook users in the world that separates any two is 4.74, which shows that the world is surprisingly small.

Here, I got the names of Japanese actors and actresses in the above python code and made them into CSV, so I would like to get the number of bacon for each of them and make them into CSV.

The CSV of actors and actresses is as follows.

`all_names.csv`


name
Hiroshi Abe
Abe Tsuyoshi
Osamu Adachi
Jin Akanishi
...

Below is the code. The point is how to use post of Mechanize. Also, I couldn't simply get the "number of bacon" I wanted to get from HTML (it was untagged text), so I used a regular expression. Reference: How to use Ruby regular expressions

The handling of CSV is described in Notes for handling Ruby CSV. Since CSV.open can be used in the same way as File.open, I used this here.

require 'mechanize'
require 'csv'
require 'kconv'

def get_bacon_num_to(person)

  agent = Mechanize.new
  page = agent.post('http://oracleofbacon.org/movielinks.php',  { a: 'Kevin Bacon', b: person })
  main_text = page.at('#main').inner_text.toutf8
  match_result = main_text.match(/has a Bacon number of ([0-9]+)/)

  bacon_number = 0

  if match_result.nil?
    puts "#{person}: Not found."
  else
    bacon_number = main_text.match(/has a Bacon number of ([0-9]+)/)[1]
    puts "#{person}: #{bacon_number}"
  end

  return bacon_number

end

people = CSV.read('all_names.csv', headers: true)

CSV.open("result.csv", 'w') do |file|
  people.each do |person|
    num = get_bacon_num_to(person['name'])
    file << [person['name'], num]
    sleep(1)
  end

end

At the end

I think there are various methods, but I think that the tools introduced this time can handle many things. Please try by all means try!

[Python / Ruby] Understanding with code How to get data from online and write it to CSV

Past articles

Overview

Use of API

Scraping

CSV output

all_names.csv

Use of API

Scraping and CSV

all_names.csv

At the end

`all_names.csv`

`all_names.csv`