The other day, I completed the topic cousera machine learning course, so I want to try it in practice Idolmaster Cinderella Girls .wikipedia.org/wiki/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E3%83%9E%E3%82%B9%E3%82%BF% E3% 83% BC_% E3% 82% B7% E3% 83% B3% E3% 83% 87% E3% 83% AC% E3% 83% A9% E3% 82% AC% E3% 83% BC% E3% I will try to predict three types (Cu, Co, Pa) using the profile data of 83% AB% E3% 82% BA).
First is the acquisition of data used for learning. I searched for a Delemas version of Pokemon api, but it didn't look good, so I usually use the Delemas wiki. I got the data from wiki.gamerch.com/).
For the scraping method, I referred to the following pages. http://qiita.com/Azunyan/items/9b3d16428d2bcc7c9406
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import urllib2
import csv
from bs4 import BeautifulSoup
#URL to access
url = "https://imascg-slstage-wiki.gamerch.com/%E3%82%A2%E3%82%A4%E3%83%89%E3%83%AB%E4%B8%80%E8%A6%A7"
#Read URL
html = urllib2.urlopen(url)
#Handle html with Beautiful Soup
soup = BeautifulSoup(html, "html.parser")
#Get all the contents of the first table
table = soup.findAll("table")[0]
#Decompose table row by row
rows = table.findAll("tr")
csvFile = open("aimasudata.csv", 'wt')
writer = csv.writer(csvFile)
for row in rows:
csvRow = []
for cell in row.findAll(['td', 'th']):
csvRow.append(cell.get_text().encode('utf-8'))
writer.writerow(csvRow)
Like this
--I didn't know how to read the html tag, so it took a long time to find the acquisition destination of soup.findAll. If you want to get the data of the table for the time being, you can specify the table and know the number of the table in the same page. --It is said that ASCII code cannot be used when cell.get_text () is used for Japanese data, so encoding to utf-8 is required.
Recommended Posts