Continuing from the last time, we will create the acquisition part of Qiita.
Qiita draft scraping
Get a list of your own drafts on Qiita
Since it is a login-type process like last time, use mechanize.
crawler.rb
crawler.rb
page = agent.get("https://qiita.com/drafts")
doc = Nokogiri::HTML.parse(page.body, nil, 'utf-8')
json = JSON.parse(doc.css('.js-react-on-rails-component')[1].inner_html)
json['creating_draft_items'].each do |item|
if item['raw_body'].match(/Reservation posting/)
id = item['item_uuid']
title = item['title']
raw_body = item['raw_body']
tags = item['tag_notation'].split(' ')
agent.get("https://qiita.com/drafts/#{id}")
tag_data = []
tags.each do |tag|
tag_data.push({name:tag,versions: []})
end
end
end
Add the above sentence by adding & modifying from the last time. In the above code, we get a list of draft information, and if there is a word "reserved post" in it, we get that information. Last time, I specified the URL as an ID in the draft acquisition part, but it will be redirected by / drafts, so this can be done.
【next time】 I'm finally going to make a post part, but it seems to be more difficult than I thought ... Maybe I'll rely on selenium ...