I've been learning Deep Learning recently, and I played with seq2seq in the TensorFlow tutorial and implemented something that can respond to conversations. However, there is no Japanese conversation data anywhere. .. .. So I decided to scrape and collect.
The actual source code is below. https://github.com/ryosuke1217/askfm_q-a_scraper/blob/master/askfm.py
Scraping from Chrome using selenium.
askfm.py
driver = webdriver.Chrome()
driver.get("https://ask.fm/" + word)
wordには取得したいURLの「ask.fm/」以降をコマンドラインから渡してあげます。
askfm.py
while True:
scroll_h = driver.execute_script("var h = window.pageYOffset; return h")
judge = driver.execute_script("var m = window.pageYOffset; return m")
previous_h = driver.execute_script("var h = window.pageYOffset; return h")
#scroll
driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
sleep(3)
after_h = driver.execute_script("var h = window.pageYOffset; return h")
if previous_h == after_h:
break
print('load complete')
By getting the height of the screen while scrolling and continuing until there is no change, You can scroll to the bottom layer.
askfm.py
questions = driver.find_elements_by_class_name("streamItemContent-question")
answers = driver.find_elements_by_class_name("answerWrapper")
qas = [(q.find_element_by_tag_name('h2').text, a.find_element_by_tag_name('p').text) for q, a in zip(questions, answers)]
Collect the question and answer parts from the HTML source of the screen.
askfm.py
with codecs.open('data/askfm_data_' + word + '.txt', 'w', 'utf-8') as f:
for q, a in qas:
if q == '' or a == '' or 'http' in q or 'http' in a:
continue
q = q.replace('\n', '')
a = a.replace('\n', '')
f.write(q)
f.write('\n')
f.write(a)
f.write('\n')
f.write('\n')
driver.quit()
After that, organize the data in the required form, write it to a file, and finish.
askfm_data_partyhike.txt
When I was young, Ayumi Hamasaki hated the rear aura, but recently Ayumi Hamasaki feels blues.
Isn't it a debooth, not a blues?
I was tired of job hunting. Please give me some advice ...
The hard work of this time is 90 of the rest of my life%It's better to keep running without giving up even if you overdo it a little. If you think that the remaining decades will be decided in a few months at most, you should be able to do your best.
Occasionally, people are invited to drink, but how many people will participate each time?
Regardless of gender, I only drink it by hand. If you do more than one, there will be a mix of people who take voyeurs and write personal information on 2channel. I get 5 to 10 DMs every time, but most of the time I don't get it because I don't get many people who I can trust.
Would you like to elope?
I don't think.
Lips, lips, eyes, eyes, hands, hands Isn't God banning anything?
I love you ~ × 3
Is your uncle taking any measures against false accusations on the commuter train?
I rarely get on a crowded train because I come to work late, but once in a while I grab a strap with both hands and protect myself completely.
・
・
Omitted because it is huge below
It's a question and answer text rather than a conversation, I'm going to use it well, so I'm okay.
Thank you for watching.
Recommended Posts