This is @marutaku from M2. This article is the 23rd day of MYJLab Advent Calendar. The previous article was I tried to visualize the comment flow rate of the Twitch distribution archive by @kanekom.
In this article, I was planning to make a fucking app with golang and announce it, but I gave it up because the master's thesis was on fire. I knew. So, I would like to introduce the lie news generator that I made in the past!
Kyoko Shimbun is very interesting, so I want to generate it automatically! I made it from the easy motivation. As for the data used, the Kyoko Shimbun writes parody (?) Articles using current affairs, so I thought I would generate false news based on the news that actually happened on Yahoo News. Markov chains are used to generate sentences. A detailed explanation of Markov chains is omitted because there are many other detailed articles. The reason for adopting the Markov chain is that although the meaning itself is unknown, it is possible to generate sentences that are not completely incomprehensible. It might be interesting to try using BERT now.
We have collected the news of December 2018 from Yahoo News. You can see how old it was. I used Scrapy for the Crawler framework. I used Scrapy because it has the advantage of not having to write tedious processing compared to writing crawler code with scratch. For details, please refer to Articles I wrote in the past.
The following is a part of the crawler that I actually created.
import scrapy
from news.items import NewsIndexItem, NewsDetailItem
from bs4 import BeautifulSoup
class YahooSpider(scrapy.Spider):
name = 'yahoo'
allowed_domains = ['yahoo.co.jp']
start_urls = [
'https://news.yahoo.co.jp/hl?c=dom',
'https://news.yahoo.co.jp/hl?c=c_int',
'https://news.yahoo.co.jp/hl?c=bus',
'https://news.yahoo.co.jp/hl?c=c_ent',
'https://news.yahoo.co.jp/hl?c=c_spo',
'https://news.yahoo.co.jp/hl?c=c_sci',
'https://news.yahoo.co.jp/hl?c=c_life',
'https://news.yahoo.co.jp/hl?c=loc'
]
def parse(self, response):
response = BeautifulSoup(response.text, 'html.parser')
next_urls = response.select('.epConMore > a')
for next_url in next_urls:
yield scrapy.Request(next_url['href'], callback=self.parse_sub_topic_index)
def parse_sub_topic_index(self, response):
for content in response.css('li.listFeedWrap'):
href = content.css('a::attr(href)').extract_first()
if href:
yield scrapy.Request(href, callback=self.parse_detail)
next_page_url = BeautifulSoup(response.text, 'html.parser').select_one('li.next > a')
if next_page_url:
print('='*10, next_page_url, '='*30)
yield scrapy.Request(next_page_url['href'], callback=self.parse_sub_topic_index)
def parse_detail(self, response):
response = BeautifulSoup(response.text, 'html.parser')
content_dom = response.select_one('div#ym_newsarticle')
title = content_dom.select_one('h1').text
text = content_dom.select_one('p.ynDetailText').text
item = NewsDetailItem(
title=title,
content=text
)
yield item
I got the title and body from yahoo news.
The Markov chain code was too dirty, so I will omit it. please forgive me. If you do it now, there is a convenient library like markovify, so I think you should use that.
This is an example of the actually generated text
** * The sentences generated below have nothing to do with real people, incidents, or groups. Also, the writer has no intention of slandering anyone. ** **
There were two in Kanagawa), but on the day when it was held 24 times (with elbows), the applicant commented that it was "held in this town", corresponding to 4K from Yangon.<br>
Honda CBR1000RR and CBR1000RR SP, which are present at the special move Cross Arm Rainbow of Mitsuharu Misawa, who boasted the 65th Macau Grand Prix consecutive championship in the year, are in Chicago Red, Black, White while attracting spectators in terms of convenience. The two people involved in the shrine are "happy with the blast."<br>
■ Nagasaki's change to 3G, which was raised by improving the shortage of medicines, will increase the entrustment fee to other devices.<br>
To read the dentures, the sharp part has begun to be demonstrated, and the feeling of developing the "super" that introduces the brewery's brewery bar is a challenge for me, so I always show it moving. Then, let's train an occupational therapist? The person in charge of Kao was EVIL's EVIL, SANADA (3), who was the president of North Korea on the 29th.<br>
I will replace it with a creative work of the Supreme Court two-part work "(205 Bleecker store".<br>
Shugo Imahira and "Boyfriend" (Japanese title undecided, Japanese version) who succeeded in coming to the environment have been working on it, and the first peak was lowered.<br>
It's been about 3 weeks in total.<br>
It will be sold in the first secretary and the short story "Manpuku" of each country.<br>
It will be handed over to the office.<br>
The temperature is -8, Funabashi, and dirt are sought for the death penalty, and a light red flame is rushing in.<br>
The people involved in the shrine are" happy with the blast "
,
`The temperature is -8, Funabashi, and dirt are being sought for the death penalty, and a light red flame is rushing in. ``
The people involved in the shrine are at the end of the century. I do not understand.
There is a risk of developing the first quarter of the second grade beef (1) scheme before the direct call occurs.<br>
It seems to be 46 people.<br>
The auxiliary power was on, and we had a press conference in Osaka city.<br>
Initially distributed in Daegu on September 13th, about 237GB in Japanese yen will only give that power for five years, and the impression of a young man is "I'm taking a nap" (IS) in Seoul. There are still many opportunities for return visits.<br>
However, this season No. 27 (18) is in its first year.<br>
It is used at this time.<br>
That's what I made clear. "<br>
Planned from Sungmo's thoughts.<br>
2010: I'll spend time in the newly created Sierter by the Constitutional Court.<br>
I can imagine anyone.<br>
2010: I'll spend time in the newly created Cielter by the Constitutional Court. I can imagine anyone
What the hell happened ...
Regarding the reason, the 70th anniversary that was measured the next morning, the 29th of the same day ◇ Chiba / Ichihara GC Ichihara C made a professional debut, and it was a fresh start of the global season.<br>
Jockey also said and made me laugh.<br>
Although it increased from September 2018, it was 1 draw and 3 losses, but it is actually done.<br>
Stop the rear tires.<br>
The horizontally opposed 6-cylinder engine can be revived as a model that appealed, "I want you to protect yourself in the chairman's business."<br>
It is responsible for the under part on the south side.<br>
In a building occupied by another person, is it called "HP Lifecycle Service"? "And so on.<br>
At OCR, Becky Lynch (3), who is the same "servant" as the airport, will participate in the Centers for Disease Control and Prevention (CDC) and will not comment that the skull of the "road" will shift back and forth as "menstrual movement". It was held on a schedule that there was a car for scary illness.<br>
There is a car in a scary illness
?????
For the time being, I was able to create sentences with Markov chains. There are many problems such as the parentheses not closing, but personally I am satisfied because it produces quite interesting sentences. If I have free time, I would like to give you an article that uses another method to generate sentences. If you have time.
Recommended Posts