The story of making a lie news generator

This is @marutaku from M2. This article is the 23rd day of MYJLab Advent Calendar. The previous article was I tried to visualize the comment flow rate of the Twitch distribution archive by @kanekom.

In this article, I was planning to make a fucking app with golang and announce it, but I gave it up because the master's thesis was on fire. I knew. So, I would like to introduce the lie news generator that I made in the past!

What kind of app

Kyoko Shimbun is very interesting, so I want to generate it automatically! I made it from the easy motivation. As for the data used, the Kyoko Shimbun writes parody (?) Articles using current affairs, so I thought I would generate false news based on the news that actually happened on Yahoo News. Markov chains are used to generate sentences. A detailed explanation of Markov chains is omitted because there are many other detailed articles. The reason for adopting the Markov chain is that although the meaning itself is unknown, it is possible to generate sentences that are not completely incomprehensible. It might be interesting to try using BERT now.

Crawler

We have collected the news of December 2018 from Yahoo News. You can see how old it was. I used Scrapy for the Crawler framework. I used Scrapy because it has the advantage of not having to write tedious processing compared to writing crawler code with scratch. For details, please refer to Articles I wrote in the past.

The following is a part of the crawler that I actually created.

import scrapy
from news.items import NewsIndexItem, NewsDetailItem
from bs4 import BeautifulSoup


class YahooSpider(scrapy.Spider):
    name = 'yahoo'
    allowed_domains = ['yahoo.co.jp']
    start_urls = [
        'https://news.yahoo.co.jp/hl?c=dom',
        'https://news.yahoo.co.jp/hl?c=c_int',
        'https://news.yahoo.co.jp/hl?c=bus',
        'https://news.yahoo.co.jp/hl?c=c_ent',
        'https://news.yahoo.co.jp/hl?c=c_spo',
        'https://news.yahoo.co.jp/hl?c=c_sci',
        'https://news.yahoo.co.jp/hl?c=c_life',
        'https://news.yahoo.co.jp/hl?c=loc'
    ]

    def parse(self, response):
        response = BeautifulSoup(response.text, 'html.parser')
        next_urls = response.select('.epConMore > a')
        for next_url in next_urls:
            yield scrapy.Request(next_url['href'], callback=self.parse_sub_topic_index)

    def parse_sub_topic_index(self, response):
        for content in response.css('li.listFeedWrap'):
            href = content.css('a::attr(href)').extract_first()
            if href:
                yield scrapy.Request(href, callback=self.parse_detail)
        next_page_url = BeautifulSoup(response.text, 'html.parser').select_one('li.next > a')
        if next_page_url:
            print('='*10, next_page_url, '='*30)
            yield scrapy.Request(next_page_url['href'], callback=self.parse_sub_topic_index)

    def parse_detail(self, response):
        response = BeautifulSoup(response.text, 'html.parser')
        content_dom = response.select_one('div#ym_newsarticle')
        title = content_dom.select_one('h1').text
        text = content_dom.select_one('p.ynDetailText').text
        item = NewsDetailItem(
            title=title,
            content=text
        )
        yield item

I got the title and body from yahoo news.

Actually generated text

The Markov chain code was too dirty, so I will omit it. please forgive me. If you do it now, there is a convenient library like markovify, so I think you should use that.

This is an example of the actually generated text

** * The sentences generated below have nothing to do with real people, incidents, or groups. Also, the writer has no intention of slandering anyone. ** **

Example sentence ①

There were two in Kanagawa), but on the day when it was held 24 times (with elbows), the applicant commented that it was "held in this town", corresponding to 4K from Yangon.<br>
Honda CBR1000RR and CBR1000RR SP, which are present at the special move Cross Arm Rainbow of Mitsuharu Misawa, who boasted the 65th Macau Grand Prix consecutive championship in the year, are in Chicago Red, Black, White while attracting spectators in terms of convenience. The two people involved in the shrine are "happy with the blast."<br>
■ Nagasaki's change to 3G, which was raised by improving the shortage of medicines, will increase the entrustment fee to other devices.<br>
To read the dentures, the sharp part has begun to be demonstrated, and the feeling of developing the "super" that introduces the brewery's brewery bar is a challenge for me, so I always show it moving. Then, let's train an occupational therapist? The person in charge of Kao was EVIL's EVIL, SANADA (3), who was the president of North Korea on the 29th.<br>
I will replace it with a creative work of the Supreme Court two-part work "(205 Bleecker store".<br>
Shugo Imahira and "Boyfriend" (Japanese title undecided, Japanese version) who succeeded in coming to the environment have been working on it, and the first peak was lowered.<br>
It's been about 3 weeks in total.<br>
It will be sold in the first secretary and the short story "Manpuku" of each country.<br>
It will be handed over to the office.<br>
The temperature is -8, Funabashi, and dirt are sought for the death penalty, and a light red flame is rushing in.<br>

The people involved in the shrine are" happy with the blast ", `The temperature is -8, Funabashi, and dirt are being sought for the death penalty, and a light red flame is rushing in. ``

The people involved in the shrine are at the end of the century. I do not understand.

Example sentence ②

There is a risk of developing the first quarter of the second grade beef (1) scheme before the direct call occurs.<br>
It seems to be 46 people.<br>
The auxiliary power was on, and we had a press conference in Osaka city.<br>
Initially distributed in Daegu on September 13th, about 237GB in Japanese yen will only give that power for five years, and the impression of a young man is "I'm taking a nap" (IS) in Seoul. There are still many opportunities for return visits.<br>
However, this season No. 27 (18) is in its first year.<br>
It is used at this time.<br>
That's what I made clear. "<br>
Planned from Sungmo's thoughts.<br>
2010: I'll spend time in the newly created Sierter by the Constitutional Court.<br>
I can imagine anyone.<br>

2010: I'll spend time in the newly created Cielter by the Constitutional Court. I can imagine anyone What the hell happened ...

Example sentence ③

Regarding the reason, the 70th anniversary that was measured the next morning, the 29th of the same day ◇ Chiba / Ichihara GC Ichihara C made a professional debut, and it was a fresh start of the global season.<br>
Jockey also said and made me laugh.<br>
Although it increased from September 2018, it was 1 draw and 3 losses, but it is actually done.<br>
Stop the rear tires.<br>
The horizontally opposed 6-cylinder engine can be revived as a model that appealed, "I want you to protect yourself in the chairman's business."<br>
It is responsible for the under part on the south side.<br>
In a building occupied by another person, is it called "HP Lifecycle Service"? "And so on.<br>
At OCR, Becky Lynch (3), who is the same "servant" as the airport, will participate in the Centers for Disease Control and Prevention (CDC) and will not comment that the skull of the "road" will shift back and forth as "menstrual movement". It was held on a schedule that there was a car for scary illness.<br>

There is a car in a scary illness ?????

At the end

For the time being, I was able to create sentences with Markov chains. There are many problems such as the parentheses not closing, but personally I am satisfied because it produces quite interesting sentences. If I have free time, I would like to give you an article that uses another method to generate sentences. If you have time.

Recommended Posts

The story of making a lie news generator
The story of making a mel icon generator
The story of making the Mel Icon Generator version2
The story of making a music generation neural network
The story of writing a program
The story of making a question box bot with discord.py
The story of making Python an exe
The story of making an immutable mold
The story of blackjack A processing (python)
The story of making a standard driver for db with python.
The story of making a module that skips mail with python
The story of making a university 100 yen breakfast LINE bot with Python
The story of sys.path.append ()
The story of making a sound camera with Touch Designer and ReSpeaker
The story of making a package that speeds up the operation of Juman (Juman ++) & KNP
The story of making a box that interconnects Pepper's AL Memory and MQTT
The story of making a web application that records extensive reading with Django
The story of making a Line Bot that tells us the schedule of competitive programming
The story of launching a Minecraft server from Discord
A story that reduces the effort of operation / maintenance
A story about changing the master name of BlueZ
Zip 4 Gbyte problem is a story of the past
A story that analyzed the delivery of Nico Nama.
The story of building Zabbix 4.4
[Apache] The story of prefork
The story of creating a VIP channel for in-house chatwork
The story of a Django model field disappearing from a class
The story of creating a database using the Google Analytics API
The story of Python and the story of NaN
The story of participating in AtCoder
The story of the "hole" in the file
The story of remounting the application server
A story stuck with the installation of the machine learning library JAX
A story that struggled to handle the Python package of PocketSphinx
[Python] Get the update date of a news article from HTML
The story of creating a site that lists the release dates of books
[Pythonista] The story of making an action to copy selected text
The story of making a tool that runs on Mac and Windows at the game development site
A story that visualizes the present of Qiita with Qiita API + Elasticsearch + Kibana
The story of a Parking Sensor in 10 minutes with GrovePi + Starter Kit
The story of trying to reconnect the client
The story of an error in PyOCR
The story of verifying the open data of COVID-19
The story of adding MeCab to ubuntu 16.04
Measure the relevance strength of a crosstab
A quick overview of the Linux kernel
The story of developing a web application that automatically generates catchphrases [MeCab]
The story of manipulating python global variables
[python] [meta] Is the type of python a type?
The story of making a slackbot that outputs as gif or png when you send the processing code
The story of trying deep3d and losing
The story of deciphering Keras' LSTM model.predict
The story of the escape probability of a random walk on an integer grid
A memo explaining the axis specification of axis
Get the filename of a directory (glob)
The story of pep8 changing to pycodestyle
The story of making a tool to load an image with Python ⇒ save it as another name
Notice the completion of a time-consuming command
The story of IPv6 address that I want to keep at a minimum
The story of Django creating a library that might be a little more useful
The story of making soracom_exporter (I tried to monitor SORACOM Air with Prometheus)