[JAVA] Summarize the titles of Hottentori at the end and look at the present on the Web

Now that Google Reader is over and services such as SmartNews and Gunosy that offer recommended content even when you sleep are attracting attention, I would like to make something smart here.

So, I pulled an article from a popular entry in Hateb, I wrote a program that summarizes it in one line.

Yes this.

Summary http://xiidec.appspot.com/markov.html

If you use this ...

The truth of Japan is that the lazy cat lion can ride the elite course in this country.

Like this

What to do with the story of wanting productivity as to why highly educated discriminatory remarks are required.

The hot news is mixed up and summarized in one line.

Regarding Ayumi Hamasaki, the reactor did not reach enough, and discriminatory remarks about core meltdown continued.

You can see the current state of the Web in one line!

How it works

On the server side (Python), fetch the RSS of Hatena Bookmark popular entry.
Decompose words with morphological analysis tiny_segmenter that works with Javascript.
Reconstructed using a Markov chain, an algorithm often used for bots.

It's almost like this.

I rent a server of Google App Engine (free) and run it. The mechanism for fetching feed is almost the same as when feed parser automatically picks up cat images. Pass it to the client.

The client then decomposes the received string into words with a magical Javascript library called TinySegmenter.

It's nice weather today. ↓ today|Is|good|weather|is|Ne|。

Such an image.

Then, it is reconstructed using an algorithm called Markov chain. For more information, see [Markov Chain Articles] on Wikipedia (http://en.wikipedia.org/wiki/%E3%83%9E%E3%83%AB%E3%82%B3%E3%83%95%E9% If you read 80% A3% E9% 8E% 96), I don't think you can understand it very well, but if you explain roughly,

Today → is → good → weather → is → ne →. I am a cat → is → is →. → Name → → Not yet → Not available →. Parental → → → No gun → → Child → → Around → From → Loss → Only →.

Suppose there are multiple sentences.

First of all, the first word is randomly fetched. → "Today" The only word that follows "today" is "measles." → "Today" The words following "ha" are "good" and "cat". Choose randomly. → "Today is a cat" The only word that follows a cat is "de". → "Today is a cat" The words following "de" are "aru" and "children". Also choose randomly. → "Today is a cat and a child" Next to "children" is another choice. → "Today is a cat and a child" Next to "no" are "no gun" and "around". → "Today is a cat and a child's gun" Next to "Muteppou", "de" is selected again. → "Today is a cat and a child's gunless gun" Let's finish it. → "Today is a cat and a child's gun."

It turned out to be something like that. The Markov chain is actually a little deeper. A difficult formula comes out. Whatever the theory, let's hope it becomes like that as a result.

Source

This is the source on the server side.

`markov.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-
import webapp2
import os
from google.appengine.ext.webapp import template
from xml.etree.ElementTree import *
import re

import urllib

class Markov(webapp2.RequestHandler):
	def get(self):
		mes=""
		if self.request.get('mode')=="2ch":
			mes=self.get_2ch()
		else:
			mes=self.get_hotentry_title()
		
		template_values={
		'mes':mes
		}
		path = os.path.join(os.path.dirname(__file__), 'html/markov.html')
		self.response.out.write(template.render(path, template_values))
		
	def get_hotentry_title(self):
		titles = ""
		tree = parse(urllib.urlopen('http://feeds.feedburner.com/hatena/b/hotentry'))
		for i in tree.findall('./{http://purl.org/rss/1.0/}item'):
			titles+= re.sub("[-:|/｜：].{1,30}$","",i.find('{http://purl.org/rss/1.0/}title').text) + "\n"
		return titles
		
	def get_2ch(self):
		titles = ""
		response = urllib.urlopen('http://engawa.2ch.net/poverty/subject.txt')
		html = unicode(response.read(), "cp932", 'ignore').encode("utf-8")
		for line in html.split("\n"):
			if line != "":
				titles+=re.sub("\(.*?\)$","",line.split("<>", 2)[1])+ "\n"
		return titles
		
app = webapp2.WSGIApplication([
	('/markov.html', Markov)
], debug=True)

The get method of the Markov class is the method that works when the user accesses it. get_hotentry_title () gets a list of popular entries and passes them to markov.html. ElementTree is used to get RSS. It seemed like it would be a hassle to use feedparser on GAE.

get_2ch () is an extra function. Instead of the entry of the end, I will pick up the thread of 2ch. Add "? Mode = 2ch" to the end of the URL to get 2ch mode. If you enhance the function to change the information fetched according to the parameters like this, your dreams will spread.

re.sub("[-:|/｜：].{1,30}$”,””,~~~)

This mysterious description called re.sub. This eliminates unnecessary noise.

◯◯ △△ Only one clear way to do 100 selections-XX blog

Delete the "-XX blog" with such a title to make it simple.

Next is the client side.

`markov.html`



<html>
    <head>
    </head>
    <body style="">
        <p>&nbsp;</p>
        <p>
        <meta charset="UTF-8">
        <title>Summary-kun</title>
        <link rel="stylesheet" href="http://code.jquery.com/mobile/1.1.0/jquery.mobile-1.1.0.min.css" />
	<script type="text/javascript" src="http://code.jquery.com/jquery-1.7.1.min.js"></script>
		<script type="text/javascript" src="http://code.jquery.com/mobile/1.1.0/jquery.mobile-1.1.0.min.js"></script>
			<script type="text/javascript" src="jscss/tiny_segmenter-0.1.js" charset="UTF-8">
        </script> <script type="text/javascript">
	var segmenter
	$(function(){
		segmenter = new TinySegmenter();//Instance generation
	})
	//Run
	function doAction(){
		var wkIn=$("#txtIN").val()//input
		var segs = segmenter.segment(wkIn);  //Returns an array of words
		var dict=makeDic(wkIn)
		var wkbest=doShuffle(dict);	
		for(var i=0;i<=10;i++){
		wkOut=doShuffle(dict).replace(/\n/g,"");	
			if(Math.abs(40-wkOut.length)<Math.abs(40-wkbest.length)){
				wkbest=wkOut
			}
		}
		
		$("#txtOUT").val(wkbest);//Output
		
	}
	//Shuffle sentences
	function doShuffle(wkDic){
		var wkNowWord=""
		var wkStr=""
		wkNowWord=wkDic["_BOS_"][Math.floor( Math.random() * wkDic["_BOS_"].length )];
		wkStr+=wkNowWord;
		while(wkNowWord != "_EOS_"){
			wkNowWord=wkDic[wkNowWord][Math.floor( Math.random() * wkDic[wkNowWord].length )];
			wkStr+=wkNowWord;
		}
		wkStr=wkStr.replace(/_EOS_$/,"。")
		return wkStr;
	}
	//Add to dictionary
	function makeDic(wkStr){
		wkStr=nonoise(wkStr);
		var wkLines= wkStr.split("。");
		var wkDict=new Object();
		for(var i =0;i<=wkLines.length-1;i++){
			var wkWords=segmenter.segment(wkLines[i]);
			if(! wkDict["_BOS_"] ){wkDict["_BOS_"]=new Array();}
			if(wkWords[0]){wkDict["_BOS_"].push(wkWords[0])};//Beginning of sentence

			for(var w=0;w<=wkWords.length-1;w++){
				var wkNowWord=wkWords[w];//Now word
				var wkNextWord=wkWords[w+1];//Next word
				if(wkNextWord==undefined){//End of sentence
					wkNextWord="_EOS_"
				}
				if(! wkDict[wkNowWord] ){
					wkDict[wkNowWord]=new Array();
				}
				wkDict[wkNowWord].push(wkNextWord);
				if(wkNowWord=="、"){//"," Can be used as the beginning of a sentence.
					wkDict["_BOS_"].push(wkNextWord);
				}
			}
			
		}
		return wkDict;
	}
	
	//Noise removal
	function nonoise(wkStr){
		wkStr=wkStr.replace(/\n/g,"。");
		wkStr=wkStr.replace(/[\?\!？！]/g,"。");
		wkStr=wkStr.replace(/[-|｜:: ・]/g,"。");
		wkStr=wkStr.replace(/[「」（）\(\)\[\]【】]/g," ");
		return wkStr;
	}	
</script>  </meta>
<div data-role="page" id="first">
	<div data-role="content">	

        <p>To summarize the topical articles on the net in one line ...</p>
					<p><textarea cols="60" rows="8" name="txtIN" id="txtIN"  style="max-height:200px;">{{ mes }}</textarea></p>
        <input type="button" name="" value="Generate" onClick=" doAction()"></br>
        <textarea cols="60" rows="8" name="txtIN" id="txtOUT"></textarea>
        <p></p>

</div>
</div>
</body>
</html>

Wow messed up. First, doAction (), this is the main function. The character string received by segmenter.segment (wkIN) is decomposed into pieces. Based on that, make a dictionary of sentence connections with makeDic (). After that, mix it 10 times with doShuffle () and adopt the character string closest to 40 characters.

Complete.

It seems that various improvements can be made by changing the information fetched from the Web or changing the evaluation criteria of mixed sentences according to your preference.

Summary

Not very practical.