[Recommended tagging for machine learning # 1] Scraping of Hatena blog articles

Hi, this is Bython Chogo. I have to learn English so I try to post article both English and Japanese :(

Now studying Machine Learning and practicing test scripting with Bayesian filtering. my plan is to estimate tag from web posted contents after learning several posts and tags. Bayesian sample script can be got from Gihyo web page, I'll introduce later, before that today's topic and problem to talk is scraping contents from post.

I found good slide to describe what I'd like to say however I've lost ... orz. Will add it later. Regarding the article, there is two way to scrape body contents. One is using characterized format of each contents. I don't need header or footer date for learning words because it may not useful for identifying the tag.

As a example, I try to scrape only article on Hatena Blog, the article is between the below tags.

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

this case, I wrote below code.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find("div", {"class": "entry-content"})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

Looks not cool.. but it works :( Also I have to prepare all format I will scrape. This is very tired. So second way is to use learning method! But this looks difficult for me.

To be continued...

Hi, my name is Mayor Umemura. Thank you. I write in English and Japanese because I am also learning English, but I hope you will keep an eye on the ugliness of English. I am currently studying machine learning, and as part of my practical experience, I am making an automatic tagging system for articles using Bayesian. However, there are many new things to remember when doing it, and the road to Senri is just one step away, so I'm doing it steadily.

So, today's topic is the extraction of articles used for learning and judgment, so-called scraping. You can find articles in various places with hot themes. There was a good article I searched for the other day, but I inadvertently forgot. I would like to re-tension it later. So, the content of the article introduced two methods to extract only the body of the article, ignoring the header and footer of the target page.

One is to register and extract the box tag of the article body steadily according to the format of the site. For example, in the case of Hatena Blog.

    <div class=entry-contents>
    CONTENTS to SCRAPE!
    </div>

So, I wrote the following script to extract the contents from this one.

    soup = BeautifulSoup( html, "lxml" )
    soup.originalEnoding
    tag = soup.find("div", {"class": "entry-content"})
    text = ""
    for con in tag.contents:
        p = re.compile(r'<.*?>')
        text += p.sub('', con.encode('utf8'))

This is all I can do, thinking that the code is probably ugly. So, with this method, you have to register the characteristic enclosure of each site, and if that is troublesome, use the learning of the second method, I feel like it was written in the above article. To do. However, it is a difficult place with the current ability.

I would like to serialize this script until it is completed.