Purpose

It takes an arbitrary text file as input and generates an N-gram for it. This time we will generate ** words ** N-gram.

data set

e.g. news articles

Generate N-gram for the following articles. It is assumed that the article is located in ./data/news.txt from the directory where the program is located.

It is a result that can be said to have overturned the common sense of space development, and is attracting attention as an epoch-making technology that reduces launch costs. At a press conference held at the Kennedy Space Center in Florida after a successful launch of the rocket, SpaceX CEO Elon Musk said, "The rocket can be returned. I was able to prove that, "he said, expressing his joy in the success of the experiment. After that, we will conduct experiments on the ground to see if the rocket returned this time is normal, and if there are no problems, next month or next month. He commented that he would launch the same rocket again, saying, "The rocket could be reused thousands of times in the future, but at present I think it can be reused 10 to 20 times. Including other rockets. , Reuse of all rockets will be the norm in the future, "he said.

program

text2bow is a function that converts a sentence into a word set, and mod = "file" when inputting a file. When inputting a character string, specify mod = "str". (If you use it as a module, this may be more)

`ngram.py`


#!/usr/bin/env python
# -*- coding: utf-8 -*-

import sys
import commands as cmd

#text->word(morpheme)set
def text2bow(obj,mod):

    # input:Mod for files="file", input:Mod for strings="str"
    if mod == "file":
        morp = cmd.getstatusoutput("cat " + obj + " | mecab -Owakati")
    elif mod == "str":
        morp = cmd.getstatusoutput("echo " + obj.encode('utf-8') + " | mecab -Owakati")
    else:
        print "error!!"
        sys.exit(0)

    words = morp[1].decode('utf-8')
    words = words.replace('\n','')

    bow = words.split(' ')

    return bow

# N-Gram generation
def gen_Ngram(words,N):

    ngram = []

    for i in range(len(words)):
        cw = ""
        
        if i >= N-1:
            for j in reversed(range(N)):
                cw += words[i-j]
        else:
            continue

        ngram.append(cw)
                
    return ngram

#output
def output_Ngram(ngram):

    for i in range(len(ngram)):
        print ngram[i].encode('utf-8')

def main():

    argvs = sys.argv

    # input:For files
    bow = text2bow(argvs[2],mod="file")

    # input:For strings
    #bow = text2bow(obj=u"This is N-This is a program that generates gram.",mod="str")

    ngram = gen_Ngram(bow,int(argvs[1]))

    output_Ngram(ngram)

if __name__ == "__main__":

    main()

Execution method

For the time being, this time it is assumed that a text file is passed as input. (When inputting a character string in the program, import ngram.py and use various methods. Pay attention only to the mod value of text2bow) The execution method is as follows.

`ngram.py`


$ python ngram.py N textfile

--N: Arbitrary number (e.g. 2-gram-> N = 2) --textfile: File path of the input text file

Run

Output 2-gram of the above news article.

`ngram.py`


$ python ngram.py 2 data/news.txt

Output result

Space exploration Of development Common sense Common sense Overturn Overturned Tato Tomo Can also be said ...

If you can get the above output, it's OK.

Summary

This time, I created a program that can handle the word N-gram in Python. To handle it as a module, import the program and use each method. I intended to make it with versatility in mind, so I think it can be imported and used easily.

Module to generate word N-gram in Python

Purpose

data set

e.g. news articles

program

ngram.py

Execution method

ngram.py

Run

ngram.py

Output result

Summary

`ngram.py`

`ngram.py`

`ngram.py`