It takes an arbitrary text file as input and generates an N-gram for it. This time we will generate ** words ** N-gram.
Generate N-gram for the following articles. It is assumed that the article is located in ./data/news.txt from the directory where the program is located.
It is a result that can be said to have overturned the common sense of space development, and is attracting attention as an epoch-making technology that reduces launch costs. At a press conference held at the Kennedy Space Center in Florida after a successful launch of the rocket, SpaceX CEO Elon Musk said, "The rocket can be returned. I was able to prove that, "he said, expressing his joy in the success of the experiment. After that, we will conduct experiments on the ground to see if the rocket returned this time is normal, and if there are no problems, next month or next month. He commented that he would launch the same rocket again, saying, "The rocket could be reused thousands of times in the future, but at present I think it can be reused 10 to 20 times. Including other rockets. , Reuse of all rockets will be the norm in the future, "he said.
text2bow is a function that converts a sentence into a word set, and mod = "file" when inputting a file. When inputting a character string, specify mod = "str". (If you use it as a module, this may be more)
ngram.py
#!/usr/bin/env python
# -*- coding: utf-8 -*-
import sys
import commands as cmd
#text->word(morpheme)set
def text2bow(obj,mod):
# input:Mod for files="file", input:Mod for strings="str"
if mod == "file":
morp = cmd.getstatusoutput("cat " + obj + " | mecab -Owakati")
elif mod == "str":
morp = cmd.getstatusoutput("echo " + obj.encode('utf-8') + " | mecab -Owakati")
else:
print "error!!"
sys.exit(0)
words = morp[1].decode('utf-8')
words = words.replace('\n','')
bow = words.split(' ')
return bow
# N-Gram generation
def gen_Ngram(words,N):
ngram = []
for i in range(len(words)):
cw = ""
if i >= N-1:
for j in reversed(range(N)):
cw += words[i-j]
else:
continue
ngram.append(cw)
return ngram
#output
def output_Ngram(ngram):
for i in range(len(ngram)):
print ngram[i].encode('utf-8')
def main():
argvs = sys.argv
# input:For files
bow = text2bow(argvs[2],mod="file")
# input:For strings
#bow = text2bow(obj=u"This is N-This is a program that generates gram.",mod="str")
ngram = gen_Ngram(bow,int(argvs[1]))
output_Ngram(ngram)
if __name__ == "__main__":
main()
For the time being, this time it is assumed that a text file is passed as input. (When inputting a character string in the program, import ngram.py and use various methods. Pay attention only to the mod value of text2bow) The execution method is as follows.
ngram.py
$ python ngram.py N textfile
--N: Arbitrary number (e.g. 2-gram-> N = 2) --textfile: File path of the input text file
Output 2-gram of the above news article.
ngram.py
$ python ngram.py 2 data/news.txt
Space exploration Of development Common sense Common sense Overturn Overturned Tato Tomo Can also be said ...
If you can get the above output, it's OK.
This time, I created a program that can handle the word N-gram in Python. To handle it as a module, import the program and use each method. I intended to make it with versatility in mind, so I think it can be imported and used easily.
Recommended Posts