This is the 70th record of Language Processing 100 Knock 2015. Basically, it was almost the same as "Amateur language processing 100 knocks", so I didn't post it to the block, but Regarding "Chapter 8: Machine Learning", I have taken the time to work on it seriously and have changed it to some extent, so I posted it. I will. I will mainly use Stanford NLP.

Reference link

Link	Remarks
070.Obtaining and shaping data.ipynb	Answer program GitHub link
100 amateur language processing knocks:70	I am always indebted to you by knocking 100 language processing
Getting Started with Stanford NLP in Python	It was easy to understand the difference from Stanford Core NLP

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.15	I use pyenv because I sometimes use multiple Python environments
Python	3.6.9	python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv

problem

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

70. Obtaining and shaping data

Using Correct answer data of polarity analysis for sentences, correct answer data as follows. Create (sentiment.txt).

Add the string "+1" to the beginning of each line in rt-polarity.pos (polarity label "+1" followed by a space followed by positive statement content)

Add the string "-1" to the beginning of each line in rt-polarity.neg (polarity label "-1" followed by a space followed by a negative statement)

Concatenate the contents of 1 and 2 above and rearrange the lines randomly

After creating> sentiment.txt, check the number of positive examples (positive sentences) and the number of negative examples (negative sentences).

Precautions for the file to be read

The character code seems to be WINDOWS-1252 instead of UTF-8 (I haven't confirmed it properly, but "Amateur language processing 100 knocks: 70" ) Same file reading)
Some texts contain umlauts as well as English (characters like "Ü")
Basically all characters are in lowercase

Answer

Answer premise

The folder structure is as follows under the Jupyter Notebook directory. Original data is unzipped and placed.

└── rt-polaritydata
    ├── rt-polarity.neg
    └── rt-polarity.pos

Answer program [070. Obtaining and shaping data.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6 % E7% BF% 92/070.% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 85% A5% E6% 89% 8B% E3% 83% BB% E6% 95% B4% E5% BD% A2.ipynb)

import codecs
import random

FNAME_SMT = 'sentiment.txt'
pos_prefix = '+1'
neg_prefix = '-1'

result = []

def read_file(fname, prefix):
    #Unconfirmed whether it can be read by the open function without using codecs()
    with codecs.open(fname, encoding='cp1252') as file:  #Encoding is Windows-1252
        return ['{0} {1}'.format(prefix, line.strip()) for line in file]

#Positive read
result.extend(read_file('./rt-polaritydata/rt-polarity.pos', pos_prefix))

#Negative read
result.extend(read_file('./rt-polaritydata/rt-polarity.neg', neg_prefix))

random.shuffle(result)

with open(FNAME_SMT, 'w') as file_out:
    file_out.write('\n'.join(result))

#Check the number
cnt_pos = 0
cnt_neg = 0
with open(FNAME_SMT) as file:
    for line in file:
        if line.startswith(pos_prefix):
            cnt_pos += 1
        elif line.startswith(neg_prefix):
            cnt_neg += 1

print('pos:{}, neg:{}'.format(cnt_pos, cnt_neg))

Answer commentary

It's basically reading and writing files, so I haven't done much to mention it. I'm using the codecs library to open the file, but this is just a copy of the relevant part of 100 amateur language processing knocks: 70 Therefore, I have not verified whether it is possible with the normal ʻopenfunction. However, I didn't want to use thecodecs` library again in the subsequent programs, so I saved it in UTF-8. Even so, the characters containing umlauts are saved correctly.

When executed, the last print function will output the number as shown below.

pos:5331, neg:5331

100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data