This is the 70th record of Language Processing 100 Knock 2015. Basically, it was almost the same as "Amateur language processing 100 knocks", so I didn't post it to the block, but Regarding "Chapter 8: Machine Learning", I have taken the time to work on it seriously and have changed it to some extent, so I posted it. I will. I will mainly use Stanford NLP.
Link | Remarks |
---|---|
070.Obtaining and shaping data.ipynb | Answer program GitHub link |
100 amateur language processing knocks:70 | I am always indebted to you by knocking 100 language processing |
Getting Started with Stanford NLP in Python | It was easy to understand the difference from Stanford Core NLP |
type | version | Contents |
---|---|---|
OS | Ubuntu18.04.01 LTS | It is running virtually |
pyenv | 1.2.15 | I use pyenv because I sometimes use multiple Python environments |
Python | 3.6.9 | python3 on pyenv.6.I'm using 9 3.7 or 3.There is no deep reason not to use 8 series Packages are managed using venv |
In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).
Using Correct answer data of polarity analysis for sentences, correct answer data as follows. Create (sentiment.txt).
- Add the string "+1" to the beginning of each line in rt-polarity.pos (polarity label "+1" followed by a space followed by positive statement content)
- Add the string "-1" to the beginning of each line in rt-polarity.neg (polarity label "-1" followed by a space followed by a negative statement)
- Concatenate the contents of 1 and 2 above and rearrange the lines randomly
After creating> sentiment.txt, check the number of positive examples (positive sentences) and the number of negative examples (negative sentences).
Precautions for the file to be read
The folder structure is as follows under the Jupyter Notebook directory. Original data is unzipped and placed.
└── rt-polaritydata
├── rt-polarity.neg
└── rt-polarity.pos
import codecs
import random
FNAME_SMT = 'sentiment.txt'
pos_prefix = '+1'
neg_prefix = '-1'
result = []
def read_file(fname, prefix):
#Unconfirmed whether it can be read by the open function without using codecs()
with codecs.open(fname, encoding='cp1252') as file: #Encoding is Windows-1252
return ['{0} {1}'.format(prefix, line.strip()) for line in file]
#Positive read
result.extend(read_file('./rt-polaritydata/rt-polarity.pos', pos_prefix))
#Negative read
result.extend(read_file('./rt-polaritydata/rt-polarity.neg', neg_prefix))
random.shuffle(result)
with open(FNAME_SMT, 'w') as file_out:
file_out.write('\n'.join(result))
#Check the number
cnt_pos = 0
cnt_neg = 0
with open(FNAME_SMT) as file:
for line in file:
if line.startswith(pos_prefix):
cnt_pos += 1
elif line.startswith(neg_prefix):
cnt_neg += 1
print('pos:{}, neg:{}'.format(cnt_pos, cnt_neg))
It's basically reading and writing files, so I haven't done much to mention it.
I'm using the codecs
library to open the file, but this is just a copy of the relevant part of 100 amateur language processing knocks: 70 Therefore, I have not verified whether it is possible with the normal ʻopenfunction. However, I didn't want to use the
codecs` library again in the subsequent programs, so I saved it in UTF-8. Even so, the characters containing umlauts are saved correctly.
When executed, the last print
function will output the number as shown below.
pos:5331, neg:5331
Recommended Posts