100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data

This is the 70th record of Language Processing 100 Knock 2015. Basically, it was almost the same as "Amateur language processing 100 knocks", so I didn't post it to the block, but Regarding "Chapter 8: Machine Learning", I have taken the time to work on it seriously and have changed it to some extent, so I posted it. I will. I will mainly use Stanford NLP.

Reference link

Link Remarks
070.Obtaining and shaping data.ipynb Answer program GitHub link
100 amateur language processing knocks:70 I am always indebted to you by knocking 100 language processing
Getting Started with Stanford NLP in Python It was easy to understand the difference from Stanford Core NLP

environment

type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

problem

Chapter 8: Machine Learning

In this chapter, [sentence polarity dataset] of Movie Review Data published by Bo Pang and Lillian Lee. v1.0](http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.README.1.0.txt) is used to make the sentence positive or negative. Work on the task (polarity analysis) to classify as (negative).

70. Obtaining and shaping data

Using Correct answer data of polarity analysis for sentences, correct answer data as follows. Create (sentiment.txt).

  1. Add the string "+1" to the beginning of each line in rt-polarity.pos (polarity label "+1" followed by a space followed by positive statement content)
  2. Add the string "-1" to the beginning of each line in rt-polarity.neg (polarity label "-1" followed by a space followed by a negative statement)
  3. Concatenate the contents of 1 and 2 above and rearrange the lines randomly

After creating> sentiment.txt, check the number of positive examples (positive sentences) and the number of negative examples (negative sentences).

Precautions for the file to be read

  1. The character code seems to be WINDOWS-1252 instead of UTF-8 (I haven't confirmed it properly, but "Amateur language processing 100 knocks: 70" ) Same file reading)
  2. Some texts contain umlauts as well as English (characters like "Ü")
  3. Basically all characters are in lowercase

Answer

Answer premise

The folder structure is as follows under the Jupyter Notebook directory. Original data is unzipped and placed.

└── rt-polaritydata
    ├── rt-polarity.neg
    └── rt-polarity.pos

Answer program [070. Obtaining and shaping data.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/08.%E6%A9%9F%E6%A2%B0%E5%AD%A6 % E7% BF% 92/070.% E3% 83% 87% E3% 83% BC% E3% 82% BF% E3% 81% AE% E5% 85% A5% E6% 89% 8B% E3% 83% BB% E6% 95% B4% E5% BD% A2.ipynb)

import codecs
import random

FNAME_SMT = 'sentiment.txt'
pos_prefix = '+1'
neg_prefix = '-1'

result = []

def read_file(fname, prefix):
    #Unconfirmed whether it can be read by the open function without using codecs()
    with codecs.open(fname, encoding='cp1252') as file:  #Encoding is Windows-1252
        return ['{0} {1}'.format(prefix, line.strip()) for line in file]

#Positive read
result.extend(read_file('./rt-polaritydata/rt-polarity.pos', pos_prefix))

#Negative read
result.extend(read_file('./rt-polaritydata/rt-polarity.neg', neg_prefix))

random.shuffle(result)

with open(FNAME_SMT, 'w') as file_out:
    file_out.write('\n'.join(result))

#Check the number
cnt_pos = 0
cnt_neg = 0
with open(FNAME_SMT) as file:
    for line in file:
        if line.startswith(pos_prefix):
            cnt_pos += 1
        elif line.startswith(neg_prefix):
            cnt_neg += 1

print('pos:{}, neg:{}'.format(cnt_pos, cnt_neg))

Answer commentary

It's basically reading and writing files, so I haven't done much to mention it. I'm using the codecs library to open the file, but this is just a copy of the relevant part of 100 amateur language processing knocks: 70 Therefore, I have not verified whether it is possible with the normal ʻopenfunction. However, I didn't want to use thecodecs` library again in the subsequent programs, so I saved it in UTF-8. Even so, the characters containing umlauts are saved correctly.

When executed, the last print function will output the number as shown below.

pos:5331, neg:5331

Recommended Posts

100 Language Processing Knock-70 (using Stanford NLP): Obtaining and shaping data
100 Language Processing Knock-71 (using Stanford NLP): Stopword
100 language processing knock-72 (using Stanford NLP): feature extraction
100 language processing knock-20 (using pandas): reading JSON data
100 language processing knock-92 (using Gensim): application to analogy data
100 Language Processing Knock-31 (using pandas): Verb
100 language processing knock-73 (using scikit-learn): learning
100 language processing knock-74 (using scikit-learn): Prediction
100 Language Processing Knock-38 (using pandas): Histogram
100 language processing knock-97 (using scikit-learn): k-means clustering
100 Language Processing Knock-33 (using pandas): Sahen noun
100 Language Processing Knock-91: Preparation of Analogy Data
100 Language Processing Knock-35 (using pandas): Noun concatenation
100 Language Processing Knock-39 (using pandas): Zipf's Law
100 Language Processing Knock-34 (using pandas): "A B"
100 Language Processing Knock (2020): 28
100 Language Processing Knock (2020): 38
100 language processing knock 00 ~ 02
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 1)
100 language processing knock-90 (using Gensim): learning with word2vec
100 language processing knock-79 (using scikit-learn): precision-recall graph drawing
100 Language Processing Knock-32 (using pandas): Prototype of verb
100 language processing knock-98 (using pandas): Ward's method clustering
100 language processing knock-75 (using scikit-learn): weight of features
100 language processing knock-99 (using pandas): visualization by t-SNE
100 language processing knock-95 (using pandas): Rating with WordSimilarity-353
100 language processing knock 2020 [00 ~ 39 answer]
100 language processing knock 2020 [00-79 answer]
100 language processing knock 2020 [00 ~ 69 answer]
100 Language Processing Knock 2020 Chapter 1
100 Amateur Language Processing Knock: 17
100 language processing knock 2020 [00 ~ 49 answer]
100 Language Processing Knock-52: Stemming
100 Language Processing Knock Chapter 1
100 Amateur Language Processing Knock: 07
100 Language Processing Knock 2020 Chapter 3
100 Language Processing Knock 2020 Chapter 2
100 Amateur Language Processing Knock: 09
100 Amateur Language Processing Knock: 47
100 Language Processing Knock-53: Tokenization
100 Amateur Language Processing Knock: 97
100 language processing knock 2020 [00 ~ 59 answer]
100 Amateur Language Processing Knock: 67
100 Language Processing Knock: Chapter 2 UNIX Command Basics (using pandas)
100 Language Processing Knock-83 (using pandas): Measuring word / context frequency
100 language processing knock-30 (using pandas): reading morphological analysis results
100 language processing knock-94 (using Gensim): similarity calculation with WordSimilarity-353
Overview of natural language processing and its data preprocessing
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 second half)
Rehabilitation of Python and NLP skills starting with "100 Language Processing Knock 2015" (Chapter 2 first half)
100 Language Processing Knock-51: Word Clipping
100 Language Processing Knock-58: Tuple Extraction
100 Language Processing Knock-57: Dependency Analysis
100 Language Processing Knock-84 (using pandas): Creating a word context matrix
100 language processing knock-77 (using scikit-learn): measurement of correct answer rate
100 language processing knock-50: sentence break
100 Language Processing Knock Chapter 1 (Python)
100 Language Processing Knock Chapter 2 (Python)
100 Language Processing Knock-25: Template Extraction
100 Language Processing Knock-87: Word Similarity
I tried 100 language processing knock 2020