Language processing 100 knocks 2015 "Chapter 6: Processing English texts" This is the record of 51st "Cut out words" of .tohoku.ac.jp/nlp100/#ch6). This time, technically, it is almost the same as the previous time. A simple knock that ends with less than 10 lines of code.

Reference link

Link	Remarks
051.Cut out words.ipynb	Answer program GitHub link
100 amateur language processing knocks:51	Copy and paste source of many source parts

environment

type	version	Contents
OS	Ubuntu18.04.01 LTS	It is running virtually
pyenv	1.2.16	I use pyenv because I sometimes use multiple Python environments
Python	3.8.1	python3 on pyenv.8.I'm using 1 Packages are managed using venv

Chapter 6: Processing English Text

content of study

An overview of various basic technologies of natural language processing through English text processing using Stanford Core NLP.

Stanford Core NLP, Stemming, Part-of-speech tagging, Named entity recognition, Co-reference analysis, Parsing analysis, Phrase structure analysis, S-expressions

Knock content

For the English text (nlp.txt), execute the following processing.

51. Cutting out words

Treat whitespace as word breaks, take 50 outputs as input, and output in the form of one word per line. However, output a blank line at the end of the sentence.

Answer

Answer Program [051. Word Clipping.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/06.%E8%8B%B1%E8%AA%9E%E3%83%86%E3 % 82% AD% E3% 82% B9% E3% 83% 88% E3% 81% AE% E5% 87% A6% E7% 90% 86 / 051.% E5% 8D% 98% E8% AA% 9E% E3% 81% AE% E5% 88% 87% E3% 82% 8A% E5% 87% BA% E3% 81% 97.ipynb)

import re

with open('./050.result.txt') as file_in, \
     open('./051.result.txt', 'w') as file_out:
    for line in file_in:
        if line != '\n':
            line = re.sub(r'''
                         [\.|;|:|\?|!|,]*  # . or ; or : or ? or ! or ,Is 0 times or more
                         \s                 #Blank
                       ''', '\n', line, flags = re.VERBOSE)
            print(line, file=file_out)

Answer commentary

Regular expressions

Processing using regular expressions following the previous time. This time, replace the blank (space) with a line break. This time it's simpler because there are no positive look-ahead / look-behind assertions. Even if there is a symbol system before the blank, it is replaced.

Output result (execution result)

When the program is executed, the following result (excerpt from the first 20 lines) is output.

`text:051.result.txt(Excerpt from the first 20 lines)`


Natural
language
processing

From
Wikipedia
the
free
encyclopedia

Natural
language
processing
(NLP)
is
a
field
of
computer
science

Recommended Posts

100 Language Processing Knock-51: Word Clipping

100 Language Processing Knock-87: Word Similarity

100 Language Processing Knock (2020): 28

100 Language Processing Knock (2020): 38

100 language processing knock 00 ~ 02

100 Language Processing Knock-82 (Context Word): Context Extraction

Language processing 100 knock-86: Word vector display

100 Language Processing Knock 2020 Chapter 7: Word Vector

100 language processing knock 2020 [00 ~ 39 answer]

100 language processing knock 2020 [00-79 answer]

100 language processing knock 2020 [00 ~ 69 answer]

100 Language Processing Knock 2020 Chapter 1

100 Amateur Language Processing Knock: 17

100 language processing knock 2020 [00 ~ 49 answer]

100 Language Processing Knock-52: Stemming

100 Language Processing Knock Chapter 1

100 Amateur Language Processing Knock: 07

100 Language Processing Knock 2020 Chapter 3

100 Language Processing Knock 2020 Chapter 2

100 Amateur Language Processing Knock: 09

100 Amateur Language Processing Knock: 47

100 Language Processing Knock-53: Tokenization

100 Amateur Language Processing Knock: 97

100 language processing knock 2020 [00 ~ 59 answer]

100 Amateur Language Processing Knock: 67

100 Language Processing with Python Knock 2015

100 Language Processing Knock-58: Tuple Extraction

100 Language Processing Knock-57: Dependency Analysis

100 language processing knock-50: sentence break

100 Language Processing Knock Chapter 1 (Python)

100 Language Processing Knock Chapter 2 (Python)

Natural language processing 3 Word continuity

100 Language Processing Knock-25: Template Extraction

I tried 100 language processing knock 2020

100 language processing knock-56: co-reference analysis

Solving 100 Language Processing Knock 2020 (01. "Patatokukashi")

Natural language processing 2 Word similarity

100 Amateur Language Processing Knock: Summary

100 Language Processing Knock-36 (using pandas): Frequency of word occurrence

100 Language Processing Knock-83 (using pandas): Measuring word / context frequency

100 Language Processing Knock 2015 Chapter 5 Dependency Analysis (40-49)

100 Language Processing Knock with Python (Chapter 1)

100 Language Processing Knock Chapter 1 in Python

100 Language Processing Knock 2020 Chapter 4: Morphological Analysis

100 Language Processing Knock 2020 Chapter 9: RNN, CNN

100 language processing knock-76 (using scikit-learn): labeling

100 language processing knock-55: named entity extraction

I tried 100 language processing knock 2020: Chapter 3

100 Language Processing Knock with Python (Chapter 3)

100 Language Processing Knock: Chapter 1 Preparatory Movement

100 Language Processing Knock 2020 Chapter 6: Machine Learning

100 Language Processing Knock Chapter 4: Morphological Analysis

[Language processing 100 knocks 2020] Chapter 7: Word vector

100 Language Processing Knock 2020 Chapter 10: Machine Translation (90-98)

100 Language Processing Knock 2020 Chapter 5: Dependency Analysis

100 Language Processing Knock-28: MediaWiki Markup Removal

100 Language Processing Knock 2020 Chapter 8: Neural Net

100 Language Processing Knock-59: Analysis of S-expressions

Python beginner tried 100 language processing knock 2015 (05 ~ 09)

100 Language Processing Knock-31 (using pandas): Verb

100 language processing knock 2020 "for Google Colaboratory"