100 Language Processing Knock: Chapter 1 Preparatory Movement

"Chapter 1: Preparatory Movement" Of Language Processing 100 Knock 2015 It is a record of tohoku.ac.jp/nlp100/#ch1). This is a review of what I did over a year ago. Looking at the code at that time again, there are many corrections, and it seems that it is my own growth. I feel that the amount of code has been compressed to about half that of the program I did at that time. And now that I have some Python experience, it's a ** good tutorial to learn Python and language processing **. Compared to the latter half, one knock is lighter, which is exactly what the name "preparatory movement" deserves.


type version Contents
OS Ubuntu18.04.01 LTS It is running virtually
pyenv 1.2.15 I use pyenv because I sometimes use multiple Python environments
Python 3.6.9 python3 on pyenv.6.I'm using 9
3.7 or 3.There is no deep reason not to use 8 series
Packages are managed using venv

Chapter 1: Preparatory Movement

Review some advanced topics in programming languages while working on subjects dealing with texts and strings.

String, Unicode, List type, Dictionary type, Collective type, Iterator, Slice, Random number

00. Reverse order of character strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).

Answer: [000. Reverse order of strings.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B% E5% 8B% 95/000.% E6% 96% 87% E5% AD% 97% E5% 88% 97% E3% 81% AE% E9% 80% 86% E9% A0% 86.ipynb)

Specify the slice with [start: stop: step] and make it a negative number to reverse the order.

python:000.Reverse order of strings.ipynb


Terminal output result


01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.

Answer: [001. "Patatokukashi".ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5 % 8B% 95 / 001.% E3% 83% 91% E3% 82% BF% E3% 83% 88% E3% 82% AF% E3% 82% AB% E3% 82% B7% E3% 83% BC% E3% 83% BC.ipynb)

Specify the slice with [start: stop: step] and output the 8th character from the beginning in 2 character steps.

python:001."Patatoku Cassie".ipynb

print('Patatoku Kashii'[0:7:2])

Terminal output result

Police car

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.

Answer: [002. "Police car" + "Taxi" = "Patatokukashi".ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99 % E9% 81% 8B% E5% 8B% 95 / 002.% E3% 80% 8C% E3% 83% 91% E3% 83% 88% E3% 82% AB% E3% 83% BC% E3% 80% 8D% EF% BC% 8B% E3% 80% 8C% E3% 82% BF% E3% 82% AF% E3% 82% B7% E3% 83% BC% E3% 80% 8D% EF% BC% 9D% E3% 80% 8C% E3% 83% 91% E3% 82% BF% E3% 83% 88% E3% 82% AF% E3% 82% AB% E3% 82% B7% E3% 83% BC% E3% 83% BC% E3% 80% 8Dipynb)

Use the zip function to loop the two words" police car "and" taxi "and list them as['Patter',' Toku',' Kashi',' ー ー']in inclusion notation. Output by connecting the list with the join function. I understand the zip function in my head, but it's a kind of command that I haven't experienced in the language, so it's hard to come up with the idea of using it.

python:002."Police car" + "taxi" = "patatokukashi".ipynb

result = [char1+char2 for char1, char2 in zip('Police car', 'taxi')]

Terminal output result

Patatoku Kashii

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."

Answer: [003. Pi.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5 % 8B% 95 / 003.% E5% 86% 86% E5% 91% A8% E7% 8E% 87.ipynb)

Use the split function to divide the space. It's a very useful guy in English language processing. The strip function removes commas and periods at the end of words.


sentence = 'Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics.'
for word in sentence.split():
    print(len(word.strip(',.')), word.strip(',.'))

The number of characters is the pi.

Terminal output result

3 Now
1 I
4 need
1 a
5 drink
9 alcoholic
2 of
6 course
5 after
3 the
5 heavy
8 lectures
9 involving
7 quantum
9 mechanics

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.

Answer: [004. Element symbol.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5% 8B% 95/004.% E5% 85% 83% E7% B4% A0% E8% A8% 98% E5% 8F% B7.ipynb)

I'm using a dictionary type of comprehension (I had a hard time not knowing how to combine it with an if statement). The dictionary is sorted so that the output is in the order of element symbols. Finally, I used pprint for the output because I wanted to break each element.

python:004.Element symbol.ipynb

from pprint import pprint

sentence = 'Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.'
word_list = sentence.split()

result = ({word[0] if i in {1, 5, 6, 7, 8, 9, 15, 16, 19} else word[:2]: i for i, word in enumerate(word_list, 1)})
pprint(sorted(result.items(), key=lambda x:x[1]))

Terminal output result

[('H', 1),
 ('He', 2),
 ('Li', 3),
 ('Be', 4),
 ('B', 5),
 ('C', 6),
 ('N', 7),
 ('O', 8),
 ('F', 9),
 ('Ne', 10),
 ('Na', 11),
 ('Mi', 12),
 ('Al', 13),
 ('Si', 14),
 ('P', 15),
 ('S', 16),
 ('Cl', 17),
 ('Ar', 18),
 ('K', 19),
 ('Ca', 20)]

05. n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".

Answer: [005.n-gram.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5 % 8B% 95 / 005.n-gram.ipynb)

Is it about using range in for as a new technical element?


def generate_ngram(sentence):
    #List by splitting with spaces
    words = sentence.split()
    #White space removal
    chars = sentence.replace(' ','')
    #Word bi-gram generation
    bigram_word = [words[i-1] + ' ' + words[i] for i in range(len(words)) if i > 0]
    #Character bi-gram generation
    bigram_char = [chars[i-1] + chars[i] for i in range(len(chars)) if i > 0]
    return bigram_word, bigram_char

print(generate_ngram('I am an NLPer'))

Terminal output result

(['I am', 'am an', 'an NLPer'], ['Ia', 'am', 'ma', 'an', 'nN', 'NL', 'LP', 'Pe', 'er'])

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.

Answer: [006. Assembly.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5%8B % 95/006.% E9% 9B% 86% E5% 90% 88.ipynb)

In Python, there is something called set, and it seems that you can easily find the union, intersection, and difference set.


def generate_ngram(sentense):
    #White space removal
    chars = sentense.replace(' ','')
    #Character bi-gram generation
    bigram_char = [chars[i-1] + chars[i] for i in range(len(chars)) if i > 0]
    return bigram_char

bigram_x = set(generate_ngram('paraparaparadise'))
bigram_y = set(generate_ngram('paragraph'))



#Difference set

search_word = {'se'}

Terminal output result

{'ag', 'ap', 'se', 'ra', 'is', 'pa', 'ad', 'ph', 'di', 'ar', 'gr'}
{'pa', 'ar', 'ap', 'ra'}
{'ad', 'se', 'di', 'is'}


07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.

Answer: [007. Sentence generation by template.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B% E5% 8B% 95 / 007.% E3% 83% 86% E3% 83% B3% E3% 83% 97% E3% 83% AC% E3% 83% BC% E3% 83% 88% E3% 81% AB % E3% 82% 88% E3% 82% 8B% E6% 96% 87% E7% 94% 9F% E6% 88% 90.ipynb)

Characters are combined with +. {} At the time of '{} can be {}'. Format (x, y, z).

python:007.Sentence generation by template.ipynb

def create_sentence(x,y,z):
    return str(x) + 'of time' + str(y) + 'Is' + str(z)

print(create_sentence(12, 'temperature', 22.4))

Terminal output result

The temperature at 12:00 is 22.4

08. Ciphertext

Implement the function cipher that converts each character of the given character string with the following specifications.

--Replace with lowercase letters (219 --character code) --Other characters are output as they are

Use this function to encrypt / decrypt English messages.

Answer: [008. Ciphertext.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5% 8B% 95 / 008.% E6% 9A% 97% E5% 8F% B7% E6% 96% 87.ipynb)

"219 --Character code" seems to mean something like this. The character code of a is 97, and if 219 --97 = 122 is set in this encryption, the character code becomes 122, which is z. The character code of z is 122, and if 219 --122 = 97 is set in this encryption, the character code 97 is a. In other words, it is an encryption that replaces the lowercase Roman letters a to z in the reverse order of z to a. Use the built-in function chr to control the character code. I was wondering whether to use the inclusion notation, but I stopped it because it seems to be troublesome twice to have to put join at the end.


def cipher(sentence):
    result = ''
    for char in sentence:
        if char.islower():
            result += chr(219-ord(char))
            result += char
    return result

print(cipher('I Am An Idiot'))

Terminal output result

I An Am Iwrlg

09. Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.

Answer: [009.Typoglycemia.ipynb](https://github.com/YoheiFukuhara/nlp100/blob/master/01.%E6%BA%96%E5%82%99%E9%81%8B%E5%8B % 95/009.Typoglycemia.ipynb)

It is a phenomenon that some words in a sentence can be read correctly even if the order other than the first and last letters is changed.

I see, you can read it somehow. Characters are sorted using the shuffle function of the random package.


from random import shuffle

def typoglycemia(word):
    mid_chars = list(word[1:-1])
    return word[0] + ''.join(mid_chars) + word[-1]

sentence = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
' '.join([word if len(word) <= 4 else typoglycemia(word) for word in sentence.split(' ')])

Terminal output result

"I cul'dnot beilvee that I culod altualcy udnnrseatd what I was riadeng : the paemhnenol peowr of the hmuan mind ."

