100 language processing knock 2015

http://www.cl.ecei.tohoku.ac.jp/nlp100/

Beginners will do their best with Python (3.x). I think there are many similar articles, but as a personal memorandum. If you have any advice or suggestions, please leave a comment!

The source code is also posted on github. https://github.com/hbkr/nlp100

Chapter1

00. Reverse order of strings

Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning). *

`000.py`


s = "stressed"
print(s[::-1])

desserts

s [i: j: k] means slice of s from i to j with step k, so s [:: -1] goes back -1 character from the end to the beginning.

01. "Patatokukashi"

Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string. *

`001.py`


s = "Patatoku Kashii"
print(s[::2])

Police car

As explained above, you can use s [:: 2] to extract a character string by skipping one character from the beginning to the end.

02. "Police car" + "Taxi" = "Patatokukashi"

Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning. *

`002.py`


s = "".join(i+j for i, j in zip("Police car", "taxi"))
print(s)

Patatoku Kashii

You can loop multiple sequence objects at the same time with zip. sep.join (seq) concatenates seq with sep as the delimiter to make one string. The list comprehension is join with an empty string.

03. Pi

Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics." Into words, and create a list of the number of characters (in the alphabet) of each word in order of appearance. *

`003.py`


s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
count = [len(i.strip(",.")) for i in s.split()]
print(count)

[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]

Use str.split (sep) to divide the string into a list with sep as the delimiter. If no delimiter is specified, it will be separated by spaces, tabs, newline strings, and so on. The number of characters is counted by len () after deleting the preceding and following,. With str.strip (",.") .

04. Element symbol

Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create. *

`004.py`


s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
dic = {word[:2-(i in (1,5,6,7,8,9,15,16,19))]:i for i, word in enumerate(s.replace(".", "").split(), 1)}
print(dic)

{'He': 2, 'K': 19, 'S': 16, 'Ar': 18, 'Si': 14, 'O': 8, 'F': 9, 'P': 15, 'Na': 11, 'Cl': 17, 'B': 5, 'Ca': 20, 'Ne': 10, 'Be': 4, 'N': 7, 'C': 6, 'Mi': 12, 'Li': 3, 'H': 1, 'Al': 13}

You can get both the element index and the element with ʻenumerate (seq [, start = 0]). The index is passed as it is to the ʻin operator to adjust the number of characters to be acquired.

n-gram

Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer". *

`005.py`


def n_gram(s, n): return {tuple(s[i:i+n]) for i in range(len(s)-n+1)}

s = "I am an NLPer"
print(n_gram(s, 2))
print(n_gram([t.strip(".,") for t in s.split()], 2))

{('m', ' '), ('n', ' '), ('e', 'r'), ('N', 'L'), (' ', 'N'), ('a', 'm'), ('a', 'n'), ('L', 'P'), ('I', ' '), (' ', 'a'), ('P', 'e')}
{('an', 'NLPer'), ('I', 'am'), ('am', 'an')}

The N-gram method is a method of indexing sentences with N characters as headwords in the order of the character strings. The n_gram (s, n) function cuts out the passed sequence object s element by n and returns it as a set type. By returning it as a set type, the elements are not duplicated.

06. Meeting

Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition,> check if the bi-gram'se'is included in X and Y. *

`006.py`


n_gram = lambda s, n: {tuple(s[i:i+n]) for i in range(len(s)-n+1)}

X = n_gram("paraparaparadise", 2)
Y = n_gram("paragraph", 2)

print("X: %s" % X)
print("Y: %s" % Y)
print("union: %s" % str(X|Y))
print("difference: %s" % str(X-Y))
print("intersection: %s" % str(X&Y))

if n_gram("se", 2) <= X: print("'se' is included in X.")
if n_gram("se", 2) <= Y: print("'se' is included in Y.")

X: {('a', 'd'), ('a', 'p'), ('i', 's'), ('s', 'e'), ('a', 'r'), ('p', 'a'), ('d', 'i'), ('r', 'a')}
Y: {('g', 'r'), ('p', 'h'), ('a', 'p'), ('a', 'r'), ('p', 'a'), ('r', 'a'), ('a', 'g')}
union: {('a', 'd'), ('g', 'r'), ('p', 'h'), ('a', 'p'), ('i', 's'), ('s', 'e'), ('a', 'r'), ('p', 'a'), ('d', 'i'), ('r', 'a'), ('a', 'g')}
difference: {('i', 's'), ('d', 'i'), ('a', 'd'), ('s', 'e')}
intersection: {('a', 'r'), ('p', 'a'), ('a', 'p'), ('r', 'a')}
'se' is included in X.

I will use the n_gram created in 005.py, but this time I tried using the lambda expression (since I didn't say that I should create a function this time). X | Y is the union, X-Y is the complement, and X & Y is the intersection.

07. Sentence generation by template

Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result. *

`007.py`


def f(x, y, z): return "%s time%s is%s" % (x, y, z)

print(f(12, "temperature", 22.4))

The temperature at 12:00 is 22.4

" {1} at {0} is {2} ".format (x, y, z) is fine.

08. Ciphertext

Implement the function cipher that converts each character of the given character string according to the following specifications. --If lowercase letters, replace with (219 --character code) characters --Other characters are output as they are Use this function to encrypt / decrypt English messages. *

`008.py`


def cipher(s): return "".join(chr(219-ord(c)) if c.islower() else c for c in s)

s = "Hi He Lied Because Boron Could Not Oxidize Fluorine."
print(cipher(s))
print(cipher(cipher(s)))

Hr Hv Lrvw Bvxzfhv Blilm Clfow Nlg Ocrwrav Foflirmv.
Hi He Lied Because Boron Could Not Oxidize Fluorine.

It seems that " a "<= c <=" z " can be used instead of ʻis lower`. Is that faster?

Typoglycemia

Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result. *

`009.py`


from random import random

typo = lambda s: " ".join(t[0]+"".join(sorted(t[1:-1], key=lambda k:random()))+t[-1] if len(t) > 4 else t for t in s.split())

s = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(typo(s))

I cdnlu'ot blieeve that I culod aclualty uetdnnsard what I was rdeniag : the pnneehmoal pwoer of the huamn mind .

Somehow I got stubborn and did my best in one line. I'm using sorted () because the shuffle () function has no return value.

Entry where Python beginners do their best to knock 100 language processing little by little

100 language processing knock 2015

00. Reverse order of strings

000.py

01. "Patatokukashi"

001.py

02. "Police car" + "Taxi" = "Patatokukashi"

002.py

03. Pi

003.py