http://www.cl.ecei.tohoku.ac.jp/nlp100/
Beginners will do their best with Python (3.x). I think there are many similar articles, but as a personal memorandum. If you have any advice or suggestions, please leave a comment!
The source code is also posted on github. https://github.com/hbkr/nlp100
Chapter1
000.py
s = "stressed"
print(s[::-1])
desserts
s [i: j: k]
means slice of s from i to j with step k, so s [:: -1]
goes back -1 character from the end to the beginning.
001.py
s = "Patatoku Kashii"
print(s[::2])
Police car
As explained above, you can use s [:: 2]
to extract a character string by skipping one character from the beginning to the end.
002.py
s = "".join(i+j for i, j in zip("Police car", "taxi"))
print(s)
Patatoku Kashii
You can loop multiple sequence objects at the same time with zip
. sep.join (seq)
concatenates seq
with sep
as the delimiter to make one string. The list comprehension is join
with an empty string.
003.py
s = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
count = [len(i.strip(",.")) for i in s.split()]
print(count)
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
Use str.split (sep)
to divide the string into a list with sep
as the delimiter. If no delimiter is specified, it will be separated by spaces, tabs, newline strings, and so on. The number of characters is counted by len ()
after deleting the preceding and following,. With str.strip (",.")
.
004.py
s = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
dic = {word[:2-(i in (1,5,6,7,8,9,15,16,19))]:i for i, word in enumerate(s.replace(".", "").split(), 1)}
print(dic)
{'He': 2, 'K': 19, 'S': 16, 'Ar': 18, 'Si': 14, 'O': 8, 'F': 9, 'P': 15, 'Na': 11, 'Cl': 17, 'B': 5, 'Ca': 20, 'Ne': 10, 'Be': 4, 'N': 7, 'C': 6, 'Mi': 12, 'Li': 3, 'H': 1, 'Al': 13}
You can get both the element index and the element with ʻenumerate (seq [, start = 0]). The index is passed as it is to the ʻin
operator to adjust the number of characters to be acquired.
005.py
def n_gram(s, n): return {tuple(s[i:i+n]) for i in range(len(s)-n+1)}
s = "I am an NLPer"
print(n_gram(s, 2))
print(n_gram([t.strip(".,") for t in s.split()], 2))
{('m', ' '), ('n', ' '), ('e', 'r'), ('N', 'L'), (' ', 'N'), ('a', 'm'), ('a', 'n'), ('L', 'P'), ('I', ' '), (' ', 'a'), ('P', 'e')}
{('an', 'NLPer'), ('I', 'am'), ('am', 'an')}
The N-gram method is a method of indexing sentences with N characters as headwords in the order of the character strings. The n_gram (s, n)
function cuts out the passed sequence object s
element by n
and returns it as a set type. By returning it as a set type, the elements are not duplicated.
006.py
n_gram = lambda s, n: {tuple(s[i:i+n]) for i in range(len(s)-n+1)}
X = n_gram("paraparaparadise", 2)
Y = n_gram("paragraph", 2)
print("X: %s" % X)
print("Y: %s" % Y)
print("union: %s" % str(X|Y))
print("difference: %s" % str(X-Y))
print("intersection: %s" % str(X&Y))
if n_gram("se", 2) <= X: print("'se' is included in X.")
if n_gram("se", 2) <= Y: print("'se' is included in Y.")
X: {('a', 'd'), ('a', 'p'), ('i', 's'), ('s', 'e'), ('a', 'r'), ('p', 'a'), ('d', 'i'), ('r', 'a')}
Y: {('g', 'r'), ('p', 'h'), ('a', 'p'), ('a', 'r'), ('p', 'a'), ('r', 'a'), ('a', 'g')}
union: {('a', 'd'), ('g', 'r'), ('p', 'h'), ('a', 'p'), ('i', 's'), ('s', 'e'), ('a', 'r'), ('p', 'a'), ('d', 'i'), ('r', 'a'), ('a', 'g')}
difference: {('i', 's'), ('d', 'i'), ('a', 'd'), ('s', 'e')}
intersection: {('a', 'r'), ('p', 'a'), ('a', 'p'), ('r', 'a')}
'se' is included in X.
I will use the n_gram
created in 005.py, but this time I tried using the lambda
expression (since I didn't say that I should create a function this time). X | Y
is the union, X-Y
is the complement, and X & Y
is the intersection.
007.py
def f(x, y, z): return "%s time%s is%s" % (x, y, z)
print(f(12, "temperature", 22.4))
The temperature at 12:00 is 22.4
" {1} at {0} is {2} ".format (x, y, z)
is fine.
008.py
def cipher(s): return "".join(chr(219-ord(c)) if c.islower() else c for c in s)
s = "Hi He Lied Because Boron Could Not Oxidize Fluorine."
print(cipher(s))
print(cipher(cipher(s)))
Hr Hv Lrvw Bvxzfhv Blilm Clfow Nlg Ocrwrav Foflirmv.
Hi He Lied Because Boron Could Not Oxidize Fluorine.
It seems that " a "<= c <=" z "
can be used instead of ʻis lower`. Is that faster?
009.py
from random import random
typo = lambda s: " ".join(t[0]+"".join(sorted(t[1:-1], key=lambda k:random()))+t[-1] if len(t) > 4 else t for t in s.split())
s = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
print(typo(s))
I cdnlu'ot blieeve that I culod aclualty uetdnnsard what I was rdeniag : the pnneehmoal pwoer of the huamn mind .
Somehow I got stubborn and did my best in one line. I'm using sorted ()
because the shuffle ()
function has no return value.
Recommended Posts