This article is for people who want to start natural language processing with Python, with a little understanding of C but little knowledge of Python. We are aiming for the shortest route to solve Chapter 1 of 100 Language Processing Knock 2015, which is famous as an introduction to natural language processing. .. (4/8 postscript: Chapter 1 is the same for the 2020 version)
There are already many articles on this Qiita answer example of 100 knocks, but the explanation is not so complete and I thought it would be difficult for Python beginners, so I wrote this article.
The official documentation for Python is pretty kind, and I think you can study on your own by reading the Tutorial, but in this article I would like to touch only the items necessary to solve 100 knocks.
let's do our best. $ brew install python3
for MacOS, $ sudo apt install python3.7 python3.7-dev
for Ubuntu
In plain Windows, it seems easy to refer to Python installation (Win10).
(It may be Google Colaboratory.)
It is OK if you can start Python3 by typing a command such as $ python3
or $ python3.7
on the command line. Now Python is ready to run in interactive mode. In this mode, enter an expression and press Enter to return the expression evaluation result.
>>> 1+2
3
One of the features of Python is "dynamic typing". Unlike C, you don't have to declare the type of a variable, and the integer (int) type can become a floating point number (float) type.
>>> a = 1
>>> a = 5/2
>>> a
2.5
First, let's see what types (embedded types) can be used in the Python standard. Numeric types such as int and float mentioned in the above example are one of them. In Chapter 1 of 100 knocks, you can solve it if you know only the following types. --Character string (text sequence) --List --Set --Dictionary (mapping)
Since language processing is performed, we will start with the character string type. To write a string in Python, just enclose it in '
or "
! Japanese is perfect. You can easily combine strings.
>>> "Welcome"
'Welcome'
>>> 'hoge' + 'fuga'
'hogefuga'
--You can access it with subscripts, like an array of C --Negative subscripts allow you to access from behind the string
>>> word = 'Python'
>>> word[0]
'P'
>>> word[-1]
'n'
--You can easily get a substring by using "Slice"!
--Get elements from ʻith to less than
jth with
word [i: j] --If you omit ʻi
or j
, it means" end ".
>>> word[1:4]
'yth'
>>> word[2:-1]
'tho'
>>> word[:2]
'Py'
>>> word[2:]
'thon'
--With word [i: j: k]
, you can get the elements from ʻith to less than
jth for each
k`.
>>> word[1:5:2]
'yh'
>>> word[::2]
'Pto'
>>> word[::-2]
'nhy'
Let's make full use of slices.
Get a string in which the characters of the string "stressed" are arranged in reverse (from the end to the beginning).
Below is an example of the answer.
nlp00.py
word = 'stressed'
word[::-1]
'desserts'
Take out the 1st, 3rd, 5th, and 7th characters of the character string "Patatokukashi" and get the concatenated character string.
Below is an example of the answer.
nlp01.py
word = 'Patatoku Cassie'
word[::2]
'Police car'
If you think of the list type as a power-up version of the array learned in C, it's okay. Write as follows.
squares = [1, 4, 9, 16, 25]
Write an empty list as follows:
empty = []
The list type can be subscripted and sliced in the same way as the string type. These built-in types are collectively called ** sequence types **.
>>> squares[:3]
[1, 4, 9]
In Python, there are functions dedicated to each data type, which we call ** methods **.
For example, the ʻappendmethod of a list type adds an element to the list. To call it, write
list.append ()` as follows.
>>> squares.append(36)
>>> squares
[1, 4, 9, 16, 25, 36]
I will also introduce some string type methods.
--x.split (sep)
: Create a list by separating the string x
with sep
.
--sep.join (list)
: Creates a string that combines the elements of list
with sep
--x.strip (chars)
: Returns a string with chars
removed from both ends of the string
--x.rstrip (chars)
: Returns a string with chars
removed from the right edge of the string
** * When the arguments of split ()
, strip ()
, and rstrip ()
are omitted, it means "any whitespace character" **
>>> 'I have a pen.'.split(' ')
['I', 'have', 'a', 'pen.']
>>> ' '.join(['I', 'have', 'a', 'pen.'])
'I have a pen.'
It may be a little hard to remember that join ()
is a string type method instead of a list type. To summarize stack overflow, each element of the list is not a string Because it is not applicable.
>>> 'ehoge'.strip('e')
'hog'
>>> 'ehoge'.rstrip('e')
'ehog'
>>> 'ehoge'.rstrip('eg') #It means to remove e or g from the right end as much as possible.
'eho'
Now that you understand the list, let's deal with the for statement used to repeat something. For example, when calculating the sum of the elements of an array in C, I wrote as follows.
int i;
int squares[6] = {1, 4, 6, 16, 25, 36};
int total = 0;
for(i = 0; i < 6; i++) {
total += squares[i];
}
The Python for statement looks like this.
total = 0
for square in squares:
total += square
Each loop takes one element of squares
and assigns it to square
. In other words, square = 1
, square = 4
,,, and so on.
Formalizing the for statement looks like this.
Variables in list representing the for element: TAB processing content
Immediately after ʻin is deceived as" list ", but it is OK even if it is a string type (because each character can be regarded as an element of a string). The generic name for things that can be placed immediately after ʻin
in a for statement is ** iterable (object) ** (iterable).
Indentation was optional in C, but mandatory in Python!
print()
Let's actually see the value of the variable in the for loop. In interactive mode, the print ()
function is used because it does not evaluate the variables in the for block and display the values. Arguments are output as standard even if they are not string type. You can use it without doing something like #include
in C. Such functions are called ** built-in functions **.
for square in squares:
print(square)
1
4
9
16
25
36
As you can see, Python's print ()
function automatically breaks lines (by default).
To prevent line breaks, specify the optional argument ʻend` as follows.
for square in squares:
print(square, end=' ')
1 4 9 16 25 36
len()
Here are some useful built-in functions. len ()
returns the length of a list, string, etc.
>>> len(squares)
6
>>> len('AIUEO')
5
range()
range ()
is generally used when you want to rotate the for statement n times.
for i in range(3):
print(i)
0
1
2
range ()
is called range type, which is a kind of sequence type. However, the usage is as much as converting it to a for statement or list type.>>> range(3)
range(0, 3)
>>> list(range(4))
[0, 1, 2, 3]
Let's also try saving the source code to a file and executing it.
For example, if you save a file with just print ('Hello World')
as hello.py
and type $ python3 hello.py
on the command line, you should see'Hello World'. ..
Let's continue with 100 knocks in such a place.
Obtain the character string "Patatokukashi" by alternately connecting the characters "Police car" + "Taxi" from the beginning.
Let's do as much as we have learned so far. Below is an example of the answer.
nlp02.py
str_a = 'Police car'
str_b = 'taxi'
for i in range(len(str_a)):
print(str_a[i]+str_b[i], end='')
Patatoku Kashii
In such a case, you can write it more easily by using the built-in function zip ()
. This function pairs the i-th element from the argument iterable object.
>>> list(zip(str_a, str_b))
[('Pa','Ta'), ('To',' Ku'), ('Ka','Shi'), ('ー','ー')]
So the above code can be rewritten as: This is a very frequently used function.
for a, b in zip(str_a, str_b):
print(a+b, end='')
Patatoku Kashii
By the way, it seems to mean tightening the zipperregardless of the file format
ZIP`.
If you didn't think "What is this comma?" After seeing the above explanation, skip this section and solve Question 03.
The object that this zip ()
retrieves in each loop is like a list with str_a [i]
and str_b [i]
as elements. To be correct, it's a tuple, an immutable version of the list.
Tuples are a type of sequence, so you can access subscripts, but you cannot add elements with ʻappend ()`. In Python, what can be changed is called mutable, and what cannot be changed is called immutable. I've been through so far, but lists are mutable, and strings and tuples are immutable.
>>> a = 'abc'
>>> a[1] = 'a'
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-56-ae17b2fd35d6> in <module>
1 a = 'abc'
----> 2 a[1] = 'a'
TypeError: 'str' object does not support item assignment
>>> a = [0, 9, 2]
>>> a[1] = 1
>>> a
[0, 1, 2]
Tuples are described by enclosing them in ()
, but usually the outer ()
is unnecessary (it cannot be omitted inside the ()
where the function argument is written).
>>> 1, 2, 3
(1, 2, 3)
Using the specification that can be described by omitting the ()
of the tuple, assigning multiple values separated by ,
to one variable is called a tuple pack. In contrast, assigning a sequence to multiple variables at once is called sequence unpacking.
>>> a = 'a'
>>> b = 'b'
>>> t = a, b
>>> t
('a', 'b')
>>> x, y = t
>>> print(x)
>>> print(y)
a
b
The side street has become long, but now I know what for a, b in zip (str_a, str_b):
is. The tuples returned by the zip function in each loop are unpacked into variables a and b.
let's do it.
Break down the sentence "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
This problem, periods and commas are not alphabets and must be removed to get pi.
Below is an example of the answer.
nlp03.py
sent = "Now I need a drink, alcoholic of course, after the heavy lectures involving quantum mechanics."
words = sent.split()
ans = []
for word in words:
ans.append(len(word.rstrip('.,')))
print(ans)
[3, 1, 4, 1, 5, 9, 2, 6, 5, 3, 5, 8, 9, 7, 9]
There is actually a better way to write it (list comprehension), but it's okay if you haven't seen it at this stage.
ans = [len(word.rstrip('.,')) for word in words]
In strings, lists, etc., we used subscripts, that is, a range of integers, to access elements, but dictionary types can be accessed with a "key" that we define. For example, since you want to memorize the number of occurrences of a certain word, you can create a dictionary with the key as the word and the value as the number of occurrences. The dictionary object is defined as follows and the value is retrieved. You can also easily add key / value pairs to your dictionary.
>>> dic = {'I':141, 'you':112}
>>> dic
{'I': 141, 'you': 112}
>>> dic['I']
141
>>> dic['have'] = 256
>>> dic
{'I': 141, 'you': 112, 'have': 256}
True or False is returned when equations and inequalities are evaluated.
>>> 1 == 1
True
>>> 1 == 2
False
>>> 1 < 2 <= 3
True
You can invert the bool value with not
.
ʻIn` without for determines the affiliation.
>>> 1 in [1, 2, 3]
True
Those for which ʻin` operations are defined are called ** container type **. Strings, lists, tuples and dictionaries are all container types.
Note that the Python if statement is similar to C, but ʻelif instead of ʻelse if
.
if 1==2:
print('hoge')
elif 1==3:
print('fuga')
else:
print('bar')
bar
Let's make full use of what we have learned so far.
Break down the sentence "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can." Into words 1, 5, 6, 7, 8, 9, 15, 16, The 19th word is the first character, and the other words are the first two characters, and the associative array (dictionary type or map type) from the extracted character string to the word position (what number of words from the beginning) Create.
Below is an example of the answer.
nlp04.py
sent = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
positions = [1, 5, 6, 7, 8, 9, 15, 16, 19]
words = sent.split()
i = 1
ans = {}
for word in words:
if i in positions:
key = word[0]
else:
key = word[:2]
ans[key] = i
i += 1
print(ans)
{'H': 1, 'He': 2, 'Li': 3, 'Be': 4, 'B': 5, 'C': 6, 'N': 7, 'O': 8, 'F': 9, 'Ne': 10, 'Na': 11, 'Mi': 12, 'Al': 13, 'Si': 14, 'P': 15, 'S': 16, 'Cl': 17, 'Ar': 18, 'K': 19, 'Ca': 20}
(What, Mi doesn't have an element symbol? In the first place, the English version of my ship has Mg like a demon gate ...)
The built-in function ʻenumerate ()will add the number of loops together, making it easier to write. Since it starts from 0 when used normally, the optional argument
start = 1` is specified.
ans = {}
for i, word in enumerate(words, start=1):
if i in where:
key = word[0]
else:
key = word[:2]
ans[key] = i
ʻEnumerate ()is very convenient and often used like this, so be sure to know it. Actually, there is a way to change the elements of
words using ʻenumerate ()
even in [03](# 100 knock 03) (you can save the trouble of generating a new list ʻans`) ..
Next, let's try 100 knock 05.
Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".
If you want to define your own function in python, you can write as follows.
def function_name(argument): TAB processing
The output format is not specified, but let's write a code that outputs as follows, for example.
[['i', 'am'], ['am', 'an], ['an', 'nlper']]
It seems okay if you create a function that takes a sequence and n as arguments.
Below is an example of the answer.
def ngram(seq, n):
lis = []
for i in range(len(seq) - n + 1):
lis.append(seq[i:i+n])
return lis
If you've done so far, save it in a file named nlp05.py
with the following and run it on the command line with $ python3 nlp05.py
.
nlp05.py
def ngram(seq, n):
lis = []
for i in range(len(seq) - n + 1):
lis.append(seq[i:i+n])
return lis
if __name__ == '__main__':
sent = 'I am an NLPer'
words = sent.split(' ')
lis = ngram(words, 2)
print(lis)
lis = ngram(sent, 2)
print(lis)
ʻIf name == The part of'main': `is necessary because it will be imported in the next problem. Those who are wondering should not write this line once.
Also, the ngram function has the following alternative solution, which is faster, but the explanation is a little longer, so I will omit it.
def ngram(seq, n):
return list(zip(*(seq[i:] for i in range(n))))
Next is 100 knock 06.
Find the set of characters bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
For now, let's run the following code.
from nlp05 import ngram
x = set(ngram('paraparaparadise', 2))
print(x)
{'is', 'se', 'di', 'ap', 'ad', 'ar', 'pa', 'ra'}
The ngram function created earlier in the from nlp05 import ngram
part is" imported ".
The set type can be manually defined as x = {'pa','ar'}
, but it can also be created from other iterable objects using set ()
.
You can find the union with |
, the intersection with &
, and the difference set with -
.
Below is an example of the answer.
nlp06.py
from nlp05 import ngram
x = set(ngram('paraparaparadise', 2))
y = set(ngram('paragraph', 2))
print(x | y)
print(x & y)
print(x - y)
print(y - x)
print('se' in x)
print('se' in y)
{'is', 'se', 'di', 'ph', 'ap', 'ag', 'ad', 'ar', 'pa', 'gr', 'ra'}
{'ra', 'ap', 'pa', 'ar'}
{'is', 'se', 'ad', 'di'}
{'ag', 'gr', 'ph'}
True
False
If you apply this in operation to a list like problem 04, it will take O (n), so use it as a set type if possible.
Solve 100 knocks 07.
Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = "temperature", z = 22.4, and check the execution result.
It is difficult to do this by combining strings with +
. There are three ways in Python.
>>> x = 12
>>> y = 'temperature'
>>> '%At d o'clock%s' % (x, y)
'12 o'clock temperature'
>>> '{}of time{}'.format(x, y)
'12 o'clock temperature'
>>> f'{x}of time{y}'
'12 o'clock temperature'
Basically, f-string is the easiest. However, backslashes cannot be used in {}
. Also, when specifying the number of digits, the printf format seems to be faster (Reference)
With this, you can do 07 without any hesitation.
08 can be done using the built-in functions ʻord () and
chr () , which convert between characters and code points. You can use code points to determine lowercase letters, or you can use
str.islower ()`. A code point is like an index of characters on a computer. Converting a byte string to a code point is called decoding, and vice versa is called encoding. The correspondence between them is called a character code. It may be good to know such a story.
09 can be done using the shuffle ()
or sample ()
of the random module. Note that shuffle ()
is a method that destroys the original data and has no return value. Of course, don't forget to import.
You should now be able to get the basics of Python. Actually, I wanted to handle the contents of inclusion notation, iterator, generator, lambda expression, argument list unpacking, and map ()
, but I omitted it because of the desire to cut the bowels. If you are interested, please check it out.
(4/25 postscript) [Chapter 2] Introduction to Python with 100 knocks of language processing has been released! Using the problems in Chapter 2, I will explain some of the above-mentioned contents that I wanted to handle, in addition to file input / output and Unix commands.
Recommended Posts