The teaching material of the programming basic study session, which is one of the training for newcomers, created by Tohoku University Inui / Okazaki Lab (currently Inui / Suzuki Lab) io / ja /) ”is the second article in Python (3.7).
Having studied Python on my own, there may be mistakes and more efficient ways to do it. We would appreciate it if you could point out any improvements you may find.
The source code is also available on GitHub.
Review some advanced topics in programming languages while working on subjects dealing with texts and strings.
05.n-gram
Create a function that creates an n-gram from a given sequence (string, list, etc.). Use this function to get the word bi-gram and the letter bi-gram from the sentence "I am an NLPer".
What is n-gram
An arbitrary n-character string in an arbitrary document or character string.
Bi-gram represents a character string that is two consecutive characters.
Reference: What is n-gram? Weblio Dictionary
05.py
def n_gram(target, n):
return [target[index: index + n] for index in range(len(target) - n + 1)]
words = "I am an NLPer"
print(n_gram(words.split(), 2))
# >> [['I', 'am'], ['am', 'an'], ['an', 'NLPer']]
print(n_gram(words, 2))
# >> ['I ', ' a', 'am', 'm ', ' a', 'an', 'n ', ' N', 'NL', 'LP', 'Pe', 'er']
I will use slices this time as well.
In the n_gram
function, it returns a list that extracts the specified number of elements while shifting the index one by one for the given list / string.
The word bi-gram splits the input string with spaces using the split
method and passes it as an argument.
The character bi-gram slices what is given as it is as a character string.
Find the set of character bi-grams contained in "paraparaparadise" and "paragraph" as X and Y, respectively, and find the union, intersection, and complement of X and Y, respectively. In addition, find out if the bi-gram'se'is included in X and Y.
06.py
def n_gram(target, n):
return [target[index: index + n] for index in range(len(target) - n + 1)]
word1 = "paraparaparadise"
word2 = "paragraph"
x = set(n_gram(word1, 2))
y = set(n_gram(word2, 2))
# x = {'is', 'ra', 'ad', 'se', 'ar', 'ap', 'pa', 'di'}
# y = {'ag', 'gr', 'ra', 'ar', 'ap', 'pa', 'ph'}
#Union
print(x | y)
# >> {'ph', 'ap', 'is', 'ad', 'pa', 'se', 'di', 'ar', 'gr', 'ag', 'ra'}
#Intersection
print(x & y)
# >> {'ar', 'pa', 'ap', 'ra'}
#Difference set
print(x - y)
# >> {'di', 'is', 'se', 'ad'}
print(y - x)
# >> {'ph', 'ag', 'gr'}
print('se' in x)
# >> True
print('se' in y)
# >> False
The part to create the character bi-gram is the same as before, so I will omit the explanation.
In Python, you can handle sets by using the Set type.
Convert the list of bi-grams returned from the n_gram
function to Set type and find each set.
Whether or not a character string is included can be determined by using ʻin`.
Implement a function that takes arguments x, y, z and returns the string "y at x is z". Furthermore, set x = 12, y = ”temperature”, z = 22.4, and check the execution result.
07.py
def something_at_that_time(hour, something, predicate):
return "{}of time{}Is{}".format(str(hour), something, str(predicate))
x = 12
y = "temperature"
z = 22.4
print(something_at_that_time(x, y, z))
The format
method is useful for template statement generation.
You can convert {}
to a character string by putting {}
in the character string and specifying the character string in the argument of the format
method in the order of the inserted{}
.
You can also put a variable inside {}
like {h1}
and specify it with " {h1} ".format (h1 = variable A)
and a keyword argument.
You can also define the format of the string to be converted (such as how many decimal places and zeros).
Implement the function cipher that converts each character of the given character string with the following specifications.
--If lowercase letters, replace with (219 --character code) characters --Other characters are output as they are
Use this function to encrypt / decrypt English messages.
08.py
def cipher(string):
encyption = ""
for i in list(string):
if i.islower():
encyption += chr((219 - ord(i)))
else:
encyption += i
return encyption
test = "Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can."
encyption = cipher(test)
print(encyption)
# >> Hr Hv Lrvw Bvxzfhv Blilm Clfow Nlg Ocrwrav Foflirmv. Nvd Nzgrlmh Mrtsg Aohl Srtm Pvzxv Svxfirgb Cozfhv. Aigsfi Krmt Czm.
normal = cipher(encyption)
print(normal)
# >> Hi He Lied Because Boron Could Not Oxidize Fluorine. New Nations Might Also Sign Peace Security Clause. Arthur King Can.
Encrypt / decrypt with the cipher
function that takes a character string as an argument.
As in the question, in the case of lowercase letters, get the Unicode code point with the ʻordfunction, get the encrypted Unicode code point by subtracting it from 219, and use the
chr` function to get the Unicode code point. Converting to characters.
Decryption can also be done by subtracting from 219, which is the miso of this problem.
Create a program that randomly rearranges the order of the other letters, leaving the first and last letters of each word for the word string separated by spaces. However, words with a length of 4 or less are not rearranged. Give an appropriate English sentence (for example, "I couldn't believe that I could actually understand what I was reading: the phenomenal power of the human mind.") And check the execution result.
It is a reproduction of the characteristic that humans can read even if the order of the characters between them is different if only the first and last characters are present.
09.py
import random
input_line = "I couldn't believe that I could actually understand what I was reading : the phenomenal power of the human mind ."
words_list = input_line.split()
ans = []
for i in words_list:
if len(i) <= 4:
ans.append(i)
continue
char = list(i)
middle_char = char[1:len(i) - 1]
ans.append(char[0] + "".join(random.sample(middle_char, len(middle_char))) + char[-1])
print(" ".join(ans))
# >> I cod'nult bvieele that I culod auclatly unserdatnd what I was rdienag : the panmhoeenl poewr of the hmaun mind .
random ()
of the random module is used, the answer will be different for each execution.This problem is first listed word by word and then processed word by word. If the word is 4 letters or less, add it to the answer word list as it is.
If it does not go into the ʻifprocess, the words are further listed in a string and the characters in between are extracted in slices. In the answer word list, the first character and the last character are fixed, and the part between them is a list of all elements randomly sorted without duplication from the extracted character list, and joined with the
join` method. Stringify and add the first and last characters concatenated.
Finally, with the answer word list as an argument, execute the join
method on the blank to get the answer.
In this article, I tried to solve 100 language processing knocks 2020 edition Chapter 1: Preparatory movement problem numbers 05 to 09.
The set type is used a lot unexpectedly, and the conversion between characters and character codes is unique to language processing, so I learned a lot this time as well.
I'm still immature, so if you have a better answer, please let me know! !! Thank you.
-Language processing 100 knocks 2020 version [Chapter 2: UNIX commands 10-14] -I tried to solve 100 language processing knock 2020 version [Chapter 2: UNIX commands 15-19] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 20-24] -Language processing 100 knocks 2020 version [Chapter 3: Regular expressions 25-29]
Recommended Posts