GiNZA, which carries out ultra-high-precision natural language processing and dependency analysis, is amazing. With Colaboratory, you don't need to build an environment and you can use it quickly with just a browser.
During LT (Lightning Talk) to emphasize that crispness Environment construction & code writing on the spot, natural language processing, Moreover, I tried a "magic trick" that said that it could be highly accurate and highly functional. At first glance, there is a seed in "magic tricks" ... To talk.
If you read to the end, you will understand the following two know-how ・ How to process high-precision natural language in 3 minutes from zero with GiNZA ・ Magic tricks for live coding with LT
"GiNZA" announced in April 2019, If you try running the Japanese natural language processing open source library, Easy and highly accurate (super important), including dependency and vectorization, I was surprised that I was able to perform all natural language processing.
Reference: https://www.recruit.co.jp/newsroom/2019/0402_18331.html
In natural language processing, it is often quite difficult to build an environment. (Mecab, etc.) Or it is troublesome to install a dictionary etc. for high accuracy. However, with "GiNZA", ** Install including high-precision dictionary with one pip **. In addition, using ** Colaboratory **, ** Even if you start from a completely zero state, you can easily move it with just a browser **.
It's not just morphological analysis. Dependency analysis, extraction of personal names and place names, trained models for vectorizing sentences, etc. A wide variety of main functions can be used in one shot (miscellaneous explanation)
Perhaps because it is relatively new, there is still little information and it seems difficult to find it. ⇒ ★ I thought I would summarize the main usage in this article ★
・ Part of speech is difficult to understand based on UD (Universal Dependencies) standards such as "PROPN" (You can also put out familiar expressions such as nouns and verbs) -Since the analysis result has various attributes, which one do you want? Status (It has a lot of functions and I can't use it well) ・ A little know-how is required to operate with Colaboratory (A little research was required for each of the startup method, graph display method, etc.) It's easy to understand in general, but I feel like I want the first introduction.
I had the opportunity to attend a study session LT (Lightning Talk). I wanted to tell you the greatness of "GiNZA", but I would like to introduce it normally. ** I can't convey the "quick and easy feeling" ** ** Above all, it lacks "interestingness" **. (Not all people are interested in natural language processing)
So, from the ** blank slate ** that just launched the browser, With " ** Live Coding ** </ font>", daddy on the spot I came up with a ** spectacle ** that implements high-precision natural language parsing. ** There is a feeling that it can be done quickly and easily ** ** Well, this alone can do such a great analysis! ?? I'm surprised that ** ** The value to see on the spot will increase and it will be interesting as an event **.
** in time in parallel with the talk </ font> ** ** I don't have the skill to code without mistakes! </ font> **
That's it! ** Let Python live code (mystery) ** </ font> ⇒Finally, it was done automatically, and ** punch line ** was added as a story.
So The story of developing a magic trick that makes it look like live coding **.
The code will be described later, so if you replace the GiNZA part, ** As anyone can easily live code in LT ** ** You can pretend! !! ** ** ~~ What a terrifying know-how ~~
First of all, as GiNZA's know-how, No prerequisite knowledge / environment required, super easy with just a browser Describes a method for high-precision and high-performance natural language processing.
To do this, search for "Colaboratory" in your browser and Open "PYTHON 3's new notebook" and Just run (shift + enter) the following less than 100 lines of code in sequence. Of course it's free. Please do try that out.
From installing GiNZA to sample implementation of key features I made a code that understands everything. (* At the time of live coding, it was carried out with a simplified version of this)
Installation is pip only. Easy with all related modules and dictionary data.
!pip install ginza
# ★2020-01-From the release of 15 v3, it seems that it has become possible to enter with just pip like this
#The conventional installation method is as follows. (This is done in LT)
# !pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
Due to the characteristics of Colaboratory and the path of modules, It is necessary to execute the following command after executing pip (magic)
import pkg_resources, imp
imp.reload(pkg_resources)
First, the execution result is posted. In this way, morphological analysis that seems to be often used, Dependency structure analysis (dependency), analysis of abstract classifications such as personal names and place names, And made a sample code to display the visualization in one shot.
Here is the function that output the above.
#A function that displays the main elements from the dependency structure analysis result
#Model loading should be done outside the function
#import spacy
#nlp = spacy.load('ja_ginza')
#easy_display_nlp(nlp, "Test text")
def easy_display_nlp(my_nlp, input_str):
doc = my_nlp(input_str)
###Tabular display of dependent parsing results
result_list = []
for sent in doc.sents:
#Line break display for each sentence (sentence delimited display)
print(sent)
#Analyze each sentence and put the result in the list (even if there are multiple sentences, they are combined into one)
for token in sent:
#https://spacy.io/api/token
#print(dir(token))
#The comments are not written on the official website, but are interpreted, so they are for reference only.
info_dict = {}
info_dict[".i"] = token.i #Token number (even if there are multiple sentences, it does not return to 0 and becomes a serial number)
info_dict[".orth_"] = token.orth_ #Original text
info_dict["._.reading"] = token._.reading #Pseudonym reading
info_dict[".pos_"] = token.pos_ #Part of speech(UD)
info_dict[".tag_"] = token.tag_ #Part of speech(Japanese)
info_dict[".lemma_"] = token.lemma_ #Uninflected word (after name identification)
info_dict["._.inf"] = token._.inf #Utilization information
info_dict[".rank"] = token.rank #May be treated like frequency
info_dict[".norm_"] = token.norm_ #prototype
info_dict[".is_oov"] = token.is_oov #Is it an unregistered word?
info_dict[".is_stop"] = token.is_stop #Is it a stop word?
info_dict[".has_vector"] = token.has_vector #Do you have word2vec information?
info_dict["list(.lefts)"] = list(token.lefts) #Related word summary(left)
info_dict["list(.rights)"] = list(token.rights) #Related word summary(right)
info_dict[".dep_"] = token.dep_ #Dependency relationship
info_dict[".head.i"] = token.head.i #Dependent token number
info_dict[".head.text"] = token.head.text #Dependent text
result_list.append(info_dict)
#The created dictionary list is converted to DataFrame format and displayed neatly on Jupyter.
import pandas as pd
#pd.set_option('display.max_columns', 100)
df = pd.DataFrame(result_list)
from IPython.display import display
display(df)
###Dependency display
#Illustrate the graph format of the dependency
#A little ingenuity is required to display directly on the Colaboratory
#https://stackoverflow.com/questions/58892382/displacy-from-spacy-in-google-colab
from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
###Visualization of abstract classification
#If there is no place name etc. in the entered text,
#UserWarning: [W006]No entities to visualize found in Doc object warning
#Display of abstract classification
ent_result_list = []
for ent in doc.ents:
ent_dict = {}
ent_dict[".text"]=ent.text
ent_dict[".start_char"]=ent.start_char
ent_dict[".end_cahr"]=ent.end_char
ent_dict[".label_"]=ent.label_
ent_result_list.append(ent_dict)
#Tabular display of DataFrame
display(pd.DataFrame(ent_result_list))
#Display in marking format
displacy.render(doc, style='ent', jupyter=True, options={'distance': 90})
###Enumeration of keywords
#prefix/It will be served in the form of a suffix added
for chunks in doc.noun_chunks:
print(chunks,end=", ")
Click here for how to execute
#Usage sample
import spacy
nlp = spacy.load('ja_ginza')
target_str = "Gonbei's baby caught a cold. Tokyo Patent Approval Office."
easy_display_nlp(nlp, target_str)
The morphological analysis results and main attributes are displayed in tabular format. In addition, the dependency and the extraction of personal names and place names are illustrated. What kind of analysis will be done using this function? It is supposed to be used when searching for the first time.
As shown below, sentences can be vectorized and similarity can be calculated. (Built-in trained model)
doc1 = nlp('This ramen is delicious')
doc2 = nlp('Let's go eat curry')
doc3 = nlp('I'm sorry I can't go to the alumni association')
print(doc1.similarity(doc2))
print(doc2.similarity(doc3))
print(doc3.similarity(doc1))
> 0.8385934558551319
> 0.6690503605367146
> 0.5515445470506148
> #The two foods are the most similar.
It seems that word vectorization can be realized in the same way.
Check if you have vector information with token.has_vector
,
It might be better to consider returning to the basic form with token.lemma_
.
For GiNZA trained models
I haven't seen it well yet, so I'd like to check it later.
That's all for GiNZA's know-how. No prerequisite knowledge / prerequisite environment required. Just open Colaboratory in your browser and run the above code in sequence. ** Anyone can realize high-precision and multifunctional natural language analysis very easily! ** </ font>
"Party syntax to protect the people from NHK" is I can't really understand what you're protecting from what. But don't worry. You can see it in one shot. ** Yes, with GiNZA **
target_str = "From the party that protects the people from NHK From the party that protects NHK From the party that protects the people from NHK From the party that protects the people from NHK From the party that protects the people from NHK"
easy_display_nlp(nlp, target_str)
Just installing pip will result in an error at runtime. According to some seniors' information, restart Colaboratory, There are some methods such as, and before arriving at the above method, I went back and forth. However, if you look closely, the above code was also written on the GiNZA official website.
Initially, I did not know how to do this, and I will explain later.
nlp = spacy.load ('ja_ginza')
part,
nlp = spacy.load(r'/usr/local/lib/python3.6/dist-packages/ja_ginza/ja_ginza-2.2.0')
I changed it like this and devised a method to pass the path directly.
I would like to report that it works even with this without the need to execute extra code.
!find / | grep spacy | grep data
Find out the installation path of GiNZA on Colaboratory,
It is a method to specify the absolute path directly at the time of spacy.load.
In some examples, "See, this word could be parsed as PROPN!" Because it seems like, the explanation was over. (I don't know at all unless I say "noun" or "verb")
These classifications called Universal Dependencies It seems to be an international standard, and I'm just not studying, I also wrote Japanese classifications and investigated the correspondence.
Also, because there was too much attribute information available after analysis, The tabular format of the dataframe makes it easier to see.
If you normally use the graph display (displacy.render) of dependency and person name place name extraction
Serving on http://0.0.0.0:5000 ...
And so on, the server for display starts up,
It seems that the flow is to access the server and see the figure,
I set the option to display directly in the Colaboratory.
That's all for GiNZA's know-how.
By the way, I don't really want to introduce you from here ** The world of bad boys **. First of all, I demonstrated live coding like this, Please see the gif video.
Live coding magic trick (pip part of environment construction omitted)
Despite declaring it a "magic trick" "It looks like I'm really doing live coding."
Expecting a tsukkomi like "No, I didn't actually hit it" Although I put in a lie that I cut a little, "It's easy to get rid of it ..." Because I put too much extra Kodawari. It seems that it looked like a super enji unexpectedly (unlike) ~~ Normally, no one has such a stupid implementation ~~
With GiNZA, you can easily do advanced things in just a few lines. The synergistic effect of ** automatic live coding ** is amazing!
The main policy is automatic typing using pyautogui
.
Please refer to the article below for details on automation by pyautogui
.
I tried to create a mechanism to automate "copying sutras" and accumulate merit automatically.
However, this is "fully automated" and ** There is no live feeling at all. ** ** Just writing the code fully automatically Ad-lib does not work, and there is a gap with the talk.
So as another policy, ** Keyboard event handling is important **.
Every time I press a special command (this time I set it to ctrl + Q), Automatically type the code for one cell. In other words, the momentary manual feeling when starting to write each cell, Above all, ** It is important to "manually" only "execute" each cell! </ font> **
With this moderate automatic and manual blend, Coding / typing becomes auto, while ** Only the "pleasure of banging the ENTER key" remains manually **. In order to carry out while explaining etc. in the talk It is also perfect for event progress such as LT.
Use a library called keyboard.
pip install keyboard
See below for details on the specifications of this library. https://github.com/boppreh/keyboard#keyboardon_press_keykey-callback-suppressfalse
The most important points are as follows. (See all codes below for details)
keyboard.add_hotkey('ctrl+q', my_auto_func, args=(1, ) )
Add hotkeys like this, Every time ctrl + q is pressed Run the automatic typing function (my_auto_func) created using pyautogui.
List the contents you want to type in advance as a character string, In my_auto_func, you can output them in order. (The first execution is a character string for the first cell, and the second time is a character string for the second cell.)
It seems to be buggy when you press the "half-width / full-width" key. It seems that the hot key is released.
Initially, the sentences to be analyzed in natural language processing are I was thinking of entering it manually. However, this problem could not be solved, and at this time, I decided to do all typing automatically. (Only shift + enter of execution is manual ♪)
If you try to enter ":" on anything other than a US keyboard, The "Shift +:" key is sent and "*" is entered. Reference: https://teratail.com/questions/79973
Although I took measures by referring to the link above, in the end, Stop typing character by character with the typewrite command as shown below,
Stopped implementation = typing strings individually
pyautogui.typewrite("abcdefg", interval = 0.1)
Operate / via the clipboard as shown below I implemented it by sticking it with just ctrl + v.
Adopted implementation = Output all via clipboard
#Copy the characters to the clipboard
pyperclip.copy(cp_str)
#All strings are pasted and registered
pyautogui.hotkey('ctrl', 'v')
If you want to enter Japanese because of the "half-width / full-width" key problem, ~~ It is troublesome to make Giji Henkan like sutra copying with auto ~~ This is because it is easier to create via the clipboard. It is acceptable this time that Japanese comes out directly without going through conversion and is unnatural.
Also, in Colaboratory If you hit "line feed", it will be automatically indented, so For key emulation methods other than the clipboard, If you do not process the code string to be typed, the indent will shift. The clipboard method does not take that into account It is a big advantage that you can copy and paste the original code as an argument ♪
I thought, but once I made it and looked at automatic typing, ** The biggest miscalculation </ font> **.
Because it appears character by character at completely equal intervals ** I can't feel humanity at all! </ font> **
** Typing progresses at the same pace as the sutras. </ font> ** ~~ I didn't notice at all when I was making sutras ~~
Even if you aim for sushi on the premise that it will come out (finally you will disassemble it yourself) ** It's not fun to get rid of it very easily. </ font> **
So I thought about tuning to create a natural atmosphere.
The method I originally thought of was
Of pyautogui.hotkey
and pyautogui.typewrite
To make the typing interval random
How to set interval and enter sleep with random numbers.
However, the problem was that the type was too slow. There seems to be a "minimum operation interval" in pyautogui, There is always a certain amount of waiting for each operation. If you operate each character, the type will be too slow.
How to wait, such as interval and sleep, How to hit such as hotkey and typewrite, Waiting time random number setting range, etc. I combined some, but it doesn't look natural.
Looking at the automatic typing process, Rather than typing at appropriate intervals It looks more natural for multiple letters to appear together, I noticed that. It feels like humans can hit every few letters in an instant.
So, combine multiple characters for each appropriate probability (That is, store the output string in the clipboard before executing ctrl + v) The implementation is such that it outputs all at once. See the full code below for a detailed implementation.
Even at speeds that are not possible in reality (multiple key inputs are output at once) This looked more natural. (It is an individual impression) With this, it feels like human typing, I was able to implement automatic typing.
~~ I started to work harder on natural typing than the original natural language processing ~~ ~~ It was a dangerous sign because it was three days ago when the theme was decided ~~
After such various studies, the following code was completed. If you replace the part of the character string to be input, With this, anyone can realize a demo that is live-coded in LT!
~~ I write that abuse is strictly prohibited, but code that seems to be only abused ~~
All code for automatic live coding
import pyautogui
import pyperclip
import time
import random
#Add keyboard event
import keyboard
#https://github.com/boppreh/keyboard#keyboardon_press_keykey-callback-suppressfalse
#'''It becomes a character string as it is with line breaks in triple quotes.
#If you write any code group here,
#Each time you press a hotkey, the text is listed
#Pseudo conversion of Japanese is not supported (because it is easy but troublesome)
my_str_list = [
'''!pip install "https://github.com/megagonlabs/ginza/releases/download/latest/ginza-latest.tar.gz"
''',
'''import pkg_resources, imp
imp.reload(pkg_resources)
''',
'''import spacy
nlp = spacy.load('ja_ginza')
''',
'''
doc = nlp("Today is a pizza party in Tokyo. Gonbei's baby caught a cold.")
''',
'''for s in doc.sents:
for t in s:
info = [t.orth_, t._.reading, t.tag_]
print(info)
''',
'''from spacy import displacy
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
''',
'''
displacy.render(doc, style='ent', jupyter=True, options={'distance': 90})
''',
'''doc = nlp("From the party that protects the people from NHK The party that protects the people from the party that protects the people")
displacy.render(doc, style='dep', jupyter=True, options={'distance': 90})
''',
'''doc1 = nlp('This ramen is delicious')
doc2 = nlp('Let's go eat curry')
doc3 = nlp('I'm sorry I can't go to the alumni association')
''',
'''print(doc1.similarity(doc2))
print(doc2.similarity(doc3))
print(doc3.similarity(doc1))
''',
]
#Global variables
now_counter = 0
#Main execution function (argument unused)
def my_auto_func(arg_val):
global now_counter
#Prevents the process from proceeding while the key used for hotkey setting is held down. Wait a little
#In particular, if you hold down the control key etc., it is easy to enter another operation, so be careful
time.sleep(1.5)
print("called: "+str(now_counter))
#When the key is pressed and output, if there is nothing to output next, the process ends
if now_counter >= len(my_str_list):
print("END: finish")
exit()
cp_str = ""
#Copy version
for my_char in my_str_list[now_counter]:
#In order to adjust the speed, I used to paste multiple characters at the same time under specific conditions.
sl_time = random.uniform(-0.03, 0.10)
cp_str += my_char
if sl_time < 0 :
#Continue without pasting
continue
else :
#Copy the characters to the clipboard
pyperclip.copy(cp_str)
#All strings are pasted and registered
pyautogui.hotkey('ctrl', 'v')
#The one used for pasting is clear
cp_str = ""
#Sleep caused by random numbers
time.sleep(sl_time)
#After exiting the loop, paste if there is any remaining
if len(cp_str) > 0 :
#Copy the characters to the clipboard
pyperclip.copy(cp_str)
#All strings are pasted and registered
pyautogui.hotkey('ctrl', 'v')
#The one used for pasting is clear
cp_str = ""
now_counter += 1
print("END: my_auto_func : "+str(now_counter))
#Bonus. Function to deal with mistakes during the demonstration
#Return the counter to the previous one and write the same code again or terminate it.
def my_sub_func(arg_val):
global now_counter
print("called: "+"for_before")
now_counter -=1
#If the key is pressed and the output is negative, the process ends.
if now_counter < 0:
print("END: finish")
exit()
print("END: my_sub_func : "+str(now_counter))
#Below, the main routine
#If you switch between full-width and half-width characters on the way, it seems that adding hotkeys will be buggy, so do not touch it.
#(Execute in half-width mode)
#When stopping, "ctrl+You can forcibly terminate this Python side with "c".
if __name__ == "__main__":
try:
#Add a hotkey and its event only once.
#Main hotkey settings: Do not duplicate other shortcuts in your app
keyboard.add_hotkey('ctrl+q', my_auto_func, args=(1, ) )
#Sub hotkey settings: Do not duplicate other shortcuts in your app
keyboard.add_hotkey('ctrl+i', my_sub_func, args=(1, ) )
print("Press ESC to stop.")
keyboard.wait('esc')
print("esc-END")
except:
import traceback
print(traceback.format_exc())
exit()
exit()
Due to the limited time of LT Even if you demo Colaboratory Even if you think about it, it is more appropriate to prepare the code from the beginning, It's crazy to do live coding. But, ** Interesting as crazy ** (by Akagi)
But actually, the code is prepared in advance and it looks dangerous ** Pleasure named Safety, Pleasure named Safety ** (by Tonegawa) A mechanism to taste.
Anyone with this ** It may be possible to pretend to be "super-enji at first glance" **.
Misdirection is mainly used in magic tricks, A phenomenon or technique that directs the attention of the audience to another place that is not intended.
In order to make it look "live", it is good to "make" the audience's beliefs.
・ Dare to manually launch the first Colaboratory For example, type Colaboratory in your browser to search. ・ Increase the motion of shift + enter at runtime to emphasize the manual feeling. ・ When auto-typing, your hands are on the keyboard (naturally) ・ Rather than suddenly entering from live coding If it's this guy, maybe it's okay to implement it at super speed? It makes me think Pretending to create an atmosphere of doing something amazing, etc.
With GiNZA Very easy (browser only / environment not required) Nasty (high precision / multi-function) Being able to process natural language I was able to express it ~ Applause of thunder ~~ Bokeh in the world ~~ (By Red Hot Chili Pepper)
This time it is actually called LT, but the time is quite long, 15 minutes, ~~ Isn't it already in the range called LT? ~~ "How to come up with interesting ideas" "Learning method to manage with technology" I talked about two things.
I mentioned it as "a learning method to do something with technology" To illustrate the "HB pencil method" We gave a live coding demonstration this time.
** Just like folding an HB pencil into a snap ** ** Learning method that you can take for granted ** ** The important thing is to "recognize" **
For example, natural language processing is easy, A learning method that you only have to recognize that it is natural to be able to do it.
For something (when coming up with an idea) Speaking of which, I just think that if you look it up in GiNZA, you can do it. If you really want to know, after thinking about doing it You can look at this article again or check GiNZA. Just by looking at this LT, learning is already over! ~~ When I think of that word in my head ~~ ~~ Because it's already over ~~ ~~ You can use it if you "learned"! ~~
● Learning methods such as introductory books / curriculum / qualification exams ⇒The person who does it is good, but I fall asleep on the way, so I can't. It's hard.
● Learning to move your hands and make something yourself ⇒ It may be this in the end. However, as long as human time is finite, everything is impossible. What is your motivation to do?
● Lazy evaluation study method ⇒ Do it when you need it, which is similar to the HB pencil method, Without the information "What can I do if I learn?" I can't come up with an idea when I think about it. Also, "when you need it" is I don't personally like the expression because it feels like it's being done.
● HB pencil method ⇒ In this article, you should know only by looking at the image of the execution result of GiNZA. The learning method of the degree. I just recognize that it can be done in about 3 minutes. The merit is that it is easy. Even if you don't have time to see the details, you can vaguely understand what you can do, It has the effect of making it easier to come up with ideas, including unlearned techniques. Will I learn that at a later date? Motivation is likely to occur.
~~ Like inhaling and exhaling air! ~~ ~~ Manipulating the stand is a natural mental power! ~~ ~~ Exactly the learning method that dominates the world ~~
So this time, "Let's recognize that high-precision natural language analysis can be performed with just a browser!" "Let's recognize that semi-automatic live coding can be demonstrated in LT!" I introduced two examples.
When you have an idea revelation If this article works a little positively, I wouldn't be so happy.
Finally, with the GiNZA developers We would like to express our gratitude to all the people involved in the event.
that's all.
Recommended Posts