fastText is a tool published by Facebook for natural language processing. Natural language processing can be performed at high speed. GitHub fastText
Please refer to the explanation site for the mechanism. What can you do with fastText that learns 1 billion words published by Facebook in minutes
This time, I would like to use that fastText to vectorize the pedigree of horse racing instead of natural language processing. The idea of using fastText to do more than natural language processing was inspired by the article below. Use fastText to get distributed representations of non-words
I'm uploading a vector file and a jupyter notebook file for how to on github. github keiba_ketto_vec
Pedigree The past 3 generations are put into the fastText format.
It is a pedigree table for the racehorse Satono Diamond.
Pedigree | Horse name |
---|---|
Child | Satono diamond |
father | deep Impact |
mother | Malpensa |
Father | Sunday Silence |
Parents | Wind in her hair |
Mother father | Orpen |
Mother mother | Marsella |
Father father | Halo |
Parents | Wishing Well |
Parents father | Alzao |
Parents | Burghclere |
Mother father father | Lure |
Mother parents | Bonita Francita |
Mother mother father | Southern Halo |
Mother mother mother | Riviere |
Convert the horse names in the above table into one line separated by half-width spaces. The same is true for other racehorses.
input.csv
Satono Diamond Deep Impact Malpensa Sunday Silence Wind in Her Hair Orpen Marsella Halo WishingWell Alzao Burghclere Lure Bonita Francita Southern Halo Riviere
Simon Trunale Gold Allure Humoresque Sunday Silence Nikiya Afleet Allie Win Halo WishingWell Nureyev ReluctantGuest Mr.Prospector PoliteLady Alydar FleetVictress
Water Lourdes Water League Water Henin Dehere Solo BostonHarbor Scrape DeputyMinister SisterDot Halo MineOnly Capote HarborSprings Mr.Prospector File
...
Use the fasttext skipgram command for vectorization. When you run it, you should have generated bin and vec files.
$fasttext skipgram -input input.csv -output ketto_model -minn 50
Regarding the minn option, as a mechanism of fastText, in addition to the words separated by spaces, it seems that each word is further decomposed at the character level and analyzed. Looked at the implementation of fastText
This feature, for example, puts the horses "Gold Allure" and "Gold Ship" together in "Gold". This time, the name itself has no meaning, so use the minn option to disable the feature and prevent it from breaking down to the character level.
Use gensim to read the vectorized file and perform vector operations.
Check if you can calculate Linate, the younger sister of your father, using vector operations from Satono Diamond.
So
Satono diamond+Stay Gold-deep Impact=Linate
Should hold.
howto.py
import gensim
#Read vector data using gensim
model = gensim.models.KeyedVectors.load_word2vec_format('ketto_model.vec', binary=False)
# most_Operate using similar methods
#Pass the data to be added to positive in a list, and pass the data to be subtracted to negative in a list.
model.most_similar(
positive=["Stay Gold", "Satono diamond"],
negative=["deep Impact"]
)
Check the following for how to use gensim. gensim models.word2vec
result.
[('Paulen', 0.8220623731613159),
('Marquessa', 0.8190209865570068),
('Malpensa', 0.814713716506958),
('Linate', 0.80884850025177),
('Shapira', 0.8080180287361145),
('Moonlight knight', 0.8041872382164001),
('Semplice', 0.7995823621749878),
('OnAir', 0.7940067648887634),
('Fusion lock', 0.7933699488639832),
('Orpen', 0.7927322387695312)]
The result was that Paulen of the mother and father Orpen system was the most similar, but following the mother Malpensa and the half-sister Malpensa (father Orfevre, father Stay Gold), Linate also appeared firmly in the calculation result. Therefore, it can be said that the vectorization has been successful.
that's all.
Recommended Posts