Overview

fastText is a tool published by Facebook for natural language processing. Natural language processing can be performed at high speed. GitHub fastText

Please refer to the explanation site for the mechanism. What can you do with fastText that learns 1 billion words published by Facebook in minutes

This time, I would like to use that fastText to vectorize the pedigree of horse racing instead of natural language processing. The idea of using fastText to do more than natural language processing was inspired by the article below. Use fastText to get distributed representations of non-words

Execution result

I'm uploading a vector file and a jupyter notebook file for how to on github. github keiba_ketto_vec

How to make

data set

Pedigree The past 3 generations are put into the fastText format.

It is a pedigree table for the racehorse Satono Diamond.

Pedigree	Horse name
Child	Satono diamond
father	deep Impact
mother	Malpensa
Father	Sunday Silence
Parents	Wind in her hair
Mother father	Orpen
Mother mother	Marsella
Father father	Halo
Parents	Wishing Well
Parents father	Alzao
Parents	Burghclere
Mother father father	Lure
Mother parents	Bonita Francita
Mother mother father	Southern Halo
Mother mother mother	Riviere

Convert the horse names in the above table into one line separated by half-width spaces. The same is true for other racehorses.

`input.csv`


Satono Diamond Deep Impact Malpensa Sunday Silence Wind in Her Hair Orpen Marsella Halo WishingWell Alzao Burghclere Lure Bonita Francita Southern Halo Riviere
Simon Trunale Gold Allure Humoresque Sunday Silence Nikiya Afleet Allie Win Halo WishingWell Nureyev ReluctantGuest Mr.Prospector PoliteLady Alydar FleetVictress
Water Lourdes Water League Water Henin Dehere Solo BostonHarbor Scrape DeputyMinister SisterDot Halo MineOnly Capote HarborSprings Mr.Prospector File
...

Vectorization

Use the fasttext skipgram command for vectorization. When you run it, you should have generated bin and vec files.

$fasttext skipgram -input input.csv -output ketto_model -minn 50

input option Specify the input file
output option Specify the model name to be the output
minn option Specifies the minimum size for decomposing a string into characters.

Regarding the minn option, as a mechanism of fastText, in addition to the words separated by spaces, it seems that each word is further decomposed at the character level and analyzed. Looked at the implementation of fastText

This feature, for example, puts the horses "Gold Allure" and "Gold Ship" together in "Gold". This time, the name itself has no meaning, so use the minn option to disable the feature and prevent it from breaking down to the character level.

Check the result

Use gensim to read the vectorized file and perform vector operations.

Check if you can calculate Linate, the younger sister of your father, using vector operations from Satono Diamond.

Satono Diamond (Father: Deep Impact)
Linate (Father: Stay Gold)

Satono diamond+Stay Gold-deep Impact=Linate

Should hold.

`howto.py`


import gensim

#Read vector data using gensim
model = gensim.models.KeyedVectors.load_word2vec_format('ketto_model.vec', binary=False)

# most_Operate using similar methods
#Pass the data to be added to positive in a list, and pass the data to be subtracted to negative in a list.

model.most_similar(
    positive=["Stay Gold", "Satono diamond"],
    negative=["deep Impact"]
)

Check the following for how to use gensim. gensim models.word2vec

`result.`


[('Paulen', 0.8220623731613159),
 ('Marquessa', 0.8190209865570068),
 ('Malpensa', 0.814713716506958),
 ('Linate', 0.80884850025177),
 ('Shapira', 0.8080180287361145),
 ('Moonlight knight', 0.8041872382164001),
 ('Semplice', 0.7995823621749878),
 ('OnAir', 0.7940067648887634),
 ('Fusion lock', 0.7933699488639832),
 ('Orpen', 0.7927322387695312)]

The result was that Paulen of the mother and father Orpen system was the most similar, but following the mother Malpensa and the half-sister Malpensa (father Orfevre, father Stay Gold), Linate also appeared firmly in the calculation result. Therefore, it can be said that the vectorization has been successful.

that's all.

Vectorization of horse racing pedigree using fastText