Machine learning (DeepLeaning4j) in Java Try to learn a document and extract words that are highly related to a specific word

things to do

[Wikipedia article]( % 83% B3% E3% 83% 91% E3% 82% AF% E3% 83% 88_ (% E7% AB% B6% E8% B5% B0% E9% A6% AC)) as training data Extract words (nouns and verbs) that are highly related to "Deep Impact".


--Anyway, focus on where you can easily make and try without specialized knowledge --Language uses DeepLearning4j in Java ――It is really nonsense that the learning data (Corsus: document collection) is one article on the Wiki even though it is Deep Leaning. ――But I still want to try it easily, so I dare to study only one article

Text to prepare in advance

[Here]( Paste the text appropriately from% B3% E3% 83% 91% E3% 82% AF% E3% 83% 88_ (% E7% AB% B6% E8% B5% B0% E9% A6% AC)) Save in UTF-8.


It ’s easy, so with Maven



            <id>ATILIKA dependencies</id>






Wrap Kuromoji's Tokenizer

As it is, only English can be morphologically analyzed, so kuromoji can be used. Here was very helpful. Or rather, almost as it is. However, some modifications have been made.

 *Because Japanese morphological analysis is required
 *Wrap Kuromoji's Tokenizer with dl4j's Tokneizer interface
 * @author 
public class KuromojiIpadicTokenizer implements Tokenizer{
    private List<Token> tokens;
    private int index;
    private TokenPreProcess preProcess;

     *The dictionary just doesn't work ...
     *For the time being, set Mode to Search so that it feels relatively good.
    public KuromojiIpadicTokenizer (String toTokenize) {
            org.atilika.kuromoji.Tokenizer tokenizer 
                = org.atilika.kuromoji.Tokenizer.builder()
            tokens = tokenizer.tokenize(toTokenize);
            index = (tokens.isEmpty()) ? -1:0;
        } catch (IOException ex) {
            Logger.getLogger(KuromojiIpadicTokenizer.class.getName()).log(Level.SEVERE, null, ex);

    public int countTokens() {
        return tokens.size();

    public List<String> getTokens() {
        List<String> ret = new ArrayList<String>();
        while (hasMoreTokens()) {
        return ret;

    public boolean hasMoreTokens() {
        if (index < 0)
            return false;
            return index < tokens.size();

     *Narrow down the related words to nouns and verbs (uninflected words)
     *Custom nouns may be needed if kuromoji's user dictionary works
     *Avoid analysis by dropping other part of speech into half-width spaces
     * @return 
    public String nextToken() {
        if (index < 0)
        return null;

        Token tok = tokens.get(index);
            && !tok.getPartOfSpeech().startsWith("verb")
            && !tok.getPartOfSpeech().startsWith("Custom noun")){
            return " ";
        } else if (preProcess != null) return preProcess.preProcess(tok.getPartOfSpeech().startsWith("verb") ? tok.getBaseForm() : tok.getSurfaceForm());
        else return tok.getSurfaceForm();

    public void setTokenPreProcessor(TokenPreProcess preProcess) {
        this.preProcess = preProcess;

 *Wrap Kuromoji's Factory
 * @author 
public class KuromojiIpadicTokenizerFactory implements TokenizerFactory {

    private TokenPreProcess preProcess;

    private static String preValue = "";
    public Tokenizer create(String toTokenize) {
//        System.out.println(toTokenize);
        if (toTokenize == null || toTokenize.isEmpty()) {
            //Avoid analysis with half-width spaces, not exceptions
            //Otherwise you will not be able to learn documents with consecutive line breaks
            toTokenize = " ";
        KuromojiIpadicTokenizer ret = new KuromojiIpadicTokenizer(toTokenize);
        return ret;

    public Tokenizer create(InputStream paramInputStream) {
        throw new UnsupportedOperationException();

    public void setTokenPreProcessor(TokenPreProcess preProcess) {
        this.preProcess = preProcess;

    public TokenPreProcess getTokenPreProcessor() {
        return this.preProcess;

Let them learn and output the results

Use this guy.

 *Extract words from sentences and
 *The relevance (words that appear frequently in close proximity are highly relevant theories, though they are loosely said)
 *Sample to learn and calculate highly relevant words
 * @author 
public class WordVecSample {
    public static void main(String[] args) throws IOException{

         *Originally for English
         *Japanese Analyzer is used for stop words (words that are not evaluated)
         *Although the evaluation target is narrowed down to verbs (basic forms) and nouns, it is difficult to target "aru", "iru", etc.
        List<String> stopWords = new ArrayList<>();
        stopWords.addAll(Arrays.asList(JapaneseAnalyzer.getDefaultStopSet().toString().split(", ")));
        //This time I also added Chinese numerals as a trial
         *Corpus(Collection of sentences)Data loading
         *■ Since the learning data is overwhelmingly small this time, it is impossible to put it to practical use at this level.
         *To the last sample ...
        System.out.println( "Reading training data..." );
        File f = new File( "D:\\deepleaning\\deepImpact.txt" );
        SentenceIterator ite = new LineSentenceIterator( f );
        //Inherit the preprocessor and smooth the word
        //At this point, it may be better to include half-width kana conversion.
        ite.setPreProcessor((String sentence) -> sentence.toLowerCase().replaceAll("\n", " "));
         *Break down sentences into words
         *The point is to smooth the notation before token division in preProcess
         *■ For morphological analysis, replace it so that it uses kuromoji because it supports Japanese.
        final EndingPreProcessor preProcessor = new EndingPreProcessor();
        KuromojiIpadicTokenizerFactory tokenizer = new KuromojiIpadicTokenizerFactory();
        tokenizer.setTokenPreProcessor((String token) -> {
            if(token == null) {
                return " ";
            } else {
                token = token.toLowerCase();
                String base = preProcessor.preProcess( token );
                return base;

         *Creating a model (tokenizer and various settings)
         *This time, I adjusted the parameters so that I could get a result like that with a few sentences.
        System.out.println( "Building a model..." );
        Word2Vec vec = new Word2Vec.Builder()
            .minWordFrequency( 1 )          //Do not learn words with less than the specified number of appearances ⇒ This time there are few corsus, so the setting is low
            .iterations( 3 )                //Number of iterations during learning
            .batchSize( 1000 )              //Maximum number of words to learn in one iteration
            .layerSize( 120 )               //Vector dimension number of words
            .learningRate( 0.09 )           //Learning rate
            .minLearningRate( 1e-3 )        //Minimum learning rate
            .useAdaGrad( false )            // not used
            .negativeSample( 30 )           //Number of reverse answers used in Skipgram If you have a lot of corsus, you should probably reduce it.
            .stopWords(stopWords)           //Excluded words: Exclude words that appear everywhere, such as being present
            .iterate( ite )                 //Corsus model
            .tokenizerFactory(tokenizer)    //Tokenizer

        System.out.println( "Learning..." );;

         *Output training result model for analysis
         *In fact, it's a good idea to persist this by appropriate means.
        WordVectorSerializer.writeWordVectors( vec , "D:\\words2.txt" );

         *Output of learning results
            //The similarity between the two words can be calculated by the cosine distance, but since there are few samples, only comments are written.
            String word1 = "Word 1";
            String word2 = "Word 2";
            double similarity = vec.similarity( word1 , word2 );
            System.out.println( String.format( "The similarity between 「%s」 and 「%s」 is %f" , word1 , word2 , similarity ) );

        //Try to select 5 words that are similar to any word
        String word = "deep Impact";
        int ranking = 5;
        Collection<String> similarWords = vec.wordsNearest( word , ranking );
        System.out.println( String.format( "「%Words that are presumed to be closely related to "s" ⇒%s" , word , similarWords ) );

Output result

From the second time onward, the previous learning results will not be inherited. Results of every learning reset

1st time

Words presumed to be strongly related to "Deep Impact"
⇒ [Straight line,Jockey,Baba,Run,Symboli Rudolf]

Second time

Words presumed to be strongly related to "Deep Impact"
⇒ [Bull,Andre,Writer,win,Pointed out]

3rd time

Words presumed to be strongly related to "Deep Impact"
⇒ [receive,Betting ticket,For,magnification,Horse]

4th time

Words presumed to be strongly related to "Deep Impact"
⇒ [Match,Western text,Evaluation,Hong Kong,Self]

5th time

Words presumed to be strongly related to "Deep Impact"
⇒ [Grades,iron,run,sirocco,come]

The first time is as it is. However, the second and subsequent times are terrible. .. ..

This result can't be helped because it is a study with only one article. I think it would be nice to increase the training data and adjust the parameters.

However, add a dictionary of racehorse names to prevent word division of horse names. It seems necessary to add a few more stop words anyway. I wonder if I didn't need a verb.

