[Wikipedia article](https://ja.wikipedia.org/wiki/%E3%83%87%E3%82%A3%E3%83%BC%E3%83%97%E3%82%A4%E3 % 83% B3% E3% 83% 91% E3% 82% AF% E3% 83% 88_ (% E7% AB% B6% E8% B5% B0% E9% A6% AC)) as training data Extract words (nouns and verbs) that are highly related to "Deep Impact".
--Anyway, focus on where you can easily make and try without specialized knowledge --Language uses DeepLearning4j in Java ――It is really nonsense that the learning data (Corsus: document collection) is one article on the Wiki even though it is Deep Leaning. ――But I still want to try it easily, so I dare to study only one article
[Here](https://ja.wikipedia.org/wiki/%E3%83%87%E3%82%A3%E3%83%BC%E3%83%97%E3%82%A4%E3%83 Paste the text appropriately from% B3% E3% 83% 91% E3% 82% AF% E3% 83% 88_ (% E7% AB% B6% E8% B5% B0% E9% A6% AC)) Save in UTF-8.
pom.xml
(abridgement)
<repositories>
<repository>
<id>ATILIKA dependencies</id>
<url>http://www.atilika.org/nexus/content/repositories/atilika</url>
</repository>
</repositories>
(Omission)
<dependency>
<artifactId>lucene-core</artifactId>
<groupId>org.apache.lucene</groupId>
<version>5.1.0</version>
</dependency>
<dependency>
<artifactId>lucene-analyzers-kuromoji</artifactId>
<groupId>org.apache.lucene</groupId>
<version>5.1.0</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-ui</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>org.deeplearning4j</groupId>
<artifactId>deeplearning4j-nlp</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>org.nd4j</groupId>
<artifactId>nd4j-native</artifactId>
<version>0.5.0</version>
</dependency>
<dependency>
<groupId>org.atilika.kuromoji</groupId>
<artifactId>kuromoji</artifactId>
<version>0.7.7</version>
<type>jar</type>
</dependency>
As it is, only English can be morphologically analyzed, so kuromoji can be used. Here was very helpful. Or rather, almost as it is. However, some modifications have been made.
KuromojiIpadicTokenizer.java
/**
*Because Japanese morphological analysis is required
*Wrap Kuromoji's Tokenizer with dl4j's Tokneizer interface
* @author
*/
public class KuromojiIpadicTokenizer implements Tokenizer{
private List<Token> tokens;
private int index;
private TokenPreProcess preProcess;
/**
*The dictionary just doesn't work ...
*For the time being, set Mode to Search so that it feels relatively good.
*/
public KuromojiIpadicTokenizer (String toTokenize) {
try{
org.atilika.kuromoji.Tokenizer tokenizer
= org.atilika.kuromoji.Tokenizer.builder()
.userDictionary("D:\\deepleaning\\mydic.txt")
.mode(org.atilika.kuromoji.Tokenizer.Mode.SEARCH)
.build();
tokens = tokenizer.tokenize(toTokenize);
index = (tokens.isEmpty()) ? -1:0;
} catch (IOException ex) {
Logger.getLogger(KuromojiIpadicTokenizer.class.getName()).log(Level.SEVERE, null, ex);
}
}
@Override
public int countTokens() {
return tokens.size();
}
@Override
public List<String> getTokens() {
List<String> ret = new ArrayList<String>();
while (hasMoreTokens()) {
ret.add(nextToken());
}
return ret;
}
@Override
public boolean hasMoreTokens() {
if (index < 0)
return false;
else
return index < tokens.size();
}
/**
*Narrow down the related words to nouns and verbs (uninflected words)
*Custom nouns may be needed if kuromoji's user dictionary works
*Avoid analysis by dropping other part of speech into half-width spaces
* @return
*/
@Override
public String nextToken() {
if (index < 0)
return null;
Token tok = tokens.get(index);
index++;
if(!tok.getPartOfSpeech().startsWith("noun")
&& !tok.getPartOfSpeech().startsWith("verb")
&& !tok.getPartOfSpeech().startsWith("Custom noun")){
return " ";
} else if (preProcess != null) return preProcess.preProcess(tok.getPartOfSpeech().startsWith("verb") ? tok.getBaseForm() : tok.getSurfaceForm());
else return tok.getSurfaceForm();
}
@Override
public void setTokenPreProcessor(TokenPreProcess preProcess) {
this.preProcess = preProcess;
}
}
KuromojiIpadicTokenizerFactory.java
/**
*Wrap Kuromoji's Factory
* @author
*/
public class KuromojiIpadicTokenizerFactory implements TokenizerFactory {
private TokenPreProcess preProcess;
private static String preValue = "";
@Override
public Tokenizer create(String toTokenize) {
// System.out.println(toTokenize);
if (toTokenize == null || toTokenize.isEmpty()) {
//Avoid analysis with half-width spaces, not exceptions
//Otherwise you will not be able to learn documents with consecutive line breaks
toTokenize = " ";
}
KuromojiIpadicTokenizer ret = new KuromojiIpadicTokenizer(toTokenize);
ret.setTokenPreProcessor(preProcess);
return ret;
}
@Override
public Tokenizer create(InputStream paramInputStream) {
throw new UnsupportedOperationException();
}
@Override
public void setTokenPreProcessor(TokenPreProcess preProcess) {
this.preProcess = preProcess;
}
@Override
public TokenPreProcess getTokenPreProcessor() {
return this.preProcess;
}
}
Use this guy.
WordVecSample.java
/**
*Extract words from sentences and
*The relevance (words that appear frequently in close proximity are highly relevant theories, though they are loosely said)
*Sample to learn and calculate highly relevant words
* @author
*/
public class WordVecSample {
public static void main(String[] args) throws IOException{
/**
*Originally for English
*Japanese Analyzer is used for stop words (words that are not evaluated)
*Although the evaluation target is narrowed down to verbs (basic forms) and nouns, it is difficult to target "aru", "iru", etc.
*/
List<String> stopWords = new ArrayList<>();
stopWords.addAll(Arrays.asList(JapaneseAnalyzer.getDefaultStopSet().toString().split(", ")));
//This time I also added Chinese numerals as a trial
stopWords.addAll(Arrays.asList("one","D","three","four","Five","Six","Seven","Eight","Nine","Ten"));
/**
*Corpus(Collection of sentences)Data loading
*■ Since the learning data is overwhelmingly small this time, it is impossible to put it to practical use at this level.
*To the last sample ...
*/
System.out.println( "Reading training data..." );
File f = new File( "D:\\deepleaning\\deepImpact.txt" );
SentenceIterator ite = new LineSentenceIterator( f );
//Inherit the preprocessor and smooth the word
//At this point, it may be better to include half-width kana conversion.
ite.setPreProcessor((String sentence) -> sentence.toLowerCase().replaceAll("\n", " "));
/**
*Break down sentences into words
*The point is to smooth the notation before token division in preProcess
*■ For morphological analysis, replace it so that it uses kuromoji because it supports Japanese.
*/
final EndingPreProcessor preProcessor = new EndingPreProcessor();
KuromojiIpadicTokenizerFactory tokenizer = new KuromojiIpadicTokenizerFactory();
tokenizer.setTokenPreProcessor((String token) -> {
if(token == null) {
return " ";
} else {
token = token.toLowerCase();
String base = preProcessor.preProcess( token );
return base;
}
});
/**
*Creating a model (tokenizer and various settings)
*This time, I adjusted the parameters so that I could get a result like that with a few sentences.
*/
System.out.println( "Building a model..." );
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency( 1 ) //Do not learn words with less than the specified number of appearances ⇒ This time there are few corsus, so the setting is low
.iterations( 3 ) //Number of iterations during learning
.batchSize( 1000 ) //Maximum number of words to learn in one iteration
.layerSize( 120 ) //Vector dimension number of words
.learningRate( 0.09 ) //Learning rate
.minLearningRate( 1e-3 ) //Minimum learning rate
.useAdaGrad( false ) // http://www.jmlr.org/papers/volume12/duchi11a/duchi11a.pdf not used
.negativeSample( 30 ) //Number of reverse answers used in Skipgram If you have a lot of corsus, you should probably reduce it.
.stopWords(stopWords) //Excluded words: Exclude words that appear everywhere, such as being present
.iterate( ite ) //Corsus model
.tokenizerFactory(tokenizer) //Tokenizer
.build();
/**
*Learning
*/
System.out.println( "Learning..." );
vec.fit();
/**
*Output training result model for analysis
*In fact, it's a good idea to persist this by appropriate means.
*/
WordVectorSerializer.writeWordVectors( vec , "D:\\words2.txt" );
/**
*Output of learning results
*/
/*
//The similarity between the two words can be calculated by the cosine distance, but since there are few samples, only comments are written.
String word1 = "Word 1";
String word2 = "Word 2";
double similarity = vec.similarity( word1 , word2 );
System.out.println( String.format( "The similarity between 「%s」 and 「%s」 is %f" , word1 , word2 , similarity ) );
*/
//Try to select 5 words that are similar to any word
String word = "deep Impact";
int ranking = 5;
Collection<String> similarWords = vec.wordsNearest( word , ranking );
System.out.println( String.format( "「%Words that are presumed to be closely related to "s" ⇒%s" , word , similarWords ) );
}
}
From the second time onward, the previous learning results will not be inherited. Results of every learning reset
1st time
Words presumed to be strongly related to "Deep Impact"
⇒ [Straight line,Jockey,Baba,Run,Symboli Rudolf]
Second time
Words presumed to be strongly related to "Deep Impact"
⇒ [Bull,Andre,Writer,win,Pointed out]
3rd time
Words presumed to be strongly related to "Deep Impact"
⇒ [receive,Betting ticket,For,magnification,Horse]
4th time
Words presumed to be strongly related to "Deep Impact"
⇒ [Match,Western text,Evaluation,Hong Kong,Self]
5th time
Words presumed to be strongly related to "Deep Impact"
⇒ [Grades,iron,run,sirocco,come]
The first time is as it is. However, the second and subsequent times are terrible. .. ..
This result can't be helped because it is a study with only one article. I think it would be nice to increase the training data and adjust the parameters.
However, add a dictionary of racehorse names to prevent word division of horse names. It seems necessary to add a few more stop words anyway. I wonder if I didn't need a verb.
Recommended Posts