Return to Index: [002] Parsing> [003] Part of speech statistics Processing> [004] Statistical processing of parsing
Let's do a text analysis using the result of morphological analysis and simple statistical processing using NLP4J.
"Morphological analysis" and "syntax analysis" are similar to "how to use kitchen knives" in cooking. If you add "statistical processing" in addition to "morphological analysis" and "syntax analysis", I think that text analysis = cooking. The statistical processing here uses simple ones, but I think it is also good to include machine learning and complicated statistical processing.
Now, suppose you have the following document: One line is one record.
"Toyota", "I am making a hybrid car."
"Toyota", "We sell hybrid cars."
"Toyota", "I'm making a car."
"Toyota", "I sell cars."
"Nissan", "I'm making an EV."
"Nissan", "I sell EVs."
"Nissan", "I sell cars."
"Nissan", "We are affiliated with Renault."
"Nissan", "I sell light cars."
"Honda", "I'm making a car."
"Honda", "I sell cars."
"Honda", "I'm making a motorcycle."
"Honda", "I sell motorcycles."
"Honda", "I sell light cars."
"Honda", "I am making a light car."
When you divide the document into "Toyota", "Nissan", and "Honda", what are the "characteristic keywords"? I will try to put out characteristic keywords using NLP4J. (No difficult processing) The point is that statistical processing is performed using the "SimpleDocumentIndex" class.
Maven
<dependency>
<groupId>org.nlp4j</groupId>
<artifactId>nlp4j</artifactId>
<version>1.0.0.0</version>
</dependency>
Code1
public class HelloTextMiningMain1 {
public static void main(String[] args) throws Exception {
//Preparation of documents (Reading CSV etc. is also possible)
List<Document> docs = new ArrayList<Document>();
{
docs.add(createDocument("Toyota", "I am making a hybrid car."));
docs.add(createDocument("Toyota", "We sell hybrid cars."));
docs.add(createDocument("Toyota", "I'm making a car."));
docs.add(createDocument("Toyota", "I sell cars."));
docs.add(createDocument("Nissan", "I'm making an EV."));
docs.add(createDocument("Nissan", "I sell EVs."));
docs.add(createDocument("Nissan", "I sell cars."));
docs.add(createDocument("Nissan", "We are affiliated with Renault."));
docs.add(createDocument("Nissan", "I sell light cars."));
docs.add(createDocument("Honda", "I'm making a car."));
docs.add(createDocument("Honda", "I sell cars."));
docs.add(createDocument("Honda", "I'm making a motorcycle."));
docs.add(createDocument("Honda", "I sell motorcycles."));
docs.add(createDocument("Honda", "I sell light cars."));
docs.add(createDocument("Honda", "I am making a light car."));
}
//Morphological analysis annotator
DocumentAnnotator annotator = new YJpMaAnnotator();
//Morphological analysis processing
annotator.annotate(docs);
//Preparation of keyword index (statistical processing)
Index index = new SimpleDocumentIndex();
//Keyword indexing process
index.addDocuments(docs);
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Nissan");
System.out.println("Keywords(noun) for Nissan");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Toyota");
System.out.println("Keywords(noun) for Toyota");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=Honda");
System.out.println("Keywords(noun) for Honda");
for (Keyword kwd : kwds) {
System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
}
}
}
static Document createDocument(String item, String text) {
Document doc = new DefaultDocument();
doc.putAttribute("item", item);
doc.setText(text);
return doc;
}
}
Output
Keywords(noun) for Nissan
3.0,EV
3.0,Renault
3.0,Alliance
1.0,Light car
0.6,Automobile
Keywords(noun) for Toyota
3.8,hybrid
3.8,car
1.5,Automobile
Keywords(noun) for Honda
2.5,bike
1.7,Light car
1.0,Automobile
It's easy! The result is like this! Does it match the human senses?
Introduction of NLP4J-[000] Natural Language Processing Index in Java
https://www.nlp4j.org/
Recommended Posts