Return to Index: [002] Parsing> [003] Part of speech statistics Processing> [004] Statistical processing of parsing

Let's do a text analysis using the result of morphological analysis and simple statistical processing using NLP4J.

"Morphological analysis" and "syntax analysis" are similar to "how to use kitchen knives" in cooking. If you add "statistical processing" in addition to "morphological analysis" and "syntax analysis", I think that text analysis = cooking. The statistical processing here uses simple ones, but I think it is also good to include machine learning and complicated statistical processing.

Now, suppose you have the following document: One line is one record.

"Toyota", "I am making a hybrid car."
"Toyota", "We sell hybrid cars."
"Toyota", "I'm making a car."
"Toyota", "I sell cars."
"Nissan", "I'm making an EV."
"Nissan", "I sell EVs."
"Nissan", "I sell cars."
"Nissan", "We are affiliated with Renault."
"Nissan", "I sell light cars."
"Honda", "I'm making a car."
"Honda", "I sell cars."
"Honda", "I'm making a motorcycle."
"Honda", "I sell motorcycles."
"Honda", "I sell light cars."
"Honda", "I am making a light car."

When you divide the document into "Toyota", "Nissan", and "Honda", what are the "characteristic keywords"? I will try to put out characteristic keywords using NLP4J. (No difficult processing) The point is that statistical processing is performed using the "SimpleDocumentIndex" class.

Maven

<dependency>
  <groupId>org.nlp4j</groupId>
  <artifactId>nlp4j</artifactId>
  <version>1.0.0.0</version>
</dependency>

Code1

public class HelloTextMiningMain1 {
	public static void main(String[] args) throws Exception {
//Preparation of documents (Reading CSV etc. is also possible)
		List<Document> docs = new ArrayList<Document>();
		{
			docs.add(createDocument("Toyota", "I am making a hybrid car."));
			docs.add(createDocument("Toyota", "We sell hybrid cars."));
			docs.add(createDocument("Toyota", "I'm making a car."));
			docs.add(createDocument("Toyota", "I sell cars."));
			docs.add(createDocument("Nissan", "I'm making an EV."));
			docs.add(createDocument("Nissan", "I sell EVs."));
			docs.add(createDocument("Nissan", "I sell cars."));
			docs.add(createDocument("Nissan", "We are affiliated with Renault."));
			docs.add(createDocument("Nissan", "I sell light cars."));
			docs.add(createDocument("Honda", "I'm making a car."));
			docs.add(createDocument("Honda", "I sell cars."));
			docs.add(createDocument("Honda", "I'm making a motorcycle."));
			docs.add(createDocument("Honda", "I sell motorcycles."));
			docs.add(createDocument("Honda", "I sell light cars."));
			docs.add(createDocument("Honda", "I am making a light car."));
		}

//Morphological analysis annotator
		DocumentAnnotator annotator = new YJpMaAnnotator();
//Morphological analysis processing
		annotator.annotate(docs);

//Preparation of keyword index (statistical processing)
		Index index = new SimpleDocumentIndex();
//Keyword indexing process
		index.addDocuments(docs);
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Nissan");
			System.out.println("Keywords(noun) for Nissan");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Toyota");
			System.out.println("Keywords(noun) for Toyota");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Honda");
			System.out.println("Keywords(noun) for Honda");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
	}

	static Document createDocument(String item, String text) {
		Document doc = new DefaultDocument();
		doc.putAttribute("item", item);
		doc.setText(text);
		return doc;
	}

}

Output

Keywords(noun) for Nissan
3.0,EV
3.0,Renault
3.0,Alliance
1.0,Light car
0.6,Automobile
Keywords(noun) for Toyota
3.8,hybrid
3.8,car
1.5,Automobile
Keywords(noun) for Honda
2.5,bike
1.7,Light car
1.0,Automobile

It's easy! The result is like this! Does it match the human senses?

Return to Index

Introduction of NLP4J-[000] Natural Language Processing Index in Java

Project URL

https://www.nlp4j.org/

NLP4J [003] Try text analysis using natural language processing and part-speech statistical processing in Java

Return to Index

Project URL