NLP4J [003] Try text analysis using natural language processing and part-speech statistical processing in Java

Return to Index: [002] Parsing> [003] Part of speech statistics Processing> [004] Statistical processing of parsing

Let's do a text analysis using the result of morphological analysis and simple statistical processing using NLP4J.

"Morphological analysis" and "syntax analysis" are similar to "how to use kitchen knives" in cooking. If you add "statistical processing" in addition to "morphological analysis" and "syntax analysis", I think that text analysis = cooking. The statistical processing here uses simple ones, but I think it is also good to include machine learning and complicated statistical processing.

Now, suppose you have the following document: One line is one record.

"Toyota", "I am making a hybrid car."
"Toyota", "We sell hybrid cars."
"Toyota", "I'm making a car."
"Toyota", "I sell cars."
"Nissan", "I'm making an EV."
"Nissan", "I sell EVs."
"Nissan", "I sell cars."
"Nissan", "We are affiliated with Renault."
"Nissan", "I sell light cars."
"Honda", "I'm making a car."
"Honda", "I sell cars."
"Honda", "I'm making a motorcycle."
"Honda", "I sell motorcycles."
"Honda", "I sell light cars."
"Honda", "I am making a light car."

When you divide the document into "Toyota", "Nissan", and "Honda", what are the "characteristic keywords"? I will try to put out characteristic keywords using NLP4J. (No difficult processing) The point is that statistical processing is performed using the "SimpleDocumentIndex" class.

Maven

<dependency>
  <groupId>org.nlp4j</groupId>
  <artifactId>nlp4j</artifactId>
  <version>1.0.0.0</version>
</dependency>

Code1

public class HelloTextMiningMain1 {
	public static void main(String[] args) throws Exception {
//Preparation of documents (Reading CSV etc. is also possible)
		List<Document> docs = new ArrayList<Document>();
		{
			docs.add(createDocument("Toyota", "I am making a hybrid car."));
			docs.add(createDocument("Toyota", "We sell hybrid cars."));
			docs.add(createDocument("Toyota", "I'm making a car."));
			docs.add(createDocument("Toyota", "I sell cars."));
			docs.add(createDocument("Nissan", "I'm making an EV."));
			docs.add(createDocument("Nissan", "I sell EVs."));
			docs.add(createDocument("Nissan", "I sell cars."));
			docs.add(createDocument("Nissan", "We are affiliated with Renault."));
			docs.add(createDocument("Nissan", "I sell light cars."));
			docs.add(createDocument("Honda", "I'm making a car."));
			docs.add(createDocument("Honda", "I sell cars."));
			docs.add(createDocument("Honda", "I'm making a motorcycle."));
			docs.add(createDocument("Honda", "I sell motorcycles."));
			docs.add(createDocument("Honda", "I sell light cars."));
			docs.add(createDocument("Honda", "I am making a light car."));
		}

//Morphological analysis annotator
		DocumentAnnotator annotator = new YJpMaAnnotator();
//Morphological analysis processing
		annotator.annotate(docs);

//Preparation of keyword index (statistical processing)
		Index index = new SimpleDocumentIndex();
//Keyword indexing process
		index.addDocuments(docs);
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Nissan");
			System.out.println("Keywords(noun) for Nissan");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Toyota");
			System.out.println("Keywords(noun) for Toyota");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//Acquisition of highly co-occurrence keywords
			List<Keyword> kwds = index.getKeywords("noun", "item=Honda");
			System.out.println("Keywords(noun) for Honda");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("%.1f,%s", kwd.getCorrelation(), kwd.getLex()));
			}
		}
	}

	static Document createDocument(String item, String text) {
		Document doc = new DefaultDocument();
		doc.putAttribute("item", item);
		doc.setText(text);
		return doc;
	}

}

Output

Keywords(noun) for Nissan
3.0,EV
3.0,Renault
3.0,Alliance
1.0,Light car
0.6,Automobile
Keywords(noun) for Toyota
3.8,hybrid
3.8,car
1.5,Automobile
Keywords(noun) for Honda
2.5,bike
1.7,Light car
1.0,Automobile

It's easy! The result is like this! Does it match the human senses?

Return to Index

Introduction of NLP4J-[000] Natural Language Processing Index in Java

Project URL

https://www.nlp4j.org/ NLP4J_N_128.png


Recommended Posts

NLP4J [003] Try text analysis using natural language processing and part-speech statistical processing in Java
NLP4J [004] Try text analysis using natural language processing and parsing statistical processing in Java
Introducing NLP4J-[000] Natural Language Processing Index in Java
NLP4J [002] Try parsing Japanese using Yahoo! Developer Network Japanese Parsing Analysis (V1) in Java
NLP4J [001b] Morphological analysis in Java (using kuromoji)
Try using RocksDB in Java
Parallel and parallel processing in various languages (Java edition)
Try using Sourcetrail (win version) in Java code
Try using GCP's Cloud Vision API in Java
Try using Sourcetrail (macOS version) in Java code
Try using the COTOHA API parsing in Java
[Android / Java] Screen transition and return processing in fragments
Convert JSON and YAML in Java (using Jackson and SnakeYAML)
NLP4J [005-1] Try Twitter analysis with Twitter4J and NLP4J (data collection)
Use Watson Conversation as NLP (Java) (Natural Language Processing)
How to convert A to a and a to A using AND and OR in Java
Try global hooking in Java using the JNativeHook library
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
[Java] Change language and locale to English in JVM options
Log aggregation and analysis (working with AWS Athena in Java)
Translate using Microsoft Translator Text API in Java (Japanese → English)
Tips for using Salesforce SOAP and Bulk API in Java
Try debugging natural language processing on Windows. with VS Code