NLP4J [004] Try text analysis using natural language processing and parsing statistical processing in Java

Return to Index: [003] Statistical processing of part of speech> [004] Parsing Statistical processing of analysis> [005-1] NLP4J + Twitter4J (data collection)

Let's do a text analysis using the result of morphological analysis and simple statistical processing using NLP4J.

Again, "Morphological analysis" and "syntax analysis" are similar to "how to use kitchen knives" in cooking. If you add "statistical processing" in addition to "morphological analysis" and "syntax analysis", I think that text analysis = cooking. The statistical processing here uses simple ones, but I think it is also good to include machine learning and complicated statistical processing.

Now, suppose you have the following document: One line is one record.

"Toyota", "I am making a hybrid car."
"Toyota", "We sell hybrid cars."
"Toyota", "I'm making a car."
"Toyota", "I sell cars."
"Nissan", "I'm making an EV."
"Nissan", "I sell EVs."
"Nissan", "I sell cars."
"Nissan", "We are affiliated with Renault."
"Nissan", "I sell light cars."
"Honda", "I'm making a car."
"Honda", "I sell cars."
"Honda", "I'm making a motorcycle."
"Honda", "I sell motorcycles."
"Honda", "I sell light cars."
"Honda", "I am making a light car."

When you divide the document into "Toyota", "Nissan", and "Honda", what are the "characteristic" dependency "keywords"? I will try to put out characteristic keywords using NLP4J. (No difficult processing) The point is that we use parsing and statistical processing using the "SimpleDocumentIndex" class.

Maven

<dependency>
  <groupId>org.nlp4j</groupId>
  <artifactId>nlp4j</artifactId>
  <version>1.0.0.0</version>
</dependency>

Code1

public class HelloTextMiningMain2B {
	public static void main(String[] args) throws Exception {
		//Preparation of documents (Reading CSV etc. is also possible)
		List<Document> docs = new ArrayList<Document>();
		{
			docs.add(createDocument("Toyota", "I am making a hybrid car."));
			docs.add(createDocument("Toyota", "We sell hybrid cars."));
			docs.add(createDocument("Toyota", "I'm making a car."));
			docs.add(createDocument("Toyota", "I sell cars."));
			docs.add(createDocument("Nissan", "I'm making an EV."));
			docs.add(createDocument("Nissan", "I sell EVs."));
			docs.add(createDocument("Nissan", "I sell cars."));
			docs.add(createDocument("Nissan", "We are affiliated with Renault."));
			docs.add(createDocument("Nissan", "I sell light cars."));
			docs.add(createDocument("Honda", "I'm making a car."));
			docs.add(createDocument("Honda", "I sell cars."));
			docs.add(createDocument("Honda", "I'm making a motorcycle."));
			docs.add(createDocument("Honda", "I sell motorcycles."));
			docs.add(createDocument("Honda", "I sell light cars."));
			docs.add(createDocument("Honda", "I am making a light car."));
		}
		//Morphological analysis annotator + parsing annotator
		DocumentAnnotator annotator = new YjpAllAnnotator(); //Morphological analysis + parsing
		{
			System.err.println("Morphological analysis + parsing");
			long time1 = System.currentTimeMillis();
			//Morphological analysis + parsing
			annotator.annotate(docs);
			long time2 = System.currentTimeMillis();
			System.err.println("processing time[ms]:" + (time2 - time1));
		}
		//Preparation of keyword index (statistical processing)
		Index index = new SimpleDocumentIndex();
		{
			System.err.println("Indexing");
			long time1 = System.currentTimeMillis();
			//Keyword indexing process
			index.addDocuments(docs);
			long time2 = System.currentTimeMillis();
			System.err.println("processing time[ms]:" + (time2 - time1));
		}
		{
			//"noun...Acquiring keywords that are highly co-occurring with Nissan in "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Nissan");
			System.out.println("noun...Verb for Nissan");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//"noun...Acquisition of keywords with high co-occurrence in Toyota with "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Toyota");
			System.out.println("noun...Verb for Toyota");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//"noun...Acquisition of keywords that are highly co-occurring with Honda with "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Honda");
			System.out.println("noun...Verb for Honda");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
	}
	static Document createDocument(String item, String text) {
		Document doc = new DefaultDocument();
		doc.putAttribute("item", item);
		doc.setText(text);
		return doc;
	}
}

Output

Morphological analysis + parsing
processing time[ms]:9618
Indexing
processing time[ms]:3
noun...Verb for Nissan
count=1,correlation=3.0,lex=EV...sell
count=1,correlation=3.0,lex=EV...create
count=1,correlation=1.5,lex=Light car...sell
count=1,correlation=1.0,lex=Automobile...sell
noun...Verb for Toyota
count=1,correlation=3.8,lex=car...sell
count=1,correlation=3.8,lex=car...create
count=1,correlation=3.8,lex=hybrid...sell
count=1,correlation=3.8,lex=hybrid...create
count=1,correlation=1.9,lex=Automobile...create
count=1,correlation=1.3,lex=Automobile...sell
noun...Verb for Honda
count=1,correlation=2.5,lex=bike...sell
count=1,correlation=2.5,lex=Light car...create
count=1,correlation=2.5,lex=bike...create
count=1,correlation=1.3,lex=Automobile...create
count=1,correlation=1.3,lex=Light car...sell
count=1,correlation=0.8,lex=Automobile...sell

It's easy! The result is like this! Does it match the human senses?


Return to Index: [003] Statistical processing of part of speech> [004] Parsing Statistical processing of analysis> [005-1] NLP4J + Twitter4J (data collection)

Project URL

https://www.nlp4j.org/ NLP4J_N_128.png


Recommended Posts

NLP4J [004] Try text analysis using natural language processing and parsing statistical processing in Java
NLP4J [003] Try text analysis using natural language processing and part-speech statistical processing in Java
Introducing NLP4J-[000] Natural Language Processing Index in Java
NLP4J [002] Try parsing Japanese using Yahoo! Developer Network Japanese Parsing Analysis (V1) in Java
Try using the COTOHA API parsing in Java
NLP4J [001b] Morphological analysis in Java (using kuromoji)
Try using RocksDB in Java
Implement Thread in Java and try using anonymous class, lambda
NLP4J [001a] Morphological analysis in Java (using Yahoo! Developer Network Japanese morphological analysis)
Try using the Stream API in Java
Try using JSON format API in Java
Try adding text to an image in Scala using the Java standard library
Parallel and parallel processing in various languages (Java edition)
Try using Sourcetrail (win version) in Java code
Try using GCP's Cloud Vision API in Java
Try using Sourcetrail (macOS version) in Java code
[Android / Java] Screen transition and return processing in fragments
Convert JSON and YAML in Java (using Jackson and SnakeYAML)
NLP4J [005-1] Try Twitter analysis with Twitter4J and NLP4J (data collection)
Use Watson Conversation as NLP (Java) (Natural Language Processing)
How to convert A to a and a to A using AND and OR in Java
Try global hooking in Java using the JNativeHook library
NLP4J [006-030] 100 language processing knocks with NLP4J # 30 Reading morphological analysis results
[Java] Change language and locale to English in JVM options
Log aggregation and analysis (working with AWS Athena in Java)
Translate using Microsoft Translator Text API in Java (Japanese → English)
Tips for using Salesforce SOAP and Bulk API in Java
Try debugging natural language processing on Windows. with VS Code