Return to Index: [003] Statistical processing of part of speech> [004] Parsing Statistical processing of analysis> [005-1] NLP4J + Twitter4J (data collection)

Let's do a text analysis using the result of morphological analysis and simple statistical processing using NLP4J.

Again, "Morphological analysis" and "syntax analysis" are similar to "how to use kitchen knives" in cooking. If you add "statistical processing" in addition to "morphological analysis" and "syntax analysis", I think that text analysis = cooking. The statistical processing here uses simple ones, but I think it is also good to include machine learning and complicated statistical processing.

Now, suppose you have the following document: One line is one record.

"Toyota", "I am making a hybrid car."
"Toyota", "We sell hybrid cars."
"Toyota", "I'm making a car."
"Toyota", "I sell cars."
"Nissan", "I'm making an EV."
"Nissan", "I sell EVs."
"Nissan", "I sell cars."
"Nissan", "We are affiliated with Renault."
"Nissan", "I sell light cars."
"Honda", "I'm making a car."
"Honda", "I sell cars."
"Honda", "I'm making a motorcycle."
"Honda", "I sell motorcycles."
"Honda", "I sell light cars."
"Honda", "I am making a light car."

When you divide the document into "Toyota", "Nissan", and "Honda", what are the "characteristic" dependency "keywords"? I will try to put out characteristic keywords using NLP4J. (No difficult processing) The point is that we use parsing and statistical processing using the "SimpleDocumentIndex" class.

Maven

<dependency>
  <groupId>org.nlp4j</groupId>
  <artifactId>nlp4j</artifactId>
  <version>1.0.0.0</version>
</dependency>

Code1

public class HelloTextMiningMain2B {
	public static void main(String[] args) throws Exception {
		//Preparation of documents (Reading CSV etc. is also possible)
		List<Document> docs = new ArrayList<Document>();
		{
			docs.add(createDocument("Toyota", "I am making a hybrid car."));
			docs.add(createDocument("Toyota", "We sell hybrid cars."));
			docs.add(createDocument("Toyota", "I'm making a car."));
			docs.add(createDocument("Toyota", "I sell cars."));
			docs.add(createDocument("Nissan", "I'm making an EV."));
			docs.add(createDocument("Nissan", "I sell EVs."));
			docs.add(createDocument("Nissan", "I sell cars."));
			docs.add(createDocument("Nissan", "We are affiliated with Renault."));
			docs.add(createDocument("Nissan", "I sell light cars."));
			docs.add(createDocument("Honda", "I'm making a car."));
			docs.add(createDocument("Honda", "I sell cars."));
			docs.add(createDocument("Honda", "I'm making a motorcycle."));
			docs.add(createDocument("Honda", "I sell motorcycles."));
			docs.add(createDocument("Honda", "I sell light cars."));
			docs.add(createDocument("Honda", "I am making a light car."));
		}
		//Morphological analysis annotator + parsing annotator
		DocumentAnnotator annotator = new YjpAllAnnotator(); //Morphological analysis + parsing
		{
			System.err.println("Morphological analysis + parsing");
			long time1 = System.currentTimeMillis();
			//Morphological analysis + parsing
			annotator.annotate(docs);
			long time2 = System.currentTimeMillis();
			System.err.println("processing time[ms]：" + (time2 - time1));
		}
		//Preparation of keyword index (statistical processing)
		Index index = new SimpleDocumentIndex();
		{
			System.err.println("Indexing");
			long time1 = System.currentTimeMillis();
			//Keyword indexing process
			index.addDocuments(docs);
			long time2 = System.currentTimeMillis();
			System.err.println("processing time[ms]：" + (time2 - time1));
		}
		{
			//"noun...Acquiring keywords that are highly co-occurring with Nissan in "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Nissan");
			System.out.println("noun...Verb for Nissan");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//"noun...Acquisition of keywords with high co-occurrence in Toyota with "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Toyota");
			System.out.println("noun...Verb for Toyota");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
		{
			//"noun...Acquisition of keywords that are highly co-occurring with Honda with "verbs"
			List<Keyword> kwds = index.getKeywords("noun...verb", "item=Honda");
			System.out.println("noun...Verb for Honda");
			for (Keyword kwd : kwds) {
				System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
						kwd.getCorrelation(), kwd.getLex()));
			}
		}
	}
	static Document createDocument(String item, String text) {
		Document doc = new DefaultDocument();
		doc.putAttribute("item", item);
		doc.setText(text);
		return doc;
	}
}

Output

Morphological analysis + parsing
processing time[ms]：9618
Indexing
processing time[ms]：3
noun...Verb for Nissan
count=1,correlation=3.0,lex=EV...sell
count=1,correlation=3.0,lex=EV...create
count=1,correlation=1.5,lex=Light car...sell
count=1,correlation=1.0,lex=Automobile...sell
noun...Verb for Toyota
count=1,correlation=3.8,lex=car...sell
count=1,correlation=3.8,lex=car...create
count=1,correlation=3.8,lex=hybrid...sell
count=1,correlation=3.8,lex=hybrid...create
count=1,correlation=1.9,lex=Automobile...create
count=1,correlation=1.3,lex=Automobile...sell
noun...Verb for Honda
count=1,correlation=2.5,lex=bike...sell
count=1,correlation=2.5,lex=Light car...create
count=1,correlation=2.5,lex=bike...create
count=1,correlation=1.3,lex=Automobile...create
count=1,correlation=1.3,lex=Light car...sell
count=1,correlation=0.8,lex=Automobile...sell

It's easy! The result is like this! Does it match the human senses?

Return to Index: [003] Statistical processing of part of speech> [004] Parsing Statistical processing of analysis> [005-1] NLP4J + Twitter4J (data collection)

Project URL

https://www.nlp4j.org/

NLP4J [004] Try text analysis using natural language processing and parsing statistical processing in Java

Project URL