Return to Index: [004] Statistical processing of parsing> [005- 1] NLP4J + Twitter4J (data collection)> Next page

Data collection

Next, let's analyze Twitter data with NLP4J.

At the moment, NLP4J is positioned as "easy analysis", so let's analyze it easily.

As an analysis scenario, let's analyze what kind of tweets the official Twitter of each automobile company (Toyota, Nissan, Honda) is tweeting. https://twitter.com/TOYOTA_PR https://twitter.com/NissanJP https://twitter.com/HondaJP

I feel that it is possible to analyze with various perspectives and data by modifying the query appropriately.

Let's start by collecting data.

For Twitter data collection, I will try using "Twitter4j", which is famous as Java Wrapper of Twitter API.

Twitter4J http://twitter4j.org/ja/index.html

Let's add Twitter4J to Maven pom.xml.

Maven POM

<dependency>
	<groupId>org.nlp4j</groupId>
	<artifactId>nlp4j</artifactId>
	<version>1.0.0.0</version>
</dependency>
<dependency>
	<groupId>org.twitter4j</groupId>
	<artifactId>twitter4j-core</artifactId>
	<version>[4.0,)</version>
</dependency>

Preparing Twitter4J

Twitter app definition

Access the following to define your application. https://apps.twitter.com/

Get the key

After defining the application, you can get the following values on apps.twitter.com, so copy them.

Application Settings Note as Consumer Key (API Key)-> (1) Note as Consumer Secret (API Secret)-> (2)

Your Access Token Note as Access Token-> (3) Note as Access Token Secret-(4)

Preparation of twitter4j.properties (pass through classpath)

Prepare the following property file and put it in the classpath.

debug=false
http.prettyDebug=false

oauth.consumerKey= (1)Value obtained in
oauth.consumerSecret= (2)Value obtained in
oauth.accessToken= (3)Value obtained in
oauth.accessTokenSecret= (4)Value obtained in
jsonStoreEnabled=true

It is a good idea to place the file in the root of the package like this:

Code I will try to put the collected tweets directly into NLP4J. (Of course, you can save it to a file such as CSV or JSON.)

import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;

import nlp4j.DocumentAnnotator;
import nlp4j.Document;
import nlp4j.Index;
import nlp4j.Keyword;
import nlp4j.impl.DefaultDocument;
import nlp4j.index.SimpleDocumentIndex;
import nlp4j.yhoo_jp.YjpAllAnnotator;
import twitter4j.Query;
import twitter4j.QueryResult;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterFactory;

public class HelloTextMiningTwitter {
	public static void main(String[] args) throws Exception {
		String[] accounts = { "NissanJP", "TOYOTA_PR", "HondaJP" };
		List<Document> docs = new ArrayList<Document>();
		for (String account : accounts) {
			docs.addAll(createDocumentTwitter(account));
		}
//Morphological analysis annotator + parsing annotator
		DocumentAnnotator annotator = new YjpAllAnnotator(); //Morphological analysis + parsing
		{
			System.err.println("Morphological analysis + parsing");
			//Morphological analysis + parsing
			annotator.annotate(docs);
		}
//Preparation of keyword index (statistical processing)
		Index index = new SimpleDocumentIndex();
		{
			System.err.println("Indexing");
			//Keyword indexing process
			index.addDocuments(docs);
		}
		{
			//Acquisition of frequently used keywords
			System.out.println("Noun frequency order");
			List<Keyword> kwds = index.getKeywords();
			kwds = kwds.stream() //
					.filter(o -> o.getCount() > 1) //2 or more
					.filter(o -> o.getFacet().equals("noun")) // 品詞がnoun
					.collect(Collectors.toList());
			for (Keyword kwd : kwds) {
				System.out.println(
						String.format("count=%d,facet=%s,lex=%s", kwd.getCount(), kwd.getFacet(), kwd.getLex()));
			}
		}
		for (String account : accounts) {
			{
				//Acquisition of highly co-occurrence keywords
				List<Keyword> kwds = index.getKeywords("noun", "item=" + account);
				System.out.println("Noun for" + account);
				for (Keyword kwd : kwds) {
					System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
							kwd.getCorrelation(), kwd.getLex()));
				}
			}
			{
				//Acquisition of highly co-occurrence keywords
				List<Keyword> kwds = index.getKeywords("noun...verb", "item=" + account);
				System.out.println("noun...Verb for" + account);
				for (Keyword kwd : kwds) {
					System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
							kwd.getCorrelation(), kwd.getLex()));
				}
			}
		}
	}

	static List<Document> createDocumentTwitter(String item) {
		ArrayList<Document> docs = new ArrayList<Document>();
		try {
			Twitter twitter = TwitterFactory.getSingleton();
			Query query = new Query("from:" + item);
			query.setCount(10);
			QueryResult result = twitter.search(query);
			for (Status status : result.getTweets()) {
				// System.out.println("@" + status.getUser().getScreenName() + ":" +
				// status.getText());
				Document doc = new DefaultDocument();
				doc.putAttribute("item", item);
				doc.setText(status.getText());
				docs.add(doc);
			}
		} catch (Exception e) {
			e.printStackTrace();
		}
		return docs;
	}
}

It's easy! On the next page, we will look at the analysis results.