Return to Index: [004] Statistical processing of parsing> [005- 1] NLP4J + Twitter4J (data collection)> Next page
Next, let's analyze Twitter data with NLP4J.
At the moment, NLP4J is positioned as "easy analysis", so let's analyze it easily.
As an analysis scenario, let's analyze what kind of tweets the official Twitter of each automobile company (Toyota, Nissan, Honda) is tweeting. https://twitter.com/TOYOTA_PR https://twitter.com/NissanJP https://twitter.com/HondaJP
I feel that it is possible to analyze with various perspectives and data by modifying the query appropriately.
Let's start by collecting data.
For Twitter data collection, I will try using "Twitter4j", which is famous as Java Wrapper of Twitter API.
Twitter4J http://twitter4j.org/ja/index.html
Let's add Twitter4J to Maven pom.xml.
Maven POM
<dependency>
<groupId>org.nlp4j</groupId>
<artifactId>nlp4j</artifactId>
<version>1.0.0.0</version>
</dependency>
<dependency>
<groupId>org.twitter4j</groupId>
<artifactId>twitter4j-core</artifactId>
<version>[4.0,)</version>
</dependency>
Access the following to define your application. https://apps.twitter.com/
After defining the application, you can get the following values on apps.twitter.com, so copy them.
Application Settings Note as Consumer Key (API Key)-> (1) Note as Consumer Secret (API Secret)-> (2)
Your Access Token Note as Access Token-> (3) Note as Access Token Secret-(4)
Prepare the following property file and put it in the classpath.
debug=false
http.prettyDebug=false
oauth.consumerKey= (1)Value obtained in
oauth.consumerSecret= (2)Value obtained in
oauth.accessToken= (3)Value obtained in
oauth.accessTokenSecret= (4)Value obtained in
jsonStoreEnabled=true
It is a good idea to place the file in the root of the package like this:
Code I will try to put the collected tweets directly into NLP4J. (Of course, you can save it to a file such as CSV or JSON.)
import java.util.ArrayList;
import java.util.List;
import java.util.stream.Collectors;
import nlp4j.DocumentAnnotator;
import nlp4j.Document;
import nlp4j.Index;
import nlp4j.Keyword;
import nlp4j.impl.DefaultDocument;
import nlp4j.index.SimpleDocumentIndex;
import nlp4j.yhoo_jp.YjpAllAnnotator;
import twitter4j.Query;
import twitter4j.QueryResult;
import twitter4j.Status;
import twitter4j.Twitter;
import twitter4j.TwitterFactory;
public class HelloTextMiningTwitter {
public static void main(String[] args) throws Exception {
String[] accounts = { "NissanJP", "TOYOTA_PR", "HondaJP" };
List<Document> docs = new ArrayList<Document>();
for (String account : accounts) {
docs.addAll(createDocumentTwitter(account));
}
//Morphological analysis annotator + parsing annotator
DocumentAnnotator annotator = new YjpAllAnnotator(); //Morphological analysis + parsing
{
System.err.println("Morphological analysis + parsing");
//Morphological analysis + parsing
annotator.annotate(docs);
}
//Preparation of keyword index (statistical processing)
Index index = new SimpleDocumentIndex();
{
System.err.println("Indexing");
//Keyword indexing process
index.addDocuments(docs);
}
{
//Acquisition of frequently used keywords
System.out.println("Noun frequency order");
List<Keyword> kwds = index.getKeywords();
kwds = kwds.stream() //
.filter(o -> o.getCount() > 1) //2 or more
.filter(o -> o.getFacet().equals("noun")) // 品詞がnoun
.collect(Collectors.toList());
for (Keyword kwd : kwds) {
System.out.println(
String.format("count=%d,facet=%s,lex=%s", kwd.getCount(), kwd.getFacet(), kwd.getLex()));
}
}
for (String account : accounts) {
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun", "item=" + account);
System.out.println("Noun for" + account);
for (Keyword kwd : kwds) {
System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
kwd.getCorrelation(), kwd.getLex()));
}
}
{
//Acquisition of highly co-occurrence keywords
List<Keyword> kwds = index.getKeywords("noun...verb", "item=" + account);
System.out.println("noun...Verb for" + account);
for (Keyword kwd : kwds) {
System.out.println(String.format("count=%d,correlation=%.1f,lex=%s", kwd.getCount(),
kwd.getCorrelation(), kwd.getLex()));
}
}
}
}
static List<Document> createDocumentTwitter(String item) {
ArrayList<Document> docs = new ArrayList<Document>();
try {
Twitter twitter = TwitterFactory.getSingleton();
Query query = new Query("from:" + item);
query.setCount(10);
QueryResult result = twitter.search(query);
for (Status status : result.getTweets()) {
// System.out.println("@" + status.getUser().getScreenName() + ":" +
// status.getText());
Document doc = new DefaultDocument();
doc.putAttribute("item", item);
doc.setText(status.getText());
docs.add(doc);
}
} catch (Exception e) {
e.printStackTrace();
}
return docs;
}
}
It's easy! On the next page, we will look at the analysis results.
Return to Index: [004] Statistical processing of parsing> [005- 1] NLP4J + Twitter4J (data collection)> Next page
https://www.nlp4j.org/
Recommended Posts