Japanese analysis processing using Janome part1

This article is the 7th day of the Cloud Analytics Advent Calendar.

Here, using the Cloudant data prepared on Day 3, using the Notebook on Data Science Experience, Spark's RDD handling and articles The Japanese sentences included in the title are handled by simple aggregation processing using Janome. * The article has become long, so I will divide it ...

Please refer to Article on the 3rd day for the preparation of Node-RED and Cloudant and the data used this time. In addition, the environment of this Notebook uses DataScienceExperience. For details on DSX, refer to Article on Day 1.

Preparing to connect with Cloudant

Create a connection to the Cloudant prepared from add data assets in the figure below.

スクリーンショット 2016-12-12 10.54.18.png

When the right panel opens, select the Connections tab and Create a new connection from Create Connection. スクリーンショット 2016-12-12 10.58.54.png

Open the Create Connection screen and set the required items.

Name: Give it a descriptive name.
Description: Description of this connection
Service Category: Data services or other external services on Bluemix. This time select Data Service.
Service Instance: You can select the content according to the Service Category in DropBox. If you use Cloudant on Bluemix, it will be listed automatically.
Select: Items change according to Service Instance, but when using Cloudant, the DataBase of Cloudant selected in Service Instance is listed. Here, the database of rss created is selected.

Setting of each item 1 スクリーンショット 2016-12-12 11.02.19.png

Setting of each item 2 スクリーンショット 2016-12-12 11.05.19.png

If you press the Create button, a connection to Cloudant's rss database will be created as shown below.

スクリーンショット 2016-12-12 11.08.11.png

Cloudant is now ready. It's very easy.

Create Notebook

Next, create a new notebook for analysis. Select add notebooks from the project screen to add a new notebook. スクリーンショット 2016-12-12 11.12.16.png

It overlaps with the content explained in Day 1, but it is easy.

Name: Notebook name (currently it is safer to avoid creating in Japanese)
Description: Notebook description
Language: Python 3.5 is used this time.
Spark version: Spark version. I will use 2.0.
Spark Service: Select the Spark Service to use.

スクリーンショット 2016-12-12 11.17.23.png

In Language, you can also select Python 2 system, This time, we will handle Japanese, so we have selected 3.5, which is easy to operate.

For Data Science Experience, please refer to Article on Day 1.

Preparation for Janome

The notebook will open automatically when you create it. To perform morphological analysis in Japanese, install Janome, a morphological analyzer available in Python.

Janome's HP

You can use pip on Data Science Experience. Enter the following code in the first cell and execute it

!pip install janome

For the basic usage of Jupyter such as code execution, refer to the article on the first day.

Run it to install Janome. スクリーンショット 2016-12-12 11.27.57.png

After installing Janome, let's do a simple test to see if Janome can be used normally. Enter the writing code in a new cell and execute it.

from janome.tokenizer import Tokenizer
t = Tokenizer()
tokens = t.tokenize("Of the thighs and thighs")
for token in tokens:
    print(token)

Be careful of indentation when copying. It is OK if the result of normal morphological analysis is output.

Cloudant connection code

When Janome is ready, we'll write the code to get the data from Cloudant. First, fill the cell with Cloudant authentication data. Call the Cloudant information you registered earlier. Open the find and add data menu in the upper right and open Find the news_rss you just registered from Connections. スクリーンショット 2016-12-12 11.36.16.png

If you press insert to code with a new cell selected, the cell will automatically fill in the required information. スクリーンショット 2016-12-12 11.48.50.png

Executes the entered cell and makes the variable of credentials_1 available. Next, create a DataFrame using SparkSQL. Enter the following code in a new cell and execute it.

from pyspark.sql import SQLContext
#SQLContext
sqlContext = SQLContext(sc)
#load rss from cloudant
rss = sqlContext.read.format("com.cloudant.spark").\
option("cloudant.host",credentials_1["host"]).\
option("cloudant.username", credentials_1["username"]).\
option("cloudant.password",credentials_1["password"]).\
load("rss")

I specified the Cloudant format in sqlContext, specified the host name, connected user, and password from the credentials information above, and loaded the data from the rss table.

If you get the exception of Another instance of Derby may have already booted the database ..., restart the notebook kernel and re-execute the code from the first cell. It seems that apache Derby is used internally, but there seems to be a case where the handling of connection does not go well ...

Run the following code to see the schema of the loaded data.

rss.printSchema()

スクリーンショット 2016-12-12 12.01.28.png

Finally, let's check the contents of the data. Execute the following code to display the data.

rss.show()

スクリーンショット 2016-12-12 12.10.46.png

This is the end of part1. From part2, I will use PySpark and Janome to operate Japanese.