This article is the 7th day of the Cloud Analytics Advent Calendar.
Here, using the Cloudant data prepared on Day 3, using the Notebook on Data Science Experience, Spark's RDD handling and articles The Japanese sentences included in the title are handled by simple aggregation processing using Janome. * The article has become long, so I will divide it ...
Please refer to Article on the 3rd day for the preparation of Node-RED and Cloudant and the data used this time. In addition, the environment of this Notebook uses DataScienceExperience. For details on DSX, refer to Article on Day 1.
Create a connection to the Cloudant prepared from add data assets in the figure below.
When the right panel opens, select the Connections tab and Create a new connection from Create Connection.
Open the Create Connection screen and set the required items.
Setting of each item 1
Setting of each item 2
If you press the Create button, a connection to Cloudant's rss database will be created as shown below.
Cloudant is now ready. It's very easy.
Next, create a new notebook for analysis. Select add notebooks from the project screen to add a new notebook.
It overlaps with the content explained in Day 1, but it is easy.
In Language, you can also select Python 2 system, This time, we will handle Japanese, so we have selected 3.5, which is easy to operate.
For Data Science Experience, please refer to Article on Day 1.
The notebook will open automatically when you create it. To perform morphological analysis in Japanese, install Janome, a morphological analyzer available in Python.
You can use pip on Data Science Experience. Enter the following code in the first cell and execute it
!pip install janome
For the basic usage of Jupyter such as code execution, refer to the article on the first day.
Run it to install Janome.
After installing Janome, let's do a simple test to see if Janome can be used normally. Enter the writing code in a new cell and execute it.
from janome.tokenizer import Tokenizer
t = Tokenizer()
tokens = t.tokenize("Of the thighs and thighs")
for token in tokens:
print(token)
Be careful of indentation when copying. It is OK if the result of normal morphological analysis is output.
When Janome is ready, we'll write the code to get the data from Cloudant. First, fill the cell with Cloudant authentication data. Call the Cloudant information you registered earlier. Open the find and add data menu in the upper right and open Find the news_rss you just registered from Connections.
If you press insert to code with a new cell selected, the cell will automatically fill in the required information.
Executes the entered cell and makes the variable of credentials_1 available. Next, create a DataFrame using SparkSQL. Enter the following code in a new cell and execute it.
from pyspark.sql import SQLContext
#SQLContext
sqlContext = SQLContext(sc)
#load rss from cloudant
rss = sqlContext.read.format("com.cloudant.spark").\
option("cloudant.host",credentials_1["host"]).\
option("cloudant.username", credentials_1["username"]).\
option("cloudant.password",credentials_1["password"]).\
load("rss")
I specified the Cloudant format in sqlContext, specified the host name, connected user, and password from the credentials information above, and loaded the data from the rss table.
If you get the exception of Another instance of Derby may have already booted the database ..., restart the notebook kernel and re-execute the code from the first cell. It seems that apache Derby is used internally, but there seems to be a case where the handling of connection does not go well ...
Run the following code to see the schema of the loaded data.
rss.printSchema()
Finally, let's check the contents of the data. Execute the following code to display the data.
rss.show()
This is the end of part1. From part2, I will use PySpark and Janome to operate Japanese.
Recommended Posts