A note to load open data in CSV format into Cloudant on Bluemix

This is a memo that solved various problems while trying to load a large amount of data into Cloudant. When developing a system that makes full use of AI technology such as Watson, the more accurate data is stored, the more interesting and valuable AI will be. Therefore, I focused on open data and considered importing it into a database that can handle unstructured data.

What you want to do

I want to easily perform operations such as importing open data in CSV format to my PC, checking and editing in Excel, saving in CSV format, and loading it into Cloudant. Since the application server is Linux, I want to convert the Kanji code from SJIS to UTF-8 and change the line feed code at once. That is the purpose.

スクリーンショット 2017-07-06 8.54.48.png

What is open data?

There are various types of open data, but in this sample code, we will load the postal code data of the post office in reference material (1) into Cloudant.

I will quote the explanation about open data from the website that handles open data.

Open Data should be available in such a way that specific data can be used and reposted as desired by everyone, without any restrictions on control mechanisms such as copyrights and patents. The idea is that there is.

Reference material (2)

From the perspective of contributing to the development of Japan's socio-economy by improving public life and revitalizing corporate activities by widely disclosing public data, the Japanese government has created a data format suitable for machine reading for commercial purposes. We are promoting efforts for "open data" that is disclosed under usage rules that allow secondary use, including.

Reference material (3)

Open data is "data in a data format suitable for machine reading and published under usage rules that allow secondary use" and "data that enables secondary use of data without much manpower". is. In other words, it refers to data that anyone can freely copy, process, and distribute within the permitted rules. Of course, it can also be used for commercial purposes. The purpose is to contribute to the development of socio-economics as a whole by having users (citizens, private companies, etc.) effectively utilize various public data such as "demographics" and "locations of public facilities". , Are undertaking similar attempts around the world.

Reference material (4)

The Japanese government's efforts regarding open data can be found in Reference Material (5), so I think that you can deepen your understanding by referring to it as well.

What is Cloudant

The reasons for choosing and using Cloudant are described below.

IBM Cloudant is an unmanaged NoSQL JSON database service that provides fast, uninterrupted data flow between applications and databases. With Cloudant, you'll be able to develop more value-added apps, grow your business, and reduce management effort.

Reference material (6)

According to this reference material, it has the following features.

In addition, the Cloudant Lite plan used from Bluemix has a big point that it can be used for free with a data amount of less than 1G. Reference material (7)

Cloudant also has Twitter, and you can also collect information at @IBMCludant https://twitter.com/ibmcloudant.

Preparations

You'll need a Bluemix account to actually run this note. Also, after logging in to Bluemix, create a Cloudant service, browse for service credentials, copy vcap-local.json.sample to create vcap-local.json, and set the credentials information. By doing so, you will be able to access Cloudant on Bluemix from your own laptop or other development environment.

Loading program to Cloudant

I edited the following items and put the available Python program on GitHUB https://github.com/takara9/cloudant_loader. Please refer to README.md for the specific editing method.

Explanation of loading program

I will write a description of the program so that the person who read this article can modify it by referring to the program registered on my GitHub.

Specifying inputs and outputs

Specify output_db and input_file before running. input_file can be in DOS format or SJIS as it is. output_db is the cloudant database name.

import sys
import json
import codecs
import uuid
import time
from cloudant.client import Cloudant
from cloudant.query import Query

==================================
# Load destination database name
 output_db ='fukuoka' <-with here

# Input CSV file name
 input_file = '40fukuoka.csv' <-here
output_file = input_file + ".utf-8"
==================================

Once loaded into Cloudant's database, you can see it on the next screen. The part displayed as fukuoka is the result of loading with the above settings.

スクリーンショット 2017-07-06 10.06.33.png

To see this screen, go through the menu as Bluemix Menu-> Services-> Dashboard-> Cloudant-> LAUNCH.

Connect to Cloudant

Load the service credentials in JSON format and connect to the Cloudant database. The format of the JSON file is placed on GitHub with the file name vcap-local.json.sample, so please refer to it and create it.

# Get Cloudant credentials
f = open('./vcap-local.json', 'r')
cred = json.load(f)
f.close()

 print "Connect to Cloudant"
client = Cloudant(cred['services']['cloudantNoSQLDB'][0]['credentials']['username'], 
                  cred['services']['cloudantNoSQLDB'][0]['credentials']['password'], 
                  url=cred['services']['cloudantNoSQLDB'][0]['credentials']['url'])
client.connect()

You can refer to the service credentials as follows. On the Bluemix Cloudant service screen, click View Credentaial-1 credentials and the JSON format data will be displayed, so copy and paste it to use it.

スクリーンショット 2017-07-06 10.17.42.png

Database deletion and creation

In the case of loading by diff, it works as an addition by not deleting the existing database. On the other hand, in the case where it is necessary to reload the data several times until it becomes usable data, it will be deleted and created repeatedly, so it is written as follows.

# Delete if DB exists
 print "Create existing database", output_db, ""
try:
    db = client[output_db]
    if db.exists():
        client.delete_database(output_db)
except:
    pass


# Create a new DB
 print "Create new database", output_db, ""
try: 
    db = client.create_database(output_db)
 print "successful database creation"
except:
 print "database creation failure"
sys.exit()

Conversion from DOS format to UNIX format

In this part, the Kanji code conversion from SJIS to UTF-8 and the line feed code conversion are performed to generate an intermediate file. Here, the reason for not using iconv is that if the open data contains code outside the range of SJIS, iconv may interrupt the process without error, so the data load fails. But you may not notice it.

In this code, try --except is used to supplement the conversion error and display it as the line number of the CSV data, so you can write down the part where the conversion failed and proceed.

# Output preprocessing results such as character code conversion to an intermediate file
fin = codecs.open(input_file, "r", "shift_jis")
fout = codecs.open(output_file, "w", "utf-8")
line_no = 0
while True:
    try:
        line = fin.readline()
        if line == "":
            break
        line = line.rstrip('\r\n') + "\n"
        line = line.replace('"','')
        fout.write(line)
        line_no = line_no + 1
    except Exception as e:
        print "Line = ", line_no, " Error message: ", e
        pass

Create JSON format document data and load it into Cloudant

In the above-mentioned kanji code conversion process, we were able to eliminate the kanji code that would be interrupted due to an error, so this time we will focus on speed and use "reader.readlines ()" to fetch all lines into memory at once. .. Then, create a line-by-line JSON document and register it with Cloudant.

# Read the CSV data from the Excel sheet and register it in the DB
reader = codecs.open(output_file, 'r', 'utf-8')
lines = reader.readlines()
print lines[0]

wait_cnt = 0
line_no = 0
for line in lines:

If there is an item name line in the first line of the CSV file, it is skipped. If you don't have an item line, it's a good idea to comment out the three lines starting with if line_no. Then, you can grasp the progress by displaying the debug display. Next are the four lines starting with wait_cnt, which the free Cloudant Lite plan has a limited write speed and will return an error if more than 10 writes occur per second. In order to keep the speed limit and avoid the occurrence of an error, we put WAIT for 1 second after writing 9 items.

 #Skip header line
    if line_no == 0:
        line_no = line_no + 1
        continue

 #Debug display
    print line_no,line.rstrip('\n')

 # Expand to array
    al = line.split(',')
    line_no = line_no + 1

 Supports write speed limit of #Cloudant Lite plan
    wait_cnt = wait_cnt + 1
    if wait_cnt > 9:
        wait_cnt = 0
        time.sleep(1)

Generate json_doc as json format document data and write it with db.create_document. Here, the JSON item name must be named according to the CSV data item to be imported. How to write json_doc needs to be customized according to the target CSV. The last two lines of this code are comments, but this is a write confirmation process. I'm commenting because it will be slow if this process is included.

 # DOC creation and writing
    id = str(uuid.uuid4())
    json_doc = {
        #"_id": id,
        "_id": al[0],
        "jusho_CD": al[0],
        "todofuken_CD": al[1],
 <Omitted>
        "jigyosyo_mei_kana": al[19],
        "jigyosyo_jyusyo": al[20]
        #"sin_jyusyo_kana": al[21]
     }

 # Write
    cdoc = db.create_document(json_doc)
    #if cdoc.exists():
    #    print "SUCCESS!!"

In the case of Cloudant, if there is no _id item, uuid is generated internally and registered. However, if you retrieve it with a unique key, you can retrieve it at high speed by setting the key in _id. Here, the address code is a unique code, so it is set to _id.

    json_doc = {
        #"_id": id,
        "_id": al[0],

Summary

When developing systems using RDBs, loading data was the first stumbling block. Even if you use Cloudant, it doesn't seem that the data loading is difficult.

Reference material

(1) Post office postal code data download http://www.post.japanpost.jp/zipcode/dl/kogaki-zip.html (2) Wikipedia Open Data https://ja.wikipedia.org/wiki/%E3%82%AA%E3%83%BC%E3%83%97%E3%83%B3%E3%83%87%E3 % 83% BC% E3% 82% BF (3) OPEN DATA JAPAN Data Catalog Site http://www.data.go.jp/ (4) Open-governmentdata https://www.open-governmentdata.org/ (5) Ministry of Internal Affairs and Communications Promotion of Open Data Strategy What is open data? Http://www.soumu.go.jp/menu_seisaku/ictseisaku/ictriyou/opendata/opendata01.html (6) IBM Cloudant (Japanese site) https://www.ibm.com/analytics/jp/ja/technology/cloud-data-services/cloudant/ (7) IBM Bluemix Catalog Cloudant https://console.bluemix.net/catalog/services/cloudant-nosql-db?env_id=ibm:yp:us-south

Recommended Posts

A note to load open data in CSV format into Cloudant on Bluemix
A note on how to load a virtual environment in PyCharm
How to combine all CSVs in a folder into one CSV
Note: [Python3] Convert datetime to a string in any format you like
Books on data science to read in 2020
A small story that outputs table data in CSV format at high speed
Open a ZIP created on Windows in Linux
A note on optimizing blackbox functions in Python
Data analysis in Python: A note about line_profiler
A memorandum on how to use keras.preprocessing.image in Keras
Things to note when initializing a list in Python
Until you insert data into a spreadsheet in Python
A note on handling variables in Python recursive functions
2 ways to read all csv files in a folder
A note on the default behavior of collate_fn in PyTorch
How to output a document in pdf format with Sphinx
Ingenuity to handle data with Pandas in a memory-saving manner
A note on touching Microsoft's face recognition API in Python
Simultaneously input specific data to a specific sheet in many excels
A note when I can't open Jupyter Notebook on Windows
A python script that converts Oracle Database data to csv
How to return the data contained in django model in json format and map it on leaflet