Upload as open data using CKAN API in Python & automatically link with Github Actions

Summary
[Prerequisites](# Prerequisites)
[Data upload mechanism](#Data upload mechanism)
[Automation of data linkage](#Automation of data linkage)
Summary

Overview

Recently, an African open data portal site called openAFRICA operated by an organization called Code for Africa and its own water supply in Rwanda. The automatic linkage function of water supply vector tile data, which is maintained jointly with the public corporation WASAC, was implemented in Python.

Since it uses an API called CKAN, which seems to be widely used in open data sites of Japanese local governments, I think it can be used when you want to automatically link open data such as files owned by your organization via the API. So I want to share it.

Prerequisites

-You have your own account on an open data platform using CKAN API --Managing open data on Github

Throughout this article, when the open data on Github is updated, Github Action will be used to automatically link the data on the platform via CKAN.

By the way, the openAFRICA page of the open data of the water supply vector tile of Rwanda Waterworks Corporation can be found at the following link. https://open.africa/dataset/rw-water-vectortiles

In addition, the Github repository of water vector tiles can be found at the link below, and it is automatically updated to Github from the server of the waterworks company every week. https://github.com/WASAC/vt

Data upload mechanism

Repository download and installation

If pipenv is not installed, please set it first.

git clone https://github.com/watergis/open-africa-uploader
cd open-africa-uploader
pipenv install
pipenv shell

File upload mechanism using CKAN API

First, I will post the full source code of OpenAfricaUploader.py in the repository.

import os
import ckanapi
import requests


class OpanAfricaUploader(object):
  def __init__(self, api_key):
    """Constructor

    Args:
        api_key (string): CKAN api key
    """
    self.data_portal = 'https://africaopendata.org'
    self.APIKEY = api_key
    self.ckan = ckanapi.RemoteCKAN(self.data_portal, apikey=self.APIKEY)

  def create_package(self, url, title):
    """create new package if it does not exist yet.

    Args:
        url (str): the url of package eg. https://open.africa/dataset/{package url}
        title (str): the title of package
    """
    package_name = url
    package_title = title
    try:
        print ('Creating "{package_title}" package'.format(**locals()))
        self.package = self.ckan.action.package_create(name=package_name,
                                            title=package_title,
                                            owner_org = 'water-and-sanitation-corporation-ltd-wasac')
    except (ckanapi.ValidationError) as e:
        if (e.error_dict['__type'] == 'Validation Error' and
          e.error_dict['name'] == ['That URL is already in use.']):
            print ('"{package_title}" package already exists'.format(**locals()))
            self.package = self.ckan.action.package_show(id=package_name)
        else:
            raise

  def resource_create(self, data, path, api="/api/action/resource_create"):
    """create new resource, or update existing resource

    Args:
        data (object): data for creating resource. data must contain package_id, name, format, description. If you overwrite existing resource, id also must be included.
        path (str): file path for uploading
        api (str, optional): API url for creating or updating. Defaults to "/api/action/resource_create". If you want to update, please specify url for "/api/action/resource_update"
    """
    self.api_url = self.data_portal + api
    print ('Creating "{}"'.format(data['name']))
    r = requests.post(self.api_url,
                      data=data,
                      headers={'Authorization': self.APIKEY},
                      files=[('upload', open(path, 'rb'))])

    if r.status_code != 200:
        print ('Error while creating resource: {0}'.format(r.content))
    else:
      print ('Uploaded "{}" successfully'.format(data['name']))

  def resource_update(self, data, path):
    """update existing resource

    Args:
        data (object): data for creating resource. data must contain id, package_id, name, format, description.
        path (str): file path for uploading
    """
    self.resource_create(data, path, "/api/action/resource_update")

  def upload_datasets(self, path, description):
    """upload datasets under the package

    Args:
        path (str): file path for uploading
        description (str): description for the dataset
    """
    filename = os.path.basename(path)
    extension = os.path.splitext(filename)[1][1:].lower()
    
    data = {
      'package_id': self.package['id'],
      'name': filename,
      'format': extension,
      'description': description
    }

    resources = self.package['resources']
    if len(resources) > 0:
      target_resource = None
      for resource in reversed(resources):
        if filename == resource['name']:
          target_resource = resource
          break

      if target_resource == None:
        self.resource_create(data, path)
      else:
        print ('Resource "{}" already exists, it will be overwritten'.format(target_resource['name']))
        data['id'] = target_resource['id']
        self.resource_update(data, path)
    else:
      self.resource_create(data, path)

The source code to call OpenAfricaUploader.py and upload the file looks like the following.

import os
from OpenAfricaUploader import OpanAfricaUploader

uploader = OpanAfricaUploader(args.key)
uploader.create_package('rw-water-vectortiles','Vector Tiles for rural water supply systems in Rwanda')
uploader.upload_datasets(os.path.abspath('../data/rwss.mbtiles'), 'mbtiles format of Mapbox Vector Tiles which was created by tippecanoe.')

I will explain one by one.

constructor

This module has the URL of the base portal site set in the constructor in advance for uploading to openAFRICA.

Replace the URL of self.data_portal ='https://africaopendata.org' with the URL of the CKAN API used by your organization.

  def __init__(self, api_key):
    """Constructor

    Args:
        api_key (string): CKAN api key
    """
    self.data_portal = 'https://africaopendata.org'
    self.APIKEY = api_key
    self.ckan = ckanapi.RemoteCKAN(self.data_portal, apikey=self.APIKEY)

The call to the constructor looks like this: Specify the CKAN API key for your account in args.key.

uploader = OpanAfricaUploader(args.key)

Creating a package

Create a package using the API package_create. At that time, specify the following as an argument.

--name = The string specified here will be the URL of the package --title = Package title --owner_org = ID of the target organization on the CKAN portal

If the creation is successful, the package information will be returned as a return value. If it already exists, an error will occur, so I am writing a process to get the existing package information in the exception handling.

  def create_package(self, url, title):
    """create new package if it does not exist yet.

    Args:
        url (str): the url of package eg. https://open.africa/dataset/{package url}
        title (str): the title of package
    """
    package_name = url
    package_title = title
    try:
        print ('Creating "{package_title}" package'.format(**locals()))
        self.package = self.ckan.action.package_create(name=package_name,
                                            title=package_title,
                                            owner_org = 'water-and-sanitation-corporation-ltd-wasac')
    except (ckanapi.ValidationError) as e:
        if (e.error_dict['__type'] == 'Validation Error' and
          e.error_dict['name'] == ['That URL is already in use.']):
            print ('"{package_title}" package already exists'.format(**locals()))
            self.package = self.ckan.action.package_show(id=package_name)
        else:
            raise

The way to call this function is as follows

uploader.create_package('rw-water-vectortiles','Vector Tiles for rural water supply systems in Rwanda')

Creating and updating resources

Resources are created with a function called resource_create. You can use the REST API / api / action / resource_create to pass the binary data and file information to be uploaded.

def resource_create(self, data, path, api="/api/action/resource_create"):
    self.api_url = self.data_portal + api
    print ('Creating "{}"'.format(data['name']))
    r = requests.post(self.api_url,
                      data=data,
                      headers={'Authorization': self.APIKEY},
                      files=[('upload', open(path, 'rb'))])

    if r.status_code != 200:
        print ('Error while creating resource: {0}'.format(r.content))
    else:
      print ('Uploaded "{}" successfully'.format(data['name']))

However, if you only use resource_create, you can only add resources, and the number will increase steadily each time you update, so use the API / api / action / resource_update to update any existing resources. I will do it.

The usage of resource_update is basically the same as resource_create, the only difference is whether or not there is resource_id in data.

def resource_update(self, data, path):
    self.resource_create(data, path, "/api/action/resource_update")

A function called upload_datasets is a nice combination of resource_create and resource_update, updating existing resources if they exist, and creating new ones if they don't.

def upload_datasets(self, path, description):
    #Separate the file name from the extension
    filename = os.path.basename(path)
    extension = os.path.splitext(filename)[1][1:].lower()
    
    #Create data for resource creation
    data = {
      'package_id': self.package['id'], #Package ID
      'name': filename,                 #File name to be updated
      'format': extension,              #Format (here, extension)
      'description': description        #File description
    }

    #If there is already a resource in the package, check if there is a resource with the same name as the file name to be uploaded.
    resources = self.package['resources']
    if len(resources) > 0:
      target_resource = None
      for resource in reversed(resources):
        if filename == resource['name']:
          target_resource = resource
          break

      if target_resource == None:
        #Resource if no resource with the same name exists_Call create
        self.resource_create(data, path)
      else:
        #If there is a resource, set the ID in data and resource_Call update
        print ('Resource "{}" already exists, it will be overwritten'.format(target_resource['name']))
        data['id'] = target_resource['id']
        self.resource_update(data, path)
    else:
      #Resource if no resource_Call create
      self.resource_create(data, path)

The way to call the upload_datasets function is as follows.

 uploader.upload_datasets(os.path.abspath('../data/rwss.mbtiles'), 'mbtiles format of Mapbox Vector Tiles which was created by tippecanoe.')

Make upload source callable from the command line

You can call it from the command line with upload2openafrica.py.

import os
import argparse
from OpenAfricaUploader import OpanAfricaUploader

def get_args():
  prog = "upload2openafrica.py"
  usage = "%(prog)s [options]"
  parser = argparse.ArgumentParser(prog=prog, usage=usage)
  parser.add_argument("--key", dest="key", help="Your CKAN api key", required=True)
  parser.add_argument("--pkg", dest="package", help="Target url of your package", required=True)
  parser.add_argument("--title", dest="title", help="Title of your package", required=True)
  parser.add_argument("--file", dest="file", help="Relative path of file which you would like to upload", required=True)
  parser.add_argument("--desc", dest="description", help="any description for your file", required=True)
  args = parser.parse_args()

  return args

if __name__ == "__main__":
  args = get_args()

  uploader = OpanAfricaUploader(args.key)
  uploader.create_package(args.package,args.title)
  uploader.upload_datasets(os.path.abspath(args.file), args.description)

When actually using it, it looks like the following. I am making a shell script called upload_mbtiles.sh. Be sure to set the environment variable to CKAN_API_KEY.


#!/bin/bash

pipenv run python upload2openafrica.py \
  --key ${CKAN_API_KEY} \
  --pkg rw-water-vectortiles \
  --title "Vector Tiles for rural water supply systems in Rwanda" \
  --file ../data/rwss.mbtiles \
  --desc "mbtiles format of Mapbox Vector Tiles which was created by tippecanoe."

You can now upload open data using the CKAN API.

Automation of data linkage

However, it is troublesome to manually link with CKAN every time, so I will automate it with Github Action. The workflow file looks like this:

name: openAFRICA upload

on:
  push:
    branches: [ master ]
    #Here, the workflow is run when the data folder and below are updated.
    paths:
      - "data/**"

jobs:
  build:

    runs-on: ubuntu-latest

    steps:
    - uses: actions/checkout@v2
    - name: Set up Python 3.8
      uses: actions/setup-python@v2
      with:
        python-version: 3.8
    - name: Install dependencies
      #First, make the initial settings for Pipenv.
      run: |
        cd scripts
        pip install pipenv
        pipenv install
    - name: upload to openAFRICA
      #CKAN in Secrets on the Settings page of the Github repository_API_If you register with the name KEY, you can use environment variables as follows
      env:
        CKAN_API_KEY: ${{secrets.CKAN_API_KEY}}
      #After that, I will call the shell script
      run: |
        cd scripts
        ./upload_mbtiles.sh

With this alone, once the file is uploaded to Github, it can be automatically linked to the open data platform. The following image is the screen when Github Aciton of Rwanda's Water Authority is executed.

Summary

The CKAN API is used on various open source platforms at home and abroad. The CKAN API can implement data linkage relatively easily by using Python. Also, if open data is managed on Github, Github Action can be used to make automatic linkage even easier.

We hope that the module created for openAFRICA will be useful for utilizing open data using other CKAN in Japan and overseas.