Introduction

About this program

Open data (csv) released from Gifu prefecture, ・ Scraping regularly with github actions, -Output a json file as a simple dictionary array without editing ・ If there is a difference, push to the gh-pages branch ・ You can access the json file directly on github pages It is a program.

Background of publication

Developed this program for the development of Gifu Prefecture coronavirus countermeasure site. Although it is published in other cases, processing is included in csv-> json output, so Many corrections were needed for reference. Therefore, in this program, it is easy for other developers to develop by keeping the processing to the minimum and making it a format that can output the original csv data as it is in json.

Product Github https://github.com/CODE-for-GIFU/covid19-scraping

Json output on Github pages

http://code-for-gifu.github.io/covid19-scraping/patients.json http://code-for-gifu.github.io/covid19-scraping/testcount.json http://code-for-gifu.github.io/covid19-scraping/callcenter.json http://code-for-gifu.github.io/covid19-scraping/advicecenter.json

Reference CSV file

Gifu Prefecture Open Data https://data.gifu-opendata.pref.gifu.lg.jp/dataset/c11223-001

how to use

Run on Github

starting method

Fork in your environment
The actions described in github / workflows / main.yml will start automatically at regular times (every 10 minutes).
Cannot be executed manually

How to stop

Delete .github/workflows/main.yml Or
Comment out the following 3 lines of .github/workflows/main.yml

`main.yml`


on:
  schedule:
    - cron: "*/10 * * * *”

Hosting on Github Pages

Select gh-pages branch in Settings-> Github Pages-> Source.

For details, refer to the official github action documentation. https://help.github.com/ja/actions

Run in local environment

pip install -r requirements.txt
python3 main.py

A json file will be generated in the / data folder.

Technical documentation

Please refer to the source code on github for the entire code.

python

Main

`main.py`


os.makedirs('./data', exist_ok=True)
for remotes in REMOTE_SOURCES:
    data = import_csv_from(remotes['url'])
    dumps_json(remotes['jsonname'], data)

Read all csv list defined in another file and output json file

Data definition section

`settings.py`


#External resource definition
REMOTE_SOURCES = [
    {
        'url': 'https://opendata-source.com/source1.csv',
        'jsonname': 'source1.json',
    },
    {
        'url': 'https://opendata-source.com/source2.csv',
        'jsonname': 'source2.json',
    },
    {
        'url': 'https://opendata-source.com/source3.csv',
        'jsonname': 'source3.json',
    },
    {
        'url': 'https://opendata-source.com/source4.csv',
        'jsonname': 'source4.json',
    }
]

ʻUrl`: Paste the reference csv link
json_name: output json file name

csv reading part

`main.py`


def import_csv_from(csvurl):
    request_file = urllib.request.urlopen(csvurl)
    if not request_file.getcode() == 200:
        return

    f = decode_csv(request_file.read())
    filename = os.path.splitext(os.path.basename(csvurl))[0]
    datas = csvstr_to_dicts(f)
    timestamp = (request_file.getheader('Last-Modified'))

    return {
        'data': datas,
        'last_update': dateutil.parser.parse(timestamp).astimezone(JST).isoformat()
    }

csv access utilizes ʻurllib`.
data: Stores the decoded data itself of csv.
last_update: Get the last modified date of the file.

csv decoding part

`main.py`


def decode_csv(csv_data):
    print('csv decoding')
    for codec in CODECS:
        try:
            csv_str = csv_data.decode(codec)
            print('ok:' + codec)
            return csv_str
        except:
            print('ng:' + codec)
            continue
    print('Appropriate codec is not found.')

Try the codecs defined in another file in order

csv → json data conversion unit

`main.py`


def csvstr_to_dicts(csvstr):
    datas = []
    rows = [row for row in csv.reader(csvstr.splitlines())]
    header = rows[0]
    for i in range(len(header)):
        for j in range(len(UNUSE_CHARACTER)):
            header[i] = header[i].replace(UNUSE_CHARACTER[j], '')

    maindatas = rows[1:]
    for d in maindatas:
        #Skip blank lines
        if d == []:
            continue
        data = {}
        for i in range(len(header)):
            data[header[i]] = d[i]
        datas.append(data)
    return datas

Convert CSV string to [dict] type
Simple replacement and deletion of characters that cannot be used as key

json data output section

`main.py`


def dumps_json(file_name: str, json_data: Dict):
    with codecs.open("./data/" + file_name, "w", "utf-8") as f:
        f.write(json.dumps(json_data, ensure_ascii=False,
                           indent=4, separators=(',', ': ')))

Dump json kit with measures against garbled Japanese characters

Github Action It is built with a yml file.

Schedule

`main.yml`


on:
  　　schedule:
    　- cron: "*/10 * * * *”

Regular execution. Currently every 10 minutes

python script execution part

`main.yml`


    steps:
      - uses: actions/checkout@v2
      - name: Set up Python 3.8
        uses: actions/setup-python@v1
        with:
          python-version: 3.8
      - name: Install dependencies
        run: |
          python -m pip install --upgrade pip
          pip install -r requirements.txt
      - name: Run script
        run: |
          python main.py

Describe the python environment in requirements.txt and install it automatically
Automatically start main.py after installation

push to gh-pages

`main.yml`


      - name: deploy
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./data
          publish_branch: gh-pages

Automatically push the execution result json file to a specific branch
secrets.GITHUB_TOKEN represents yourself
publish_dir is the setting of the output folder. Specify the data folder to output the json file.
publish_branch specifies the branch to push

References

Hokkaido: Python script for scraping --covid19hokkaido_scraping https://github.com/Kanahiro/covid19hokkaido_scraping/blob/master/main.py

Scrap the published csv with Github Action and publish it on Github Pages

Introduction

About this program

Background of publication

Json output on Github pages

Reference CSV file

Run on Github

starting method

How to stop

main.yml

Hosting on Github Pages

Run in local environment

Technical documentation

Main

main.py

Data definition section

settings.py

csv reading part

main.py

csv decoding part

main.py

csv → json data conversion unit

main.py

json data output section

main.py

Schedule

main.yml

python script execution part

main.yml

push to gh-pages

main.yml

References

`main.yml`

`main.py`

`settings.py`

`main.py`

`main.py`

`main.py`

`main.py`

`main.yml`

`main.yml`

`main.yml`