Open data (csv) released from Gifu prefecture, ・ Scraping regularly with github actions, -Output a json file as a simple dictionary array without editing ・ If there is a difference, push to the gh-pages branch ・ You can access the json file directly on github pages It is a program.
Developed this program for the development of Gifu Prefecture coronavirus countermeasure site. Although it is published in other cases, processing is included in csv-> json output, so Many corrections were needed for reference. Therefore, in this program, it is easy for other developers to develop by keeping the processing to the minimum and making it a format that can output the original csv data as it is in json.
Product Github https://github.com/CODE-for-GIFU/covid19-scraping
http://code-for-gifu.github.io/covid19-scraping/patients.json http://code-for-gifu.github.io/covid19-scraping/testcount.json http://code-for-gifu.github.io/covid19-scraping/callcenter.json http://code-for-gifu.github.io/covid19-scraping/advicecenter.json
Gifu Prefecture Open Data https://data.gifu-opendata.pref.gifu.lg.jp/dataset/c11223-001
how to use
.github/workflows/main.yml
Or.github/workflows/main.yml
main.yml
on:
schedule:
- cron: "*/10 * * * *”
gh-pages branch
in Settings
-> Github Pages
-> Source
.For details, refer to the official github action documentation. https://help.github.com/ja/actions
pip install -r requirements.txt
python3 main.py
A json file will be generated in the / data
folder.
python
main.py
os.makedirs('./data', exist_ok=True)
for remotes in REMOTE_SOURCES:
data = import_csv_from(remotes['url'])
dumps_json(remotes['jsonname'], data)
settings.py
#External resource definition
REMOTE_SOURCES = [
{
'url': 'https://opendata-source.com/source1.csv',
'jsonname': 'source1.json',
},
{
'url': 'https://opendata-source.com/source2.csv',
'jsonname': 'source2.json',
},
{
'url': 'https://opendata-source.com/source3.csv',
'jsonname': 'source3.json',
},
{
'url': 'https://opendata-source.com/source4.csv',
'jsonname': 'source4.json',
}
]
json_name
: output json file namemain.py
def import_csv_from(csvurl):
request_file = urllib.request.urlopen(csvurl)
if not request_file.getcode() == 200:
return
f = decode_csv(request_file.read())
filename = os.path.splitext(os.path.basename(csvurl))[0]
datas = csvstr_to_dicts(f)
timestamp = (request_file.getheader('Last-Modified'))
return {
'data': datas,
'last_update': dateutil.parser.parse(timestamp).astimezone(JST).isoformat()
}
data
: Stores the decoded data itself of csv.last_update
: Get the last modified date of the file.main.py
def decode_csv(csv_data):
print('csv decoding')
for codec in CODECS:
try:
csv_str = csv_data.decode(codec)
print('ok:' + codec)
return csv_str
except:
print('ng:' + codec)
continue
print('Appropriate codec is not found.')
main.py
def csvstr_to_dicts(csvstr):
datas = []
rows = [row for row in csv.reader(csvstr.splitlines())]
header = rows[0]
for i in range(len(header)):
for j in range(len(UNUSE_CHARACTER)):
header[i] = header[i].replace(UNUSE_CHARACTER[j], '')
maindatas = rows[1:]
for d in maindatas:
#Skip blank lines
if d == []:
continue
data = {}
for i in range(len(header)):
data[header[i]] = d[i]
datas.append(data)
return datas
main.py
def dumps_json(file_name: str, json_data: Dict):
with codecs.open("./data/" + file_name, "w", "utf-8") as f:
f.write(json.dumps(json_data, ensure_ascii=False,
indent=4, separators=(',', ': ')))
Github Action It is built with a yml file.
main.yml
on:
schedule:
- cron: "*/10 * * * *”
main.yml
steps:
- uses: actions/checkout@v2
- name: Set up Python 3.8
uses: actions/setup-python@v1
with:
python-version: 3.8
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install -r requirements.txt
- name: Run script
run: |
python main.py
requirements.txt
and install it automaticallymain.yml
- name: deploy
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./data
publish_branch: gh-pages
secrets.GITHUB_TOKEN
represents yourselfpublish_dir
is the setting of the output folder. Specify the data
folder to output the json file.publish_branch
specifies the branch to pushHokkaido: Python script for scraping --covid19hokkaido_scraping https://github.com/Kanahiro/covid19hokkaido_scraping/blob/master/main.py
Recommended Posts