This article is the last day (25th day) of Civictech 1st year Advent Calendar 2020. (That said, it's a digestion of the articles I've written down, and it's not a very interesting article ...)
Hi, this is y-chan, a Hyogo prefecture version coronavirus summary site contributor. I feel busy with actively contributing to the original Tokyo site, developing various things under SecHack365, and running CCC2020. This time, I did something that I probably didn't do (according to our research), such as verifying the open data of COVID-19, so I think I should write a little about it. I think there are some mistakes because it contains stories that I have heard, but I would appreciate it if you kindly correct it with an edit request.
I would like to give you a preface. The act of acquiring, shaping, or processing open data that we are doing is called data wrangling. The word data wrangling itself seems to be a coined word, but it seems that wrangling has a meaning such as "taming". By the way, I haven't touched the data until now, so I didn't know the term itself until recently.
As the heading says. Usually, when using something like open data, validation is a strange story. Open data is published by governments and local governments, and data input is data that humans would have done. Of course, there can be human error. It's rarely perfect. Such errors, noises, and missing values are corrected by the users of open data themselves by normalizing data errors at the stage of "preprocessing" or by erasing (treating as none) the data. It is used for. In addition, this "pre-processing" seems to be included in the data wrangling. However, in data wrangling, the act of "verification" is not usually done and is not supposed to be done.
I think there are many people who say "What?" When they say that they will verify the open data of COVID-19 when data wrangling. When I said "I'm making a verification mechanism" within the team of the summary site, one person responded with "What does it mean to verify?" "It should be preprocessed rather than verified." Pretreatment is also required, and we are doing some, but I think that many of the new coronavirus infection control sites in each region are cooperating with local governments regarding open data. In the Hyogo prefecture version, there is no particular cooperation, but I have pointed out before that "Isn't the data wrong?" The reason I made the point is simply because the data I format is out of order, but there are some sections where I thought that sensitive data such as the attributes of positive patients should be in the correct form. At that time, the number of data was small and there were only a few human errors, but now the number of infected people is increasing and the number of human errors is also increasing considerably. At the same time, it becomes difficult to find mistakes. So I decided to leave the discovery of mistakes to the program. This is the reason why I tried to verify open data. Well, I think it's something different to impose data corrections on prefecture officials ...
Now, I would like to briefly write down how the verification of open data was performed. Verification, but what you are doing is simple and clear
--If the data is a character string, does it fit the standard? --If the data is a number and is published in multiple formats (daily, cumulative, etc.), is the number consistent (simply, the sum of the daily values is one with the cumulative value? Do you do it etc.)
I'm only looking. In fact, mistakes such as typographical errors can be corrected, but numerical errors cannot be corrected accurately on this side, and there is no choice but to use them as they are or truncate them, so they are not corrected after all on the summary site. The data is posted almost as it is ... Also, I don't know if there are any rules for character strings, and I set the standard based on the data that has come out so far, so sometimes exceptional things get caught in the verification. By the way, the script is the same as the one for data scraping.
The verification result is Anyone can view it. As a result of the verification, if there is a possibility that the data is incorrect, I made a script to display the section as a message. However, there is still the problem that the meaning of the message is difficult to understand because the feeling of rush work is undeniable ... Also, this is just the verification result based on the criteria I set. Although the difference in numerical values seems to be a clear mistake, it is subtle whether a character string that does not fit the fixed form is a human error. This is also the reason why we usually do not verify open data.
It's a summary that doesn't make sense, but I think it's a bit ridiculous that I haven't done "4" yet. However, since it is sensitive data, I am wondering how to change the numerical value or rewrite the attribute information of positive patients. I think that is the difficulty of open data called COVID-19.