When developing a web application, it is common to implement a process that formats the result of executing a SQL query so that it matches the response of the API.
I think that the data obtained as a result of executing the SQL query is mostly a list of instances defined by the driver used for DB connection, but the process of formatting this data is unexpectedly troublesome. I think there are many. (Most of the processing itself is simple, so I often feel that way.)
In this article, I examined how to efficiently implement the above data formatting process.
For example, if the following data can be obtained from the DB Consider shaping this data.
data_list = [
{
'user_id': 1,
'group_name': 'GroupA',
'user_name': 'user1',
'email': '[email protected]'
},
{
'user_id': 2,
'group_name': 'GroupB',
'user_name': 'user2',
'email': '[email protected]'
},
{
'user_id': 3,
'group_name': 'GroupB',
'user_name': 'user3',
'email': '[email protected]'
},
{
'user_id': 4,
'group_name': 'GroupA',
'user_name': 'user4',
'email': '[email protected]'
},
{
'user_id': 5,
'group_name': 'GroupA',
'user_name': 'user5',
'email': '[email protected]'
}
]
Expected results were grouped by group_name as shown below It will be converted to data in the format.
{
"GroupA": [
{
"user_id": 1,
"user_name": "user1",
"email": "[email protected]"
},
{
"user_id": 4,
"user_name": "user4",
"email": "[email protected]"
},
{
"user_id": 5,
"user_name": "user5",
"email": "[email protected]"
}
],
"GroupB": [
{
"user_id": 2,
"user_name": "user2",
"email": "[email protected]"
},
{
"user_id": 3,
"user_name": "user3",
"email": "[email protected]"
}
]
}
I examined the following two patterns as the implementation method of the shaping process.
I think the simplest way is to fill the data one by one in the for statement as shown below.
#The argument is the data obtained from the DB
def sample1(data_list):
result_dict = {}
for data in data_list:
group_name = data.get('group_name')
# group_Consideration when name is not registered
if group_name not in result_dict:
result_dict[group_name] = []
# group_Generate a dictionary excluding name and add it to the list
result_dict[group_name].append({key:value for key, value in data.items() if key != 'group_name'})
return result_dict
This formatting process can also be implemented by using reduce.
from functools import reduce
#The argument is the data obtained from the DB
def sample2(data_list):
def accumulate(total, target):
group_name = target.get('group_name')
# group_Consideration when name is not registered
if group_name not in total:
total[group_name] = []
# group_Generate a dictionary excluding name and add it to the list
total[group_name].append({key:value for key, value in target.items() if key != 'group_name'})
return total
return reduce(accumulate, data_list, {})
To briefly explain this implementation, reduce can pass a function as the first argument, data as the second argument, and an optional initial value as the third argument, so it was obtained from the data formatting function (accumulate), DB. Data (data_list), an empty dictionary is passed as the initial value. Then, when accumulate is called for the first time, an empty dictionary is passed to total, the first data of data_list is passed to target, and the previous return value is set for total after the second time. Become.
The advantage of writing pattern 1 is that it can be implemented by any formatting process, but the disadvantage is that it must be implemented every time a formatting process like this one is required (low reusability). I think.
On the other hand, how to write pattern 2 may reduce readability when implementing complicated processing, but the formatting processing is common by dynamically changing the column name referenced by the data formatting function. The advantage is that it can be converted.
Also, isn't there a problem in terms of speed when using reduce? Because there was a concern Just in case, I measured the time it takes to format the data of 10000000 records for each pattern. * Implemented with jupyter notebook
%timeit sample1(data_list)
11.6 s ± 211 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit sample2(data_list)
12.3 s ± 290 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
From the above execution result, it was found that the implementation using reduce is a little slower, but even so, there is only about 1 second difference in the data of 10000000 records, so you do not have to worry about speed. think.
From the above, in conclusion I think it is realistic to give up on complicated processing and use the for statement while making the basic policy to use reduce as much as possible to standardize the processing.
Recommended Posts