Given the following two lists, I want to weight and count the number of occurrences of each element included in ʻa by the value of
b`.
Python is 3.7.5.
a = ["A", "B", "C", "A"]
b = [ 1 , 1 , 2 , 2 ]
c = hoge(a, b)
print(c)
output
{"A": 3, "B": 1, "C": 2} #I want this kind of output
#The key and value can be separate
# (["A", "B", "C"], [3, 1, 2])
Suppose you want to count the number of books sold so far at a bookstore for each book. [^ 1] However, I only have ** multiple table data that has already been aggregated by month **. For the sake of simplicity, let's imagine the following two csv files.
■ 2020_01.csv
Book name | Number of books sold |
---|---|
Book_A | 1 |
Book_B | 2 |
Book_C | 3 |
■ 2020_02.csv
Book name | Number of books sold |
---|---|
Book_A | 2 |
Book_C | 1 |
Book_D | 3 |
Combining these two data results in a counting problem with "elements" and "weights" as described in "What you want to do".
It was made by the following three methods. I would be grateful if you could tell me which one is better or another method [^ 2].
label
that uniquely corresponds to the name of the book, and weight count with numpy.bincount
.collections.Counter
object for each table and add the Counter
objects for all tables.reduce
instead of 3'. For statement.You can count by weighting the input by using the bincount
function of numpy
.
Reference: Meaning of weight in numpy.bincount
However, each element you enter in np.bincount
** must be a non-negative integer **.
numpy.bincount(x, weights=None, minlength=0) Count number of occurrences of each value in array of non-negative ints.
x : array_like, 1 dimension, nonnegative ints ---- Input array. weights : array_like, optional ---- Weights, array of the same shape as x. minlength : int, optional ---- A minimum number of bins for the output array. ---- New in version 1.6.0.
Therefore, in order to use np.bincount
, prepare a label
that uniquely corresponds to the name of the book.
I used LabelEncoder
of sklearn
to create label
.
import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder
#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])
#Join table
df_all = pd.concat([df_01, df_02])
#The contents are like this.
# | | Name | Count |
# |--:|:--|--:|
# | 0 | Book_A | 1 |
# | 1 | Book_B | 2 |
# | 2 | Book_C | 3 |
# | 0 | Book_A | 2 |
# | 1 | Book_C | 1 |
# | 2 | Book_D | 3 |
#Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(df_all['Name'].values)
#Add new Label column
df_all["Label"] = encoded
# np.Weighted count with bincount
#In addition to the Label column, enter the Count column as the weight. Since the result has a decimal point, I'm converting it to an int.
count_result = np.bincount(df_all["Label"], weights=df_all["Count"]).astype(int)
#Get the Name corresponding to result
name_result = le.inverse_transform(range(len(result)))
#Create the dictionary you want in the end
result = dict(zip(name_result, count_result))
print(result)
output
{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}
You can also create a label
using np.unique
.
You can get the same result as fit_transform
of LabelEncoder
by setting the argument return_inverse
of np.unique
to True.
In addition, you can also get the corresponding Name (name_result
in the above) at once.
# np.Label encoding using unique
name_result, encoded = np.unique(df_all["Name"], return_inverse=True)
print(encoded)
print(name_result)
output
[0 1 2 0 2 3]
['Book_A' 'Book_B' 'Book_C' 'Book_D']
In addition, weighting count is possible by turning the for statement without using np.bincount
[^ 3].
#Create a zero-padded array with the same length as the dictionary you want
unique_length = len(name_result)
count_result = np.zeros(unique_length, dtype=int)
#Extract only the rows whose encoded matches i in the table and calculate the sum of the Count values.
for i in range(unique_length):
count_result[i] = df_all.iloc[encoded==i]["Count"].sum().astype(int)
result = dict(zip(name_result, count_result))
print(result)
output
{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}
The Counter
module of the standard module collections
will often be introduced for ** unweighted ** counting.
from collections import Counter
a = ["A", "B", "C", "A"]
#Give Counter a list and do unweighted counting
counter = Counter(a)
print(counter)
#Access to elements is the same as a dictionary
print("A:", counter["A"])
output
Counter({'A': 2, 'B': 1, 'C': 1})
A: 2
Also, if it has already been aggregated like this time, you can create an object by storing it in the dictionary type and then passing it.
counter = Counter(dict([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]]))
print(counter)
output
Counter({'Book_A': 1, 'Book_B': 2, 'Book_C': 3})
By the way, this Counter
object can be operated on.
Reference: Various ways to check the number of occurrences of an element with Python Counter
It seems that the purpose of this time can be achieved by the calculation of the sum.
from collections import Counter
a = ["A", "B", "C", "A"]
b = ["C", "D"]
counter_a = Counter(a)
counter_b = Counter(b)
#Can be added with sum
counter_ab = sum([counter_a, counter_b], Counter())
print(counter_ab)
output
Counter({'A': 2, 'C': 2, 'B': 1, 'D': 1})
from collections import Counter
#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])
#Creating a Counter
counter_01 = Counter(dict(df_01[["Name", "Count"]].values))
counter_02 = Counter(dict(df_02[["Name", "Count"]].values))
#Calculate the sum
# *Supplement:You can set the initial value for the second argument of sum.
#This time, an empty Counter is set as the initial value. The default is 0(int)is.
result = sum([counter_01, counter_02], Counter())
print(result)
output
Counter({'Book_C': 4, 'Book_A': 3, 'Book_D': 3, 'Book_B': 2})
~~ Apparently, the counts are sorted in descending order. ~~
If you give the dictionary multiple value
s for the same key
, it will be overwritten by the last given value
.
print( {"A": 1, "B": 2, "C": 3, "A":10} )
output
{'A': 10, 'B': 2, 'C': 3}
Using this, in order to update the count value of a certain key
, it seems that ** get the value ** of the existing dictionary, add ** the value ** to be added, and add it to the end.
Also, to add an element after an existing dictionary, you can expand the dictionary by prepending \ * \ * (two stars) to the variable.
Reference: [\ Python ] function arguments \ * (star) and \ * \ * (double star)
#Existing dictionary
d = {"A": 1, "B": 2, "C": 3}
#Element to add value
k = "A"
v = 10
#update
d = {**d, k: d[k]+v} # {"A": 1, "B": 2, "C": 3, "A": 1+10}Equivalent to
print(d)
output
{'A': 11, 'B': 2, 'C': 3}
However, if you specify a key
that does not exist in the dictionary, an error will occur, so you cannot add a new key
as it is.
Therefore, we use the function get ()
of the dictionary object. You can use get ()
to set the value to be returned by default when key
does not exist in the dictionary.
Reference: Get value from key with get method of Python dictionary (key that does not exist is OK)
d = {"A": 1, "B": 2, "C": 3}
#Specify an existing key
print(d.get("A", "NO KEY"))
#Specify a key that does not exist
print(d.get("D", "NO KEY"))
output
1
NO KEY
This allows you to handle additions and updates in the same way by setting the default value to 0
.
Using the above contents, the code that performs weighting counting by adding / updating a value to an empty dictionary is as follows.
import pandas as pd
from itertools import chain
#Data preparation
import pandas as pd
from itertools import chain
from functools import reduce
#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
["Book_B", 2],
["Book_C", 3]],
columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
["Book_C", 1],
["Book_D", 3]],
columns=["Name", "Count"])
#Convert data frame to dictionary
data1 = dict(df_01[["Name", "Count"]].values)
data2 = dict(df_02[["Name", "Count"]].values)
#Function definition
chain_items = lambda data : chain.from_iterable( d.items() for d in data ) #Combine multiple dictionaries"key and value pair"Function that returns
add_elem = lambda acc, e : { **acc, e[0]: acc.get(e[0], 0) + e[1] } #Functions that add elements to the dictionary and update values
#A function that receives and merges multiple dictionaries where key is an element and value is a weight
def merge_count(*data) :
result = {}
for e in chain_items(data) :
result = add_elem(result, e)
return result
print( merge_count(data1, data2) )
output
{'A': 3, 'B': 2, 'C': 4, 'D': 3}
reduce
instead of 3'for statementWith reduce
, iterative processing is possible without writing a for statement.
reduce
takes the following arguments.
--First argument: Function. However, take the calculation result up to the previous time and the value this time as arguments. --Second argument: Loopable object (list, generator, etc.) --Third argument (optional): Initial value. The default is 0
from functools import reduce
func = lambda ans, x: ans * x
a = [1, 2, 3, 4]
start = 10
print(reduce(func, a, start))
output
240 # 10*1 = 10
# -> 10*2 = 20
# -> 20*3 = 60
# -> 60*4 = 240
Recreating the above merge_count
using reduce
gives:
from functools import reduce
merge_count = lambda *data : reduce( add_elem, chain_items(data), {} ) #Merge above_Equivalent to count
print( merge_count(data1, data2) )
output
{'A': 3, 'B': 2, 'C': 4, 'D': 3}
The following site was very helpful for reduce
.
Reference: Introduction to Functional Programming
Meaning of weight in numpy.bincount [Category variable encoding] (https://qiita.com/ground0state/items/f516b97c7a8641e474c4)
[[Python] Enumeration of list elements, how to use collections.Counter] (https://qiita.com/ellio08/items/259388b511e24625c0d7) [Various ways to check the number of occurrences of an element with Python Counter] (https://www.headboost.jp/python-counter/)
[\ [Python ] function arguments \ * (star) and \ * \ * (double star)] (https://qiita.com/supersaiakujin/items/faee48d35f8d80daa1ac) [Introduction to Functional Programming] (https://postd.cc/an-introduction-to-functional-programming/)
[^ 1]: I gave an appropriate concrete example to make it easier to convey, but in reality it was used to aggregate the morphological analysis results of multiple documents. [^ 2]: Execution speed, memory efficiency, etc ... [^ 3]: I couldn't think of anything other than writing a for statement with my own knowledge ... (excluding list comprehensions).