Thing you want to do

Given the following two lists, I want to weight and count the number of occurrences of each element included in ʻa by the value of b`. Python is 3.7.5.

a = ["A", "B", "C", "A"]
b = [ 1 ,  1 ,  2 ,  2 ]

c = hoge(a, b)
print(c)

`output`


{"A": 3, "B": 1, "C": 2}  #I want this kind of output

#The key and value can be separate
# (["A", "B", "C"], [3, 1, 2])

Addition: Comment introduced a simple implementation for the above problem.

Specific example of what you want to do

Suppose you want to count the number of books sold so far at a bookstore for each book. [^ 1] However, I only have ** multiple table data that has already been aggregated by month **. For the sake of simplicity, let's imagine the following two csv files.

■ 2020_01.csv

Book name	Number of books sold
Book_A	1
Book_B	2
Book_C	3

■ 2020_02.csv

Book name	Number of books sold
Book_A	2
Book_C	1
Book_D	3

Combining these two data results in a counting problem with "elements" and "weights" as described in "What you want to do".

Method

It was made by the following three methods. I would be grateful if you could tell me which one is better or another method [^ 2].

Join all the tables, create a label that uniquely corresponds to the name of the book, and weight count with numpy.bincount.
Create a collections.Counter object for each table and add the Counter objects for all tables.
Use the for statement to add elements to the dictionary and update the values. Use reduce instead of 3'. For statement.

Addition Comment added 3 that was taught.

1. Use numpy.bincount

You can count by weighting the input by using the bincount function of numpy. Reference: Meaning of weight in numpy.bincount

However, each element you enter in np.bincount ** must be a non-negative integer **.

numpy.bincount(x, weights=None, minlength=0) Count number of occurrences of each value in array of non-negative ints.

x : array_like, 1 dimension, nonnegative ints ---- Input array. weights : array_like, optional ---- Weights, array of the same shape as x. minlength : int, optional ---- A minimum number of bins for the output array. ---- New in version 1.6.0.

Therefore, in order to use np.bincount, prepare a label that uniquely corresponds to the name of the book. I used LabelEncoder of sklearn to create label.

code

import pandas as pd
import numpy as np
from sklearn.preprocessing import LabelEncoder

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])

#Join table
df_all = pd.concat([df_01, df_02])
#The contents are like this.
# |  | Name | Count |
# |--:|:--|--:|
# | 0 | Book_A | 1 |
# | 1 | Book_B | 2 |
# | 2 | Book_C | 3 |
# | 0 | Book_A | 2 |
# | 1 | Book_C | 1 |
# | 2 | Book_D | 3 |

#Label Encoding
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
encoded = le.fit_transform(df_all['Name'].values)

#Add new Label column
df_all["Label"] = encoded

# np.Weighted count with bincount
#In addition to the Label column, enter the Count column as the weight. Since the result has a decimal point, I'm converting it to an int.
count_result = np.bincount(df_all["Label"], weights=df_all["Count"]).astype(int)
#Get the Name corresponding to result
name_result = le.inverse_transform(range(len(result)))

#Create the dictionary you want in the end
result = dict(zip(name_result, count_result))
print(result)

`output`


{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}

Supplement

You can also create a label using np.unique. You can get the same result as fit_transform of LabelEncoder by setting the argument return_inverse of np.unique to True. In addition, you can also get the corresponding Name (name_result in the above) at once.

# np.Label encoding using unique
name_result, encoded = np.unique(df_all["Name"], return_inverse=True)
print(encoded)
print(name_result)

`output`


[0 1 2 0 2 3]
['Book_A' 'Book_B' 'Book_C' 'Book_D']

In addition, weighting count is possible by turning the for statement without using np.bincount [^ 3].

#Create a zero-padded array with the same length as the dictionary you want
unique_length = len(name_result)
count_result = np.zeros(unique_length, dtype=int)

#Extract only the rows whose encoded matches i in the table and calculate the sum of the Count values.
for i in range(unique_length):
    count_result[i] = df_all.iloc[encoded==i]["Count"].sum().astype(int)

result = dict(zip(name_result, count_result))
print(result)

`output`


{'Book_A': 3, 'Book_B': 2, 'Book_C': 4, 'Book_D': 3}

2. Use collections.Counter

Overview of collections.Counter

The Counter module of the standard module collections will often be introduced for ** unweighted ** counting.

from collections import Counter

a = ["A", "B", "C", "A"]

#Give Counter a list and do unweighted counting
counter = Counter(a)
print(counter)

#Access to elements is the same as a dictionary
print("A:", counter["A"])

`output`


Counter({'A': 2, 'B': 1, 'C': 1})
A: 2

Also, if it has already been aggregated like this time, you can create an object by storing it in the dictionary type and then passing it.

counter = Counter(dict([["Book_A", 1],
                        ["Book_B", 2],
                        ["Book_C", 3]]))
print(counter)

`output`


Counter({'Book_A': 1, 'Book_B': 2, 'Book_C': 3})

Calculation using Counter

By the way, this Counter object can be operated on. Reference: Various ways to check the number of occurrences of an element with Python Counter

It seems that the purpose of this time can be achieved by the calculation of the sum.

from collections import Counter

a = ["A", "B", "C", "A"]
b = ["C", "D"]

counter_a = Counter(a)
counter_b = Counter(b)

#Can be added with sum
counter_ab = sum([counter_a, counter_b], Counter())
print(counter_ab)

`output`


Counter({'A': 2, 'C': 2, 'B': 1, 'D': 1})

code

from collections import Counter

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])

#Creating a Counter
counter_01 = Counter(dict(df_01[["Name", "Count"]].values))
counter_02 = Counter(dict(df_02[["Name", "Count"]].values))

#Calculate the sum
# *Supplement:You can set the initial value for the second argument of sum.
#This time, an empty Counter is set as the initial value. The default is 0(int)is.
result = sum([counter_01, counter_02], Counter())
print(result)

`output`


Counter({'Book_C': 4, 'Book_A': 3, 'Book_D': 3, 'Book_B': 2})

~~ Apparently, the counts are sorted in descending order. ~~

Addition: Sometimes it was not sorted. It's a dictionary in the first place, so the order didn't matter.

3. Add elements to the dictionary and update values with the for statement

Adding elements to the dictionary and updating values

If you give the dictionary multiple values for the same key, it will be overwritten by the last given value.

print( {"A": 1, "B": 2, "C": 3, "A":10} )

`output`


{'A': 10, 'B': 2, 'C': 3}

Using this, in order to update the count value of a certain key, it seems that ** get the value ** of the existing dictionary, add ** the value ** to be added, and add it to the end. Also, to add an element after an existing dictionary, you can expand the dictionary by prepending \ * \ * (two stars) to the variable. Reference: [\ Python ] function arguments \ * (star) and \ * \ * (double star)

#Existing dictionary
d = {"A": 1, "B": 2, "C": 3}

#Element to add value
k = "A"
v = 10
#update
d = {**d, k: d[k]+v}    # {"A": 1, "B": 2, "C": 3, "A": 1+10}Equivalent to

print(d)

`output`


{'A': 11, 'B': 2, 'C': 3}

However, if you specify a key that does not exist in the dictionary, an error will occur, so you cannot add a new key as it is. Therefore, we use the function get () of the dictionary object. You can use get () to set the value to be returned by default when key does not exist in the dictionary. Reference: Get value from key with get method of Python dictionary (key that does not exist is OK)

d = {"A": 1, "B": 2, "C": 3}

#Specify an existing key
print(d.get("A", "NO KEY"))
#Specify a key that does not exist
print(d.get("D", "NO KEY"))

`output`


1
NO KEY

This allows you to handle additions and updates in the same way by setting the default value to 0. Using the above contents, the code that performs weighting counting by adding / updating a value to an empty dictionary is as follows.

code

import pandas as pd
from itertools import chain

#Data preparation
import pandas as pd
from itertools import chain
from functools import reduce

#Data preparation
df_01 = pd.DataFrame([["Book_A", 1],
                      ["Book_B", 2],
                      ["Book_C", 3]],
                     columns=["Name", "Count"])
df_02 = pd.DataFrame([["Book_A", 2],
                      ["Book_C", 1],
                      ["Book_D", 3]],
                     columns=["Name", "Count"])
#Convert data frame to dictionary
data1 = dict(df_01[["Name", "Count"]].values)
data2 = dict(df_02[["Name", "Count"]].values)

#Function definition
chain_items = lambda data : chain.from_iterable( d.items() for d in data )  #Combine multiple dictionaries"key and value pair"Function that returns
add_elem = lambda acc, e : { **acc, e[0]: acc.get(e[0], 0) + e[1] }  #Functions that add elements to the dictionary and update values

#A function that receives and merges multiple dictionaries where key is an element and value is a weight
def merge_count(*data) :
    result = {}
    for e in chain_items(data) :
        result = add_elem(result, e)
    return result

print( merge_count(data1, data2) )

`output`


{'A': 3, 'B': 2, 'C': 4, 'D': 3}

Use `reduce` instead of 3'for statement

With reduce, iterative processing is possible without writing a for statement. reduce takes the following arguments.

--First argument: Function. However, take the calculation result up to the previous time and the value this time as arguments. --Second argument: Loopable object (list, generator, etc.) --Third argument (optional): Initial value. The default is 0

from functools import reduce

func = lambda ans, x: ans * x
a = [1, 2, 3, 4]
start = 10

print(reduce(func, a, start))

`output`


240  #    10*1 = 10
     # -> 10*2 = 20
     # -> 20*3 = 60
     # -> 60*4 = 240

Recreating the above merge_count using reduce gives:

from functools import reduce

merge_count = lambda *data : reduce( add_elem, chain_items(data), {} )    #Merge above_Equivalent to count
print( merge_count(data1, data2) )

`output`


{'A': 3, 'B': 2, 'C': 4, 'D': 3}

The following site was very helpful for reduce. Reference: Introduction to Functional Programming

Method 3 was taught in Comment.

Referenced page

Meaning of weight in numpy.bincount [Category variable encoding] (https://qiita.com/ground0state/items/f516b97c7a8641e474c4)

[[Python] Enumeration of list elements, how to use collections.Counter] (https://qiita.com/ellio08/items/259388b511e24625c0d7) [Various ways to check the number of occurrences of an element with Python Counter] (https://www.headboost.jp/python-counter/)

[\ [Python ] function arguments \ * (star) and \ * \ * (double star)] (https://qiita.com/supersaiakujin/items/faee48d35f8d80daa1ac) [Introduction to Functional Programming] (https://postd.cc/an-introduction-to-functional-programming/)

[^ 1]: I gave an appropriate concrete example to make it easier to convey, but in reality it was used to aggregate the morphological analysis results of multiple documents. [^ 2]: Execution speed, memory efficiency, etc ... [^ 3]: I couldn't think of anything other than writing a for statement with my own knowledge ... (excluding list comprehensions).

How to count the number of occurrences of each element in the list in Python with weight

Thing you want to do

output

Specific example of what you want to do

Method

1. Use numpy.bincount

code

output

Supplement

output

output

2. Use collections.Counter

Overview of collections.Counter

output

output

Calculation using Counter

output

code

output

3. Add elements to the dictionary and update values with the for statement

Adding elements to the dictionary and updating values

output

output

output

code

output

Use reduce instead of 3'for statement

output

output

Referenced page

`output`

`output`

`output`

`output`

`output`

`output`

`output`

`output`

`output`

`output`

`output`

`output`

Use `reduce` instead of 3'for statement

`output`

`output`