It didn't seem to work with to_dict, so I tried it myself. The reason was that I wanted to process purchasing data by collaborative filtering, but with data frames It didn't seem to work. Also, I would like to try the recommendation logic like in collective intelligence programming. I wanted to use the data in the data frame at hand by converting it somehow.
# coding: utf-8
import pandas as pd
from collections import defaultdict
df = pd.DataFrame({'id':['a','a','b','b','c',], 'shouhin':['x', 'y', 'y','z', 'x']})
Suppose you have the following data
id shouhin
0 a x
1 a y
2 b y
3 b z
4 c x
The purpose is to change this to a dictionary like the one below.
{'a': ['y', 'x'], 'b': ['y', 'z'], 'c': ['x']}
First, create a dictionary with defaultdict. Then, fetch each line with df.values and create a dictionary with nested elements. (df.values returns numpy.array)
tempdic = defaultdict(dict)
for d in df.values:
tempdic[d[0]][d[1]] = 1.0 #Any value is acceptable
Then, you can do the following.
dic = {k: tempdic[k].keys() for k in tempdic}
Looking at dic, it's as expected
{'a': ['y', 'x'], 'c': ['x'], 'b': ['y', 'z']}
If you use set, you can get common products and it is easy to calculate the jaccard coefficient.
{'y'}```
Even if you don't set the first part to df.values, you can loop and get the elements of each line with df.iloc [line number].
It is possible, but in that case the speed is much slower.
In the case of purchasing data, I think that the amount of data is quite large, so if it is slow here, it will be severe.
Also, I think there is a way to do it all at once using while or if, but this also gives priority to speed.
I try not to use such a method.
Recommended Posts