What is market basket analysis?

That is the example of "Diapers and beer are bought together on Friday night". In basket analysis, three indicators, support, confidence, and lift, are calculated from sales data. This article is intended to be implemented using PySpark. See other articles for the analysis method. There are also some in Qiita.

Definition

Support(A⇒B)=P(A∩B)=\frac{Number of baskets containing A and B}{Total number of baskets}

Conviction(A⇒B) = \frac{P(A∩B)}{P(A)}=\frac{Number of baskets containing A and B}{Total number of baskets containing A}

Expected confidence(A⇒B) = P(B)=\frac{Number of baskets containing B}{Total number of baskets}

lift(A⇒B) = \frac{P(A∩B)}{P(A)P(B)}=\frac{Conviction}{期待Conviction}

Sample data

I will use Groceries used in the R language association analysis example. There are many commentary articles and Youtube videos, so it's easy to check the calculation results. This file has one line per basket and contains a total of 9835 baskets of data. A line is sometimes called a transaction. The first five lines are as follows.

`groceries.csv`


citrus fruit,semi-finished bread,margarine,ready soups
tropical fruit,yogurt,coffee
whole milk
pip fruit,yogurt,cream cheese ,meat spreads
other vegetables,whole milk,condensed milk,long life bakery product

Calculation of support

`support.py`


# -*- coding: utf-8 -*-
import sys
from itertools import combinations
from pprint import pprint
from pyspark import SparkContext


#Data read. Trim and normalize to lowercase
sc = SparkContext()
baskets = (
    sc.textFile(sys.argv[1])
    .map(lambda row: set([word.strip().lower() for word in row.split(",")]))
).cache()

#Total number of baskets
total = float(baskets.count())
 
result = (
    baskets
    #Give the ID to the basket
    .zipWithIndex()

    #Make a pair of products. Sorting is for stable pairs.
    .flatMap(lambda (items, basket_id): ((tuple(sorted(c)), (basket_id,)) for c in combinations(items, 2)))

    #Count the number of baskets using a pair of products as a key
    .reduceByKey(lambda a, b: a + b)
    .map(lambda pair_baskets: (pair_baskets[0], len(pair_baskets[1])))

    #Add support
    .map(lambda pair_count: (pair_count[0], (pair_count[1], pair_count[1] / total * 100)))

    #Sort in descending order by support
    .sortBy(lambda (pair, stats): -stats[1])
)

#Show top 10 support
pprint(result.take(10))

Results of support

(Vegetables, milk) was the top with a frequency of 736 and a support of 7.48% out of 9835. The following are data from Westerners such as bread and milk, milk and yogurt, etc., so reasonable results have been obtained.

$ spark-submit support.py groceries.csv

[((u'other vegetables', u'whole milk'), (736, 7.483477376715811)),
 ((u'rolls/buns', u'whole milk'), (557, 5.663446873411286)),
 ((u'whole milk', u'yogurt'), (551, 5.602440264361973)),
 ((u'root vegetables', u'whole milk'), (481, 4.89069649211998)),
 ((u'other vegetables', u'root vegetables'), (466, 4.738179969496695)),
 ((u'other vegetables', u'yogurt'), (427, 4.341637010676156)),
 ((u'other vegetables', u'rolls/buns'), (419, 4.260294865277071)),
 ((u'tropical fruit', u'whole milk'), (416, 4.229791560752415)),
 ((u'soda', u'whole milk'), (394, 4.006100660904932)),
 ((u'rolls/buns', u'soda'), (377, 3.833248601931876))]

Then, let's take a short detour to see what the worst 10 is. It's OK if the sortBy order is set to stats [1]. Mayonnaise and white wine, brandy and candy, gum and red wine, artificial sweeteners and dog food, light bulbs and jam, etc.

[((u'mayonnaise', u'white wine'), (1, 0.010167768174885612)),
 ((u'chewing gum', u'red/blush wine'), (1, 0.010167768174885612)),
 ((u'chicken', u'potato products'), (1, 0.010167768174885612)),
 ((u'brandy', u'candy'), (1, 0.010167768174885612)),
 ((u'chewing gum', u'instant coffee'), (1, 0.010167768174885612)),
 ((u'artif. sweetener', u'dog food'), (1, 0.010167768174885612)),
 ((u'meat spreads', u'uht-milk'), (1, 0.010167768174885612)),
 ((u'baby food', u'rolls/buns'), (1, 0.010167768174885612)),
 ((u'baking powder', u'frozen fruits'), (1, 0.010167768174885612)),
 ((u'jam', u'light bulbs'), (1, 0.010167768174885612))]

Confidence calculation

Since (X⇒Y) and (Y⇒X) are different confidence levels, I used permutations instead of combinations to list all the cases.

`confidence.py`


# -*- coding: utf-8 -*-
import sys
from itertools import permutations, combinations
from pprint import pprint
from pyspark import SparkContext


#Data read. Trim and normalize to lowercase
sc = SparkContext()
baskets = (
    sc.textFile(sys.argv[1])
    .map(lambda row: set([word.strip().lower() for word in row.split(",")]))
).cache()

#Total number of baskets
total = float(baskets.count())

#Give the ID to the basket
baskets_with_id = baskets.zipWithIndex()

# (Product pair,Number of baskets it contains)make.
pair_count = (
    baskets_with_id
    .flatMap(lambda (items, basket_id): [(pair, (basket_id,)) for pair in permutations(items, 2)])
    #Make a list of baskets that contain a pair of products as a key
    .reduceByKey(lambda a, b: a + b)
    #Count the number of baskets and add(pair, count)
    .map(lambda pair_baskets: (pair_baskets[0], len(pair_baskets[1])))
)

#Number of baskets containing product X
x_count = (
    baskets_with_id
    .flatMap(lambda (items, basket_id): [(x, (basket_id,)) for x in items])
    #Make a list of basket IDs that contain product X
    .reduceByKey(lambda a, b: a + b)
    #Count the number of baskets and add(x, count)
    .map(lambda x_baskets: (x_baskets[0], len(x_baskets[1])))
)

#Calculate conviction for X
confidence = (
    pair_count
    #Transform so that you can join with X as a key
    .map(lambda (pair, count): (pair[0], (pair, count)))
    .join(x_count)

    #Add confidence
    .map(lambda (x, ((pair, xy_count), x_count)): (pair, (xy_count, x_count, float(xy_count) / x_count * 100)))
    
    #Sort by confidence in descending order
    .sortBy(lambda (pair, stats): -stats[2])
)

pprint(confidence.take(10))

Confidence result

The result is a tuple of ((Product X, Product Y), (Number of baskets containing XY, Number of baskets containing X, Confidence%)). Sorted by certainty, it's 100% certainty, but it's just an example of a rare combination that appears only once.

$ spark-submit confidence.py groceries.csv

[((u'baby food', u'waffles'), (1, 1, 100.0)),
 ((u'baby food', u'cake bar'), (1, 1, 100.0)),
 ((u'baby food', u'dessert'), (1, 1, 100.0)),
 ((u'baby food', u'brown bread'), (1, 1, 100.0)),
 ((u'baby food', u'rolls/buns'), (1, 1, 100.0)),
 ((u'baby food', u'soups'), (1, 1, 100.0)),
 ((u'baby food', u'chocolate'), (1, 1, 100.0)),
 ((u'baby food', u'whipped/sour cream'), (1, 1, 100.0)),
 ((u'baby food', u'fruit/vegetable juice'), (1, 1, 100.0)),
 ((u'baby food', u'pastry'), (1, 1, 100.0))]

So, when I sorted by [the number of baskets containing X, the number of baskets containing XY], the following results were obtained. Milk is the most bought, with 11% to 29% confidence that vegetables, bread and yogurt will be bought together.

[((u'whole milk', u'other vegetables'), (736, 2513, 29.287703939514525)),
 ((u'whole milk', u'rolls/buns'), (557, 2513, 22.16474333465977)),
 ((u'whole milk', u'yogurt'), (551, 2513, 21.92598487863112)),
 ((u'whole milk', u'root vegetables'), (481, 2513, 19.140469558296857)),
 ((u'whole milk', u'tropical fruit'), (416, 2513, 16.55391961798647)),
 ((u'whole milk', u'soda'), (394, 2513, 15.678471945881418)),
 ((u'whole milk', u'bottled water'), (338, 2513, 13.450059689614008)),
 ((u'whole milk', u'pastry'), (327, 2513, 13.01233585356148)),
 ((u'whole milk', u'whipped/sour cream'), (317, 2513, 12.614405093513728)),
 ((u'whole milk', u'citrus fruit'), (300, 2513, 11.937922801432551))]

Lift calculation

Source code evaporation. I will post it as soon as it is found.

Lift result

Anyway, people who buy something that looks like a snack are more likely to buy it with sake than to buy it alone (laughs).

[((u'cocoa drinks', u'preservation products'), 22352.27272727273),
 ((u'preservation products', u'cocoa drinks'), 22352.272727272728),
 ((u'finished products', u'baby food'), 15367.1875),
 ((u'baby food', u'finished products'), 15367.1875),
 ((u'baby food', u'soups'), 14679.104477611942),
 ((u'soups', u'baby food'), 14679.10447761194),
 ((u'abrasive cleaner', u'preservation products'), 14050.000000000002),
 ((u'preservation products', u'abrasive cleaner'), 14050.0),
 ((u'cream', u'baby cosmetics'), 12608.97435897436),
 ((u'baby cosmetics', u'cream'), 12608.974358974358)]

Summary

We have conducted a market basket analysis with PySpark.

This article was written a long time ago and was left as a draft, so it may not work in the current pyspark.

Basket analysis with Spark (1)