Run Aprili from Python with Orange

Introduction

--When I was looking for a Python implementation of Apriori, it seems that it is implemented in Orange, so a memo when I tried it

Orange is a component-based data mining software.

Caution

--There is a new version of Orange 3 and an old version of Orange 2 (Orange 2.7.8 as of 9/11/2016)

--Apriori module not found in Orange 3. .. can't find associate module

--So install Orange 2

Installation

Here is an example of installing on Ubuntu

Download the source file from Official Site and extract it.

Build and install as described in the Python Software Foundation (https://pypi.python.org/pypi/Orange/2.7)

python setup.py build
python setup.py install

If you don't need scipy, install it

Apriori

I referred to Association Analysis

The data is prepared as follows. The extension must be basket

$ more hayes-roth-train1-1.basket
a2,b2,c3,d4,D3
a3,b2,c1,d3,D1
<snip>

Run

>>> import Orange
>>> data = Orange.data.Table('hayes-roth-train1-1.basket')
>>> rules = Orange.associate.AssociationRulesSparseInducer(data, support=0.03, confidence=0.2, classification_rules=1, store_examples=True)
>>> print "%4s %4s %4s  %4s" % ("Supp", "Conf", "Lift", "Rule")
Supp Conf Lift  Rule
>>> for r in rules[:5]:
...     print "%4.1f %4.1f %4.1f   %s" % (r.support, r.confidence, r.lift, r)
...
 0.0  0.2  3.6   b4 -> c4
 0.0  0.3  3.6   c4 -> b4
 0.0  0.2  7.2   c4 -> b4 a1
 0.0  0.5  6.0   c4 a1 -> b4
 0.0  0.2  7.2   c4 -> b4 a1 D3

Extract Rule

See the following document and try it

Association rules and frequent itemsets Orange.data.Instance Orange.data.Value

>>> len(rules)
400
>>> rules[383]
b2 a2 -> c1
>>> rules[383].left
[], {"b2":1.000, "a2":1.000}
>>> rules[383].left.get_metas(str).keys()
['a2', 'b2']
>>> rules[383].right.get_metas(str).keys()
['c1']
>>> rules[383].confidence
0.7333333492279053
>>> rules[383].support
0.07692307978868484
>>> rules[383].n_applies_both
11.0

Does the target match the rule?

>>> rule = rules[383]
>>> for d in data:
...     if rule.appliesBoth(d):
...         print d
...
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d3":1.000, "D1":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d2":1.000, "D2":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d2":1.000, "D2":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d2":1.000, "D2":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d3":1.000, "D2":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d4":1.000, "D3":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d1":1.000, "D1":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d3":1.000, "D1":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d1":1.000, "D1":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d2":1.000, "D2":1.000}
[], {"a2":1.000, "b2":1.000, "c1":1.000, "d1":1.000, "D1":1.000}

>>> rule.examples
Orange.data.Table 'table'
>>> rule.match_both
<2, 3, 5, 40, 87, 105, 111, 116, 118, 135, 137>

in conclusion

--Check from now on if speed will come out --Is there any difference from scikit-learn?