0%

Apriori Association (Implementation)

Apriori Association is a type of unsupervised learning technique. It looks for the relation of one entity to another entity and then maps them accordingly so that it can be more appropriate. The entity can be anything ranging from grocery items like milk to clothing items like a shirt. The technique finds some interesting associations or relations among the variables of dataset. Apriori Association uses a Hash Tree search and breadth-first to calculate the associations in the entities effectively. The process of finding the frequent items for a massive dataset is an iterative process. This algorithm was introduced by the R. Agrawal and Srikant in 1994. It is majorly utilised for market basket analysis. It helps in finding those products which can be bought together. It can also be used in the medical dimain to find potential drug reactions for patients. One of the use case of Association is inside malls where the store owner can use this unsupervised machine learning algorithms to find association between multiple items like eggs, milk, bread or vegetables after taking and reading the costumer insights. If the relation is high, the store owner will keep them together in the store and increase the sales as the probability of the costumer buying them is much higher.
With that said, let's start.
To download the dataset I have used in this article- click here

You will understand more about the model in the article in coming pages

Simple step by Step Process

Please follow this step by step process to get a high view understanding of Association and Apriori Association which is one of the important concept in Machine Learning and Computer Science.
I will execute the model on a sample dataset with by solving a typical association problem in simple steps with a workable accuracy for your understanding.

First we import libraries

In [2]:
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt

Now understanding our downloaded dataset

In [3]:
df = pd.read_csv('090.csv', header = None)
In [4]:
df.head()
Out[4]:
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
0 shrimp almonds avocado vegetables mix green grapes whole weat flour yams cottage cheese energy drink tomato juice low fat yogurt green tea honey salad mineral water salmon antioxydant juice frozen smoothie spinach olive oil
1 burgers meatballs eggs NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 chutney NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
3 turkey avocado NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
4 mineral water milk energy bar whole wheat rice green tea NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN

Brief on the problem we are dealing with


The dataset is having 20 columns and 7501 columns, we have 20 food items in a grocery store as columns and the rows represents the costumer

This data explains that costumer 0 buys shrimp, almonds, avocado together
Costumer 1 buys burgers, meatballs and eggs toghter and so on

To visualize this data, let us take this image into imagination

Here is the transaction per costumers, the model's role is to understand what kind of items the customer bought together and then give us the output what is the association of item 1 with another, for instance it will tell us that wrt customer buying patterns, how is milk related to egges

In [5]:
df.shape
Out[5]:
(7501, 20)

Data Transformation:

As aprori expects a list and not a dataframe So we will transform this dataset to a list in this step
In [6]:
print(type(df))
<class 'pandas.core.frame.DataFrame'>
In [16]:
listt = []
for i in range(0, 7051):
        listt.append([str(df.values[i,j]) for j in range(0,20)])
In [17]:
listt[:1]

Let's have a look to our data which should be in list form for the next step

Out[17]:
[['shrimp',
  'almonds',
  'avocado',
  'vegetables mix',
  'green grapes',
  'whole weat flour',
  'yams',
  'cottage cheese',
  'energy drink',
  'tomato juice',
  'low fat yogurt',
  'green tea',
  'honey',
  'salad',
  'mineral water',
  'salmon',
  'antioxydant juice',
  'frozen smoothie',
  'spinach',
  'olive oil'],
 [#1st Customer's Purchase Data,
  'burgers',
  'meatballs',
  'eggs',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan'],
 [#2nd Customer's Purchase Data,
  'chutney',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan',
  'nan']]

Model fitting and training on the dataset(list)

As aprori is an unsupervised learning method, We will download the pretrained apriori algorithm file and we will import aprori class from that file which we will later use to work with our store dataset
To download the Pretrained Aprori Model CLICK HERE

The file we need is apyori.py, Ignore other files and continue here

In [22]:

Importing the model

from apyori import apriori

Now, we will need to instantiate the class with an object with which we will call our class later

In [25]:
rules = apriori(listt, min_support = 0.003 , min_confidence = 0.2, min_lift = 3, min_length = 2)

Brief into basic association concepts and the parameters

The Apriori Association takes in the parameters: input list, minimum support, minumum confidence, minimum lift and minimum length

min_support

All right so let's start with the support,
support of a set of item 1 (e.g. milk) and item 2 (e.g egg) will be equal to the number of transactions executed (frequency) in our data of total 7500 transactions which had the frequency of egg and milk together (I) divided by the total number of transactions performed which is 7500 in this case.
The support argument that we're putting here is actually the minimum support you want to have in your rules.
That means that the items that are going to appear in your rules will have a higher support than this, i.e higher frequency of occurance (of milk and egg in this case) in the dataset than the min_support parameter
So what we must ask ourselves Is that what supports Do we want to have our different items in the rules

min_confidence

We can see that the minimum confidence of this rule is 0.2, 0.2 here implies 20 percent of occurances which implies that if for instance 100 people in California and same in Texas purchase mushroom cream sauce in the store, they have 20 percent chance of being as close together as well.

min_lift

We can see that the minimum confidence of this rule is 0.2, 0.2 here implies 20 percent of occurances which implies that if for instance 100 people in California and same in Texas purchase mushroom cream sauce in the store, they have 20 percent chance of being as close together as well.

min_length

min_length: This indicates what is the minimum number of items that should be in a transaction which in our case is 2, therefore we are considering the trnsactions or custumers data which are having minimum 2 in the cart at time of purchase.

Fitting the model

In this step we will fit our model to the dataset, it will now learn and generalize to the list data structure with name"rules" we will provide to it

In [27]:


rules = list(rules)

See results

In [31]:
rules[:1] #getting the 1st item from the output
Out[31]:
[RelationRecord(items=frozenset({'burgers', 'almonds'}), support=0.005531130336122536, ordered_statistics=[OrderedStatistic(items_base=frozenset({'almonds'}), items_add=frozenset({'burgers'}), confidence=0.2671232876712329, lift=3.102942835864684)])]

Here, as we can see. We have our first association from the output list from the Apriori Association algorithm which gives the associative value of "burgers" and "almonds" with support of 55%, confidence of only 26% and a lift of 3
We can say that burgers and almonds are not very well suited together
Let us see another output

In [31]:
rules[30:31] #getting the 30th item from the output
Out[31]:
[RelationRecord(items=frozenset({'milk', 'bread'}), support=0.009592235556175531, ordered_statistics=[OrderedStatistic(items_base=frozenset({'milk'}), items_add=frozenset({'bread'}), confidence=0.7771235575511127, lift=6.000555746668677)])]

Here, as we can see. We have our second association from the 30th item from the output list from the Apriori Association algorithm which gives the associative value of "milk" and "bread" with support of 96%, confidence of 77% and a lift of 6
We can say that milk and bread are highly bough together and must be kept together to increase store sales