the IBM's algorithm

Next: Database generalization Up: Mining the most simplified Previous: Mining the most simplified

the IBM's algorithm

Agrawal[ATA93, AS94] et. al. in IBM proposed two algorithms to mine the generalized association rules, given the database and the taxonomy on the attributes. One is the basic algorithm, the other one is the improved version called Cumulate. Since our medical database is very different from IBM's basket database, we design our own data structure and improve the Cumulate algorithm to fit our purpose of mining the most simplified generalization rules in the medical database.

We first describe the IBM algorithms, then we will show our algorithm and improvements.

Algorithm Basic The Basic algorithm goes through the whole database several passes until it can't generate new candidates. In pass k, it counts the support for each candidates with size $k$ (contains $k$ items). So for the first pass, it simply counts the frequent 1-itemsets. Then in the following passes, say pass $k$ , it first generates a candidate set with candidate size $k+1$ using the candidate set got from last pass, and then go through the database to get the support for a each candidate in it. In order to get the generalized rules, the nodes in the itemsets can come from the leaves or the interior nodes of the taxonomy tree. To check if a tuple supports a candidate $C$ , it first adds all the ancestors of each item in $C$ , and then check if the extended tuple supports $C$ in the candidate set. The pseudo code is as the following:

L1 := frequent 1-itemsets;

k := 2; // k represents the pass number

while ( $L$ _k-1 ≠0) do

begin

$C$ _k := New candidates of size k generated from $L$ _k-1.

forall transactions t $∈$ D do

begin

Add all ancestors of each item in t to t, removing any duplicates.

Increment the count of all candidates in $C$ _k that are contained in t.

end

$L$ _k := All candidates in $C$ _k with minimum support.

k := k + 1;

end

Answer := _k L_k;

Algorithm Cumulate The Cumulate algorithm is an extension of the Basic algorithm. It mainly has three improvements:

Instead of adding all ancestors of the items in a transaction t to t, now we only add the ancestors that are in the candidate set to transactions.
Pre-compute the ancestors for each item.
Pruning itemsets containing an item and its ancestor.

The pseudo code of the this algorithm is as follows:

Compute T', the set of ancestors of each item, from T. //Optimization 2

$L$ ₁ := frequent 1-itemsets;

k := 2; // k represents the pass number

while ( $L$ _k-1 ≠0) do

begin

$C$ _k := New candidates of size k generated from $L$ _k-1.

if (k=2) then

Delete any candidate in $C$ ₂ that consists of an item and its

ancestor. //Optimization 3

Delete any ancestors in T' that are not present in any of the

candidates in $C$ _k. //Optimization 1

forall transactions t $∈$ D do

begin

for each item x $∈$ t do

Add all ancestors of x in T' to t.

Remove any duplicates from t.

Increment the count of all candidates in $C$ _k that are contained in t.

end

Answer := _k L_k;

Our improved algorithm:

LH := empty;
L1 := {frequent 1-itemsets};

Extract all of the very high confidence and high support candidates
from L1 and add it to LH.

k  := 2;
while( L(k-1) is not empty ), do
begin
  C(k) := New candidates of size k generated from L(k-1).
  if( k == 2 ) then
    Delete any candidate in C(k) that consists of an item and its
    ancestor.
  for each tuple t in database, do
    Increment the count of all candidates in C(k) that are contained
    in t
  end
  L(k) := ALl candidates in C(k) with minimum support.
  Extract all of the very high confidence and high support candidates
  from L(k) and add it to LH.
  k    := k + 1;
end

Output all of the non-empty L(k), and LH.

One of the differences between the above algorithm and the IBM Cumulate algorithm is that we don't need to extend the tuples in the database by adding to the tuple the ancestors of each item in that tuple. This will save the space as well as the time to match the tuple against the candidate. This benefit comes from our special data structure. That is, we store a template for each node in the taxonomy tree. Any child of this node will have one or several bits set inside the template. So this node will be supported as long as any one of its children appears in the tuple. Since a candidate is simply a list of nodes in the taxonomy trees, with the templates we already take into account the ancestors of each item in the tuple without extending.

The other difference is that if there is a candidate $c$ in $L(k)$ has very high support and confidence, we extract it from $L(k)$ and output it as the association rules. Therefore, in the following pass, we will not have candidates contain $c$ as a subset. The reason we have this improvement is mentioned before.

Next: Database generalization Up: Mining the most simplified Previous: Mining the most simplified

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999