Database generalization

Next: Data Generation Up: Our approach and algorithm Previous: the IBM's algorithm

Database generalization

Immediate generalization Given an item $A$ and its immediate parent $B$ in the taxonomy, the immediate generalization rule of $A$ is $A B$ .

Given a most simplified association rule set $S$ , $W$ is the set of the ancestor items of the rules in $S$ . Naively we could do an exhaustive search of the subset of $W$ . For any subset of $W$ , we generate the immediate generalization and verify its validity and compute the information loss. At the end we can find the valid generalization rules with the least information loss.

But obviously this will be too expensive. We develop a greedy algorithm which is very efficient and in general the information loss can be approximate to the optimal solution.

The algorithm works in several passes.

Pass One We compute $W$ ₁, the set of ancestor items of the size-one association rules in $S$ . We add the immediate generalization rules of $W$ ₁ to the generalization rule set $G$ ₁. We then use $G$ ₁ to prune $S$ , eliminating the association rules which will not hold any more in the generalized database with $G$ ₁ generalization rule set. Hence we get $S$ ₁ which doesn't contain any size-one association rule.

Pass Two We compute $W$ ₂, the set of ancestor items of the size-two association rules in $S$ ₁. For each item $A$ _i in $W$ ₂, we compute a heuristic function $H(A$ _i). $H$ function will be explained later. Assume $A$ _m has the highest H value. We add the immediate generalization rule of $A$ _m to the generalization rule set $G$ ₁. We then use $G$ ₁ to prune $S$ ₁, eliminating the association rules which will not hold any more in the generalized database with $G$ ₁ generalization rule set. Then we repeat this pass again until $W$ ₂ is empty. Hence we get $S$ ₂ which doesn't contain any size-one and size-two association rules.

Pass K Do similar steps as in Pass two for size-k association rules.

Finally, We get $G$ ₁. If $B$ is $A$ 's ancestor in the taxonomy, and both A and B's immediate association rules are in $G$ ₁, then we delete A's immediate association rule from $G$ ₁. At the end we get the pruned generalized association rule set $G$ _f.

Heuristic function H The intuition of the heuristic function H is that we should favor an item if it appears in many association rules in $S$ . Also we should favor an item more if it appears more in short size association rules in $S$ . One possible heuristic function could be $H(A$ _i) = $∑$ weight( $R$ _j) where $A$ _i $∈R$ _j, $R$ _j is an association rule in $S$ . A possible weight function could be $weight(R$ _j) = 1 size(R_j). Other heuristic functions with similar properties are possible as well.

We can see that this algorithm finds valid generalization rule set pretty efficiently. With a good heuristic function, on the average case, the information loss is close to the optimal solution.

Next: Data Generation Up: Our approach and algorithm Previous: the IBM's algorithm

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999