[Download postscript version]
next up previous
Next: Data Generation Up: Our approach and algorithm Previous: the IBM's algorithm

Database generalization

Immediate generalization Given an item A and its immediate parent B in the taxonomy, the immediate generalization rule of A is A B.

Given a most simplified association rule set S, W is the set of the ancestor items of the rules in S. Naively we could do an exhaustive search of the subset of W. For any subset of W, we generate the immediate generalization and verify its validity and compute the information loss. At the end we can find the valid generalization rules with the least information loss.

But obviously this will be too expensive. We develop a greedy algorithm which is very efficient and in general the information loss can be approximate to the optimal solution.

The algorithm works in several passes.

Pass One We compute W1, the set of ancestor items of the size-one association rules in S. We add the immediate generalization rules of W1 to the generalization rule set G1. We then use G1 to prune S, eliminating the association rules which will not hold any more in the generalized database with G1 generalization rule set. Hence we get S1 which doesn't contain any size-one association rule.

Pass Two We compute W2, the set of ancestor items of the size-two association rules in S1. For each item Ai in W2, we compute a heuristic function H(Ai). H function will be explained later. Assume Am has the highest H value. We add the immediate generalization rule of Am to the generalization rule set G1. We then use G1 to prune S1, eliminating the association rules which will not hold any more in the generalized database with G1 generalization rule set. Then we repeat this pass again until W2 is empty. Hence we get S2 which doesn't contain any size-one and size-two association rules.

Pass K Do similar steps as in Pass two for size-k association rules.

Finally, We get G1. If B is A's ancestor in the taxonomy, and both A and B's immediate association rules are in G1, then we delete A's immediate association rule from G1. At the end we get the pruned generalized association rule set Gf.

Heuristic function H The intuition of the heuristic function H is that we should favor an item if it appears in many association rules in S. Also we should favor an item more if it appears more in short size association rules in S. One possible heuristic function could be H(Ai) = weight(Rj) where Ai ∈Rj, Rj is an association rule in S. A possible weight function could be weight(Rj) = 1 size(Rj). Other heuristic functions with similar properties are possible as well.

We can see that this algorithm finds valid generalization rule set pretty efficiently. With a good heuristic function, on the average case, the information loss is close to the optimal solution.


next up previous
Next: Data Generation Up: Our approach and algorithm Previous: the IBM's algorithm

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999