Given the goal, our approach has two main steps. First, we need to find the association rules which imply HIV. Second, we use these association rules to generate the valid generalization rule set with minimize the information loss.
There has been a lot of work done in mining association rules. One of the most popular and state-of-the-art algorithm is the ``Frequent-set'' and its variations and extensions [ATA93] [AS94] [BMUT97] [AM98] [SA95] [SVA97]. Two of them [SA95][SVA97] are most related to our problem. [SA95] introduces several algorithms extended from ``Frequent-set'' mining generalized association rules. [SVA97] can find association rules satisfying given constraints. But our problem has some special properties that we can take advantage of and develop faster algorithms. First, we only need to look for association rules with given consequence, in our example it is ``HIV''. Second, one interesting observation is that if and , and in the generalized database, , so that doesn't hold any more that won't hold any more either. This shows that once we find an association rule , we no longer need to search association rules like . We formalize this as the following theorem.
Simplified Association Rule Set S is a set of association rules with the form , where is a literal or item set. If is a subset of , we say includes . We eliminate such which is included by another association rule in S. Thus we can get a subset of S. is a simplified association rule set of S. The maximum simplified association rule set is the maximum subset of S in which there is no association rule included by another one. We can easily prove that for any S, such a exists and is unique.
Theorem 1. is the most simplified association rule set of . If is a valid generalized rule set that makes doesn't hold any more, then will make not to hold any more either. If interested in the proof detail, please contact for proof appendix.