 
  
  
   
In the area of medical information, 37 states maintain state-level data databases that include a tuple for each medical encounter in the state for the last 7-10 years. That is, a copy of each and every hospital visit is recorded in these databases. Obviously, these are very large databases containing very sensitive information. Obviously, these databases are of great value if we could release them for research or other statistical queries or other legal use. But at the same time, we need to guarantee that the release of information will not reveal any sensitive information.
We can give a better example here. Say, some researchers want a release from a medical database. Almost every state has a written policy (or related legislation) that prevents them from revealing the identities of HIV patients, stated as: RELEASE TO "researcher" PROHIBIT diagnosis WHEN diagnosis="HIV". However, enforcement of this policy must distinguish the explicit statement from the overall intent. Many medications are specific to HIV and so releasing these medications allows one to infer the HIV diagnosis; likewise, HIV patients are often located in the same area of the hospital, so in these cases, room number is also an indicator of HIV status. So what is actually meant by the original stated policy is not to allow one to infer that the diagnosis is HIV. The simple policy stated above in number 1 can be expanded to imply many similar policies such as these: 2. RELEASE TO "researcher" PROHIBIT medication WHEN diagnosis="HIV"; 3. RELEASE TO "researcher" PROHIBIT room WHEN diagnosis="HIV".
It's hard to believe that it is practically feasible in this setting to know a priori all attributes whose values may be related to the inference which the policy seeks to control. After all, the typical medical database has over 350 attributes. Instead, we need some data-mining and classification method to find the correlation among attributes that can infer the diagnosis which is ``HIV''. Then we need some suppression or generalization or adding noise to generate a new database. This new database can be released to the recipient. He can do any query on the released database. Our goal is to enforce the policy which means we want to guarantee that no matter what queries the recipient do he can't get any information that invalidates the policy.
There are many other potential applications as well. For simplicity reason, we stay with the ``HIV'' example in this report. Also we assume the readers are familiar with the basic concept of association rules, support, confidence, taxonomy and generalized association rules. When we mention about association rules in this report, we always mean generalized association rules. We use both item and literal to refer to the basic component of the association rules. They refer to the leave nodes as well as the higher nodes in the taxonomy. The ancestor items of an association rule is the item set of the left-hand side of the association rule. The size of an association rule is the number of ancestor items of the association rule.
This report is organized as the following: in section 2 we give the problem definition; in section 3.2 we give a detailed description of our approach and prove the correctness of the algorithm; in section 4 we describe our data generation engine, which generates database according to some data model; in section 5 we describe our data structure and explain why it's a good data structure; in section 6 we give the experimental results runs on different size of data set generated by our data generation engine; then finally in the discussion section, we compare with other possible methods to deal with the problem of mining association rules, and we also propose some possible further improvement of the algorithms.
 
  
 