Next: Our approach and algorithm Up: Policy enforcement in releasing Previous: Introduction

Problem Statement

Assumption of the database The database we consider can have categorical attributes with taxonomies over them. In reality, many databases have natual taxonomies, also as pointed out in [SA95]. For simplicity reason, in this report and in the implementation we assume all the attributes are binary with a taxonomy over them. Since it's easy to convert a database with categorical attributes to a database with binary attributes, this assumption will not give us any limitation. We also assume that all our interesting association rules have HIV as their consequence because we assume here the policy is simply not to reveal HIV. This won't lose any generality either because it's trivial to change to other more complicated consequences of the association rules.

Generalizing a Database Database DB has a set of attributes ${A$ _i} with a taxonomy T over the attributes. T can be modeled as a DAG which defines a pre-order on the DAG. $T$ _i T_j iff. $T$ _j is an ancestor of $T$ _i in the DAG. For each attribute $A$ _i, we define a generalization rule $A$ _i P_i, where $A$ _i P_i in $T$ and means generalize-to. A tuple ${x$ _i} ${y$ _i} iff. for every i, $x$ _i y_i according to the generalization rules. Hence, given a set of generalization rules, the database DB PDB iff. each tuple in DB the correspondent tuple in PDB.

Information Loss Given a database DB with a set of attributes ${A$ _i} and a set of generalization rules ${A$ _i P_i}, DB PDB. The information loss of PDB from DB is $∑$ _i (R(A_i) - R(P_i)), where R is the rank function of the taxonomy with the R(root) = 0. This formula can also easily be extended to a weighted sum.

Valid generalization rule set Given a database DB, S is the complete set of association rules with the form $A
⇒HIV$ with given threshold of support and confidence. A generalization rule set $G$ is valid iff. in the generalized database any association rule in S doesn't hold any more. It's easy to prove that this also implies that in the generalized database there will be no association rule that will imply HIV. Hence the generalized database will not reveal HIV.

Problem Statement Given a database , our goal is to find a valid generalization rule set so that the generalized database obeys the policy and has minimum information loss.

Next: Our approach and algorithm Up: Policy enforcement in releasing Previous: Introduction

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999