[Download postscript version]
Next: Future Work
Up: Discussion
Previous: Genetic algorithm
- For each one-item set, we store pointers to its first and last
occurrence. So for each candidate, we can compute the upper-bound
of the first occurrence place and the lower-bound of the last
occurrence place. The tuples between these bounds are the only ones
in the database
that we need to scan. This will speed up a lot, especially in the
case of sparse database like IBM basket database.
- Adaptive support. Right now all of the algorithms don't mention
how to select the threshold of support and confidence to generate
the candidates. They all use fix support threshold which is
predefined by the users. The problem is that for different
databases with different characteristics, users may like to find
rules with different supports. And if the users have not proper
view of this database, then it will be difficult for him to choose
the support thresholds. One possible to deal with this is make the
algorithm to select the support threshold for the user by the
statics of the supports of each 1-itemset.
- Our program can still be improved in running speed. For
instance, we can create a hierarchy of the candidates and first
match the candidates which are more general -- if the more general
candidate does not match a tuple, the more specific one will
certainly not match it either. We can also order the attribute
groups such that we check for infrequent attributes first to stop
the matching as fast as possible. Ideally we could compile code for
each candidate and pass the tuple as argument, the candidate would
then evaluate a match with the highest speed possible and call the
matching function of potential siblings as well. Our candidate
generation is currently also inefficient, with a smart data
structure we could reduce the complexity from to
where are the number of candidates in the current candidate set
and are the number of candidates in the next set. Another
improvement can be achieved by streamlining our candidate hash
function. We currently use the entire bit vector representation to
compute the hash code. An improved version would only take the list
of items in a candidate and hash it, this list is generally much
shorter than the bitvector.
Next: Future Work
Up: Discussion
Previous: Genetic algorithm
Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999