Other possible improvements

[Download postscript version]

Next: Future Work Up: Discussion Previous: Genetic algorithm

Other possible improvements

For each one-item set, we store pointers to its first and last occurrence. So for each candidate, we can compute the upper-bound of the first occurrence place and the lower-bound of the last occurrence place. The tuples between these bounds are the only ones in the database that we need to scan. This will speed up a lot, especially in the case of sparse database like IBM basket database.
Adaptive support. Right now all of the algorithms don't mention how to select the threshold of support and confidence to generate the candidates. They all use fix support threshold which is predefined by the users. The problem is that for different databases with different characteristics, users may like to find rules with different supports. And if the users have not proper view of this database, then it will be difficult for him to choose the support thresholds. One possible to deal with this is make the algorithm to select the support threshold for the user by the statics of the supports of each 1-itemset.
Our program can still be improved in running speed. For instance, we can create a hierarchy of the candidates and first match the candidates which are more general -- if the more general candidate does not match a tuple, the more specific one will certainly not match it either. We can also order the attribute groups such that we check for infrequent attributes first to stop the matching as fast as possible. Ideally we could compile code for each candidate and pass the tuple as argument, the candidate would then evaluate a match with the highest speed possible and call the matching function of potential siblings as well. Our candidate generation is currently also inefficient, with a smart data structure we could reduce the complexity from $O(n$ ²) to $O(m)$ where $n$ are the number of candidates in the current candidate set and $m$ are the number of candidates in the next set. Another improvement can be achieved by streamlining our candidate hash function. We currently use the entire bit vector representation to compute the hash code. An improved version would only take the list of items in a candidate and hash it, this list is generally much shorter than the bitvector.

Next: Future Work Up: Discussion Previous: Genetic algorithm

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999