We run our experiments on 5 machines: gs5, gs14, gs207 are 266MHz Pentium-II PCs with 64M memory, gs81 is 450MHz Pentium-II PC with 128M memory. Besides the running test program, these machines have low working load.
We designed a set of tests to compare our improved algorithm with algorithm Cumulate. The improved algorithm is expected to improve the performance. For each pass, the candidate sets should be much smaller. Also the improved algorithm is expected to return fewer qualified association rules. This can also be thought of as an automatic rule pruning.
We use our data generation to generate different datasets. The size of these datasets are 200000, 50000, 100000, 500000, and 1000000. The number of attributes vary from 500 to 1000. Our tests are more difficult than some of the IBM testing cases. Although the IBM testing cases have higher number of attributes, the average size of the transactions is 10 and the average size of the frequent itemsets is 4. In our case, the two numbers are higher. In general, the average size of the transactions is the square root of the number of the attributes. We made this decision because it is closer to the real medical databases.
For each test, we output the size of the candidate set for each iteration and the number of rules with high confidence found in every pass. We also output the total time that is used to do the rule mining. The result is shown in the following tables. The basic IBM algorithm described in [SA95] is shown in the first 2 data rows of each table, referred to as ``Cumulate''. Our improved algorithm ``Horizon'' is shown in the last two rows of each table and it has a significant speedup of up to 3.5 times. This is due to our ``horizon'' effect: once a rule has reached the confidence threshold, it is pruned. Therefore all subsequent rules which contain the pruned rule will also be pruned off.
50000 tuples, 552 attributes, 40 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 113 | 1379 | 5609 | 11793 | 14371 | 11089 | 4617 | 738 | 48 | 1124 |
# of rules | 0 | 0 | 154 | 1391 | 4851 | 6909 | 3933 | 694 | 48 | |
Horizon | 113 | 1379 | 5455 | 10400 | 9508 | 4175 | 683 | 44 | 881 | |
# of rules | 0 | 0 | 154 | 689 | 1542 | 1122 | 281 | 42 |
10000 tuples, 520 attributes, 40 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 111 | 1380 | 5900 | 12911 | 16618 | 12186 | 3904 | 148 | 12 | 263 |
# of rules | 0 | 12 | 476 | 2971 | 7146 | 7878 | 3255 | 148 | 12 | |
Horizon | 111 | 1368 | 5423 | 9928 | 9453 | 4299 | 646 | 162 | ||
# of rules | 0 | 12 | 367 | 1288 | 1968 | 1404 | 371 |
500000 tuples, 609 attributes, 50 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 137 | 1361 | 4417 | 6792 | 6754 | 3785 | 1064 | 166 | 24 | 4458 |
# of rules | 0 | 0 | 87 | 1413 | 4008 | 3302 | 1064 | 166 | 24 | |
Horizon | 137 | 1361 | 4330 | 5379 | 2745 | 483 | 3402 | |||
# of rules | 0 | 0 | 87 | 997 | 1670 | 347 |
10000 tuples, 541 attributes, 40 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 114 | 1280 | 4631 | 7787 | 7990 | 5126 | 1572 | 120 | 152 |
# of rules | 0 | 25 | 641 | 3387 | 6430 | 4959 | 1572 | 120 | |
Horizon | 114 | 1255 | 3990 | 4385 | 1473 | 146 | 73 | ||
# of rules | 0 | 25 | 414 | 871 | 476 | 39 |
10000 tuples, 554 attributes, 40 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 138 | 1790 | 8413 | 20581 | 27415 | 20506 | 7849 | 1252 | 96 | 724 |
# of rules | 0 | 71 | 1394 | 7729 | 17971 | 17917 | 7659 | 1252 | 96 | |
Horizon | 138 | 1719 | 7016 | 12811 | 9383 | 2574 | 183 | 271 | ||
# of rules | 0 | 71 | 815 | 2406 | 3004 | 1435 | 225 |
100000 tuples, 1272 attributes, 90 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 170 | 1672 | 4721 | 6992 | 6638 | 3733 | 1244 | 295 | 48 | 1228 |
# of rules | 0 | 2 | 647 | 3936 | 5967 | 3714 | 1244 | 295 | 48 | |
Horizon | 170 | 1670 | 4057 | 3033 | 666 | 12 | 613 | |||
# of rules | 0 | 2 | 551 | 1168 | 280 | 15 |
100000 tuples, 1297 attributes, 90 attribute groups | ||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Time |
Cumulate | 149 | 1526 | 3646 | 4340 | 3882 | 1885 | 314 | 939 |
# of rules | 0 | 0 | 137 | 962 | 2318 | 1660 | 310 | |
Horizon | 149 | 1526 | 3509 | 3375 | 1556 | 222 | 4 | 765 |
# of rules | 0 | 0 | 137 | 557 | 847 | 242 | 17 |
50000 tuples, 556 attributes, 40 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 111 | 1410 | 4644 | 6702 | 5468 | 2737 | 678 | 36 | 720 |
# of rules | 0 | 0 | 175 | 1520 | 3408 | 2664 | 678 | 36 | |
Horizon | 111 | 1410 | 4469 | 5170 | 2045 | 73 | 510 | ||
# of rules | 0 | 0 | 175 | 949 | 902 | 59 |
100000 tuples, 1187 attributes, 90 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 164 | 1586 | 5505 | 11659 | 17021 | 15759 | 9828 | 4536 | 972 | 3877 |
# of rules | 0 | 0 | 204 | 2784 | 9348 | 13223 | 9600 | 4536 | 972 | |
Horizon | 164 | 1586 | 5301 | 8851 | 7655 | 2528 | 228 | 2094 | ||
# of rules | 0 | 0 | 204 | 1917 | 3497 | 2255 | 310 |
10000 tuples, 494 attributes, 40 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 113 | 1560 | 7555 | 19612 | 30853 | 28797 | 13968 | 2574 | 933 |
# of rules | 0 | 20 | 1020 | 7276 | 19450 | 23315 | 12837 | 2556 | |
Horizon | 113 | 1540 | 6532 | 12326 | 11391 | 5481 | 1131 | 18 | 341 |
# of rules | 0 | 20 | 833 | 3526 | 5714 | 4249 | 1591 | 240 |
10000 tuples, 508 attributes, 40 attribute groups | ||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Time |
Cumulate | 126 | 1616 | 6174 | 11841 | 13368 | 7706 | 1656 | 263 |
# of rules | 0 | 37 | 1008 | 5399 | 10089 | 7156 | 1652 | |
Horizon | 126 | 1579 | 5163 | 6434 | 3276 | 550 | 4 | 115 |
# of rules | 0 | 37 | 710 | 2031 | 1487 | 368 | 9 |
10000 tuples, 491 attributes, 40 attribute groups | ||||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | Time |
Cumulate | 124 | 1683 | 8232 | 20381 | 28286 | 21507 | 8062 | 1060 | 48 | 798 |
# of rules | 0 | 31 | 871 | 7329 | 18009 | 18547 | 7729 | 1058 | 48 | |
Horizon | 124 | 1652 | 7360 | 13041 | 10148 | 2738 | 306 | 2 | 312 | |
# of rules | 0 | 31 | 703 | 3432 | 3357 | 1607 | 152 | 4 |
100000 tuples, 1120 attributes, 90 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 149 | 1349 | 3576 | 5534 | 4839 | 1983 | 490 | 54 | 986 |
# of rules | 0 | 0 | 430 | 2247 | 3520 | 1827 | 489 | 54 | |
Horizon | 149 | 1349 | 3146 | 3280 | 1291 | 143 | 623 | ||
# of rules | 0 | 0 | 430 | 988 | 460 | 26 |
100000 tuples, 1301 attributes, 90 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 162 | 1711 | 5671 | 10035 | 11963 | 7980 | 2367 | 174 | 2438 |
# of rules | 0 | 2 | 431 | 3520 | 8824 | 7506 | 2367 | 174 | |
Horizon | 162 | 1709 | 5239 | 6505 | 3130 | 472 | 1503 | ||
# of rules | 0 | 2 | 382 | 1990 | 2800 | 683 |
500000 tuples, 1211 attributes, 90 attribute groups | |||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | Time |
Cumulate | 173 | 1502 | 2762 | 4553 | 4778 | 2554 | 528 | 12 | 4051 |
# of rules | 0 | 0 | 71 | 1107 | 2946 | 2341 | 528 | 12 | |
Horizon | 173 | 1502 | 2691 | 3442 | 1825 | 189 | 2929 | ||
# of rules | 0 | 0 | 71 | 782 | 981 | 300 |
1000000 tuples, 477 attributes, 40 attribute groups | ||||||||
Itemsetsize | 1 | 2 | 3 | 4 | 5 | 6 | 7 | Time |
Cumulate | 96 | 1016 | 3187 | 4311 | 2487 | 888 | 146 | 6749 |
# of rules | 0 | 0 | 5 | 184 | 872 | 698 | 146 | |
Horizon | 96 | 1016 | 3182 | 4127 | 1615 | 190 | 6285 | |
# of rules | 0 | 0 | 5 | 156 | 526 | 179 |
We can see that the running time is strongly dependent on the number of rules searched. A data base which contains a high number of associations will take therefore much longer to process.