[Download postscript version]
next up previous
Next: Discussion Up: Policy enforcement in releasing Previous: Feature data structure

Experiment Result

 

We run our experiments on 5 machines: gs5, gs14, gs207 are 266MHz Pentium-II PCs with 64M memory, gs81 is 450MHz Pentium-II PC with 128M memory. Besides the running test program, these machines have low working load.

We designed a set of tests to compare our improved algorithm with algorithm Cumulate. The improved algorithm is expected to improve the performance. For each pass, the candidate sets should be much smaller. Also the improved algorithm is expected to return fewer qualified association rules. This can also be thought of as an automatic rule pruning.

We use our data generation to generate different datasets. The size of these datasets are 200000, 50000, 100000, 500000, and 1000000. The number of attributes vary from 500 to 1000. Our tests are more difficult than some of the IBM testing cases. Although the IBM testing cases have higher number of attributes, the average size of the transactions is 10 and the average size of the frequent itemsets is 4. In our case, the two numbers are higher. In general, the average size of the transactions is the square root of the number of the attributes. We made this decision because it is closer to the real medical databases.

For each test, we output the size of the candidate set for each iteration and the number of rules with high confidence found in every pass. We also output the total time that is used to do the rule mining. The result is shown in the following tables. The basic IBM algorithm described in [SA95] is shown in the first 2 data rows of each table, referred to as ``Cumulate''. Our improved algorithm ``Horizon'' is shown in the last two rows of each table and it has a significant speedup of up to 3.5 times. This is due to our ``horizon'' effect: once a rule has reached the confidence threshold, it is pruned. Therefore all subsequent rules which contain the pruned rule will also be pruned off.

 
50000 tuples, 500 attributes, 40 number of attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 114 1270 3902 5224 3769 1280 146 18 304
# of rules 0 0 196 1466 2703 1252 146 18
Horizon 114 1270 3706 3751 1059 28 229
# of rules 0 0 196 771 633 8
Table 1: Tests on gs81 

50000 tuples, 552 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 113 1379 5609 11793 14371 11089 4617 738 48 1124
# of rules 0 0 154 1391 4851 6909 3933 694 48
Horizon 113 1379 5455 10400 9508 4175 683 44 881
# of rules 0 0 154 689 1542 1122 281 42

10000 tuples, 520 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 111 1380 5900 12911 16618 12186 3904 148 12 263
# of rules 0 12 476 2971 7146 7878 3255 148 12
Horizon 111 1368 5423 9928 9453 4299 646 162
# of rules 0 12 367 1288 1968 1404 371

500000 tuples, 609 attributes, 50 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 137 1361 4417 6792 6754 3785 1064 166 24 4458
# of rules 0 0 87 1413 4008 3302 1064 166 24
Horizon 137 1361 4330 5379 2745 483 3402
# of rules 0 0 87 997 1670 347

 
50000 tuples, 524 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 114 1250 4290 7153 7063 4204 1125 10 770
# of rules 0 0 80 835 2298 2543 1007 5
Horizon 114 1250 4210 6317 4760 1655 118 5 653
# of rules 0 0 80 607 1092 710 138 2
Table 2: Tests on gs14 

10000 tuples, 541 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 114 1280 4631 7787 7990 5126 1572 120 152
# of rules 0 25 641 3387 6430 4959 1572 120
Horizon 114 1255 3990 4385 1473 146 73
# of rules 0 25 414 871 476 39

10000 tuples, 554 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 138 1790 8413 20581 27415 20506 7849 1252 96 724
# of rules 0 71 1394 7729 17971 17917 7659 1252 96
Horizon 138 1719 7016 12811 9383 2574 183 271
# of rules 0 71 815 2406 3004 1435 225

100000 tuples, 1272 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 170 1672 4721 6992 6638 3733 1244 295 48 1228
# of rules 0 2 647 3936 5967 3714 1244 295 48
Horizon 170 1670 4057 3033 666 12 613
# of rules 0 2 551 1168 280 15

100000 tuples, 1297 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 Time
Cumulate 149 1526 3646 4340 3882 1885 314 939
# of rules 0 0 137 962 2318 1660 310
Horizon 149 1526 3509 3375 1556 222 4 765
# of rules 0 0 137 557 847 242 17

 
50000 tuples, 568 attributes, 40 number of attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 109 1181 3031 4293 4466 2517 531 18 458
# of rules 0 0 136 1120 2503 2004 507 18
Horizon 109 1181 2895 3173 1935 499 23 336
# of rules 0 0 136 562 636 192 33
Table 3: Tests on gs5 

50000 tuples, 556 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 111 1410 4644 6702 5468 2737 678 36 720
# of rules 0 0 175 1520 3408 2664 678 36
Horizon 111 1410 4469 5170 2045 73 510
# of rules 0 0 175 949 902 59

100000 tuples, 1187 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 164 1586 5505 11659 17021 15759 9828 4536 972 3877
# of rules 0 0 204 2784 9348 13223 9600 4536 972
Horizon 164 1586 5301 8851 7655 2528 228 2094
# of rules 0 0 204 1917 3497 2255 310

10000 tuples, 494 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 113 1560 7555 19612 30853 28797 13968 2574 933
# of rules 0 20 1020 7276 19450 23315 12837 2556
Horizon 113 1540 6532 12326 11391 5481 1131 18 341
# of rules 0 20 833 3526 5714 4249 1591 240

 
10000 tuples, 631 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 128 1619 4033 4715 3783 1774 384 3 81
# of rules 0 19 442 2068 2926 1681 383 3
Horizon 128 1600 3591 2645 782 73 1 51
# of rules 0 19 297 594 195 36 0
Table 4: Tests on gs207 

10000 tuples, 508 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 Time
Cumulate 126 1616 6174 11841 13368 7706 1656 263
# of rules 0 37 1008 5399 10089 7156 1652
Horizon 126 1579 5163 6434 3276 550 4 115
# of rules 0 37 710 2031 1487 368 9

10000 tuples, 491 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 9 Time
Cumulate 124 1683 8232 20381 28286 21507 8062 1060 48 798
# of rules 0 31 871 7329 18009 18547 7729 1058 48
Horizon 124 1652 7360 13041 10148 2738 306 2 312
# of rules 0 31 703 3432 3357 1607 152 4

100000 tuples, 1120 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 149 1349 3576 5534 4839 1983 490 54 986
# of rules 0 0 430 2247 3520 1827 489 54
Horizon 149 1349 3146 3280 1291 143 623
# of rules 0 0 430 988 460 26

100000 tuples, 1301 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 162 1711 5671 10035 11963 7980 2367 174 2438
# of rules 0 2 431 3520 8824 7506 2367 174
Horizon 162 1709 5239 6505 3130 472 1503
# of rules 0 2 382 1990 2800 683

 
500000 tuples, 1157 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 142 1150 1845 2109 1636 670 125 6 1724
# of rules 0 0 94 814 1545 670 125 6
Horizon 142 1150 1751 1294 90 980
# of rules 0 0 94 398 243
Table 5: Tests on no.trust 

500000 tuples, 1211 attributes, 90 attribute groups
Itemsetsize 1 2 3 4 5 6 7 8 Time
Cumulate 173 1502 2762 4553 4778 2554 528 12 4051
# of rules 0 0 71 1107 2946 2341 528 12
Horizon 173 1502 2691 3442 1825 189 2929
# of rules 0 0 71 782 981 300

1000000 tuples, 477 attributes, 40 attribute groups
Itemsetsize 1 2 3 4 5 6 7 Time
Cumulate 96 1016 3187 4311 2487 888 146 6749
# of rules 0 0 5 184 872 698 146
Horizon 96 1016 3182 4127 1615 190 6285
# of rules 0 0 5 156 526 179

We can see that the running time is strongly dependent on the number of rules searched. A data base which contains a high number of associations will take therefore much longer to process.


next up previous
Next: Discussion Up: Policy enforcement in releasing Previous: Feature data structure

Adrian Perrig
Wed Sep 15 14:27:56 PDT 1999