# COMP7103 Data Mining

## Topic 2 Association Rules

A general many-many mapping (association) between two kinds of things

• A large set of items, e.g., things sold in a supermarket
• A large set of baskets, each of which is a small set of the items, e.g., the things one customer buys on one day

### Frequent Itemsets

#### Support

Support for itemset I (s(I)) = the number of baskets containing all items in I

Given a support threshold s, sets of items that appear in at least s baskets are called frequent itemsets #### Monotonicity

For any sets of items I and any set of items J, it holds that ### Applications

• given that many people buy beer and diapers together
• Run a sale on diapers; raise price of beer
• Only useful if many buy diapers & beer
• Items that appear together too often could represent plagiarism
• Unusual words appearing together in a large number of documents

### Association Rules

If-then rules I → j about the contents of baskets, I is a set of items and j is an item

• i → j means
• if a basket contains all the items in I then it is likely to contain j

#### Confidence

The probability of j given I  #### Finding Association Rules

find all association rules with support ≥ s and confidence ≥ c

#### Computation Model

• Data is kept in raw files rather than in a database system
• Stored on disk
• The true cost of mining disk-resident data is usually the number of disk I/O’s
• In practice, association-rule algorithms read data in passes – all baskets read in turn
• we measure the cost by the number of passes an algorithm takes

### Association Rules Algorithms

#### Naïve Algorithm

• Read file once, counting in main memory the occurrences of each pair
• From each basket of n items, generate its n (n -1)/2 pairs by two nested loops
• Fails if (#items)^2 exceeds main memory

#### A-Priori Algorithm

• A two-pass approach called a-priori limits the need for main memory
• Key idea: monotonicity
• If a set of items appears at least s times, so does every subset
• For pairs: if item i does not appear in s baskets, then no pair including i can appear in s baskets
• Process
• Pass 1
• Read baskets and count in main memory the occurrences of each item (Requires only memory proportional to #items)
• Items that appear at least s times are the frequent items
• Pass 2
• Read baskets again and count in main memory only those pairs both of which were found in pass 1 to be frequent
• To count number of item pairs use a hash function
• Requires memory proportional to square of frequent items only, plus a list of the frequent items, plus space for hashing • One pass for each k
• Needs room in main memory to count each candidate k -set
• For typical market-basket data and reasonable support (e.g., 1%), k = 2 requires the most memory

#### PCY Algorithm

• Main observation: during pass 1 of A-priori, most memory is idle
• Use that memory to keep additional info to improve storage during pass 2 of A-priori
• Passes > 2 are the same as in A-Priori
• Process
• Pass 1
• Use a hash function which bucketizes item pairs, that is, maps them to integers in [1,k]
• Each bucket i in [1,k] is associated with a counter ci
• During pass 1, as we examine a basket (e.g. {m,b,d,o})
• update counters of single items
• Generate all item pairs for that basket, hash each of them and add 1 to the corr. counter
• Pass 2
• Count all pairs {i, j } that meet the conditions for being a candidate pair
• Both i and j are frequent items
• The pair {i, j }, hashes to a frequent bucket
• Ignore all pairs belonging to non-frequent buckets (do not use a counter for them)

#### Simple Algorithm

• Take a random sample of the market baskets
• give a full pass on the data and keep a basket in main memory with probability p
• A random sample is the best representative of a dataset
• Keeping only the first baskets might not contain iPhones for example
• If we cannot have a sample large enough then
• Remove false positives with one more pass
• Run A-priori or one of its improvements in main memory, so you don’t pay for disk I/O each time you give a pass on the data
• Be sure you leave enough space for counts
• Adjust the support threshold s accordingly

#### SON Algorithm

• Two passes
• No false positives or false negatives
• Divide the dataset into chunks, where each chunk contains a subset of baskets
• Process
• Pass 1
• Divide the dataset into chunks, where each chunk contains a subset of baskets
• Let pi such that the ith chunk contains a fraction pi of the dataset
• For each chunk i compute all frequent itemsets with support p i x s and store them on disk. This is the set of candidates for next pass
• Pass 2
• Read all frequent itemsets found in the previous pass (candidates)
• For each of them count the number of occurrences and output only those with support at least s

## 相关文章

2021-03-18
COMP7103 Topic 3 Clustering
2021-01-28
COMP7103 Topic 1 Introduction

Yu Zhang