Background
Multidimensional Data
 Flat relational tables
 Multimedia feature vectors
 Data warehouse data
 Spatial data
 Text documents
Attribute Types
 Attributes of multidimensional tuples may have variable types
 Ordinal (e.g., age, salary)
 Nominal categorical values (e.g., color, religion)
 Binary (e.g., gender, owns_property)
 Basic queries: range, NN, similarity
Basic Queries
 (Range) selection query
 Returns the records that qualify a (multidimensional) range predicate
 Example:
 Return the employees of age between 45 and 50 and salary above $100,000
 Distance (similarity) query
 Returns the records that are within a distance from a reference record.
 Example:
 Find images with feature vectors of Euclidean distance at most ε with the feature vector of a given image
 Nearest neighbor (similarity) query
 Replaces distance bound by ranking predicate
Topk Search Methods
 Rank aggregation
 Indexbased methods
Topk Query
 Given a set of objects (e.g., relational tuples),
 Returns the k objects with the highest combined score, based on an aggregate function f.
 Example:
 Relational table containing information about restaurants, with attributes(e.g. price, quality, location)
 f: sum(price, quality, dist(location,my_hotel))
 attribute value ranges are usually normalized
 E.g., all values in a (0,1) range
 otherwise some attribute may be favored in f
Topk Query Variants

Apply on single table, or ranked lists of tuples ordered by individual attributes

Ranked inputs in the same or different servers (centralized or distributed data)

Standalone query or operator in a more complex query plan

Incremental retrieval of objects with highest scores (k is not predefined)

Topk joins
SELECT h.id, s.id
FROM House h School s
WHERE h.location=s.location
ORDER BY h.price + 10 ∗ s.tuition
LIMIT 5

Probabilistic/approximate topk retrieval

Random and/or sorted accesses at ranked inputs
Topk Query Evaluation

Most solutions assume distributive, monotone aggregate functions (e.g. f=sum)
 distributive: f(x,y,z,w)= f(f(x,y),f(z,w))
 e.g., A+B+C+D = (A+B) + (C+D)
 monotone: if x<y and z<w, then f(x,z)<f(y,w)
 distributive: f(x,y,z,w)= f(f(x,y),f(z,w))

Solutions based on 1D ordering and merging sorted lists (rank aggregation)

Solutions based on multidimensional indexing
Rank Aggregation
 Solutions based on 1D ordering and merging sorted lists (rank aggregation)
 Assume that there is a total ranking of theobjects for each attributethat can be used in topkqueries
 These sorted inputs canbe accessed sequentiallyand/or by random accesses
Advantages and Drawbacks
 Advantages：
 can be applied on any subset of inputs (arbitrary subspace)
 appropriate for distributed data
 appropriate for topk joins
 easy to understand and implement
 Drawbacks:
 slower than indexbased methods
 require inputs to be sorted
TA: Threshold Algorithm
Introduction
 Iteratively retrieves objects and their atomic scores from the ranked inputs in a roundrobin fashion.
 For each encountered object x, perform random accesses to the inputs where x has not been seen.
 Maintain topk objects seen so far.
 T = f($l_1$, . . . , $l_m$) is the score derived when applying the aggregation function to the last atomic scores seen at each input. If the score of the kth object is no smaller than T, terminate.
Example of TA(k=1,f=sum)
 STEP 1
 top1 is c, with score 2.0
 T=sum(0.9,0.9,0.9)=2.7
 T>top1, we proceed to another round of accesses
 STEP 2
 top1 is b, with score 2.2
 T=sum(0.8,0.8,0.9)=2.5
 T>top1, we proceed to another round of accesses
 STEP 3
 top1 is b, with score 2.2
 T=sum(0.6,0.6,0.8)=2.0
 T≤top1, terminate and output (b,2.2)
Properties of TA
 Used as a standard module for merging ranked lists in many applications
 Usually finds the result quickly
 Depends on random accesses, which can be expensive
 random accesses are impossible in some cases
 e.g., an API allows to access objects incrementally by ranking score, but does not provide the score of a given object
NRA: No Random Accesses
Introduction
 Iteratively retrieves objects and their atomic scores from the ranked inputs in a roundrobin fashion.
 For each object x seen so far at any input maintain:
 f_x_ub: upper bound for x’s aggregate score (f_x)
 f_x_lb: lower bound for x’s aggregate score (f_x)
 W_k = k objects with the largest f^lb.
 If the smallest f^lb in W_k is at least the largest f_x_ub of any object x not in W_k, then terminate and report W_k as topk result.
Example of NRA(k=1,f=sum)
 STEP 1
 STEP 2
 STEP 3
 STEP 4
NRA Properties
 More generic than TA, since it does not depend on random accesses
 Can be cheaper than TA, if random accesses are very expensive
 NRA accesses objects sequentially from all inputs and updates the upper bounds for all objects seen so far unconditionally.
 Cost: O(n) per access (the expected distinct number of objects accessed so far is O(n))
 No input list is pruned until the algorithm terminates
LARA: LAtticebased Rank Aggregation
 LARA: An efficient NRA implementation
 Based on 3 observations about the topk candidates
 Operates differently in the two (growing, shrinking) phases
 Takes its name from the lattice used in the shrinking phase
 Extendable to various topk query variants