Spatial Data Management
Concepts
Spatial Data
 Location data
 Checkin service
 Online Maps
 Locationbased services
 Location tracking
 Traffic Data
Spatial Databases
 PostgreSQL with PostGIS
 Neo4Jspatial
 HadoopGIS
 Ingres
 GeoMesa
Spatial Data Management
 Spatial Database Systems
 Manage large collections of multidimensional objects (2D/3D)
 A spatial object
 Contains (at least) one spatial attributes that describes its location and/or geometry
 A spatial relation
 Is an organized collection of spatial objects of the same entity
Spatial Data
Representation
 Points (Cities in largescale map)
 Extent (rivers, forest, etc.)
 Vector (approximation by geometric objects)
 Raster (A set of pixels in the grid)
Application
 Spatial data
 GIS
 Segemented images
 Components of CAD constructs or VLSI circuit
 Stars on the sky
 …
 Spatial database
 Users of mobile devices
 Geographers, life scientists
 …
Features of spatial
 Dimensionality
 There is no total ordering of objects in the multidimensional space that preserves spatial proximity
 Complex spatial extent
 No standard definitions of spatial operations and algebra
Relationa indexes (like B+ trees) and query processing methods (sortmerge join, hashjoin) are not applicable
Spatial access methods (SAMs) for spatial data have to be defined
 Index spatial objects
 Facilitate efficient processing of simple spatial query types (e.g. range queries)
Spatial Relationships
A spatial relationship associates two objects according to their relative location and extent in space. Sometimes also called “spatial relations”.
Can refer to a database relation which stores spatial objects.
Classification
 Topological relationships
 Distance relationships
 Directional relationships
Topological relationships
Each object is characterized by the space it occupies in the universe (A set of pixels).
A set of relationsips between their boundaries and interiors
 Boundary
 Interior (some may not have, points, line segments, etc.)
A hierarchy of relations
 intersect (or overlaps)
 equals
 inside
 contains
 adjacent
 disjoint
Distance relationships
Associate two objects based on their geometric (Euclidean distance), and it’s usually abstracted into human mind.
Distance relationships are expressed either explicitly or by some abstract distance class.
Directional relationships
Associates two object based on their relative orientation according to a global reference system.
Spatial Queries
Applied on one (or more) spatial relations to retrieve objects staisfying some spatial relationships
 Nearest neighbor query
 Spatial join
 Range query
 Spatial selction
 window query
Spatial Query Processing
Evaluating spatial relationships on geometric data is slow.
A spatial object is approximated by its minimum bounding rectangle (MBR)
Process
 Filter: The MBR is tested against the query predicate
 Refinement: The exact geometry of objects that pass the filter step is tested for qualification
Spatial Access Methods (SAMs)
The problem of indexing spatial data
 No dynamic access method with good theoretical worstcase guarantees for range queries
SAMs aim at the minimization of the expected cost.
 Indexing of multidimensional points
Point access methods
Divide the apce into disjoint partitions and group the points according to the regions they belong
Not effective for extended objects (may need to be clipped into several parts which leads to data redundancy and affects performance negatively).
Object clipping can be avoided if we allow the regions of object to overlap.
Optimization
 Group the objects below into 3 groups of 4 objects each such that the MBRs of the groups have the minimum overlap
 Hard optimization problem
The Rtree
Concept
 Group object MBRs to disk blocks hierarchically
 Each group of object is a leaf of the tree
 The MBRs of the leaf nodes are grouped to form nodes at the next level
 Grouping is recursively applied at each level until a single group (the root) is formed
Elements
 Leaf node entries: <MBR, objectid>, all leaves are in same level
 Nonleaf node entries: <MBR, ptr>, pointing to entries
 Root: have at least two children
 Nonroot node parameters
 M
 m
 m <= M/2
 Usually m = 0.4 M
Range searching using an Rtree
Range_query (query W, Rtree node n)
 If n is not a leaf node
 For each index entry e in n such that e.MBR intersects W
 Visit node n’ pointed by e.ptr
 Range_query (W, n')
 For each index entry e in n such that e.MBR intersects W
 If n is a leaf
 For each index entry e in n such that e.MBR intersects W
 Visit object o pointed by e.objectid
 Test range query against exact geometry of o; If o intersects W, report o
 For each index entry e in n such that e.MBR intersects W
 May follow multiple paths during search
 Different search predicates are used for different realtionships with W
Construction of the Rtree
 Dynamically constructed/maintained
 Insertions/deletions interleave with search operations
 Insertion similiar to B+ Tree, but with special optimization algorithms
 Choose the path where a new MBR is inserted
 Split overflow nodes
 Underflows in deletions
 Deleting the underflow leaf node
 Reinsert the remaining entries
 Insertion similiar to B+ Tree, but with special optimization algorithms
R*tree
Only different in the insertion algorithm (compared to Rtree), aiming at constructing a tree of high quality
A good tree
 nodes with small MBRs
 nodes with small overlap
 nodes that look like squares
 nodes as full as possible
Optimization
 Minimize the area covered by an index rectangle (small area means small dead space)
 Minimize overlap between node MBRs (Minimizes the number of traversed paths)
 Minimize the margins of node MBRs (Squarelike nodes, smaller number of intersections for a random query, better structure)
 Optimize the storage utilization
 Nodes in tree should be filled as much as possible
 Minimizes tree height and potentially decreases dead space
 Insertion heuristics (Select the path)
 Least MBR enlargement after insertion
 Least MBR overlap after insertion
Node Spliting
Determine the split axis
 For each axis (i.e. x and y axis)
 Sum=0;
 sort entries by the lower value, then by upper value
 for each sorting (e.g. lower value)
 for k=m to M+1m
 place first k entries in group A, and the remaining ones in group B

 Sum = Sum + margin(A) + margin(B)
 Choose axis with the minimum Sum
Distribute entries along axis
 Along the split axis, choose the distribution with minimum overlap
 If there are multiple groupings with minimal overlap choose <A,B> such that area(A)+area(B) is minimized
Insertion heuristics: Forced Reinsert
 Forced Reinsert
 When R*tree node n overflows, instead of splitting n immediately, try to see if some entries in n could possibly fit better in another node
 Find the 30% furthest entries from the center of the group
 Reinsert them to the tree (not to be repeated if another overflow occurs)
 Slightly more expensive, but better tree structure:
 less overlap
 more space is utilized (more full nodes)
Bulkloading Rtrees
Given a static set S of rectangles, build an Rtree that indexes S.
 Method 1: iteratively insert rectangles into an initially empty tree
 Feature
 tree reorganization is slow
 tree nodes are not as full as possible: more space occupied for the tree
 Feature
 Method 2 (xsorting): bulkload the rectangles into the tree using some fast (sort or hashbased) process
 sort rectangles using the xcoordinate of their center
 pack M consecutive rectangles in leaf nodes
 build tree bottomup
 Feature
 Rtree is built fast
 good space utilization
 results in leaf nodes that are have long stripes as MBRs
 Method 3 (Hilbert sorting): use a spacefilling curve to order the rectangles
 much better structure, but still the nodes have large overlap
 Method 4 (sorttilerecursive): Sort using one axis first and then groups of sqrt(n) rectangles using the other axis
 Usually the best structure compared to other bulkloading methods
K Nearest Neighbor Search
Given a spatial relation R, a query object q, and a number k <R, find the knearest neighbors of q in R.
We can have more than one kNN sets (with multiple possible equidistant furthest points in them).
Distance measures and MBRs
Distances between MBRs lowerbound the distances between the corresponding objects
dist(MBR(oi),MBR(oj)) ≤ dist(oi, oj)
Distances between Rtree node MBRs lowerbound the distances between the entries in them
The distance between a query object q and an Rtree node MBR lowerbounds the distances between q and the objects indexed under this node
Depthfirst NN search using an Rtree
 Start from the root and visit the node nearest to q
 Continue recursively, until a leaf node nl is visited.
 Find the NN of q in nl.
 Continue visiting other nodes after backtracking as long there are nodes closer to q than the current NN.
 Large space can be pruned by avoiding visiting Rtree nodes and their subtrees
 Should order the entries of a node in increasing distance from q to maximize potential for a good NN found fast
 Can be easily adapted for kNN search
 Requires at most one tree path to be currently in memory – good for small memory buffers
 Characteristic of all depthfirst search algorithms
 Recall that the range search algorithm is also DF
 However, does not visit the least possible number of nodes
 Also, not incremental – more on this later…
1. visit root
dist(q,M1)<dist(q,oNN)
must visit node M1
2. visit M1
dist(q,m1)<dist(q,oNN)
must visit node m1
3. visit m1
check a,b,c
found new NN:
oNN = a, dist(q,oNN) = sqrt(5)
4. backtrack to M1
check m2dist(q,m2) = 3 >= sqrt(5):
No need to visit node m2
check m3dist(q,m3) = sqrt(5) >= sqrt(5):
No need to visit node m3
5. backtrack to root
check M2dist(q,M2) = sqrt(2) < sqrt(5):
must visit node M2
6. visit M2
check m4dist(q,m4) = sqrt(2) < sqrt(5):
must visit node m4
7. visit m4
check i,j,k
found new NN:
oNN = k, dist(q,oNN) = sqrt(2)
8. backtrack to M2
check m5dist(q,m5) >= sqrt(2):
No need to visit node m5
check m6dist(q,m6) >= sqrt(2):
No need to visit node m6
9. backtrack to root
check M3dist(q,M3) >= sqrt(2):
No need to visit node M3
10. backtrack from root
Algorithm terminates
oNN =k with dist(q,oNN)= sqrt(2) found
Bestfirst NN search
Put all entries in a priority queue and always “open” the closest one, independently of the node that contains it.
Thus the best (i.e., closest) entry is always visited first.
 A more efficient algorithm (given large enough memory)
 Optimal in the number of Rtree nodes visited for a given query q
 Uses a priority queue to organize seen entries and prioritize the next node to be visited
 Adaptable for kNN search and incremental NN search
 In the previous example, we have visited fewer nodes compared to DFNN algorithm
 Only nodes whose MBR intersect the disk centered at q with radius the real NN distance are visited (see if you can you prove this)
 The algorithm can be adapted for incremental NN search
 After having found the NN can we easily (incrementally) find the next NN without starting search from the beginning?
 put objects on the heap
 never prune, but wait until an object comes out
 After having found the NN can we easily (incrementally) find the next NN without starting search from the beginning?
 The algorithm can be used for kNN search
 use a second heap to organize the NN found so far (same can be done for DFNN)
 no need if we just use the inc. version of the algorithm
 … but: The heap can grow very large until the algorithm terminates
Step 1: put all entries of root on heap Q
Q = M1(1), M2(sqrt(2)), M3(sqrt(8))
Step 2: get closest entry (top element of Q):
M1(1). Visit node M1. Put all entries of
visited node on heap Q
Q = M2(sqrt(2)), m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3)
Step 3: get closest entry (top element of Q):
M2(sqrt(2)). Visit node M2. Put all entries of
visited node on heap Q
Q =m4(sqrt(2)), m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3), m5(sqrt(13)), m5(sqrt(17))
Step 4: get closest entry (top element of Q):
m4(sqrt(2)). Visit node m4. m4 is a leaf node, so update NN if some object in m4 is closer than the current NN:
oNN = k, dist(q,oNN)= sqrt(2)
Q =m1(sqrt(5)), M3(sqrt(5)), M3(sqrt(8)), m2(3), m5(sqrt(13)), m5(sqrt(17))
Step 5: get closest entry (top element of Q):
m1(sqrt(5)). Since sqrt(5) >= dist(q,oNN)= sqrt(2), search stops and oNN is returned as the NN of q
incremental NN search
 Example 1: find the nearest large city (>10,000 residents) to my current position
 Solution 1:
 find all large cities
 apply NN search on the result
 could be slow if many such cities
 also Rtree may not be available for large cities only
 Solution 2:
 incrementally find NN and check if the large city requirement is satisfied; if not get the next NN
 Solution 1:
 Example 2: find the nearest hotel; see if you like it; if not get the next one; see if you like it; …
Spatial Joins
Most algorithms focus on the efficient processing of the filter step.
Most spatial predicates on actual objects reduce to intersection of MBRs in the filter step. Thus all algorithms consider mainly the intersect predicate.
Types
 intersection joins
 Semijoin: Find the cities that intersect a river
 Similarity join: Find pairs of hotels, restaurants close to each other (with distance smaller than 100m)
 Closest pairs: Find the closest pair of hotels, restaurants
 AllNN: For each hotel find the nearest restaurant
 Iceberg distance join: Find hotels close to at least 10 restaurants
Three categories of spatial join algorithms
 Both inputs are indexed (e.g., synchronized tree traversal)
 One input is indexed (e.g., indexed nested loops)
 Neither input is indexed (e.g., spatial hash join)
Rtree (Intersection) Join
Applies on two Rtrees of spatial relations R and S
Node MBRs at the high level of the trees can prune object combinations to be checked
This pseudocode version assumes that the trees have same height
Example
 run for root(RA), root(RB)
 for every intersecting pair there (e.g., A1, B1) run recursively for pointed nodes
 intersecting pairs of leaf nodes are qualifying object MBR pairs
Optimization
space restriction
 If an entry in n1 does not intersect the MBR of n2 it may not intersect any entry in n2.
 Perform two scans in n1 and n2 to prune such entries
plane sweep
 Sort entries in both nodes on their lowerx value (lower bound of xprojection)
 Sweep a line to find fast all entry pairs that qualify xintersection
 for each of them check yintersection
 Worstcase suboptimal. But very effective on the average
 Worstcase optimal algorithms require advanced data structures for yintersection. Large hidden constants, thus high cost for this problem size
 Can be used with other spatial join algorithms
Rtree join
 The most efficient algorithm (assuming that the relations are indexed)
 Cannot be used for nonindexed inputs
 unless we build onthefly Rtrees
 Comes with some I/O scheduling techniques for minimizing the page accesses
Joining nonindexed inputs
Spatial hash join
Partition based spatial merge join
Indexed Nested Loops
Seeded tree join and Bulkload and Match build an onthefly Rtree
Slotindex spatial join applies hashjoin using the entries of a high Rtree level
The refinement step
 Step 1: find MBR pairs that intersect
 Step 2: compare some more detailed approximations to make conclusions (a.k.a. geometric filter)
 conservative approximations
 e.g., convex hull
 progressive approximation
 e.g., maximum enclosed rectangle
 conservative approximations
 Step 3: if still join predicate inconclusive, perform expensive refinement step
 can be processed by computational geometry algorithms
 Multistep processing (Rtree join as example)