Thursday, January 15, 2015

Analysis of Locality Sensitive Hashing

Idea:
Locality Sensitive Hashing (LSH) is one of the popular indexing algorithms, which can be used for image indexing of image retrieval. The key idea of LSH is to hash the points using several hash functions so as to ensure that, for each function, the probability of the collision is much higher for objects which are close to each other than for those which are far apart. Then, one can determine near neighbors by hashing the query point and retrieving elements stored in buckets containing that point.

Definition: In order to solve an R-NN problem, given a point set S and distance metric function d, locality sensitive hashing function h can be described as follows:
H = {h: S -> U}, H is a family of hashing functions, for any v, q in S,
If v in B(q, r1), then Pr[h(q)=h(v)] >= p1;
If v not in B(q, r2), then Pr[h(q)=h(v)] <= p2.
Here, B(q, r) = {v in S | d(v, q) <= r}, Pr is probability, p1 > p2, r1 < r2.

Given parameters (r1, r2, p1, p2), we can amplify the gap between the “high” probability p1 and “low” probability p2 by concatenating several functions. Define a function family G = {g: S -> Uk} such that g(v) = (h1(v), …, hk(v)), where hi in H. For an integer L, we choose L functions g1, …, gL from G, independently and uniformly at random. During preprocessing, we store each v in the bucket gi(v), for I = 1, …, L. Since the total number of buckets may be large, we retain the non-empty buckets by resorting to hashing.

To process a query q, we search all buckets g1(q), …, gL(q); as it is possible (though unlikely) that the total number of points stored in those buckets is large, we interrupt search after finding first 3L points (including duplicates). Let v1, …, vt be the points encountered therein. For each vj, if vj in B(q, r2), then we return YES and vj, else we return NO.

Problems:
  1. How can we select k and L? If k is large, the collision probability should be reduced and the calculation time is also large; If k is small, it’s hard to distinguish the points which are far apart. If k has already be fixed, it is easy to find the optimal value of L which will guarantee that the fraction of false negatives are no more than a user specified threshold.
  2. How can we save memory? Generally, given any bucket, we should store k-dimensional hash value and a pointer to the data (e.g., image) for each point. Therefore, it will require L*(k+1) words of memory totally. We can reduce the memory requirement by not storing the hash value explicitly as concatenation of k projections, but instead hash these k values in turn to get a single word (index) for the hash, which should be L*2.

No comments:

Post a Comment