Experiments on Approximate Nearest-Neighbor Search using kd-Trees

(In headings, should I write kd-trees like "Kd-Trees", or "KD-Trees", or "kd-Trees"?)

In this set of experiments, I wanted to determine under what circumstances and parameters the approximate nearest-neighbor search using a kd-tree is significantly faster (in practice) than the exact nearest neighbor search using a kd-tree. As you know, there is always the curse of dimensionality.

If you just want the source code, jump to the bottom of this page.

Motivation

Inspired by the papers

Dickerson, Duncan, Goodrich: K-D Trees are Better when Cut on the Longest Side; ESA 2000 (LNCS 1879)
Arya, Mount, Netanyahu, Silverman, Wu: An Optimal Algorithm for Approximate Nearest Neighbor Searching in Fixed Dimensions; Journal of the ACM, Volume 45 Issue 6, Nov. 1998

I decided to test the performance of the "standard" approximate nearest neighbor (ANN) search algorithm by experiments.

The "(1+ε)-nearest neighbor" of a query point q is a point p^o ∈ S, such that || p^o - q || ≤ (1+ε) || p^* - q ||. Such a point p^o is called an "approximate nearest neighbor" to q (there might be many).

The "standard approximate nearest-neighbor search algorithm" is as follows (in Python):


    def approx_nn( self, query, epsilon ):
        cell_region = self._cell_region                         # root cell region
        queue = []                                              # p-queue of nodes, sorted by distance to query
        node = self
        current_nn = None                                       # current candidate for NN, 
        current_d = sys.float_info.max                          # start with "infinite" point

        while cell_region.dist(query) < current_d/(1.0 + epsilon):

            # descend into closest leaf
            while node._split_axis is not None:
                left_region, right_region = cell_region.split( node._split_axis, node._split_value )
                dl = left_region.dist( query )
                dr = right_region.dist( query )
                if dl < dr:                                     # left child is closer
                    heappush( queue, (dr, node._right, right_region) )
                    node = node._left
                    cell_region = left_region
                else:                                           # right child is closer
                    heappush( queue, (dl, node._left, left_region) )
                    node = node._right
                    cell_region = right_region

            # we are now at a leaf
            d = query.dist( node._point )
            if d < current_d:
                current_nn = node._point
                current_d = d

            if not queue:
                break                                           # last node was processed

            # process next node, which is the closest of all unprocessed yet
            dn, node, cell_region = heappop( queue )
        # end while

        return current_d, current_nn
    # end def approx_nn

(Note: this code is not optimized! It was written for clarity.)

One reason for my experiments was that Dickerson, Duncan, and Goodrich show in their paper that this standard algorithm, when used on "longest side" kd-trees, has a worst-case complexity of O( ¹⁄_ε^d-1 log^d n ). So I wanted to see, how bad the influence of the ε really is, and what other "hidden" constants there might be in the Big-O.

Another reason was that a lot of people seem to use the ANN Library, developed by Mount and Arya. However, the data structure used in that library (AFAIK) is a much more complicated variant of kd-trees, which they call BBD-tree (many people seem to call it "ANN kd-tree", which makes no sense). And since they have provided the results of some experiments with their BBD-trees in their paper, I wanted to see whether or not the standard (i.e., "longest side") kd-trees would perform as well.
It would be interesting to see how many leaves a typical ANN search using BBD-trees visits, but I doubt that they give a substantial performance gain. (Please share any experience you might have with me.)

And yet another reason was that I wanted to see for myself at which point the "curse of dimensionality" begins to exert its influence in the case considered here.

The Experiments

In all experiments, I have generated a number of random points according to some distribution inside the d-dimensional unit cube, built a kd-tree according to the "longest side" splitting rule (this is called "Median_of_rectangle" rule in CGAL), generated 100 random query points inside the unit cube, and performed an exact nearest-neighbor query as well as an approximate nearest neighbor query.

With each query, I have counted the number of leaves that the query algorithm visits. And finally, I have plotted the average number of leaves visited, depending on the dimension of the points.

The three distributions I have used are the following:
Uniform: each coordinate of each point is indipendently and identically distributed; the values are generated using Python's uniform random number generator.
Clustered: the set of points consists of a number of clusters; each cluster consists of 1000 points, each of which is distributed around a cluster center according to a normal distribution (sigma²=0.001); each cluster center is a random point that is generated using a uniform distribution.
Correlated: the set of points is created using an auto-correlation; the first point is randomly picked inside the unit cube; all following points are generated according to
p_i+1 = 0.9 * p_i + 0.1 * w,
where p, w ∈ R^d,
and w is a "noise vector" from a Gaussian Distribution; after all points have been created, the whole set is scaled such that its bounding box equals the unit cube.
You can find more details about the distributions in the source code.

I believe, the results apply to the nearest neighbor search algorithms / kd-trees of CGAL as well.

Results

In the following plots, a "leaves fraction" of 1.0 means that the algorithm visited all leaves of the kd-tree on average.
For ε, a value of 1.0 means that the ANN could be twice as far away from the query than the NN.
All leaf counts (whether fractions or absolute numbers) are averages over 100 experiments, each with a new random query point.
All images are clickable, leading to a larger plot.

Performance = f(n), ε = 0.1, different dimensions, clustered distribution

Performance = f(d), 10k points, ε = 0.1, different distributions

Obviously, with only 10,000 points, the "curse of dimensionalty" begins to hit for dimensions ≥ 15; and with ε = 0.1, there is not really much to be gained by an ANN search.

Performance = f(d), 10k points, ε = 0.3, different distributions

With ε = 0.3 (and 10,000 points), ANN search actually offers a significant performance gain over NN search in the dimension range ~7, ..., ~22.
In these plots, the red curves (fraction of leaves visited for exact NN search) should, of course, be the same as in the plots above with ε = 0.1; but they can slightly deviate due to the point sets being random.

Performance = f(d), 10k points, ε = 1.0, different distributions

With ε = 1.0 (and 10,000 points), the performance gain is quite substantial even for dimensions around 40; but then, the ANN could be twice as far away from the query as the NN.

Performance = f(d), 100k points, ε = 0.1, different distributions

With 100k points, the curse of dimensionality seems to kick in a bit "later", i.e., at dimensions around 10-15, while with 10k points, it started around dimension 5-10. But there seems to be still not much performance gain of ANN over exact NN.

Performance = f(d), 500k points, ε = 0.1, different distributions

Performance = f(ε), 10k points, different dimensions, clustered distribution

In these plots, the red curve (fraction of leaves visited for exact NN search) should actually be constant, but it is not due to random data sets.

Downloads

Here is a tar archive, containing everything you need, in order to reproduce the experiments.

It contains the Python program for building a kd-tree over a set of d-dimensional points and searching the nearest neighbor, and the approximate nearest neighbor. It contains also the same code again, but instrumented with some statistics gathering code. In addition, there is a Python class for d-dimensional vectors. And, finally, it contains all the shell scripts that I wrote for running the experiments (on a Linux cluster), and for plotting the results.

Contact

If you have any comments, questions, or bug reports, I would like to hear from you.
You can send me an email at: zach at informatik.uni-bremen.de.

Gabriel Zachmann
Last modified: Fri Jun 24 16:25:11 MDT 2011