iBet uBet web content aggregator. Adding the entire web to your favor.
iBet uBet web content aggregator. Adding the entire web to your favor.



Link to original content: https://doi.org/10.1145/3375395.3387649
On the I/O Complexity of the k-Nearest Neighbors Problem | Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems skip to main content
10.1145/3375395.3387649acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

On the I/O Complexity of the k-Nearest Neighbors Problem

Published: 14 June 2020 Publication History

Abstract

We consider static, external memory indexes for exact and approximate versions of the k-nearest neighbor (k-NN) problem, and show new lower bounds under a standard indivisibility assumption: Polynomial space indexing schemes for high-dimensional k-NN in Hamming space cannot take advantage of block transfers: í(k) block reads are needed to to answer a query. For the l∞ metric the lower bound holds even if we allow c-appoximate nearest neighbors to be returned, for c ∈ (1, 3). The restriction to c < 3 is necessary: For every metric there exists an indexing scheme in the indexability model of Hellerstein et al. using space O(kn), where n is the number of points, that can retrieve k 3-approximate nearest neighbors using optimal ⌈k/B⌉ I/Os, where B is the block size. For specific metrics, data structures with better approximation factors are possible. For k-NN in Hamming space and every approximation factor c>1 there exists a polynomial space data structure that returns k c-approximate nearest neighbors in ⌈k/B⌉ I/Os. To show these lower bounds we develop two new techniques: First, to handle that approximation algorithms have more freedom in deciding which result set to return we develop a relaxed version of the λ-set workload technique of Hellerstein et al. This technique allows us to show lower bounds that hold in d ≥ n dimensions. To extend the lower bounds down to d = O(k log(n/k)) dimensions, we develop a new deterministic dimension reduction technique that may be of independent interest.

References

[1]
Peyman Afshani. 2012. Improved pointer machine and I/O lower bounds for simplex range reporting and related problems. In Proceedings of 28th Symposium on Computational Geometry (SoCG). ACM, 339--346.
[2]
Peyman Afshani, Lars Arge, and Kasper Dalgaard Larsen. 2009. Orthogonal range reporting in three and higher dimensions. In 50th IEEE Symposium on Foundations of Computer Science (FOCS). IEEE, 149--158.
[3]
Alok Aggarwal and Jeffrey S Vitter. 1988. The input/output complexity of sorting and related problems. Commun. ACM, Vol. 31 (1988), 1116--1127. Issue 9.
[4]
Josh Alman and Ryan Williams. 2015. Probabilistic Polynomials and Hamming Nearest Neighbors. In Proceedings of 56th symposium on Foundations of Computer Science (FOCS). 136--150. https://doi.org/10.1109/FOCS.2015.18
[5]
Alexandr Andoni, Piotr Indyk, and Ilya Razenshteyn. 2018. Approximate nearest neighbor search in high dimensions. arXiv preprint 1806.09823 (2018). Also appears in proceedings of ICM 2018.
[6]
Alexandr Andoni, Thijs Laarhoven, Ilya P. Razenshteyn, and Erik Waingarten. 2017. Optimal Hashing-based Time-Space Trade-offs for Approximate Near Neighbors. In Proceedings of 28th ACM-SIAM Symposium on Discrete Algorithms (SODA). 47--66. https://doi.org/10.1137/1.9781611974782.4
[7]
Lars Arge, Vasilis Samoladas, and Ke Yi. 2009. Optimal external memory planar point enclosure. Algorithmica, Vol. 54, 3 (2009), 337--352.
[8]
Artem Babenko and Victor Lempitsky. 2016. Efficient indexing of billion-scale datasets of deep descriptors. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2055--2063.
[9]
Artem Babenko, Anton Slesarev, Alexandr Chigorin, and Victor Lempitsky. 2014. Neural codes for image retrieval. In European conference on computer vision. Springer, 584--599.
[10]
Bahman Bahmani, Ashish Goel, and Rajendra Shinde. 2012. Efficient distributed locality sensitive hashing. In Proceedings of 21st ACM international conference on information and knowledge management (CIKM). ACM, 2174--2178.
[11]
Omer Barkol and Yuval Rabani. 2002. Tighter Lower Bounds for Nearest Neighbor Search and Related Problems in the Cell Probe Model. J. Comput. Syst. Sci., Vol. 64, 4 (June 2002), 873--896. https://doi.org/10.1006/jcss.2002.1831
[12]
Stefan Berchtold, Christian Böhm, Daniel A. Keim, and Hans-Peter Kriegel. 1997. A Cost Model For Nearest Neighbor Search in High-Dimensional Data Space. In Proceedings of 6th ACM Symposium on Principles of Database Systems (PODS) .
[13]
Allan Borodin, Rafail Ostrovsky, and Yuval Rabani. 1999. Lower bounds for high dimensional nearest neighbor search and related problems. In Proceedings of 31st Symposium on Theory of Computation (STOC). 312--321.
[14]
Amit Chakrabarti and Oded Regev. 2004. An Optimal Randomised Cell Probe Lower Bound for Approximate Nearest Neighbour Searching. In Proceedings of 45th IEEE Symposium on Foundations of Computer Science (FOCS '04). 473--482. https://doi.org/10.1109/FOCS.2004.12
[15]
Aristides Gionis, Piotr Indyk, and Rajeev Motwani. 1999. Similarity Search in High Dimensions via Hashing. In Proceedings of 25th International Conference on Very Large Data Bases (VLDB). Morgan Kaufmann, 518--529. http://www.vldb.org/conf/1999/P49.pdf
[16]
Sariel Har-Peled, Piotr Indyk, and Rajeev Motwani. 2012. Approximate Nearest Neighbor: Towards Removing the Curse of Dimensionality. Theory of Computing, Vol. 8, 1 (2012), 321--350. https://doi.org/10.4086/toc.2012.v008a014
[17]
Joseph Hellerstein, Elias Koutsoupias, Daniel Miranker, Christos Papadimitriou, and Samoladas Vasilis. 2002. On a Model of Indexability and Its Bounds for Range Queries. J. ACM, Vol. 49, 1 (2002), 35--55.
[18]
Joseph Hellerstein, Elias Koutsoupias, and Christos Papadimitriou. 1997. On the Analysis of Indexing Schemes. In Proceedings of 16th ACM Symposium on Principles of Database Systems (PODS). ACM, 249--256.
[19]
Joseph.M. Hellerstein, Jeffrey. F. Naughton, and Avi Pfeffer. 1995. Generalized Search Tree for Database Systems. In Proceedings of 21th International Conference on Very Large Data Bases (VLDB). ACM, 562--573.
[20]
Xiaocheng Hu, Miao Qiao, and Yufei Tao. 2014. Independent range sampling. In Proceedings of 33rd ACM Symposium on Principles of database systems (PODS). ACM, 246--255.
[21]
Xiao Hu, Ke Yi, and Yufei Tao. 2019. Output-Optimal Massively Parallel Algorithms for Similarity Joins. ACM Transactions on Database Systems, Vol. 44, 2 (April 2019), 1--36. https://doi.org/10.1145/3311967
[22]
Piotr Indyk. 2001. On approximate nearest neighbors under $l_infty$ norm. J. Comput. System Sci., Vol. 63, 4 (2001), 627--638.
[23]
Piotr Indyk. 2007. Uncertainty principles, extractors, and explicit embeddings of L2 into L1. In Proceedings of 39th ACM Symposium on Theory of Computing (STOC). 615--620. https://doi.org/10.1145/1250790.1250881
[24]
Michael Kapralov. 2015. Smooth tradeoffs between insert and query complexity in nearest neighbor search. In Proceedings of 34th ACM Symposium on Principles of Database Systems (PODS). ACM, 329--342.
[25]
Eyal Kushilevitz, Rafail Ostrovsky, and Yuval Rabani. 2000. Efficient search for approximate nearest neighbor in high dimensional spaces. SIAM J. Comput., Vol. 30, 2 (2000), 457--474.
[26]
Kasper Green Larsen and Jelani Nelson. 2017. Optimality of the Johnson-Lindenstrauss lemma. In 2017 IEEE 58th Annual Symposium on Foundations of Computer Science (FOCS). IEEE, 633--638.
[27]
Kasper Green Larsen and Freek Van Walderveen. 2013. Near-optimal range reporting structures for categorical data. In Proceedings of 24th ACM-SIAM symposium on discrete algorithms (SODA). SIAM, 265--276.
[28]
Jivri Matousek. 1996. On the distortion required for embedding finite metric spaces into normed spaces. Israel Journal of Mathematics, Vol. 93, 1 (1996), 333--344.
[29]
Marius Muja and David G Lowe. 2014. Scalable nearest neighbor algorithms for high dimensional data. IEEE transactions on pattern analysis and machine intelligence, Vol. 36, 11 (2014), 2227--2240.
[30]
Anna Östlin and Rasmus Pagh. 2002. One-Probe Search. In Proceedings of 29th international colloquium on automata, languages and programming (ICALP). 439--450.
[31]
Rina Panigrahy, Kunal Talwar, and Udi Wieder. 2010. Lower bounds on near neighbor search via metric expansion. In 2010 IEEE 51st Symposium on Foundations of Computer Science. IEEE, 805--814.
[32]
Vladimir Pestov. 2013. Lower bounds on performance of metric tree indexing schemes for exact similarity search in high dimensions. Algorithmica, Vol. 66, 2 (2013), 310--328.
[33]
Vladimir Pestov and Aleksandar Stojmirović. 2006. Indexing schemes for similarity search: An illustrated paradigm. Fundamenta Informaticae, Vol. 70, 4 (2006), 367--385.
[34]
Aviad Rubinstein. 2018. Hardness of approximate nearest neighbor search. In Proceedings of 50th ACM SIGACT Symposium on Theory of Computing, STOC 2018, Los Angeles, CA, USA, June 25--29, 2018, Ilias Diakonikolas, David Kempe, and Monika Henzinger (Eds.). ACM, 1260--1268. https://doi.org/10.1145/3188745.3188916
[35]
Micha Streppel and Ke Yi. 2011. Approximate range searching in external memory. Algorithmica, Vol. 59, 2 (2011), 115--128.
[36]
Yufei Tao, Ke Yi, Cheng Sheng, and Panos Kalnis. 2010. Efficient and accurate nearest neighbor and closest pair search in high-dimensional space. ACM Trans. Database Syst., Vol. 35, 3 (2010), 20:1--20:46. https://doi.org/10.1145/1806907.1806912
[37]
Zhewei Wei, Ke Yi, and Qin Zhang. 2009. Dynamic external hashing: The limit of buffering. In Proceedings of 21st symposium on Parallelism in algorithms and architectures. ACM, 253--259.
[38]
Ke Yi. 2009. Dynamic indexability and lower bounds for dynamic one-dimensional range query indexes. In Proceedings of 28th ACM Symposium on Principles of Database Systems (PODS). ACM, 187--196.

Cited By

View all
  • (2024)Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00280(3640-3653)Online publication date: 13-May-2024

Index Terms

  1. On the I/O Complexity of the k-Nearest Neighbors Problem

                          Recommendations

                          Comments

                          Information & Contributors

                          Information

                          Published In

                          cover image ACM Conferences
                          PODS'20: Proceedings of the 39th ACM SIGMOD-SIGACT-SIGAI Symposium on Principles of Database Systems
                          June 2020
                          480 pages
                          ISBN:9781450371087
                          DOI:10.1145/3375395
                          • General Chair:
                          • Dan Suciu,
                          • Program Chair:
                          • Yufei Tao,
                          • Publications Chair:
                          • Zhewei Wei
                          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                          Sponsors

                          Publisher

                          Association for Computing Machinery

                          New York, NY, United States

                          Publication History

                          Published: 14 June 2020

                          Permissions

                          Request permissions for this article.

                          Check for updates

                          Author Tags

                          1. I/O complexity
                          2. indexability model
                          3. nearest neighbors

                          Qualifiers

                          • Research-article

                          Funding Sources

                          Conference

                          SIGMOD/PODS '20
                          Sponsor:

                          Acceptance Rates

                          Overall Acceptance Rate 642 of 2,707 submissions, 24%

                          Contributors

                          Other Metrics

                          Bibliometrics & Citations

                          Bibliometrics

                          Article Metrics

                          • Downloads (Last 12 months)12
                          • Downloads (Last 6 weeks)4
                          Reflects downloads up to 05 Nov 2024

                          Other Metrics

                          Citations

                          Cited By

                          View all
                          • (2024)Are There Fundamental Limitations in Supporting Vector Data Management in Relational Databases? A Case Study of PostgreSQL2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00280(3640-3653)Online publication date: 13-May-2024

                          View Options

                          Get Access

                          Login options

                          View options

                          PDF

                          View or Download as a PDF file.

                          PDF

                          eReader

                          View online with eReader.

                          eReader

                          Media

                          Figures

                          Other

                          Tables

                          Share

                          Share

                          Share this Publication link

                          Share on social media