Abstract
We present a new similarity search library and discuss a variety of design and performance issues related to its development. We adopt a position that engineering is equally important to design of the algorithms and pursue a goal of producing realistic benchmarks. To this end, we pay attention to various performance aspects and utilize modern hardware, which provides a high degree of parallelization. Since we focus on realistic measurements, performance of the methods should not be measured using merely the number of distance computations performed, because other costs, such as computation of a cheaper distance function, which approximates the original one, are oftentimes substantial. The paper includes preliminary experimental results, which support this point of view. Rather than looking for the best method, we want to ensure that the library implements competitive baselines, which can be useful for future work.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Amato, G., Rabitti, F., Savino, P., Zezula, P.: Region proximity in metric spaces and its use for approximate similarity search. ACM Trans. Inf. Syst. 21(2), 192–227 (2003)
Amato, G., Savino, P.: Approximate similarity search in metric spaces using inverted files. In: Proceedings of the 3rd International Conference on Scalable Information Systems, InfoScale 2008, pp. 28:1–28:10. ICST (Institute for Computer Sciences, Social-Informatics and Telecommunications Engineering), Brussels (2008)
Bhattacharya, P., Neamtiu, I.: Assessing programming language impact on development and maintenance: a study on C and C++. In: 33rd International Conference on Software Engineering (ICSE), pp. 171–180 (2011)
Cayton, L.: Fast nearest neighbor retrieval for bregman divergences. In: Proceedings of the 25th International Conference on Machine Learning, ICML 2008, pp. 112–119. ACM, New York (2008)
Chávez, E., Navarro, G., Baeza-Yates, R., Marroquin, J.L.: Searching in metric spaces. ACM Computing Surveys 33(3), 273–321 (2001)
Chávez, E., Navarro, G.: Probabilistic proximity search: Fighting the curse of dimensionality in metric spaces. Information Processing Letters 85(1), 39–46 (2003)
Dong, W., Wang, Z., Josephson, W., Charikar, M., Li, K.: Modeling lsh for performance tuning. In: Proceedings of the 17th ACM Conference on Information and Knowledge Management, CIKM 2008, pp. 669–678. ACM, New York (2008)
Drepper, U.: What every programmer should know about memory (2007), http://www.akkadia.org/drepper/cpumemory.pdf (last checked August 2012)
Eddelbuettel, D., Francois, R.: Rcpp: Seamless R and C++ integration. Journal of Statistical Software 40(8), 1–18 (2011)
Elizarov, R.: Millions quotes per second in pure Java (2013), http://blog.devexperts.com/millions-quotes-per-second-in-pure-java/ (last accessed on May 14, 2013)
Esuli, A.: Use of permutation prefixes for efficient and scalable approximate similarity search. Inf. Process. Manage. 48(5), 889–902 (2012)
Faloutsos, C.: Searching Multimedia Databases by Content. Kluwer Academic Publisher (1996)
Figueroa, K., Navarro, G., Chávez, E.: Metric Spaces Library (2007), http://www.sisap.org/Metric_Space_Library.html
Figueroa, K., Fredriksson, K.: Speeding up permutation based indexing with indexing. In: Proceedings of the 2009 Second International Workshop on Similarity Search and Applications, SISAP 2009, pp. 107–114. IEEE Computer Society, Washington, DC (2009)
Fredriksson, K.: Engineering efficient metric indexes. Pattern Recognition Letters 28(1), 75–84 (2007)
Fulgham, B.: The computer language benchmarks game (2013), http://benchmarksgame.alioth.debian.org/ (last accessed on May 14, 2013)
Gonzalez, E., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1647–1658 (2008)
Gonzalez, E.C., Figueroa, K., Navarro, G.: Effective proximity retrieval by ordering permutations. IEEE Transactions on Pattern Analysis and Machine Intelligence 30(9), 1647–1658 (2008)
Hedges, L.V., Vevea, J.L.: Fixed-and random-effects models in meta-analysis. Psychological Methods 3(4), 486–504 (1998)
Hundt, R.: Loop recognition in C++/Java/Go/Scala. In: Proceedings of Scala Days 2011 (2011)
Indyk, P.: Nearest neighbors in high-dimensional spaces. In: Goodman, J.E., O’Rourke, J. (eds.) Handbook of Discrete and Computational Geometry, pp. 877–892. Chapman and Hall/CRC (2004)
Indyk, P., Motwani, R.: Approximate nearest neighbors: towards removing the curse of dimensionality. In: Proceedings of the Thirtieth Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 604–613. ACM, New York (1998)
Jacobs, D., Weinshall, D., Gdalyahu, Y.: Classification with nonmetric distances: Image retrieval and class representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 22(6), 583–600 (2000)
King, G.: How not to lie with statistics: Avoiding common mistakes in quantitative political science. American Journal of Political Science, 666–687 (1986)
King, R.S.: The top 10 programming languages (the data). IEEE Spectrum 48(10), 84–84 (2011)
Kushilevitz, E., Ostrovsky, R., Rabani, Y.: Efficient search for approximate nearest neighbor in high dimensional spaces. In: Proceedings of the 30th Annual ACM Symposium on Theory of Computing, STOC 1998, pp. 614–623. ACM, New York (1998)
Lokoč, J., Hetland, M.L., Skopal, T., Beecks, C.: Ptolemaic indexing of the signature quadratic form distance. In: Proceedings of the Fourth International Conference on SImilarity Search and APplications, SISAP 2011, pp. 9–16. ACM, New York (2011)
Mu, Y., Yan, S.: Non-metric locality-sensitive hashing. In: AAAI (2010)
Novak, D., Kyselak, M., Zezula, P.: On locality-sensitive indexing in generic metric spaces. In: Proceedings of the Third International Conference on SImilarity Search and APplications, SISAP 2010, pp. 59–66. ACM, New York (2010)
Parri, J., Shapiro, D., Bolic, M., Groza, V.: Returning control to the programmer: Simd intrinsics for virtual machines. Commun. ACM 54(4), 38–43 (2011)
Pestov, V.: Indexability, concentration, and VC theory. Journal of Discrete Algorithms 13, 2–18 (2012); Best Papers from the 3rd International Conference on Similarity Search and Applications (SISAP 2010)
Pestov, V.: Is the k-NN classifier in high dimensions affected by the curse of dimensionality? Computers & Mathematics with Applications (2012)
Samet, H.: Foundations of Multidimensional and Metric Data Structures. Morgan Kaufmann Publishers Inc. (2005)
Scott, D.W., Thompson, J.R.: Probability density estimation in higher dimensions. Technical Report, Rice University, Texas Huston (1983)
Shafi, A., Carpenter, B., Baker, M., Hussain, A.: A comparative study of java and c performance in two large-scale parallel applications. Concurrency and Computation: Practice and Experience 21(15), 1882–1906 (2009)
Skopal, T.: Unified framework for fast exact and approximate search in dissimilarity spaces. ACM Trans. Database Syst. 32(4) (November 2007)
Skopal, T., Bustos, B.: On nonmetric similarity search problems in complex domains. ACM Comput. Surv. 43(4), 34:1–34:50 (October 2011)
Uhlmann, J.: Satisfying general proximity similarity queries with metric trees. Information Processing Letters 40, 175–179 (1991)
Vivanco, R.A., Pizzi, N.J.: Scientific computing with Java and C++: a case study using functional magnetic resonance neuroimages. Software: Practice and Experience 35(3), 237–254 (2005)
Weber, R., Schek, H.J., Blott, S.: A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of the 24th International Conference on Very Large Data Bases, pp. 194–205. Morgan Kaufmann (August 1998)
Zezula, P., Amato, G., Dohnal, V., Batko, M.: Similarity Search: The Metric Space Approach (Advances in Database Systems). Springer-Verlag New York, Inc., Secaucus (2005)
Zezula, P., Savino, P., Amato, G., Rabitti, F.: Approximate similarity retrieval with m-trees. The VLDB Journal 7(4), 275–293 (1998)
Zhang, Z., Ooi, B.C., Parthasarathy, S., Tung, A.K.H.: Similarity search on bregman divergence: towards non-metric indexing. Proc. VLDB Endow. 2(1), 13–24 (2009)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Boytsov, L., Naidan, B. (2013). Engineering Efficient and Effective Non-metric Space Library. In: Brisaboa, N., Pedreira, O., Zezula, P. (eds) Similarity Search and Applications. SISAP 2013. Lecture Notes in Computer Science, vol 8199. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-41062-8_28
Download citation
DOI: https://doi.org/10.1007/978-3-642-41062-8_28
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-41061-1
Online ISBN: 978-3-642-41062-8
eBook Packages: Computer ScienceComputer Science (R0)