Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval

Mourão, André; Magalhães, João

doi:10.1007/s13735-017-0140-0

Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval

Regular Paper
Published: 18 November 2017

Volume 7, pages 57–70, (2018)
Cite this article

International Journal of Multimedia Information Retrieval Aims and scope Submit manuscript

231 Accesses
3 Citations
Explore all metrics

Abstract

Effective partitioning multimedia indexes is key for efficient kNN search. But existing algorithms are based on document similarity, without partition size or redundancy constraints. Our goal is to create an index partitioning algorithm that addresses the specific properties of a distributed system: load balancing across nodes, redundancy in node failure and efficient node usage under concurrent querying. We propose the representation of data with overcomplete codebooks. Each document is quantized into a small set of codewords and indexed on per-codeword partitions. Quantization algorithms are designed to fit data as best as possible, leading to a bias toward codewords that fit the principal directions of data in the original space. In this paper, we propose the balanced KSVD (B-KSVD) algorithm: It distributes data uniformly across codewords, according to the distribution in the original space. The comprehensive experiments focused on measuring the effectiveness of partition size balancing and retrieval quality. Results show that B-KSVD better balances partition sizes (i.e., lower SD in partition size distribution), compared to k-means and KSVD baselines. B-KSVD achieves 38% 1-recall by inspecting only 1% of the full index, distributed over 10 partitions. k-means creates partitions with higher size variation and requires either larger codebooks or the inspection of larger portions of the index to achieve similar retrieval performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Joint Optimization of Multi-vector Representation with Product Quantization

MX-tree: A Double Hierarchical Metric Index with Overlap Reduction

Indexability-Based Dataset Partitioning

Notes

http://corpus-texmex.irisa.fr/.

References

Aharon M, Elad M, Bruckstein A (2006) K-SVD: an algorithm for designing overcomplete dictionaries for sparse representation. IEEE Trans Signal Process 54(11):4311–4322
Article MATH Google Scholar
Aly M, Munich M, Perona P (2011) Distributed Kd-trees for retrieval from very large image collections. In: Proceedings of BMVC
Andoni A, Indyk P (2008) Near-optimal hashing algorithms for approximate nearest neighbor in high dimensions. ACM Commun 51(1):117–122
Article Google Scholar
Arthur D, Vassilvitskii S (2007) k-means++: the advantages of careful seeding. In: Proceedings of ACM-SIAM SODA, pp 1027–1035
Babenko A, Lempitsky V (2015) The inverted multi-index. IEEE Trans PAMI 37(6):1247–1260. https://doi.org/10.1109/TPAMI.2014.2361319
Article Google Scholar
Babenko A, Lempitsky V (2016) Efficient indexing of billion-scale datasets of deep descriptors. In: Proceedings of IEEE CVPR
Batko M, Falchi F, Lucchese C, Novak D, Perego R, Rabitti F, Sedmidubsky J, Zezula P (2010) Building a web-scale image similarity search system. Multimed Tool Appl 47(3):599–629
Article Google Scholar
Bezdek JC (1981) Pattern recognition with fuzzy objective function algorithms. Kluwer Academic Publishers, Norwell
Book MATH Google Scholar
Borges P, Mourão A, Magalhães J (2015) High-dimensional indexing by sparse approximation. In: ACM ICMR’15, ACM
Büttcher S, Clarke CL, Cormack GV (2010) Information retrieval: implementing and evaluating search engines. MIT Press, Cambridge
MATH Google Scholar
Cherian A, Sra S, Morellas V, Papanikolopoulos N (2014) Efficient nearest neighbors via robust sparse hashing. IEEE TIP 23(8):3646–3655
MathSciNet MATH Google Scholar
Chum O, Philbin J, Zisserman A (2008) Near duplicate image detection: min-hash and tf-idf weighting. In: Proceedings of BMVC
Datar M, Immorlica N, Indyk P, Mirrokni VS (2004) Locality-sensitive hashing scheme based on p-stable distributions. In: Proceedings of SCG, ACM, pp 253–262
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51(1):107–113
Article Google Scholar
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of KDD, pp 226–231
Grauman K, Fergus R (2013) Chap: Learning binary hash codes for large-scale image search. In: Machine learning for computer vision. Springer, Berlin, pp 49–87. https://doi.org/10.1007/978-3-642-28661-2_3
Hinton GE, Salakhutdinov RR (2006) Reducing the dimensionality of data with neural networks. Science 313(5786):504–507. https://doi.org/10.1126/science.1127647
Article MathSciNet MATH Google Scholar
Hoerl AE, Kennard RW (2004) Ridge regression. Wiley, New York. https://doi.org/10.1002/0471667196.ess2280
Book MATH Google Scholar
Jégou H, Douze M, Schmid C (2011) Product quantization for nearest neighbor search. IEEE Trans PAMI 33(1):117–128. https://doi.org/10.1109/TPAMI.2010.57
Article Google Scholar
Jégou H, Tavenard R, Douze M, Amsaleg L (2011) Searching in one billion vectors: re-rank with source coding. arXiv e-prints arXiv:1102.3828
Ji R, Duan LY, Chen J, Xie L, Yao H, Gao W (2013) Learning to distribute vocabulary indexing for scalable visual search. IEEE Trans Multimed 15(1):153–166. https://doi.org/10.1109/TMM.2012.2225035
Article Google Scholar
Kalantidis Y, Avrithis Y (2014) Locally optimized product quantization for approximate nearest neighbor search. In: Proceedings of IEEE CVPR, pp 2329–2336. https://doi.org/10.1109/CVPR.2014.298
Karger D, Lehman E, Leighton T, Panigrahy R, Levine M, Lewin D (1997) Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the world wide web. In: Proceedings of ACM STOC, STOC ’97, pp 654–663
Kulkarni A, Callan J (2010) Document allocation policies for selective searching of distributed indexes. In: Proceedings of ACM CIKM, CIKM ’10, pp 449–458. https://doi.org/10.1145/1871437.1871497
Lewicki MS, Sejnowski TJ (2000) Learning overcomplete representations. Neural Comput 12(2):337–365. https://doi.org/10.1162/089976600300015826
Article Google Scholar
Li Z, Ning H, Cao L, Zhan T, Gong Y, Huang TS (2011) Learning to search efficiently in high dimensions. In: Neural information processing systems
Lloyd S (1982) Least squares quantization in PCM. IEEE Trans Inf Theory 28(2):129–137
Article MathSciNet MATH Google Scholar
Magalhaes J, Rueger S (2007) High-dimensional visual vocabularies for image retrieval. In: ACM SIGIR’07, ACM, New York, NY, USA, pp 815–816. https://doi.org/10.1145/1277741.1277923
Moise D, Shestakov D, Gudmundsson G, Amsaleg L (2013) Indexing and searching 100m images with map-reduce. In: ICMR’13, pp 17–24. https://doi.org/10.1145/2461466.2461470
Mourão A, Magalhães Ja (2015) Scalable multimodal search with distributed indexing by sparse hashing. In: ACM ICMR’15, ACM, New York, NY, USA, pp 283–290. https://doi.org/10.1145/2671188.2749310
Muja M, Lowe DG (2014) Scalable nearest neighbor algorithms for high dimensional data. IEEE Trans PAMI 36:2227–2240
Article Google Scholar
Olshausen BA, Field DJ (1996) Emergence of simple-cell receptive field properties by learning a sparse code for natural images. Nature 381(6583):607–609
Article Google Scholar
Pati Y, Rezaiifar R, Krishnaprasad P (1993) Orthogonal Matching Pursuit : recursive function approximation with application to wavelet decomposition. In: Asilomar Conference on Signals, Systems and Computer
Raginsky M, Lazebnik S (2009) Locality-sensitive binary codes from shift-invariant kernels. In: NIPS, pp 1509–1517
Tavenard R, Jégou H, Amsaleg L (2011) Balancing clusters to reduce response time variability in large scale image search. In: International workshop on content-based multimedia indexing (CBMI 2011), Madrid, Spain. http://hal.inria.fr/inria-00576886,qUAERO
Thaler DG, Ravishankar CV (1998) Using name-based mappings to increase hit rates. Trans Netw 6(1):1–14. https://doi.org/10.1109/90.663936
Article Google Scholar
Tibshirani R (1994) Regression shrinkage and selection via the lasso. J R Stat Soc B 58:267–288
MathSciNet MATH Google Scholar
Torralba A, Fergus R, Freeman W (2008) 80 million tiny images: a large data set for nonparametric object and scene recognition. IEEE Trans PAMI 30(11):1958–1970. https://doi.org/10.1109/TPAMI.2008.128
Article Google Scholar
Weber R, Schek HJ, Blott S (1998) A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In: Proceedings of VLDB, pp 194–205
Weiss Y, Torralba A, Fergus R (2008) Spectral hashing. NIPS 9(1):6
Google Scholar
Yang Z, Kamata SI, Ahrary A (2009) NIR: content based image retrieval on cloud computing. Proc ICIS 3:556–559
Google Scholar
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67:301–320
Article MathSciNet MATH Google Scholar

Download references

Acknowledgements

This work has been partially funded by the CMU Portugal research project GoLocal Ref. CMUP-ERI/TIC/0033 /2014, by the H2020 ICT project COGNITUS with the Grant Agreement No. 687605, by the project NOVA LINCS Ref. UID/CEC/04516/2013 and by the Ph.D. scholarship Grant SFRH/BD/95064/2013.

Author information

Authors and Affiliations

NOVA LINCS, Faculdade de Ciências e Tecnologia, Universidade NOVA de Lisboa, Caparica, Portugal
André Mourão & João Magalhães

Authors

André Mourão
View author publications
You can also search for this author in PubMed Google Scholar
João Magalhães
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to André Mourão.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mourão, A., Magalhães, J. Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval. Int J Multimed Info Retr 7, 57–70 (2018). https://doi.org/10.1007/s13735-017-0140-0

Download citation

Received: 10 September 2017
Revised: 21 October 2017
Accepted: 08 November 2017
Published: 18 November 2017
Issue Date: March 2018
DOI: https://doi.org/10.1007/s13735-017-0140-0

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Joint Optimization of Multi-vector Representation with Product Quantization

MX-tree: A Double Hierarchical Metric Index with Overlap Reduction

Indexability-Based Dataset Partitioning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Balancing search space partitions by sparse coding for distributed redundant media indexing and retrieval

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Joint Optimization of Multi-vector Representation with Product Quantization

MX-tree: A Double Hierarchical Metric Index with Overlap Reduction

Indexability-Based Dataset Partitioning

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation