Abstract
We study the problem of utilizing human intelligence to categorize a large number of objects. In this problem, given a category hierarchy and a set of objects, we can ask humans to check whether an object belongs to a category, and our goal is to find the most cost-effective strategy to locate the appropriate category in the hierarchy for each object, such that the cost (i.e., the number of questions to ask humans) is minimized. There are many important applications of this problem, including image classification and product categorization. We develop an online framework, in which category distribution is gradually learned and thus an effective order of questions are adaptively determined. We prove that even if the true category distribution is known in advance, the problem is computationally intractable. We develop an approximation algorithm, and prove that it achieves an approximation factor of 2. We also show that there is a fully polynomial time approximation scheme for the problem. Furthermore, we propose an online strategy which achieves nearly the same performance guarantee as the offline optimal strategy, even if there is no knowledge about category distribution beforehand. We develop effective techniques to tolerate crowd errors. Experiments on a real crowdsourcing platform demonstrate the effectiveness of our method.
Similar content being viewed by others
Change history
30 July 2023
This article was revised due to update in article title.
Notes
An algorithm A is an FPTAS for an optimization problem P, if for any instance \({\mathcal {I}}\) and any constant \(\epsilon > 0\), A computes a solution S in polynomial time w.r.t. the problem size n and \(1/\epsilon \), of which the value of S satisfies \(|\textsf {OPT} - \textsf {val}(A)| \le \epsilon \textsf {OPT}\), where OPT stands for the value of the optimal solution and \(\textsf {val}(A)\) is the solution returned by A.
In the paper, for any category u, we use u as its corresponding node in \(\mathcal {T}\), and a question node asked at u in \(\mathcal {D}\).
Here we only assign non-negative weights to the nodes, which could be normalized to obtain the probabilities.
In the decision trees, we omit the leaves with zero weight, since they do not contribute to the costs.
If we can randomly permute all the objects, we can show that the true distribution can indeed be learned according to the law of large numbers. However, the objects may come in an online fashion and a random permutation pre-processing step may not be always possible.
This could be trivially extend to the batched algorithm.
References
Bragg, J., Weld, D.S., et al.: Crowdsourcing multi-label classification for taxonomy creation. In: AAAI, (2013)
Chilton, L.B., Little, G., Edge, D., Weld, D.S., Landay, J.A.: Cascade: Crowdsourcing taxonomy creation. In: CHI, pp. 1999–2008. ACM, (2013)
Cicalese, F., Jacobs, T., Laber, E., Molinaro, M.: On the complexity of searching in trees and partially ordered structures. Theor. Comput. Sci. 412(50), 6879–6896 (2011)
Cicalese, F., Jacobs, T., Laber, E., Molinaro, M.: Improved approximation algorithms for the average-case tree searching problem. Algorithmica 68(4), 1045–1074 (2014)
Cicalese, F., Jacobs, T., Laber, E.S., Molinaro, M.: On greedy algorithms for decision trees. In: ISAAC, pp. 206–217, (2010)
Das Sarma, A., Parameswaran, A., Garcia-Molina, H., Halevy, A.: Crowd-powered find algorithms. In: ICDE, pp. 964–975, (2014)
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., Fei-Fei, L.: ImageNet: a large-scale hierarchical image database. In: CVPR, pp. 248–255, (2009)
Fan, J., Li, G., Ooi, B.C., Tan, K., Feng, J.: icrowd: an adaptive crowdsourcing framework. In: SIGMOD, pp. 1015–1030, (2015)
Gao, Y., Parameswaran, A.: Finish them!: pricing algorithms for human computation. PVLDB 7(14), 1965–1976 (2014)
Gharibshah, Z., Zhu, X., Hainline, A., Conway, M.: Deep learning for user interest and response prediction in online display advertising. Data Sci. Eng. 5(1), 12–26 (2020)
Ipeirotis, P.G., Provost, F., Sheng, V.S., Wang, J.: Repeated labeling using multiple noisy labelers. Data Mining Knowl. Discov. 28(2), 402–441 (2014)
Kalai, A., Vempala, S.: Efficient algorithms for online decision problems. J. Comput. Syst. Sci. 71(3), 291–307 (2005)
Kaplan, H., Lotosh, I., Milo, T., Novgorodov, S.: Answering planning queries with the crowd. PVLDB 6(9), 697–708 (2013)
Karger, D.R., Oh, S., Shah, D.: Iterative learning for reliable crowdsourcing systems. In: NIPS, pp. 1953–1961, (2011)
Kundu, S., Misra, J.: A linear tree partitioning algorithm. SIAM J. Comput. 6(1), 151–154 (1977)
Li, G.: Human-in-the-loop data integration. Proc. VLDB Endow. 10(12), 2006–2017 (2017)
Li, G., Chai, C., Fan, J., et al.: CDB: a crowd-powered database system. PVLDB 11(12), 1926–1929 (2018)
Li, G., Wang, J., Zheng, Y., Franklin, M.J.: Crowdsourced data management: a survey. IEEE Trans. Knowl. Data Eng. 28(9), 2296–2319 (2016)
Li, G., Zheng, Y., Fan, J., Wang, J., Cheng, R.: Crowdsourced data management: overview and challenges. In: SIGMOD, pp. 1711–1716, (2017)
Li, K., Li, G.: Approximate query processing: What is new and where to go? Data Sci. Eng. 3(4), 379–397 (2018)
Li, M., Wang, H., Li, J.: Mining conditional functional dependency rules on big data. Big Data Mining Anal. 03(01), 68 (2020)
Lin, C., Weld, D.S., et al.: To re (label), or not to re (label). In: AAAI, (2014)
Liu, B.: Sentiment analysis and opinion mining. Synth. Lect. Human Lang. Technol. 5(1), 1–167 (2012)
Marcus, A., Karger, D., Madden, S., Miller, R., Oh, S.: Counting with the crowd. PVLDB 6(2), 109–120 (2012)
Marcus, A., Wu, E., Karger, D., Madden, S., Miller, R.: Human-powered sorts and joins. PVLDB 5(1), 13–24 (2011)
Marcus, A., Wu, E., Madden, S., Miller, R.C.: Crowdsourced databases: query processing with people. In: CIDR, pp. 211–214, (2011)
Ni, J., Li, J., McAuley, J.: Justifying recommendations using distantly-labeled reviews and fine-grained aspects. In: EMNLP, pp. 188–197, (2019)
Parameswaran, A., Boyd, S., Garcia-Molina, H., Gupta, A., Polyzotis, N., Widom, J.: Optimal crowd-powered rating and filtering algorithms. PVLDB 7(9), 685–696 (2014)
Parameswaran, A., Sarma, A.D., Garcia-Molina, H., Polyzotis, N., Widom, J.: Human-assisted graph search: it’s okay to ask questions. PVLDB 4(5), 267–278 (2011)
Parameswaran, A.G., Garcia-Molina, H., Park, H., Polyzotis, N., Ramesh, A., Widom, J.: Crowdscreen: algorithms for filtering data with humans. In: SIGMOD, pp. 361–372, (2012)
Parameswaran, A.G., Park, H., Garcia-Molina, H., Polyzotis, N., Widom, J.: Deco: declarative crowdsourcing. In: CIKM, pp. 1203–1212, (2012)
Sun, Y., Singla, A., Fox, D., Krause, A.: Building hierarchies of concepts via crowdsourcing. In: IJCAI, pp. 844–851, (2015)
Tao, Y., Li, Y., Li, G.: Interactive graph search. In: SIGMOD, pp. 1393–1410, (2019)
Tian, S., Mo, S., Wang, L., Peng, Z.: Deep reinforcement learning-based approach to tackle topic-aware influence maximization. Data Sci. Eng. 5(1), 1–11 (2020)
Venetis, P., Garcia-Molina, H., Huang, K., Polyzotis, N.: Max algorithms in crowdsourcing environments. In: WWW, pp. 989–998, (2012)
Vesdapunt, N., Bellare, K., Dalvi, N.: Crowdsourcing algorithms for entity resolution. PVLDB 7(12), 1071–1082 (2014)
Wang, J., Kraska, T., Franklin, M.J., Feng, J.: Crowder: crowdsourcing entity resolution. PVLDB 5(11), 1483–1494 (2012)
Wang, J., Li, G., Kraska, T., Franklin, M.J., Feng, J.: Leveraging transitive relations for crowdsourced joins. In: SIGMOD, pp. 229–240, (2013)
Wang, Y., Yao, Y., Tong, H., Xu, F., Lu, J.: A brief review of network embedding. Big Data Mining Anal. 2(1), 35 (2019)
Wang, Y., Yuan, Y., Ma, Y., Wang, G.: Time-dependent graphs: definitions, applications, and algorithms. Data Sci. Eng. 4(4), 352–366 (2019)
Whang, S.E., Lofgren, P., Garcia-Molina, H.: Question selection for crowd entity resolution. PVLDB 6(6), 349–360 (2013)
Zhang, C.J., Tong, Y., Chen, L.: Where to: crowd-aided path selection. PVLDB 7(14), 2005–2016 (2014)
Zheng, Y., Li, G., Cheng, R.: DOCS: domain-aware crowdsourcing system. PVLDB 10(4), 361–372 (2016)
Zheng, Y., Li, G., Li, Y., Shan, C., Cheng, R.: Truth inference in crowdsourcing: Is the problem solved? PVLDB 10(5), 541–552 (2017)
Zheng, Y., Wang, J., Li, G., Cheng, R., Feng, J.: QASCA: a quality-aware task assignment system for crowdsourcing applications. In: SIGMOD, pp. 1031–1046, (2015)
Acknowledgements
This paper was supported by NSF of China (61925205, 61632016), Huawei, TAL education, and Beijing National Research Center for Information Science and Technology (BNRist).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Li, Y., Wu, X., Jin, Y. et al. Adaptive algorithms for crowd-aided categorization. The VLDB Journal 31, 1311–1337 (2022). https://doi.org/10.1007/s00778-021-00685-2
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00778-021-00685-2