Abstract
In the era of big data, numerous data measurements collected from all walks of life are playing important roles in various data mining applications. Not all data owners (or keepers) could develop feasible learning models for knowledge discovery’s sake. Oftentimes, the original data need to be passed to or shared with researchers or data scientists for better mining insights, especially in the medical, financial, and industrial fields. However, concerns about sensitivity and privacy limit the availability and completeness of shared (or passed) data and the quality of mining results. In this paper, we propose a novel Convolutional Bidirectional Generative Adversarial Networks (CB-GAN) framework to generate sensitive synthetic data. The Convolutional Neural Networks are utilized to capture the feature correlations of the original data, and the Generative Adversarial Networks with Autoencoders are combined to synthesize realistically distributed data. To demonstrate the feasibility of the model, we evaluated it from three aspects: how similar are the distributions of the synthetic data to the original data, how well can the synthetic data accomplish future data mining tasks, and how much sensitive information has been hidden. Various experimental results showed the superiority of the proposed method compared with the state-of-the-art methods.
This research was supported by the Guangdong Natural Science Foundation General Program (Grant No. 2022A1515011713).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Adepu, S., Kandasamy, N.K., Mathur, A.: EPIC: an electric power testbed for research and training in cyber physical systems security. In: Katsikas, S.K., et al. (eds.) SECPRE/CyberICPS -2018. LNCS, vol. 11387, pp. 37–52. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-12786-2_3
Ahmed, C.M., Palleti, V.R., Mathur, A.P.: WADI: a water distribution testbed for research in the design of secure cyber physical systems. In: Proceedings of the 3rd International Workshop on Cyber Physical Systems for Smart Water Networks, pp. 25–28 (2017)
Al-E’mari, S., Anbar, M., Sanjalawe, Y., Manickam, S.: A labeled transactions-based dataset on the Ethereum network. In: Anbar, M., Abdullah, N., Manickam, S. (eds.) ACeS 2020. CCIS, vol. 1347, pp. 61–79. Springer, Singapore (2021). https://doi.org/10.1007/978-981-33-6835-4_5
Andrzejak, R.G., Lehnertz, K., Mormann, F., Rieke, C., David, P., Elger, C.E.: Indications of nonlinear deterministic and finite-dimensional structures in time series of brain electrical activity: dependence on recording region and brain state. Phys. Rev. E 64(6), 061907 (2001)
Aung, Y.L., Tiang, H.H., Wijaya, H., Ochoa, M., Zhou, J.: Scalable VPN-forwarded honeypots: dataset and threat intelligence insights. In: Sixth Annual Industrial Control System Security (ICSS) Workshop, pp. 21–30 (2020)
Botsis, T., Hartvigsen, G., Chen, F., Weng, C.: Secondary use of EHR: data quality issues and informatics opportunities. Summit Transl. Bioinform. 2010, 1 (2010)
Buczak, A.L., Babin, S., Moniz, L.: Data-driven approach for creating synthetic electronic medical records. BMC Med. Inform. Decis. Mak. 10(1), 1–28 (2010)
Choi, E., Biswal, S., Malin, B., Duke, J., Stewart, W.F., Sun, J.: Generating multi-label discrete patient records using generative adversarial networks. In: Machine Learning for Healthcare Conference, pp. 286–305. PMLR (2017)
Clause, S.L., Triller, D.M., Bornhorst, C.P., Hamilton, R.A., Cosler, L.E.: Conforming to HIPAA regulations and compilation of research data. Am. J. Health Syst. Pharm. 61(10), 1025–1031 (2004)
Dwork, C., Roth, A.: The algorithmic foundations of differential privacy. Theoret. Comput. Sci. 9(3–4), 211–407 (2013)
El Emam, K., Rodgers, S., Malin, B.: Anonymising and sharing individual patient data. BMJ 350 (2015)
Fasano, G., Franceschini, A.: A multidimensional version of the Kolmogorov-Smirnov test. Mon. Not. R. Astron. Soc. 225(1), 155–170 (1987)
Fernandes, K., Cardoso, J.S., Fernandes, J.: Transfer learning with partial observability applied to cervical cancer screening. In: Alexandre, L.A., Salvador Sánchez, J., Rodrigues, J.M.F. (eds.) IbPRIA 2017. LNCS, vol. 10255, pp. 243–250. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-58838-4_27
Goh, J., Adepu, S., Junejo, K.N., Mathur, A.: A dataset to support research in the design of secure water treatment systems. In: Havarneanu, G., Setola, R., Nassopoulos, H., Wolthusen, S. (eds.) CRITIS 2016. LNCS, vol. 10242, pp. 88–99. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-71368-7_8
U.S. Dept. of Health and Human Services: Guidance regarding methods for de-identification of protected health information in accordance with the health insurance portability and accountability act (HIPAA) privacy rule. HIPAA) Privacy Rule (2012)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997)
Hodge, J.G., Jr., Gostin, L.O., Jacobson, P.D.: Legal issues concerning electronic health information: privacy, quality, and liability. JAMA 282(15), 1466–1471 (1999)
Ioffe, S., Szegedy, C.: Batch normalization: accelerating deep network training by reducing internal covariate shift. In: International Conference on Machine Learning, pp. 448–456. PMLR (2015)
Jensen, P.B., Jensen, L.J., Brunak, S.: Mining electronic health records: towards better research applications and clinical care. Nat. Rev. Genet. 13(6), 395–405 (2012)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint: arXiv:1412.6980 (2014)
McKenna, R., Mullins, B., Sheldon, D., Miklau, G.: Aim: an adaptive and iterative mechanism for differentially private synthetic data. arXiv preprint: arXiv:2201.12677 (2022)
McLachlan, S., Dube, K., Gallagher, T.: Using the CareMap with health incidents statistics for generating the realistic synthetic electronic healthcare record. In: 2016 IEEE International Conference on Healthcare Informatics (ICHI), pp. 439–448. IEEE (2016)
Miller, A.R., Tucker, C.: Health information exchange, system size and information silos. J. Health Econ. 33, 28–42 (2014)
Natekin, A., Knoll, A.: Gradient boosting machines, a tutorial. Front. Neurorobot. 7, 21 (2013)
Park, Y., Ghosh, J., Shankar, M.: Perturbed Gibbs samplers for generating large-scale privacy-safe synthetic health data. In: 2013 IEEE International Conference on Healthcare Informatics, pp. 493–498. IEEE (2013)
S. Oliveira, M.I., Barros Lima, G.D.F., Farias Lóscio, B.: Investigations into data ecosystems: a systematic mapping study (2019)
Tao, Y., Xiao, X., Li, J., Zhang, D.: On anti-corruption privacy preserving publication. In: 2008 IEEE 24th International Conference on Data Engineering, pp. 725–734. IEEE (2008)
Torfi, A., Fox, E.A.: CorGAN: correlation-capturing convolutional generative adversarial networks for generating synthetic healthcare records. In: The Thirty-Third International Flairs Conference (2020)
Torfi, A., Fox, E.A., Reddy, C.K.: Differentially private synthetic medical data generation using convolutional GANs. Inf. Sci. 586, 485–500 (2022)
Ulianova, S.: Cardiovascular disease dataset. Data retrieved from the Kaggle dataset (2018)
Walonoski, J., et al.: Synthea: an approach, method, and software mechanism for generating synthetic patients and the synthetic electronic health care record. J. Am. Med. Inform. Assoc. 25(3), 230–238 (2018)
Xie, L., Lin, K., Wang, S., Wang, F., Zhou, J.: Differentially private generative adversarial network. arXiv preprint: arXiv:1802.06739 (2018)
Zheng, P., Zheng, Z., Wu, J., Dai, H.N.: XBlock-eth: Extracting and exploring blockchain data from Ethereum. IEEE Open J. Comput. Soc. 1, 95–106 (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Hu, R., Li, D., Ng, SK., Zheng, Z. (2023). CB-GAN: Generate Sensitive Data with a Convolutional Bidirectional Generative Adversarial Networks. In: Wang, X., et al. Database Systems for Advanced Applications. DASFAA 2023. Lecture Notes in Computer Science, vol 13946. Springer, Cham. https://doi.org/10.1007/978-3-031-30678-5_13
Download citation
DOI: https://doi.org/10.1007/978-3-031-30678-5_13
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-30677-8
Online ISBN: 978-3-031-30678-5
eBook Packages: Computer ScienceComputer Science (R0)