2016 Volume E99.D Issue 8 Pages 2069-2078
In data sharing privacy has become one of the main concerns particularly when sharing datasets involving individuals contain private sensitive information. A model that is widely used to protect the privacy of individuals in publishing micro-data is k-anonymity. It reduces the linking confidence between private sensitive information and specific individual by generalizing the identifier attributes of each individual into at least k-1 others in dataset. K-anonymity can also be defined as clustering with constrain of minimum k tuples in each group. However, the accuracy of the data in k-anonymous dataset decreases due to huge information loss through generalization and suppression. Also most of the current approaches are designed for numerical continuous attributes and for categorical attributes they do not perform efficiently and depend on attributes hierarchical taxonomies, which often do not exist. In this paper we propose a new model for k-anonymization, which is called Similarity-Based Clustering (SBC). It is based on clustering and it measures similarity and calculates distances between tuples containing numerical and categorical attributes without hierarchical taxonomies. Based on this model a bottom up greedy algorithm is proposed. Our extensive study on two real datasets shows that the proposed algorithm in comparison with existing well-known algorithms offers much higher data utility and reduces the information loss significantly. Data utility is maintained above 80% in a wide range of k values.