Inference From the Data: Protect Privacy When Releasing Results

By:

Publication Date: Spring 2023

Open-Access PDF

Abstract

Many companies collect and analyze users’ data to learn about and improve the performance of their products. However, users’ data often contain users’ personal information. When releasing results from an analysis, adversaries may infer an individual’s information via privacy attack or threat models. In this project, Su develops the privacy-preserving version of Metropolis-Hastings (MH), a widely used sampling algorithm in statistics, and finds that the privatized algorithm maintains good data utility in a financial data set and in simulated data sets with formal privacy guarantees.

Why Should Industry Care?

Statistical analysis and machine learning of user and customer data can help companies and financial institutions to evaluate users’ satisfaction about a product or a service and provide valuable information and insights on how to improve the quality of the product or service. For example, Netflix may improve its recommendation algorithm by analyzing users’ comments; banks can analyze customer data to determine which customers may need a loan or credit card.

Anonymization and pseudonymization such as removing personal identifiers from a released data set is not an effective approach to avoid privacy risk. For example, Netflix launched a recommendation algorithm competition in 2006, publishing the users’ data after removing their name ID. However, attackers could still tell an individual’s identity by using public Internet Movie Database (IMDB) data. In addition, aggregate results from statistical analysis and machine learning can expose users’ personal information upon release.

In summary, randomized algorithms with formal privacy guarantees would be useful in some data or results scenarios to provide effective privacy protection while maintaining utility in the released results perturbed by the randomized algorithms.

Related Articles