The conundrum of user data deletion from ML models


SOURCE: ANALYTICSINDIAMAG.COM
OCT 17, 2021

Stanford and UC San Diego researchers found a method for rapidly erasing sensitive user data from machine learning models.

Researchers at Stanford and UC San Diego propose a novel approach for rapidly removing sensitive user data from machine learning (ML) models. They provide a method for evaluating when models developed from specific user data can no longer be used and address the issue of efficiently eliminating individual data from ML models after they have been trained. The only way to erase a person’s data from many basic ML models is to completely retrain the model on the remaining data. This is frequently impractical. As a result, researchers are investigating ML algorithms capable of efficiently removing data.

According to James Zou, a professor of biomedical data science at Stanford University and an expert in artificial intelligence, achieving optimal data erasure in real-time is difficult. While training ML models, bits and data can be intricately blended into them; this makes ensuring that a user has been forgotten difficult without considerably altering our models. Additionally, the researchers stated that there might be a solution to the data deletion conundrum that is acceptable to both privacy-conscious users and experts in artificial intelligence. This is known as “approximate deletion.”

Approximate deletion is particularly effective for immediately deleting sensitive information or attributes unique to a particular individual that could be utilised for further identification while deferring computationally demanding full model retraining until periods of lower computing load. According to Zou, approximation deletion can even accomplish the holy grail of exact deletion of a user’s implicit data from the trained model under certain assumptions.

Data-driven

Machine learning works by sifting through databases and assigning various prediction weights to data features — for example, an online shopper’s age, location, and previous purchase history, or a streamer’s previous viewing history and personal rating of movies watched. The models are no longer limited to commercial applications; they are now frequently employed in radiology, pathology, and other professions that directly impact humans.

While data in a database is theoretically anonymised, privacy-conscious users fear that they can still be identified by the bits and pieces of information about them embedded in the models, necessitating the need for the right to be forgotten regulations.

According to Zachary Izzo, the gold standard in the discipline is to identify the exact same model as if machine learning had never seen the erased data points. This criteria, dubbed “exact deletion,” is difficult, if not impossible, to meet, particularly with big, complex models such as those used to recommend things or movies to online merchants and streamers. Exact data deletion effectively entails completely retraining a model, Izzo explains.

Understanding Approximate Deletion

As the name says, approximation deletion enables us to eliminate the majority of the implicit data associated with users from the model. They are ‘forgotten,’ but only in the sense that our models can be retrained at a more opportune time.

Approximate deletion is particularly useful for rapidly removing sensitive information or unique features associated with a particular individual that could be used for identification in the future while deferring computationally intensive full model retraining to times of lower computational demand. Approximate deletion can even accomplish the exact deletion of a user’s implicit data from the trained model under certain assumptions. The deletion challenge has been tackled differently by researchers than by their counterparts in the field.

Additionally, the researchers describe a novel approximate deletion technique for linear and logistic models that is feature-dimensionally linear and independent of training data. This is a significant improvement over conventional systems, which are superlinearly dependent on the extent at all times. Moreover, they develop a new feature-injection test to evaluate the precision with which data is deleted from ML models.

Similar articles you can read