Can Privacy-Preserving Machine Learning Overcome Data-Sharing Worries?
Privacy-preserving AI techniques could allow researchers to extract insights from sensitive data if cost and complexity barriers can be overcome. But as the concept of privacy-preserving artificial intelligence matures, so do data volumes and complexity. This year, the size of the digital universe could hit 44 zettabytes, according to the World Economic Forum. That sum is 40 times more bytes than the number of stars in the observable universe. And by 2025, IDC projects that number could nearly double.
More Data, More Privacy Problems
While the explosion in data volume, together with declining computation costs, has driven interest in artificial intelligence, a significant portion of data poses potential privacy and cybersecurity questions. Regulatory and cybersecurity issues concerning data abound. AI researchers are constrained by data quality and availability. Databases that would enable them, for instance, to shed light on common diseases or stamp out financial fraud — an estimated $5 trillion global problem — are difficult to obtain. Conversely, innocuous datasets like ImageNet have driven machine learning advances because they are freely available.
A traditional strategy to protect sensitive data is to anonymize it, stripping out confidential information. “Most of the privacy regulations have a clause that permits sufficiently anonymizing it instead of deleting data at request,” said Lisa Donchak, associate partner at McKinsey.
But the catch is, the explosion of data makes the task of re-identifying individuals in masked datasets progressively easier. The goal of protecting privacy is getting “harder and harder to solve because there are so many data snippets available,” said Zulfikar Ramzan, chief technology officer at RSA.
The Internet of Things (IoT) complicates the picture. Connected sensors, found in everything from surveillance cameras to industrial plants to fitness trackers, collect troves of sensitive data. With the appropriate privacy protections in place, such data could be a gold mine for AI research. But security and privacy concerns stand in the way.
Addressing such hurdles requires two things. First, a framework providing user controls and rights on the front-end protects data coming into a database. “That includes specifying who has access to my data and for what purpose,” said Casimir Wierzynski, senior director of AI products at Intel. Second, it requires sufficient data protection, including encrypting data while it is at rest or in transit. The latter is arguably a thornier challenge.
Insights Only for Those Who Need Them
Traditionally, machine learning works on unencrypted data in a collaborative process. “In almost all cases with machine learning, you have multiple stakeholders working together,” Wierzynski said. One stakeholder could own a training data set while another could own the machine learning model, and yet another provides an underlying machine learning service. Third-party domain experts could be tapped to help tune a machine learning model. In other scenarios, multiple parties’ datasets could be combined. “The more data you have, the more powerful model you can build,” Wierzynski said. But as the number of parties and datasets increases, so do security risks using conventional machine learning techniques.
Over the years, security professionals have sought to reduce the liabilities of unsecured data by deploying cryptography, biometrics and multifactor authentication. Interest in such techniques has paved the way for privacy-preserving machine learning techniques, according to Rajesh Iyengar, founder and CEO of Lincode Labs. Such techniques, ranging from multiparty computation to homomorphic encryption can enable “independent data owners collaboratively train the models on datasets without compromising the integrity and privacy of data,” Iyengar said.
Multiparty computation. For decades, researchers have explored the concept of answering questions on behalf of a third-party using data that they can’t see. One example is a technique known as secure multiparty computation. “Let’s say, you and I have some data, and we want to somehow do some analysis on our joint data set without each of us sharing our individual data,” Ramzan said. Multiparty computation makes that feat possible. Adoption of the technique is early, but interest in it is growing.
Federated learning. A related concept is federated learning where multiple entities begin with an initial version of a machine learning model. “They use only their local data to make improvements to those models, and then they share all of those improvements with a central entity,” Wierzynski said.
The technique has also gained traction. The University of Pennsylvania and Intel, for instance, is working with 29 international healthcare organizations to enlist federated learning to detect brain tumors. Google has also explored use of the method.
Differential privacy. Differential privacy can provide a defined privacy level in a given analytics operation by adding noise to data to make a data breach more difficult. This encryption type works best with large data sets. “Let’s say you had data on a million patients. Data that averages results from those patients isn’t going to reveal much about anyone in that group,” Wierzynski said. Given a large enough data set, researchers can deduce the probability that an attacker could expose information about an individual and add noise to obscure such data while protecting the accuracy of the data at large. “It’s much more powerful than just deleting their names,” Wierzynski added. Differential privacy’s ability to protect confidential data, however, diminishes in attacks involving multiple queries of its data.
Homomorphic encryption. Another related technique is homomorphic encryption, a computation technique operating on encrypted data. The data owner using the technique can decrypt the result it generates. Interest in the technique is building, including for election security.
In radiology, for instance, the technique could protect privacy while using AI-based analysis. A hospital could send an x-ray to a cloud-based service to provide an AI equivalent of a second opinion for a diagnosis. They could do that by encrypting the image and sending it to a machine learning service that operates on the data without decrypting it. When the encrypted file returns, its recipient with a secret key can view the diagnostic result. “It’s a compelling way of resolving this tension between privacy and the power of AI,” Wierzynski said.
The downside is the slow speed of the technique, though it is improving. In the past, the method could be millions of times slower than unencrypted computation. “Now, it’s probably closer to like a factor of 10 to 100 [times slower] than regular computation,” Wierzynski said. In some cases, the difference doesn’t matter. If it takes 50 milliseconds to do AI inference on an unencrypted image, waiting 5 seconds for the encrypted version of processing would be acceptable in many cases.
While the concept of fully homomorphic encryption is a hot research topic, the technique remains immature, Ramzan said. “When you’re focusing on a specific domain, you might get things to be efficient enough to be useful in practice,” Ramzan said.
The idea of broadly deploying homomorphic encryption in AI is “maybe the equivalent is getting to Mars,” Ramzan said. While it is possible in the relatively near term, it could easily take several years for it to be feasible.
But while privacy-preserving machine learning techniques will become more practical in the long-term, context will likely dictate when they are useful. “Whether [such techniques] will be practical enough that people are willing to pay the cost penalty, that is a bit of an open question in my mind,” Ramzan said.