MPC-based Secure Aggregation in Federated Learning: Overview, Protocols, & Google’s Gboard
As digitalization progresses, it is becoming increasingly attractive to learn from as much data as possible in order to, e.g., continuously improve applications. Such as predicting the next word, possible travel routes, providing UI elements in services such as “Digitales Amt”, or improving people care in hospitals.
In order to train a global ML model based on data from many end-user devices and also to preserve user privacy, Google introduced Federated Learning (FL) in 2016/17. With FL, each user trains the corresponding ML model locally and “only” sends the updated ML parameters to a server. However, it has been shown that the ML parameters themselves can also be used to reconstruct the respective input data. For this reason, Secure Aggregation (SecAgg) has been integrated in FL. With SecAgg, the server only receives the sum of the updated ML parameters from all users. For the concrete instantiation of SecAgg, the cryptographic building block of secure multi-party computation (MPC) has proven to be practicable. In order to bring MPC-based SecAgg in FL further into practice – and thus enable further privacy-preserving ML applications – general approaches and dedicated protocols need to be analyzed, compared and, if necessary, improved.
Therefore, this report first shows general methods for privacy-preserving Federated Learning, and then MPC-based Secure-Aggregation protocols for Federated Learning, and compares them based on their computational and communication complexity, and their security guarantees. Furthermore, a practical example is shown: Google’s Gboard, which also integrates an MPC-based protocol.
FL with appropriate extensions, makes it possible to preserve the privacy of participants’ training data. For the training phase, for example, the cryptographic building blocks of Homomorphic Encryption (HE) and MPC have proven to be practically relevant. Although MPC-based SecAgg requires trust in the server or a subset of other participants, this method essentially offers more flexibility in practical implementations (especially in scenarios with many participants, who take part in the computation with a rather low-performance device). And even in MPC-based SecAgg, there are numerous protocols that offer different trade-offs. For example, whether (only) the participants receive the resulting global ML model (e.g. SAFELearn, SCOTCH), or primarily only the aggregation server (e.g. SecAgg, SecAgg+, FastSecAgg, LightSecAgg). For the inference phase – in which the ML model is evaluated using “new input” – the method of Differential Privacy (DP) has proven to be practically relevant. However, as with almost all methods, each method provides different trade-offs. Furthermore, (privacy-preserving) FL also has the advantage that it can develop special ML models; e.g., locally-adapted ML models, as shown by Google’s Gboard.
The different trade-offs for the respective application scenarios need to be examined in more detail for the further use of (privacy-preserving) FL. For example, how high the degree of privacy is, depending on the number of participants. In addition, it is also necessary to solve the different challenges of FL, such as data heterogeneity on end-user devices.