Data Heterogeneity in Federated Learning

Title Data Heterogeneity in Federated Learning
Summary Addressing the challenges of data imbalance in Federated Learning
Keywords non-IID data, Federated Learning
References Advances and Open Problems in Federated Learning:

FedML: A Research Library and Benchmark for Federated Machine Learning:

Supervisor Amira Soliman, Slawomir Nowaczyk
Level Master
Status Open

Federated Learning (FL) has been introduced as an alternative distributed and privacy-friendly learning approach. FL allows users to train models locally on their devices using their sensitive data, and communicate intermediate model updates to a central server without the need to centrally store the data. The principal advantage of FL is the decoupling of global model training from the need for direct access to the raw data. Accordingly, FL offers a solution to learn from private personal data such as biometrics, text input, and location coordinates where models can be trained for many services in a privacy-preserving manner.

A common assumption in FL is that each node has an unbiased sample of the complete data. In reality, though, the models created by different users can often be quite different, as the data on each device can originate from different phenomena. For example, two randomly picked users are likely to compute very different updates to a typing prediction model. This leads to a situation that is challenging from the statistical standpoint, and most existing methods make strong assumptions for how skewed the data distributions are. However, in open and decentralized environments, imbalanced data as well as missing classes are common, and it is imperative that FL methods can deal with them. A lot of work has been done for solving class imbalance and missing classes in a centralized setting, but it is more challenging to provide practical and privacy-preserving learning methods for imbalanced data in FL environments. The objective of this thesis is to create an FL algorithm that is capable of handling the data imbalance among participating devices and propose a solution to enhance model training under this imbalance.