Publications
2024
- arXiv 2024Model-Based Differentially Private Knowledge Transfer for Large Language ModelsZhaomin Wu*, Jizhou Guo*, Junyi Hou, Bingsheng He, Lixin Fan, and Qiang YangIn arXiv, 2024
As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textitLlamdex, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.
- arXiv 2024Federated Data-Efficient Instruction Tuning for Large Language ModelsZhen Qin, Zhaomin Wu, Bingsheng He, and Shuiguang DengIn arXiv, 2024
Instruction tuning helps improve pretrained large language models (LLMs) in terms of the responsiveness to human instructions, which is benefited from diversified instruction data. Federated learning extends the sources of instruction data by exploiting the diversified client-side data, making it increasingly popular for tuning LLMs. Existing approaches of federated LLM tuning typically traverse all local data during local training, bringing excessive computation overhead and posing a risk of overfitting local data. Thus, a federated data-efficient instruction tuning approach, which consumes relatively little data from the entire dataset, is needed. In response, this work introduces an approach of federated data-efficient instruction tuning for LLMs, FedHDS, which utilizes a representative subset of edge-side data, coreset, to tune the LLM. It reduces the redundancy of data samples at both intra-client and inter-client levels through a hierarchical data selection framework performed by jointly selecting a small number of representative data samples for local training without sharing the raw data. Extensive experiments conducted across six scenarios with various LLMs, datasets and data partitions demonstrate that FedHDS significantly reduces the amount of data required for fine-tuning while improving the responsiveness of the instruction-tuned LLMs to unseen tasks.
- NeurIPS 2024Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked DataZhaomin Wu, Junyi Hou, Yiqun Diao, and Bingsheng HeIn Advances in Neural Information Processing Systems, 2024
Federated Learning (FL) is an evolving paradigm that enables multiple parties to collaboratively train models without sharing raw data. Vertical Federated Learning (VFL), which involves multiple parties contributing distinct features of a shared instance group, is prevalent in real-world, cross-organizational collaborations. In such setups, parties are typically linked by fuzzy identifiers, a common scenario in practice termed as \textitmulti-party fuzzy VFL. Existing models generally address either multi-party VFL or fuzzy VFL between two parties. Extending these models to the practical multi-party fuzzy VFL typically results in significant performance degradation and increased costs for maintaining privacy. To overcome these limitations, we introduce the \textitFederated Transformer (FeT), a novel framework designed to support multi-party VFL with fuzzy identifiers. FeT encodes identifiers into data representations and conducts training using a transformer architecture distributed across different parties, incorporating three new techniques to enhance performance. Additionally, we have developed a scalable privacy framework that integrates differential privacy with secure multi-party computation, effectively protecting local representations at manageable costs. Experiments show that the FeT surpasses the performance of baseline models by up to 46 percentage points when scaled to 50 parties. Additionally, FeT outperforms cutting-edge models in two-party fuzzy VFL settings, while offering improved privacy.
- ICLR 2024VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning BenchmarksZhaomin Wu, Junyi Hou, and Bingsheng HeIn The Twelfth International Conference on Learning Representations, 2024
Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
2023
- SIGMOD 2023DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine UnlearningZhaomin Wu, Junhui Zhu, Qinbin Li, and Bingsheng HeProc. ACM Manag. Data, 2023
DeltaBoost has won Honorable Mention for Best Artifact Award in SIGMOD 2023
As machine learning (ML) has been widely developed in real-world applications, the privacy of ML models draws an increasing concern. In this paper, we study how to forget specific data records from ML models to preserve the privacy of these data. Although some studies propose efficient unlearning algorithms on random forests and extremely randomized trees, Gradient Boosting Decision Trees (GBDT), which are widely used in practice, have not been explored. The efficient unlearning of GBDT faces two major challenges: 1) the training of each tree is deterministic and non-robust; 2) the training of a tree depends on all the previous trees. To solve the first challenge, we propose a robust GBDT-like ML model DeltaBoost that enables efficient and accurate deletion according to our theoretical analysis. For the second challenge, we design a training algorithm for DeltaBoost that minimizes the dependency among trees. Our experiments on five datasets demonstrate that DeltaBoost can remove data records from the trained model efficiently and effectively. Our unlearning approach achieves up to two orders of magnitude speedup compared to retraining GBDT. Besides, DeltaBoost produces competitive performance to existing decision-tree-based ML models.
- MLSys 2023FedTree: A Federated Learning System For TreesQinbin Li, Zhaomin Wu, Yanzheng Cai, yuxuan han, Ching Man Yung, Tianyuan Fu, and Bingsheng HeIn Proceedings of Machine Learning and Systems, 2023
While the quality of machine learning services largely relies on the volume of training data, data regulations such as the General Data Protection Regulation (GDPR) impose stringent requirements on data transfer. Federated learning has emerged as a popular approach for enabling collaborative machine learning without sharing raw data. To facilitate the rapid development of federated learning, efficient and user-friendly federated learning systems are essential. Despite many existing federated learning systems designed for deep learning, tree-based federated learning systems have not been well exploited. This paper presents a tree-based federated learning system under a histogram-sharing scheme, named FedTree, that supports both horizontal and vertical federated training of GBDTs with configurable privacy protection techniques. Our extensive experiments show that FedTree achieves competitive accuracy to centralized training while incurring much less computational cost than the other generic federated learning systems.
- TKDE 2023A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and ProtectionQinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng HeIEEE Transactions on Knowledge & Data Engineering, 2023
As data privacy increasingly becomes a critical societal concern, federated learning has been a hot research topic in enabling the collaborative training of machine learning models among different organizations under the privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, there is a requirement in developing systems and infrastructures to ease the development of various federated learning algorithms. Similar to deep learning systems such as PyTorch and TensorFlow that boost the development of deep learning, federated learning systems (FLSs) are equivalently important, and face challenges from various aspects such as effectiveness, efficiency, and privacy. In this survey, we conduct a comprehensive review on federated learning systems. To understand the key design system components and guide future research, we introduce the definition of federated learning systems and analyze the system components. Moreover, we provide a thorough categorization for federated learning systems according to six different aspects, including data distribution, machine learning model, privacy mechanism, communication architecture, scale of federation and motivation of federation. The categorization can help the design of federated learning systems as shown in our case studies. By systematically summarizing the existing federated learning systems, we present the design factors, case studies, and future research opportunities.
2022
- NeurIPS 2022A Coupled Design of Exploiting Record Similarity for Practical Vertical Federated LearningZhaomin Wu, Qinbin Li, and Bingsheng HeIn Advances in Neural Information Processing Systems, 2022
Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the" record linkage”process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github. com/Xtra-Computing/FedSim.
- TBD 2022Practical Vertical Federated Learning with Unsupervised Representation LearningZhaomin Wu, Qinbin Li, and Bingsheng HeIEEE Transactions on Big Data, 2022
As societal concerns on data privacy recently increase, we have witnessed data silos among multiple parties in various applications. Federated learning emerges as a new learning paradigm that enables multiple parties to collaboratively train a machine learning model without sharing their raw data. Vertical federated learning, where each party owns different features of the same set of samples and only a single party has the label, is an important and challenging topic in federated learning. Communication costs among different parties have been a major hurdle for practical vertical learning systems. In this paper, we propose a novel communication-efficient vertical federated learning algorithm named FedOnce, which requires only one-shot communication among parties. To improve model accuracy and provide privacy guarantee, FedOnce features unsupervised learning representations in the federated setting and privacy-preserving techniques based on moments accountant. The comprehensive experiments on 10 datasets demonstrate that FedOnce achieves close performance compared to state-of-the-art vertical federated learning algorithms with much lower communication costs. Meanwhile, our privacy-preserving technique significantly outperforms the state-of-the-art approaches under the same privacy budget.
- TIST 2022The OARF Benchmark Suite: Characterization and Implications for Federated Learning SystemsSixu Hu, Yuan Li, Xu Liu, Qinbin Li, Zhaomin Wu, and Bingsheng HeACM Trans. Intell. Syst. Technol., 2022
This article presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning (FL) have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available datasets as different data silos in image, text, and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution, and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of FL systems. We have developed reference implementations, and evaluated the important aspects of FL, including model accuracy, communication cost, throughput, and convergence time. Through these evaluations, we discovered some interesting findings such as FL can effectively increase end-to-end throughput. The code of OARF is publicly available on GitHub.1
2020
- AAAI 2020Privacy-Preserving Gradient Boosting Decision TreesQinbin Li, Zhaomin Wu, Zeyi Wen, and Bingsheng HeIn The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020
The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.