Publications | Zhaomin Wu

2025

arXiv 2025

Learning Relational Tabular Data without Shared Features

Zhaomin Wu, Shida Wang, Ziyang Wang, and Bingsheng He

arXiv preprint arXiv:2502.10125, 2025

@article{wu2025leal,
  title = {Learning Relational Tabular Data without Shared Features},
  author = {Wu, Zhaomin and Wang, Shida and Wang, Ziyang and He, Bingsheng},
  journal = {arXiv preprint arXiv:2502.10125},
  year = {2025},
}

arXiv 2025

Vertical Federated Learning in Practice: The Good, the Bad, and the Ugly

Zhaomin Wu, Zhen Qin, Junyi Hou, Haodong Zhao, Qinbin Li, Bingsheng He, and Lixin Fan

arXiv preprint arXiv:2502.08160, 2025

Bib PDF

@article{wu2025vertical,
  title = {Vertical Federated Learning in Practice: The Good, the Bad, and the Ugly},
  author = {Wu, Zhaomin and Qin, Zhen and Hou, Junyi and Zhao, Haodong and Li, Qinbin and He, Bingsheng and Fan, Lixin},
  journal = {arXiv preprint arXiv:2502.08160},
  year = {2025},
}

2024

arXiv 2024
Model-Based Differentially Private Knowledge Transfer for Large Language Models

Zhaomin Wu^*, Jizhou Guo^*, Junyi Hou, Bingsheng He, Lixin Fan, and Qiang Yang

In arXiv, 2024

Abs Bib PDF

As large language models (LLMs) become increasingly prevalent in web services, effectively leveraging domain-specific knowledge while ensuring privacy has become critical. Existing methods, such as retrieval-augmented generation (RAG) and differentially private data synthesis, often compromise either the utility of domain knowledge or the privacy of sensitive data, limiting their applicability in specialized domains. To address these challenges, we propose \textitLlamdex, a novel framework that integrates privacy-preserving, domain-specific models into LLMs. Our approach significantly enhances the accuracy of domain-specific tasks, achieving up to a 26% improvement compared to existing methods under the same differential privacy constraints. Experimental results show that Llamdex not only improves the accuracy of LLM responses but also maintains comparable inference efficiency to the original LLM, highlighting its potential for real-world applications.
@inproceedings{wu2024llamdex, author = {Wu, Zhaomin and Guo, Jizhou and Hou, Junyi and He, Bingsheng and Fan, Lixin and Yang, Qiang}, title = {Model-Based Differentially Private Knowledge Transfer for Large Language Models}, booktitle = {arXiv}, year = {2024}, }
ACL 2024
Federated Data-Efficient Instruction Tuning for Large Language Models

Zhen Qin, Zhaomin Wu^†, Bingsheng He, and Shuiguang Deng^†

In arXiv, 2024

Abs Bib PDF

Instruction tuning helps improve pretrained large language models (LLMs) in terms of the responsiveness to human instructions, which is benefited from diversified instruction data. Federated learning extends the sources of instruction data by exploiting the diversified client-side data, making it increasingly popular for tuning LLMs. Existing approaches of federated LLM tuning typically traverse all local data during local training, bringing excessive computation overhead and posing a risk of overfitting local data. Thus, a federated data-efficient instruction tuning approach, which consumes relatively little data from the entire dataset, is needed. In response, this work introduces an approach of federated data-efficient instruction tuning for LLMs, FedHDS, which utilizes a representative subset of edge-side data, coreset, to tune the LLM. It reduces the redundancy of data samples at both intra-client and inter-client levels through a hierarchical data selection framework performed by jointly selecting a small number of representative data samples for local training without sharing the raw data. Extensive experiments conducted across six scenarios with various LLMs, datasets and data partitions demonstrate that FedHDS significantly reduces the amount of data required for fine-tuning while improving the responsiveness of the instruction-tuned LLMs to unseen tasks.
@inproceedings{qin2024fedhds, author = {Qin, Zhen and Wu, Zhaomin and He, Bingsheng and Deng, Shuiguang}, title = {Federated Data-Efficient Instruction Tuning for Large Language Models}, booktitle = {arXiv}, year = {2024}, }
arXiv 2024
Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures

Zhang Yicheng, Qin Zhen, Wu Zhaomin, and Deng Shuiguang

In arXiv, 2024

Abs Bib PDF

A large amount of instructional text data is essential to enhance the performance of pre-trained large language models (LLMs) for downstream tasks. This data can contain sensitive information and therefore cannot be shared in practice, resulting in data silos that limit the effectiveness of LLMs on various tasks. Federated learning (FL) enables collaborative fine-tuning across different clients without sharing their data. Nonetheless, in practice, this instructional text data is highly heterogeneous in both quantity and distribution across clients, necessitating distinct model structures to best accommodate the variations. However, existing federated fine-tuning approaches either enforce the same model structure or rely on predefined ad-hoc architectures unaware of data distribution, resulting in suboptimal performance. To address this challenge, we propose FedAMoLE, a lightweight personalized federated fine-tuning framework that leverages data-driven heterogeneous model architectures. FedAMoLE introduces the Adaptive Mixture of LoRA Experts (AMoLE) module, which facilitates model heterogeneity with minimal communication overhead by allocating varying numbers of LoRA-based domain experts to each client. Furthermore, we develop a reverse selection-based expert assignment (RSEA) strategy, which enables data-driven model architecture adjustment during fine-tuning by allowing domain experts to select clients that best align with their knowledge domains. Extensive experiments across six different scenarios of data heterogeneity demonstrate that FedAMoLE significantly outperforms existing methods for federated LLM fine-tuning, achieving superior accuracy while maintaining good scalability.
@inproceedings{zhang2024fedmoe, author = {Yicheng, Zhang and Zhen, Qin and Zhaomin, Wu and Shuiguang, Deng}, title = {Personalized Federated Fine-Tuning for LLMs via Data-Driven Heterogeneous Model Architectures}, booktitle = {arXiv}, year = {2024}, }
NeurIPS 2024
Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data

Zhaomin Wu, Junyi Hou, Yiqun Diao, and Bingsheng He

In Advances in Neural Information Processing Systems, 2024

Abs Bib PDF Code

Federated Learning (FL) is an evolving paradigm that enables multiple parties to collaboratively train models without sharing raw data. Vertical Federated Learning (VFL), which involves multiple parties contributing distinct features of a shared instance group, is prevalent in real-world, cross-organizational collaborations. In such setups, parties are typically linked by fuzzy identifiers, a common scenario in practice termed as \textitmulti-party fuzzy VFL. Existing models generally address either multi-party VFL or fuzzy VFL between two parties. Extending these models to the practical multi-party fuzzy VFL typically results in significant performance degradation and increased costs for maintaining privacy. To overcome these limitations, we introduce the \textitFederated Transformer (FeT), a novel framework designed to support multi-party VFL with fuzzy identifiers. FeT encodes identifiers into data representations and conducts training using a transformer architecture distributed across different parties, incorporating three new techniques to enhance performance. Additionally, we have developed a scalable privacy framework that integrates differential privacy with secure multi-party computation, effectively protecting local representations at manageable costs. Experiments show that the FeT surpasses the performance of baseline models by up to 46 percentage points when scaled to 50 parties. Additionally, FeT outperforms cutting-edge models in two-party fuzzy VFL settings, while offering improved privacy.
@inproceedings{wu2024fet, author = {Wu, Zhaomin and Hou, Junyi and Diao, Yiqun and He, Bingsheng}, booktitle = {Advances in Neural Information Processing Systems}, publisher = {Curran Associates, Inc.}, title = {Federated Transformer: Multi-Party Vertical Federated Learning on Practical Fuzzily Linked Data}, volume = {36}, year = {2024}, }
ICLR 2024
VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks

Zhaomin Wu, Junyi Hou, and Bingsheng He

In The Twelfth International Conference on Learning Representations, 2024

Abs Bib PDF Code Website

Vertical Federated Learning (VFL) is a crucial paradigm for training machine learning models on feature-partitioned, distributed data. However, due to privacy restrictions, few public real-world VFL datasets exist for algorithm evaluation, and these represent a limited array of feature distributions. Existing benchmarks often resort to synthetic datasets, derived from arbitrary feature splits from a global set, which only capture a subset of feature distributions, leading to inadequate algorithm performance assessment. This paper addresses these shortcomings by introducing two key factors affecting VFL performance - feature importance and feature correlation - and proposing associated evaluation metrics and dataset splitting methods. Additionally, we introduce a real VFL dataset to address the deficit in image-image VFL scenarios. Our comprehensive evaluation of cutting-edge VFL algorithms provides valuable insights for future research in the field.
@inproceedings{wu2024vertibench, title = {VertiBench: Advancing Feature Distribution Diversity in Vertical Federated Learning Benchmarks}, author = {Wu, Zhaomin and Hou, Junyi and He, Bingsheng}, booktitle = {The Twelfth International Conference on Learning Representations}, year = {2024}, url = {https://openreview.net/forum?id=glwwbaeKm2}, }

2023

SIGMOD 2023
DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning

Zhaomin Wu, Junhui Zhu, Qinbin Li, and Bingsheng He

Proc. ACM Manag. Data, 2023

Honorable Mention for Best Artifact Abs DOI Bib PDF Supp Code

DeltaBoost has won Honorable Mention for Best Artifact Award in SIGMOD 2023

As machine learning (ML) has been widely developed in real-world applications, the privacy of ML models draws an increasing concern. In this paper, we study how to forget specific data records from ML models to preserve the privacy of these data. Although some studies propose efficient unlearning algorithms on random forests and extremely randomized trees, Gradient Boosting Decision Trees (GBDT), which are widely used in practice, have not been explored. The efficient unlearning of GBDT faces two major challenges: 1) the training of each tree is deterministic and non-robust; 2) the training of a tree depends on all the previous trees. To solve the first challenge, we propose a robust GBDT-like ML model DeltaBoost that enables efficient and accurate deletion according to our theoretical analysis. For the second challenge, we design a training algorithm for DeltaBoost that minimizes the dependency among trees. Our experiments on five datasets demonstrate that DeltaBoost can remove data records from the trained model efficiently and effectively. Our unlearning approach achieves up to two orders of magnitude speedup compared to retraining GBDT. Besides, DeltaBoost produces competitive performance to existing decision-tree-based ML models.
@article{10.1145/3589313, author = {Wu, Zhaomin and Zhu, Junhui and Li, Qinbin and He, Bingsheng}, title = {DeltaBoost: Gradient Boosting Decision Trees with Efficient Machine Unlearning}, year = {2023}, issue_date = {June 2023}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {1}, number = {2}, url = {https://doi.org/10.1145/3589313}, doi = {10.1145/3589313}, journal = {Proc. ACM Manag. Data}, articleno = {168}, numpages = {26}, keywords = {data deletion, gradient boosting decision trees, machine unlearning}, }
MLSys 2023
FedTree: A Federated Learning System For Trees

Qinbin Li, Zhaomin Wu, Yanzheng Cai, yuxuan han, Ching Man Yung, Tianyuan Fu, and Bingsheng He

In Proceedings of Machine Learning and Systems, 2023

Abs Bib PDF Code

While the quality of machine learning services largely relies on the volume of training data, data regulations such as the General Data Protection Regulation (GDPR) impose stringent requirements on data transfer. Federated learning has emerged as a popular approach for enabling collaborative machine learning without sharing raw data. To facilitate the rapid development of federated learning, efficient and user-friendly federated learning systems are essential. Despite many existing federated learning systems designed for deep learning, tree-based federated learning systems have not been well exploited. This paper presents a tree-based federated learning system under a histogram-sharing scheme, named FedTree, that supports both horizontal and vertical federated training of GBDTs with configurable privacy protection techniques. Our extensive experiments show that FedTree achieves competitive accuracy to centralized training while incurring much less computational cost than the other generic federated learning systems.
@inproceedings{MLSYS2023_3430e705, author = {Li, Qinbin and Wu, Zhaomin and Cai, Yanzheng and han, yuxuan and Yung, Ching Man and Fu, Tianyuan and He, Bingsheng}, booktitle = {Proceedings of Machine Learning and Systems}, editor = {Song, D. and Carbin, M. and Chen, T.}, pages = {89--103}, publisher = {Curan}, title = {FedTree: A Federated Learning System For Trees}, url = {https://proceedings.mlsys.org/paper_files/paper/2023/file/3430e7055936cb8e26451ed49fce84a6-Paper-mlsys2023.pdf}, volume = {5}, year = {2023}, }
TKDE 2023
A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection

Qinbin Li, Zeyi Wen, Zhaomin Wu, Sixu Hu, Naibo Wang, Yuan Li, Xu Liu, and Bingsheng He

IEEE Transactions on Knowledge & Data Engineering, 2023

Abs DOI Bib PDF

As data privacy increasingly becomes a critical societal concern, federated learning has been a hot research topic in enabling the collaborative training of machine learning models among different organizations under the privacy restrictions. As researchers try to support more machine learning models with different privacy-preserving approaches, there is a requirement in developing systems and infrastructures to ease the development of various federated learning algorithms. Similar to deep learning systems such as PyTorch and TensorFlow that boost the development of deep learning, federated learning systems (FLSs) are equivalently important, and face challenges from various aspects such as effectiveness, efficiency, and privacy. In this survey, we conduct a comprehensive review on federated learning systems. To understand the key design system components and guide future research, we introduce the definition of federated learning systems and analyze the system components. Moreover, we provide a thorough categorization for federated learning systems according to six different aspects, including data distribution, machine learning model, privacy mechanism, communication architecture, scale of federation and motivation of federation. The categorization can help the design of federated learning systems as shown in our case studies. By systematically summarizing the existing federated learning systems, we present the design factors, case studies, and future research opportunities.
@article{9599369, author = {Li, Qinbin and Wen, Zeyi and Wu, Zhaomin and Hu, Sixu and Wang, Naibo and Li, Yuan and Liu, Xu and He, Bingsheng}, journal = {IEEE Transactions on Knowledge & Data Engineering}, title = {A Survey on Federated Learning Systems: Vision, Hype and Reality for Data Privacy and Protection}, year = {2023}, volume = {35}, number = {04}, issn = {1558-2191}, pages = {3347-3366}, keywords = {collaborative work;data models;machine learning;data privacy;computational modeling;deep learning;servers}, doi = {10.1109/TKDE.2021.3124599}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }

2022

NeurIPS 2022
A Coupled Design of Exploiting Record Similarity for Practical Vertical Federated Learning

Zhaomin Wu, Qinbin Li, and Bingsheng He

In Advances in Neural Information Processing Systems, 2022

Abs Bib PDF Supp Code

Federated learning is a learning paradigm to enable collaborative learning across different parties without revealing raw data. Notably, vertical federated learning (VFL), where parties share the same set of samples but only hold partial features, has a wide range of real-world applications. However, most existing studies in VFL disregard the" record linkage”process. They design algorithms either assuming the data from different parties can be exactly linked or simply linking each record with its most similar neighboring record. These approaches may fail to capture the key features from other less similar records. Moreover, such improper linkage cannot be corrected by training since existing approaches provide no feedback on linkage during training. In this paper, we design a novel coupled training paradigm, FedSim, that integrates one-to-many linkage into the training process. Besides enabling VFL in many real-world applications with fuzzy identifiers, FedSim also achieves better performance in traditional VFL tasks. Moreover, we theoretically analyze the additional privacy risk incurred by sharing similarities. Our experiments on eight datasets with various similarity metrics show that FedSim outperforms other state-of-the-art baselines. The codes of FedSim are available at https://github. com/Xtra-Computing/FedSim.
@inproceedings{NEURIPS2022_84b74416, author = {Wu, Zhaomin and Li, Qinbin and He, Bingsheng}, booktitle = {Advances in Neural Information Processing Systems}, editor = {Koyejo, S. and Mohamed, S. and Agarwal, A. and Belgrave, D. and Cho, K. and Oh, A.}, pages = {21087--21100}, publisher = {Curran Associates, Inc.}, title = {A Coupled Design of Exploiting Record Similarity for Practical Vertical Federated Learning}, url = {https://proceedings.neurips.cc/paper_files/paper/2022/file/84b744165a0597360caad96b06e69313-Paper-Conference.pdf}, volume = {35}, year = {2022}, }
TBD 2022
Practical Vertical Federated Learning with Unsupervised Representation Learning

Zhaomin Wu, Qinbin Li, and Bingsheng He

IEEE Transactions on Big Data, 2022

Abs DOI Bib PDF Code

As societal concerns on data privacy recently increase, we have witnessed data silos among multiple parties in various applications. Federated learning emerges as a new learning paradigm that enables multiple parties to collaboratively train a machine learning model without sharing their raw data. Vertical federated learning, where each party owns different features of the same set of samples and only a single party has the label, is an important and challenging topic in federated learning. Communication costs among different parties have been a major hurdle for practical vertical learning systems. In this paper, we propose a novel communication-efficient vertical federated learning algorithm named FedOnce, which requires only one-shot communication among parties. To improve model accuracy and provide privacy guarantee, FedOnce features unsupervised learning representations in the federated setting and privacy-preserving techniques based on moments accountant. The comprehensive experiments on 10 datasets demonstrate that FedOnce achieves close performance compared to state-of-the-art vertical federated learning algorithms with much lower communication costs. Meanwhile, our privacy-preserving technique significantly outperforms the state-of-the-art approaches under the same privacy budget.
@article{9789268, author = {Wu, Zhaomin and Li, Qinbin and He, Bingsheng}, journal = {IEEE Transactions on Big Data}, title = {Practical Vertical Federated Learning with Unsupervised Representation Learning}, year = {2022}, volume = {}, number = {01}, issn = {2332-7790}, pages = {1-1}, keywords = {collaborative work;privacy;differential privacy;training;data privacy;costs;unsupervised learning}, doi = {10.1109/TBDATA.2022.3180117}, publisher = {IEEE Computer Society}, address = {Los Alamitos, CA, USA}, }
TIST 2022
The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems

Sixu Hu, Yuan Li, Xu Liu, Qinbin Li, Zhaomin Wu, and Bingsheng He

ACM Trans. Intell. Syst. Technol., 2022

Abs DOI Bib PDF Code

This article presents and characterizes an Open Application Repository for Federated Learning (OARF), a benchmark suite for federated machine learning systems. Previously available benchmarks for federated learning (FL) have focused mainly on synthetic datasets and use a limited number of applications. OARF mimics more realistic application scenarios with publicly available datasets as different data silos in image, text, and structured data. Our characterization shows that the benchmark suite is diverse in data size, distribution, feature distribution, and learning task complexity. The extensive evaluations with reference implementations show the future research opportunities for important aspects of FL systems. We have developed reference implementations, and evaluated the important aspects of FL, including model accuracy, communication cost, throughput, and convergence time. Through these evaluations, we discovered some interesting findings such as FL can effectively increase end-to-end throughput. The code of OARF is publicly available on GitHub.1
@article{10.1145/3510540, author = {Hu, Sixu and Li, Yuan and Liu, Xu and Li, Qinbin and Wu, Zhaomin and He, Bingsheng}, title = {The OARF Benchmark Suite: Characterization and Implications for Federated Learning Systems}, year = {2022}, issue_date = {August 2022}, publisher = {Association for Computing Machinery}, address = {New York, NY, USA}, volume = {13}, number = {4}, issn = {2157-6904}, url = {https://doi.org/10.1145/3510540}, doi = {10.1145/3510540}, journal = {ACM Trans. Intell. Syst. Technol.}, articleno = {63}, numpages = {32}, keywords = {framework, dataset, benchmark, machine learning, Federated learning}, }

2020

AAAI 2020
Privacy-Preserving Gradient Boosting Decision Trees

Qinbin Li, Zhaomin Wu, Zeyi Wen, and Bingsheng He

In The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020

Abs DOI Bib PDF Code

The Gradient Boosting Decision Tree (GBDT) is a popular machine learning model for various tasks in recent years. In this paper, we study how to improve model accuracy of GBDT while preserving the strong guarantee of differential privacy. Sensitivity and privacy budget are two key design aspects for the effectiveness of differential private models. Existing solutions for GBDT with differential privacy suffer from the significant accuracy loss due to too loose sensitivity bounds and ineffective privacy budget allocations (especially across different trees in the GBDT model). Loose sensitivity bounds lead to more noise to obtain a fixed privacy level. Ineffective privacy budget allocations worsen the accuracy loss especially when the number of trees is large. Therefore, we propose a new GBDT training algorithm that achieves tighter sensitivity bounds and more effective noise allocations. Specifically, by investigating the property of gradient and the contribution of each tree in GBDTs, we propose to adaptively control the gradients of training data for each iteration and leaf node clipping in order to tighten the sensitivity bounds. Furthermore, we design a novel boosting framework to allocate the privacy budget between trees so that the accuracy loss can be further reduced. Our experiments show that our approach can achieve much better model accuracy than other baselines.
@inproceedings{DBLP:conf/aaai/LiWWH20, author = {Li, Qinbin and Wu, Zhaomin and Wen, Zeyi and He, Bingsheng}, title = {Privacy-Preserving Gradient Boosting Decision Trees}, booktitle = {The Thirty-Fourth {AAAI} Conference on Artificial Intelligence}, pages = {784--791}, publisher = {{AAAI} Press}, year = {2020}, url = {https://doi.org/10.1609/aaai.v34i01.5422}, doi = {10.1609/AAAI.V34I01.5422}, timestamp = {Sat, 30 Sep 2023 09:33:11 +0200}, biburl = {https://dblp.org/rec/conf/aaai/LiWWH20.bib}, bibsource = {dblp computer science bibliography, https://dblp.org}, }