Exploring Model Fusion with Optimal Transport on Transformers

When it comes to the field of artificial intelligence, the possibilities seem endless. With numerous models developed to achieve high-level performance in various tasks such as object detection, image segmentation, and language translation, the question arises of how we can utilize the knowledge gained from each individual model to improve overall performance. That’s where model fusion comes in. By combining the knowledge of different models, we hope to create a child network that outperforms its parent networks.
However, the process of combining models comes with its own set of challenges, such as the computational overhead and the need to share valuable and sensitive data for training. To address these issues, we’re exploring a technique called weight averaging, which aligns the neurons of the parent models before averaging their weights. This approach eliminates the need for sharing private data by abstracting the training data from the model-building phase.
To test the effectiveness of model fusion, we’ll be using different datasets for various tasks. For our first task, we’ll explore parallelizing training by splitting a big dataset into smaller segments and training different transformer models on each segment. We’ll compare the performance of the model trained with the whole dataset, the fused model, and the model trained on partial data. Additionally, we’re interested in exploring a decentralized approach to training through federated learning.
We’ll also be experimenting with fusing models trained on bilingual data to see if the fused model can perform well in both languages. To do this, we’ll be using different support mass distributions and trying out machine translation to see how fusing decoders perform. To evaluate the effectiveness of our approach, we’ll be using accuracy and F1-score for sentiment analysis tasks and the BLEU score for machine translation tasks.
In summary, our goal is to explore the potential of model fusion using weight averaging and optimal transport techniques. By addressing the challenges of combining models, such as the computational overhead and the need for sharing sensitive data, we hope to create a more effective and decentralized approach to training AI models. Through our experiments with different datasets and evaluation metrics, we’ll determine the reliability and effectiveness of our approach.