LLM Hyperparameter Optimization API: My Google Summer of Code Journey with Kubeflow

Link

2024-09-19 ~1 min read

Jump to TL;DR Jump to Summary Open Original ↗

⚡ TL;DR

📝 Summary

Motivation Goal My Contributions to the GSoC Project Lessons Learned Think from the User’s Perspective Don’t Fear Bugs Communication is Important Every Contribution Counts In The End This summer, I had the opportunity to participate in the Google Summer of Code (GSoC) program, where I contributed to Kubeflow, an open-source machine learning toolkit. My project focused on developing a high-level API for optimizing hyperparameters in Large Language Models (LLMs) within Katib, Kubeflow’s automated hyperparameter tuning system. I’d like to share insights from this experience with others interested in Kubeflow, GSoC, or optimizing LLMs. The rapid advancements and rising popularity of LLMs, such as GPT and BERT, have created a growing demand for efficient LLMOps in Kubernetes. To address this, we have developed a train API within the Training Python SDK, simplifying the process of fine-tuning LLMs using distributed PyTorchJob workers. However, hyperparameter optimization remains a crucial yet labor-intensive task for enhancing model performance. Hyperparameter optimization is essential but time-consuming, especially for LLMs with billions of parameters. This API simplifies the process by handling Kubernetes infrastructure, allowing data scientists to focus on model performance rather than system configuration. With this API, users can import pretrained models and datasets from Hugging Face and Amazon S3, define parameters including the hyperparameter search space, optimization objective, and resource configuration. The API then automates the creation of Experiment, which contains multiple Trials with different hyperparameter settings using PyTorch distributed training. It then collects and analyzes the metrics from each Trial to identify the optimal hyperparameter configuration. For detailed instruction on using the API, please refer to this guide My work on the project can be broadly divided into four stages: Stage 1 : Designing the API, drafting the project proposal, and refining it into a Kubeflow Enhancement Proposal (KEP).

Open the original post ↗ https://blog.kubeflow.org/gsoc-2024-project-4/