Optimizing Large Language Models for Resource-Constrained Environments: A Parameter-Efficient Approach Using QLoRA and Prompt Tuning
Keywords:
management, parameter-efficient fine-tuning, large language models, QLoRA, prompt tuning, resource-constrained environments, NLP, memory optimization, deployment cost reduction, text classification, quantization, low-rank adaptationAbstract
As the deployment of AI solutions continues to grow, particularly in resource-constrained environments, the need for efficient and cost-effective methods becomes increasingly critical. Large Language Models (LLMs) present significant computational challenges that often make their deployment impractical for many real-world applications. This study evaluates parameter-efficient fine-tuning methods, specifically QLoRA and Prompt Tuning, in combination with DistilBERT, to address these challenges. Our combined approach achieved a 36.2% reduction in memory usage and a 50% reduction in inference costs while maintaining 87.75% accuracy compared to baseline models. The results demonstrate that stacking these techniques can provide multiplicative benefits in resource reduction without significant performance degradation, offering practical solutions for resource-constrained deployments.