Large Language Models in Production: Lessons Learned

Introduction

At SETA International Vietnam, I built a QnA bot leveraging Google's Gemini LLM to help engineers search through 100,000+ documents across Confluence, Jira, and Zendesk. This experience taught me valuable lessons about deploying and maintaining LLMs in production environments. Here's what I learned along the way.

The Challenge

Engineers were spending hours searching for documentation across multiple platforms. We needed a solution that could:

Understand natural language queries
Search across diverse data sources
Provide accurate, contextual answers
Scale to handle hundreds of concurrent users
Stay within reasonable cost constraints

Architecture Design

Data Ingestion Pipeline

The first step was building a robust data pipeline to ingest documents from various sources:

Webhooks: Real-time updates for new/modified documents
Batch Processing: Scheduled full syncs to catch any missed updates
Data Transformation: Standardizing formats across different source systems
Vector Embeddings: Converting documents into searchable vector representations

Retrieval-Augmented Generation (RAG)

We implemented a RAG architecture to enhance the LLM's responses with relevant context:

User query is converted to vector embeddings
Semantic search finds the most relevant documents
Retrieved documents are included in the LLM prompt
LLM generates answer based on provided context

Prompt Engineering

Effective prompt engineering was crucial for getting accurate, helpful responses. Key strategies:

System Prompts

We defined clear system prompts that instructed the model to:

Only answer based on provided context
Cite sources for all information
Admit when it doesn't know the answer
Maintain a helpful, professional tone

Few-Shot Learning

Including examples of good question-answer pairs in the prompt significantly improved response quality, especially for domain-specific queries.

Context Management

Managing context windows effectively was one of the biggest challenges:

Chunking Strategy

Split documents into semantic chunks (paragraphs, sections)
Maintain overlap between chunks to preserve context
Store metadata (source, title, date) with each chunk

Context Ranking

Not all retrieved documents are equally relevant. We implemented a ranking system that considers:

Semantic similarity score
Document recency
Source reliability
User feedback (thumbs up/down)

Cost Optimization

LLM API calls can get expensive quickly. Here's how we managed costs:

Caching Strategy

Cache responses for identical queries
Implement semantic caching for similar queries
Cache embeddings for frequent documents

Smart Token Management

Truncate retrieved context to fit within limits
Prioritize most relevant chunks
Use summarization for very long documents

Model Selection

We used different models for different tasks:

Gemini Pro: Main QnA responses
Smaller models: Query classification and routing
Embeddings API: Vector generation

Production Challenges

Latency Management

Initial response times were too slow. Solutions:

Parallel vector search and embedding generation
Streaming responses for better perceived performance
Pre-computing embeddings for frequently accessed documents

Error Handling

LLM APIs can fail in various ways. Robust error handling includes:

Retry logic with exponential backoff
Fallback to cached responses when available
Clear error messages for users
Comprehensive logging for debugging

Monitoring and Observability

We implemented detailed monitoring to track:

Response quality (user feedback)
Latency metrics
API costs per query
Error rates
Most common queries

Results and Impact

After deployment, we saw significant improvements:

80% reduction in time spent searching for documentation
Positive feedback from 85% of users
Average response time under 3 seconds
API costs within budget constraints

Key Takeaways

Context is King: The quality of retrieved context directly impacts response quality
Iterate on Prompts: Prompt engineering is an iterative process based on real user feedback
Monitor Everything: Comprehensive monitoring is essential for identifying and fixing issues
Manage Costs: Implement caching and smart context management from day one
User Feedback Loop: Continuous improvement requires gathering and acting on user feedback

Conclusion

Deploying LLMs in production is challenging but rewarding. The key is to focus on the fundamentals: good data quality, effective prompt engineering, robust error handling, and continuous monitoring. With these in place, LLMs can provide tremendous value to users and organizations.

LLM Gemini GCP RAG NLP