Scale Simple Chat on Azure
Use this guide when a working deployment needs more throughput, higher availability, or cleaner scaling rules across App Service, Cosmos DB, Search, and AI services.
Scaling Simple Chat is about coordinating several Azure services, not just adding more web instances. Use this guide when you already have a functioning deployment and need to expand capacity without breaking session behavior, retrieval performance, or downstream service quotas.
Scale App Service deliberately
Vertical and horizontal scaling solve different problems, and scale-out requires session strategy to be correct before you add instances.
Track Cosmos RU pressure
Most data-tier trouble shows up as throttling, query latency, or hot containers before it shows up as a full outage.
Protect retrieval quality
Search replicas, partitions, and index performance all influence how grounded chat behaves at higher load.
Mind downstream AI quotas
Azure OpenAI and other AI services have their own TPM, RPM, and service limits, so app scaling alone does not solve throughput bottlenecks.
Redis comes before App Service scale-out
If you plan to run more than one web instance, configure shared session storage first. Without it, scale-out can create authentication and session consistency problems that look like random app instability.
When to Scale
Monitor these indicators to know when scaling is needed:
Performance Metrics to Watch
- CPU utilization consistently above 70%
- Memory usage consistently above 80%
- Request latency increasing beyond acceptable levels
- Queue lengths growing consistently
- RU consumption approaching limits in Cosmos DB
- Rate limit responses (HTTP 429 errors) from Azure services
Use Azure Monitor and Application Insights to track these metrics.
Scaling Azure App Service
The App Service hosts your Python backend application.
Scale Up (Vertical Scaling)
When to scale up:
- Individual requests are resource-intensive
- Single instances need more power
- Complex document processing operations
How to scale up:
- Go to Azure Portal → your App Service
- Navigate to Scale up (App Service plan)
- Select a higher pricing tier:
- From P0v3 → P1v3, P2v3, P3v3
- Consider Premium v3 tiers for better performance
- Click Apply
Immediate benefits:
- More CPU cores per instance
- Increased RAM per instance
- Better disk I/O performance
Scale Out (Horizontal Scaling)
When to scale out:
- High number of concurrent users
- Need improved availability
- Load balancing across instances
Prerequisites: ⚠️ IMPORTANT: You must configure Azure Cache for Redis before scaling out. Without Redis, user sessions will be tied to specific instances, causing authentication issues.
Setup Redis first:
- Deploy Azure Cache for Redis instance
- Configure Simple Chat to use Redis for session storage
- Update application configuration
- Test with multiple instances
How to scale out:
- Go to Azure Portal → your App Service
- Navigate to Scale out (App Service plan)
- Increase Instance count manually, or
- Configure Autoscale rules:
- Based on CPU percentage (scale out at 70%, scale in at 30%)
- Based on memory usage
- Based on request queue length
Autoscale example configuration:
Scale Out Rule:
- Metric: CPU Percentage
- Operator: Greater than
- Threshold: 70%
- Duration: 5 minutes
- Action: Increase instance count by 1
Scale In Rule:
- Metric: CPU Percentage
- Operator: Less than
- Threshold: 30%
- Duration: 10 minutes
- Action: Decrease instance count by 1
Minimum instances: 2
Maximum instances: 10
Scaling Azure Cosmos DB
Cosmos DB stores conversations, metadata, and settings.
Understanding Request Units (RU/s)
Request Units measure the cost of database operations:
- Read operations: Lower RU cost
- Write operations: Higher RU cost
- Complex queries: Variable RU cost based on complexity
Autoscale Throughput (Recommended)
Benefits:
- Automatic scaling between 10% and 100% of max RU/s
- Pay only for what you use
- Handles traffic spikes automatically
How to configure:
- Go to Azure Portal → your Cosmos DB account
- Navigate to Data Explorer
- For each container, select Scale & Settings
- Choose Autoscale throughput
- Set Maximum RU/s
Recommended Container Scaling
High-traffic containers:
messages container:
- Max RU/s: 4,000 (scales 400-4,000)
- Reason: High read/write volume from chat
documents container:
- Max RU/s: 4,000 (scales 400-4,000)
- Reason: Document metadata operations
group_documents container:
- Max RU/s: 4,000 (scales 400-4,000)
- Reason: Group document sharing
Lower-traffic containers:
settings container:
- Max RU/s: 1,000 (scales 100-1,000)
- Reason: Infrequent configuration changes
feedback container:
- Max RU/s: 1,000 (scales 100-1,000)
- Reason: Periodic feedback submission
archived_conversations container:
- Max RU/s: 1,000 (scales 100-1,000)
- Reason: Archive operations are less frequent
Monitor and Adjust
Key metrics to monitor:
- Request Unit consumption per container
- Throttling events (HTTP 429 errors)
- Query performance and latency
How to monitor:
- Use Azure Monitor metrics for Cosmos DB
- Set up alerts for high RU consumption
- Review query performance in Data Explorer
- Check throttling metrics regularly
When to increase RU/s:
- Consistent throttling (429 errors)
- Query latency increasing
- Application performance degrading
Scaling Azure AI Search
AI Search powers document retrieval and hybrid search.
Search Service Tiers
Basic tier:
- Limited storage and query volume
- Good for development and small deployments
Standard tiers (S1, S2, S3):
- Increased storage capacity
- Higher query per second (QPS) limits
- Better performance for production workloads
Storage Optimized tiers (L1, L2):
- Optimized for large document collections
- Lower cost per GB of storage
When to Scale AI Search
Scale up when you experience:
- Slow search response times
- Query throttling
- Running out of storage space
- Need higher availability (replicas)
How to Scale AI Search
Increase search units:
- Go to Azure Portal → your Search Service
- Navigate to Scale
- Adjust Search units (combination of replicas and partitions)
- Replicas: Improve query performance and availability
- Partitions: Increase storage and indexing capacity
Scaling examples:
Small deployment:
- 1 replica, 1 partition (1 search unit)
- Up to 15 million documents
- Basic query performance
Medium deployment:
- 2 replicas, 2 partitions (4 search units)
- Up to 30 million documents
- Better performance and availability
Large deployment:
- 3 replicas, 3 partitions (9 search units)
- Up to 45 million documents
- High performance and high availability
Monitor AI Search Performance
Key metrics:
- Query latency: Time to execute searches
- Query per second (QPS): Current query load
- Storage utilization: How much space is used
- Indexing performance: Document processing speed
Optimization tips:
- Use semantic search for better relevance
- Implement query caching where appropriate
- Consider index optimization for large datasets
- Monitor suggester performance if using autocomplete
Scaling Azure OpenAI Services
OpenAI services power the chat models and embeddings.
Understand Rate Limits
Each deployment has rate limits measured in:
- Tokens per minute (TPM)
- Requests per minute (RPM)
Scale OpenAI Deployments
Increase quota:
- Go to Azure OpenAI Studio
- Navigate to Deployments
- Select your deployment
- Increase TPM/RPM limits if quota allows
- Request quota increases if needed
Multiple deployments:
- Deploy same model to multiple regions
- Implement load balancing between deployments
- Use API Management for advanced routing
Model selection for scale:
- GPT-4: Higher quality, lower throughput
- GPT-3.5 Turbo: Lower cost, higher throughput
- Consider newer models as they become available
Scaling Other Components
Azure Cache for Redis
When to scale Redis:
- High memory utilization (>80%)
- Increased connection counts
- Performance degradation
How to scale:
- Scale up: Move to higher pricing tier (C1 → C2 → C3, etc.)
- Scale out: Use Redis clustering (Premium tiers)
- Monitor: Memory usage, CPU, network throughput
Azure Storage Account (Enhanced Citations)
When to scale:
- High transaction volumes
- Storage capacity approaching limits
- Performance degradation
How to scale:
- Performance tiers: Standard → Premium storage
- Replication: LRS → ZRS → GRS for availability
- Multiple accounts: Distribute load across accounts
Document Intelligence and Other AI Services
Scaling approach:
- Request rate limits: Similar to OpenAI services
- Multiple regions: Deploy across regions for availability
- Service tiers: Standard tiers offer higher limits
Best Practices for Scaling
Planning
- ✅ Baseline performance: Measure current metrics before scaling
- ✅ Load testing: Test scaling decisions in non-production first
- ✅ Gradual scaling: Make incremental changes
- ✅ Monitor impact: Watch metrics after each scaling change
Cost Optimization
- ✅ Autoscale where possible: Use autoscale for variable workloads
- ✅ Right-sizing: Don’t over-provision resources
- ✅ Reserved instances: Use reservations for predictable workloads
- ✅ Resource cleanup: Remove unused resources
Monitoring and Alerting
- ✅ Comprehensive monitoring: Monitor all components
- ✅ Proactive alerts: Set alerts before hitting limits
- ✅ Regular reviews: Review performance and scaling regularly
- ✅ Scaling runbooks: Document scaling procedures
High Availability
- ✅ Multi-instance App Service: At least 2 instances
- ✅ Redis clustering: Use Premium Redis for HA
- ✅ Multiple search replicas: Ensure search availability
- ✅ Cross-region: Consider multi-region for critical workloads
Troubleshooting Scaling Issues
App Service Scaling Problems
Session issues after scale out:
- Verify Redis configuration
- Check session storage settings
- Test load balancer configuration
Performance not improving:
- Check if bottleneck is elsewhere (database, AI services)
- Verify application is utilizing multiple cores
- Review application performance profiles
Database Scaling Issues
Continued throttling after RU increase:
- Check for hot partitions
- Review query patterns and indexing
- Consider data modeling changes
High costs:
- Review RU consumption patterns
- Optimize queries to reduce RU usage
- Consider data archival strategies
Search Performance Issues
Slow queries after scaling:
- Review index design and field configuration
- Check query complexity and filters
- Consider semantic search optimizations
Scaling Checklist
Before scaling:
- Identify performance bottlenecks
- Review current resource utilization
- Plan scaling approach
- Prepare monitoring and rollback plans
During scaling:
- Enable Redis before App Service scale out
- Configure autoscale rules appropriately
- Set up proper monitoring and alerting
- Test functionality after each change
After scaling:
- Monitor performance improvements
- Verify cost impacts are acceptable
- Document changes and decisions
- Plan next scaling steps if needed
This guide provides the foundation for scaling Simple Chat effectively. Remember that scaling is an iterative process - monitor, adjust, and optimize continuously as your usage grows.