Azure OpenAI Monitoring with Datadog
Azure OpenAI Monitoring with Datadog
Section titled “Azure OpenAI Monitoring with Datadog”This document outlines the monitoring setup for Azure OpenAI services using Datadog.
Features
Section titled “Features”- Real-time Metrics: Track API calls, latency, and token usage
- Error Tracking: Monitor error rates and types
- Rate Limiting: Get alerts for rate limit thresholds
- Cost Monitoring: Track token usage and estimated costs
- Performance Dashboards: Pre-built dashboards for AI operations
Prerequisites
Section titled “Prerequisites”- Datadog Agent v7.27.0+
- Azure CLI access
- Kubernetes cluster with Datadog operator installed
-
Configure Azure AD Application
Terminal window # Create Azure AD application for Datadogaz ad app create --display-name "Datadog Monitoring"# Assign Monitoring Reader role to the applicationaz role assignment create \--assignee <app-id> \--role "Monitoring Reader" \--scope /subscriptions/<subscription-id>/resourceGroups/<resource-group> -
Deploy Monitoring Configuration
Terminal window # Make the script executablechmod +x scripts/setup-azure-openai-monitoring.sh# Run the setup scriptAZURE_CLIENT_ID="your-client-id" \AZURE_CLIENT_SECRET="your-client-secret" \AZURE_TENANT_ID="your-tenant-id" \AZURE_SUBSCRIPTION_ID="your-subscription-id" \AZURE_OPENAI_RESOURCE_GROUP="your-resource-group" \DEPLOYMENT_NAME="your-deployment-name" \MODEL_NAME="your-model-name" \AZURE_REGION="your-region" \./scripts/setup-azure-openai-monitoring.sh
Dashboards
Section titled “Dashboards”1. Azure OpenAI Overview
Section titled “1. Azure OpenAI Overview”- API request volume
- Error rates and types
- Token usage and costs
- Latency percentiles
2. Embedding Operations
Section titled “2. Embedding Operations”- Embedding generation latency
- Vector search performance
- Batch processing metrics
3. Rate Limiting
Section titled “3. Rate Limiting”- Requests by status code
- Rate limit utilization
- Throttling events
Alerts
Section titled “Alerts”| Alert Name | Condition | Severity | Notification Channel |
|---|---|---|---|
| High Error Rate | avg(last_5m):sum:azure.openai_service.api_errors{*}.as_rate() / sum:azure.openai_service.api_requests{*}.as_rate() > 0.1 | Critical | Slack, PagerDuty |
| High Latency | avg(last_15m):avg:azure.openai_service.latency{*} > 1000 | Warning | Slack |
| Rate Limited | sum(last_5m):sum:azure.openai_service.rate_limited_requests{*} > 0 | Critical | PagerDuty |
| High Token Usage | sum(last_1h):sum:azure.openai_service.tokens_used{*}.as_count() > 1000000 | Warning | Slack |
Custom Metrics
Section titled “Custom Metrics”| Metric Name | Type | Description | Tags |
|---|---|---|---|
azure.openai.embedding.latency | Gauge | Latency of embedding generation | deployment, model |
azure.openai.embedding.tokens | Count | Tokens used for embeddings | deployment, model |
azure.openai.rate_limit.remaining | Gauge | Remaining rate limit | deployment |
azure.openai.vector_search.duration | Histogram | Vector search duration | index, dimensions |
Troubleshooting
Section titled “Troubleshooting”Common Issues
Section titled “Common Issues”-
Missing Metrics
- Verify Azure AD application has correct permissions
- Check Datadog agent logs for connection errors
- Ensure resource tags match in Azure and Datadog
-
Authentication Failures
Terminal window # Check Datadog agent logskubectl logs -l app=datadog-agent -n datadog | grep -i error# Verify Azure AD application credentialsaz ad app show --id <app-id> -
High Latency
- Check Azure OpenAI service health
- Review Datadog network metrics
- Verify region alignment between client and service
Best Practices
Section titled “Best Practices”-
Tagging Strategy
- Use consistent tags across resources
- Include environment, team, and service information
- Add model version and deployment details
-
Alert Thresholds
- Set appropriate thresholds for your workload
- Use multi-notify for critical alerts
- Implement alert fatigue prevention
-
Cost Control
- Set up budget alerts in Azure
- Monitor token usage patterns
- Implement rate limiting in your application