Deploying Azure OpenAI Monitoring
Deploying Azure OpenAI Monitoring
Section titled “Deploying Azure OpenAI Monitoring”This guide explains how to deploy and configure Azure OpenAI monitoring using Datadog.
Prerequisites
Section titled “Prerequisites”- Terraform >= 1.0
- Datadog API and App keys with appropriate permissions
- Azure CLI installed and configured
- Access to the Azure subscription with OpenAI service
Deployment Steps
Section titled “Deployment Steps”1. Initialize Terraform
Section titled “1. Initialize Terraform”cd infrastructureterraform init2. Configure Environment Variables
Section titled “2. Configure Environment Variables”Create a terraform.tfvars file with your configuration:
environment = "production"datadog_api_key = "your-datadog-api-key"datadog_app_key = "your-datadog-app-key"
# Optional: Customize alert thresholdserror_rate_threshold = 5.0 # percentagelatency_threshold_ms = 1000 # milliseconds3. Review the Plan
Section titled “3. Review the Plan”terraform plan -target=module.azure_openai_monitoring4. Apply the Configuration
Section titled “4. Apply the Configuration”terraform apply -target=module.azure_openai_monitoring5. Verify Deployment
Section titled “5. Verify Deployment”- Log in to your Datadog dashboard
- Navigate to “Dashboards” and look for “[ENV] Azure OpenAI - Vibecode-AI - Overview”
- Check the “Monitors” section for the newly created alerts
Configuration Options
Section titled “Configuration Options”Alert Thresholds
Section titled “Alert Thresholds”Customize alert thresholds in infrastructure/monitoring/azure_openai.tf:
module "azure_openai_monitoring" { # ... existing configuration ...
# Alert thresholds error_rate_threshold = 5.0 # 5% error rate latency_threshold_ms = 1000 # 1 second
# ... rest of the configuration ...}Notification Channels
Section titled “Notification Channels”Update the Slack channel in infrastructure/monitoring/azure_openai.tf:
slack_channel = var.environment == "production" ? "#alerts-ai" : "#alerts-dev"Monitoring Dashboard
Section titled “Monitoring Dashboard”The dashboard includes the following widgets:
- API Requests: Shows request volume by status code
- Error Rate: Displays the percentage of failed requests
- Token Usage: Tracks token consumption by type
Alerts
Section titled “Alerts”Two main alerts are configured:
- High Error Rate: Triggers when error rate exceeds the threshold
- High Latency: Triggers when response time exceeds the threshold
Troubleshooting
Section titled “Troubleshooting”Missing Metrics
Section titled “Missing Metrics”If metrics are not appearing in Datadog:
- Verify the Azure integration is properly configured in Datadog
- Check the Datadog agent logs for connection issues:
Terminal window kubectl logs -l app=datadog-agent -n datadog - Ensure the service name matches in both the module and your application
False Positives
Section titled “False Positives”If you’re getting false positive alerts:
- Adjust the threshold values in the module configuration
- Modify the evaluation window for alerts
- Add additional filters to the alert queries
Maintenance
Section titled “Maintenance”Updating the Module
Section titled “Updating the Module”To update the monitoring configuration:
- Make your changes in the module
- Run
terraform planto review changes - Apply with
terraform apply
Removing the Monitoring
Section titled “Removing the Monitoring”To completely remove the monitoring:
terraform destroy -target=module.azure_openai_monitoringSupport
Section titled “Support”For issues with the monitoring setup, contact the AI Engineering team or open an issue in the repository.