Skip to content

Azure OpenAI Monitoring with Datadog

This document outlines the monitoring setup for Azure OpenAI services using Datadog.

  • Real-time Metrics: Track API calls, latency, and token usage
  • Error Tracking: Monitor error rates and types
  • Rate Limiting: Get alerts for rate limit thresholds
  • Cost Monitoring: Track token usage and estimated costs
  • Performance Dashboards: Pre-built dashboards for AI operations
  • Datadog Agent v7.27.0+
  • Azure CLI access
  • Kubernetes cluster with Datadog operator installed
  1. Configure Azure AD Application

    Terminal window
    # Create Azure AD application for Datadog
    az ad app create --display-name "Datadog Monitoring"
    # Assign Monitoring Reader role to the application
    az role assignment create \
    --assignee <app-id> \
    --role "Monitoring Reader" \
    --scope /subscriptions/<subscription-id>/resourceGroups/<resource-group>
  2. Deploy Monitoring Configuration

    Terminal window
    # Make the script executable
    chmod +x scripts/setup-azure-openai-monitoring.sh
    # Run the setup script
    AZURE_CLIENT_ID="your-client-id" \
    AZURE_CLIENT_SECRET="your-client-secret" \
    AZURE_TENANT_ID="your-tenant-id" \
    AZURE_SUBSCRIPTION_ID="your-subscription-id" \
    AZURE_OPENAI_RESOURCE_GROUP="your-resource-group" \
    DEPLOYMENT_NAME="your-deployment-name" \
    MODEL_NAME="your-model-name" \
    AZURE_REGION="your-region" \
    ./scripts/setup-azure-openai-monitoring.sh
  • API request volume
  • Error rates and types
  • Token usage and costs
  • Latency percentiles
  • Embedding generation latency
  • Vector search performance
  • Batch processing metrics
  • Requests by status code
  • Rate limit utilization
  • Throttling events
Alert NameConditionSeverityNotification Channel
High Error Rateavg(last_5m):sum:azure.openai_service.api_errors{*}.as_rate() / sum:azure.openai_service.api_requests{*}.as_rate() > 0.1CriticalSlack, PagerDuty
High Latencyavg(last_15m):avg:azure.openai_service.latency{*} > 1000WarningSlack
Rate Limitedsum(last_5m):sum:azure.openai_service.rate_limited_requests{*} > 0CriticalPagerDuty
High Token Usagesum(last_1h):sum:azure.openai_service.tokens_used{*}.as_count() > 1000000WarningSlack
Metric NameTypeDescriptionTags
azure.openai.embedding.latencyGaugeLatency of embedding generationdeployment, model
azure.openai.embedding.tokensCountTokens used for embeddingsdeployment, model
azure.openai.rate_limit.remainingGaugeRemaining rate limitdeployment
azure.openai.vector_search.durationHistogramVector search durationindex, dimensions
  1. Missing Metrics

    • Verify Azure AD application has correct permissions
    • Check Datadog agent logs for connection errors
    • Ensure resource tags match in Azure and Datadog
  2. Authentication Failures

    Terminal window
    # Check Datadog agent logs
    kubectl logs -l app=datadog-agent -n datadog | grep -i error
    # Verify Azure AD application credentials
    az ad app show --id <app-id>
  3. High Latency

    • Check Azure OpenAI service health
    • Review Datadog network metrics
    • Verify region alignment between client and service
  1. Tagging Strategy

    • Use consistent tags across resources
    • Include environment, team, and service information
    • Add model version and deployment details
  2. Alert Thresholds

    • Set appropriate thresholds for your workload
    • Use multi-notify for critical alerts
    • Implement alert fatigue prevention
  3. Cost Control

    • Set up budget alerts in Azure
    • Monitor token usage patterns
    • Implement rate limiting in your application