> ## Documentation Index
> Fetch the complete documentation index at: https://docs.strata.io/llms.txt
> Use this file to discover all available pages before exploring further.

# Monitor and Observe

By the end of this guide, you will have a fully instrumented Maverics Orchestrator with OpenTelemetry-based metrics and traces, structured log output, and health check monitoring for key operational events.

Good observability means you know what your Orchestrator is doing before users tell you something is wrong. The Orchestrator exports metrics and traces via OpenTelemetry (OTLP), emits structured logs, and provides health endpoints -- giving you the raw data you need to build dashboards, set up alerts, and debug issues when they arise.

<Note>
  **Console terminology:** In the Maverics Console, Orchestrator instances and
  configuration delivery are managed through **Deployments**. When working directly
  with YAML, configuration is managed as files delivered via the `-config` flag or
  `MAVERICS_CONFIG` environment variable.
</Note>

## Prerequisites

* **A running Maverics Orchestrator** -- If you have not deployed yet, follow the [Deploy to Production guide](/guides/operations/deploy) first.
* **An OpenTelemetry collector** -- The Orchestrator exports telemetry via OTLP. You need an OpenTelemetry Collector (or compatible endpoint like Grafana Alloy, Datadog Agent, or New Relic) to receive metrics and traces.
* **A log aggregation system** (recommended) -- Elasticsearch, Loki, Splunk, or any system that can ingest structured JSON logs.

## Set Up Observability

<Steps>
  <Step title="Configure telemetry">
    The Orchestrator uses OpenTelemetry to export metrics and traces via the OTLP protocol. Metrics are collected through periodic readers that push to your OTLP endpoint at a configured interval. Traces are exported through simple processors.

    The Orchestrator can export telemetry data including:

    * **Request metrics** -- Total requests, response status codes, and request duration histograms
    * **Authentication metrics** -- Authentication event data (availability varies by Orchestrator version)
    * **Runtime metrics** -- Process-level metrics such as memory and concurrency data
    * **Distributed traces** -- End-to-end request tracing through authentication and authorization flows

    <Note>The specific metrics and trace data available depend on your Orchestrator version and configuration. Consult your Orchestrator's actual OTLP output to confirm which metrics are exported in your deployment.</Note>

    <Tabs>
      <Tab title="Console UI">
        <Info>
          **Console UI documentation is coming soon.** This section will walk you
          through configuring this component using the Maverics Console's visual
          interface, including step-by-step screenshots and field descriptions.
        </Info>

        <Frame caption="Telemetry configuration in Maverics Console">
          <img src="https://mintcdn.com/strataidentity/yo114yy_clZj7p9v/images/placeholder.svg?fit=max&auto=format&n=yo114yy_clZj7p9v&q=85&s=ea8d2ec72a69d5a8c7955d78abba6a30" alt="Telemetry endpoint configuration in Maverics Console showing OTLP endpoint and interval settings" width="800" height="400" data-path="images/placeholder.svg" />
        </Frame>
      </Tab>

      <Tab title="Configuration">
        Configure OTLP exporters for both metrics and traces:

        ```yaml maverics.yaml theme={null}
        telemetry:
          metrics:
            readers:
              - periodic:
                  exporter:
                    otlp:
                      protocol: "http/protobuf"
                      endpoint: "http://otelcol.example.com:4318/v1/metrics"
                      insecure: true
                      timeout: 5000
                  interval: 5000
          traces:
            processors:
              - simple:
                  exporter:
                    otlp:
                      protocol: "http/protobuf"
                      endpoint: "http://otelcol.example.com:4318/v1/traces"
        ```

        | Field                                               | Description                                        |
        | --------------------------------------------------- | -------------------------------------------------- |
        | `metrics.readers[].periodic.exporter.otlp.protocol` | OTLP transport -- use `"http/protobuf"`            |
        | `metrics.readers[].periodic.exporter.otlp.endpoint` | Your OTLP collector's metrics endpoint             |
        | `metrics.readers[].periodic.exporter.otlp.insecure` | Skip TLS verification for the collector connection |
        | `metrics.readers[].periodic.exporter.otlp.timeout`  | Export timeout in milliseconds                     |
        | `metrics.readers[].periodic.interval`               | Collection interval in milliseconds                |
        | `traces.processors[].simple.exporter.otlp.protocol` | OTLP transport for traces                          |
        | `traces.processors[].simple.exporter.otlp.endpoint` | Your OTLP collector's traces endpoint              |

        <Note>
          The OTLP exporter supports additional production options including TLS certificates, gzip compression, custom headers, and aggregation temporality preferences. For traces, a batch processor is available for high-volume environments. See [Telemetry Reference](/reference/orchestrator/telemetry) for the complete field reference.
        </Note>

        See [Telemetry Reference](/reference/orchestrator/telemetry) for all telemetry fields.

        <Note>
          If you manage Orchestrator configuration through the Maverics Console, advanced telemetry settings like batch processing, TLS, compression, and custom headers require the [**config override**](/reference/console/config-publishing#override-config) feature. Config override requires enablement for your organization -- contact your Strata account team or [Strata support](https://strataidentity.my.site.com/support/s/) to enable it.
        </Note>
      </Tab>
    </Tabs>

    <Tip>
      The Orchestrator exports all telemetry via OTLP. If you use Prometheus, configure your OpenTelemetry Collector to receive OTLP and export to Prometheus using the `prometheusremotewrite` exporter.
    </Tip>
  </Step>

  <Step title="Configure structured logging">
    The Orchestrator emits structured logs in JSON format -- making them easy to parse, search, and aggregate in any log management system. Structured logs include consistent fields like timestamp, log level, request ID, and component name, so you can filter and correlate events across your deployment.

    Production logging best practices:

    * **Log level** -- Use `info` for normal production operation. Switch to `debug` only when actively troubleshooting -- debug logging is verbose and can impact performance.
    * **Output format** -- JSON format (`jsonOutput: true`) is recommended for production. It integrates cleanly with log aggregation systems like Elasticsearch, Loki, and Splunk.
    * **Output destination** -- Stdout is the standard approach for containerized deployments (Docker and Kubernetes capture stdout automatically). For bare-metal deployments, you can configure file-based output with rotation.
    * **Request ID correlation** -- Each request gets a unique ID that appears in every log entry for that request. Use this to trace a single user's authentication flow across log entries.

    <Tabs>
      <Tab title="Console UI">
        <Info>
          **Console UI documentation is coming soon.** This section will walk you
          through configuring this component using the Maverics Console's visual
          interface, including step-by-step screenshots and field descriptions.
        </Info>

        <Frame caption="Logging configuration in Maverics Console">
          <img src="https://mintcdn.com/strataidentity/yo114yy_clZj7p9v/images/placeholder.svg?fit=max&auto=format&n=yo114yy_clZj7p9v&q=85&s=ea8d2ec72a69d5a8c7955d78abba6a30" alt="Structured logging settings in Maverics Console showing log level, format, and output options" width="800" height="400" data-path="images/placeholder.svg" />
        </Frame>
      </Tab>

      <Tab title="Configuration">
        Configure the logger for production:

        ```yaml maverics.yaml theme={null}
        logger:
          level: "info"
          jsonOutput: true
          timeFormat: "RFC3339Nano"
          logSessionID: false
          fieldOrdering:
            enabled: true
        ```

        | Field                          | Default         | Description                                             |
        | ------------------------------ | --------------- | ------------------------------------------------------- |
        | `logger.level`                 | `"info"`        | Log verbosity: `"debug"`, `"info"`, `"warn"`, `"error"` |
        | `logger.jsonOutput`            | `false`         | Output logs in JSON format for structured logging       |
        | `logger.timeFormat`            | `"RFC3339Nano"` | Time format string for log timestamps                   |
        | `logger.logSessionID`          | `false`         | Include the session ID in log entries for correlation   |
        | `logger.fieldOrdering.enabled` | `false`         | Order log fields consistently across entries            |

        <Note>
          The `-verbose` CLI flag or `MAVERICS_DEBUG_MODE=true` environment variable overrides `logger.level` to `"debug"` at startup.
        </Note>

        You can also configure HTTP access logging separately:

        ```yaml maverics.yaml theme={null}
        http:
          accessLog:
            disabled: false
            level: "info"
        ```
      </Tab>
    </Tabs>
  </Step>

  <Step title="Set up health checks">
    The Orchestrator exposes a configurable health endpoint that load balancers and orchestration platforms use to verify operational status. The health endpoint returns a JSON response, and a periodic heartbeat logs system metrics.

    <Tabs>
      <Tab title="Console UI">
        <Info>
          **Console UI documentation is coming soon.** This section will walk you
          through configuring this component using the Maverics Console's visual
          interface, including step-by-step screenshots and field descriptions.
        </Info>

        <Frame caption="Health check configuration in Maverics Console">
          <img src="https://mintcdn.com/strataidentity/yo114yy_clZj7p9v/images/placeholder.svg?fit=max&auto=format&n=yo114yy_clZj7p9v&q=85&s=ea8d2ec72a69d5a8c7955d78abba6a30" alt="Health check settings in Maverics Console showing endpoint path and heartbeat interval" width="800" height="400" data-path="images/placeholder.svg" />
        </Frame>
      </Tab>

      <Tab title="Configuration">
        Configure the health endpoint and heartbeat:

        ```yaml maverics.yaml theme={null}
        health:
          location: "/status"
          heartbeat:
            disabled: false
            logLevel: "info"
            interval: "60s"
        ```

        | Field                       | Default     | Description                             |
        | --------------------------- | ----------- | --------------------------------------- |
        | `health.location`           | `"/status"` | HTTP path for the health check endpoint |
        | `health.heartbeat.disabled` | `false`     | Disable periodic heartbeat logging      |
        | `health.heartbeat.logLevel` | `"info"`    | Heartbeat log level                     |
        | `health.heartbeat.interval` | `"60s"`     | Heartbeat interval (duration string)    |

        Verify the health endpoint:

        ```bash theme={null}
        curl -s https://localhost:9443/status | jq .
        ```

        ```json theme={null}
        {
          "status": "up"
        }
        ```

        The periodic heartbeat log entry includes: `orchestrator_version`, `config_version`, `cpu_count`, `cpu_usage`, `total_memory`, `memory_usage`, and `active_goroutines`.

        <Note>
          Every log entry (including heartbeat entries) includes a [`soid`](/introduction/glossary#soid-secure-orchestrator-id) field for identifying and correlating logs by Orchestrator instance. See [Logging — Deployment Correlation](/reference/orchestrator/telemetry/logging#deployment-correlation) for details.
        </Note>
      </Tab>
    </Tabs>
  </Step>

  <Step title="Set up alerting">
    Metrics and logs are useful for investigation, but alerts are what tell you something needs attention right now. Configure alerts for the conditions that indicate real problems -- not just noise.

    Recommended alerts for a production Orchestrator deployment:

    * **High error rate** -- Alert when the 5xx error rate exceeds a threshold (for example, more than 1% of requests returning 500-series errors over a 5-minute window). This catches upstream failures, misconfigurations, and application errors.
    * **Authentication failure spike** -- Alert when authentication failures increase significantly above the baseline. A sudden spike could indicate an IdP outage, expired credentials, or a misconfigured connector.
    * **Health check failure** -- Alert when the health endpoint reports an unhealthy status for more than 2 consecutive checks. This catches connector failures and startup issues.
    * **High latency** -- Alert when the p95 request latency exceeds your SLA threshold. High latency often indicates network issues, slow upstream services, or resource contention.

    <Tabs>
      <Tab title="Console UI">
        <Info>
          **Console UI documentation is coming soon.** This section will walk you
          through configuring this component using the Maverics Console's visual
          interface, including step-by-step screenshots and field descriptions.
        </Info>

        <Frame caption="Alerting configuration in Maverics Console">
          <img src="https://mintcdn.com/strataidentity/yo114yy_clZj7p9v/images/placeholder.svg?fit=max&auto=format&n=yo114yy_clZj7p9v&q=85&s=ea8d2ec72a69d5a8c7955d78abba6a30" alt="Alerting rules configuration in Maverics Console showing threshold settings and notification channels" width="800" height="400" data-path="images/placeholder.svg" />
        </Frame>
      </Tab>

      <Tab title="Configuration">
        Alerting is configured in your monitoring platform (Prometheus, Grafana, Datadog, etc.), not in the Orchestrator YAML. The Orchestrator exports the metrics that your alerting rules evaluate.

        **Example Prometheus alert rules** for an Orchestrator deployment:

        ```yaml theme={null}
        # prometheus-alerts.yaml
        groups:
          - name: maverics
            rules:
              - alert: MavericsHealthCheckFailed
                expr: up{job="maverics"} == 0
                for: 2m
                labels:
                  severity: critical
                annotations:
                  summary: "Orchestrator instance is down"

              - alert: MavericsHighErrorRate
                expr: rate(http_server_requests_total{status=~"5.."}[5m]) > 0.01
                for: 5m
                labels:
                  severity: warning
                annotations:
                  summary: "Orchestrator error rate above 1%"
        ```

        **Prometheus scrape configuration** for the Orchestrator's OTLP metrics (after your OpenTelemetry Collector translates to Prometheus format):

        ```yaml theme={null}
        # prometheus.yaml scrape config
        scrape_configs:
          - job_name: "maverics"
            scrape_interval: 15s
            static_configs:
              - targets: ["otelcol.example.com:8889"]
        ```
      </Tab>
    </Tabs>

    <Tip>
      Start with a small number of high-signal alerts and add more as you learn your deployment's baseline behavior. Too many alerts leads to alert fatigue -- where important signals get lost in the noise.
    </Tip>
  </Step>

  <Step title="Verify observability">
    With telemetry, logging, and alerting configured, verify that data is flowing correctly through your entire observability pipeline.

    ```bash theme={null}
    # Verify status endpoint
    curl -s https://your-orchestrator-host:9443/status | jq .
    ```

    Walk through a complete verification:

    1. **Check your OTLP collector** -- Confirm the collector is receiving metrics and traces from the Orchestrator
    2. **Check dashboards** -- If you have Grafana dashboards, verify that charts are rendering with real data
    3. **Check logs** -- Trigger a few requests and confirm the structured log entries appear in your log aggregation system with the correct JSON format and fields
    4. **Test an alert** -- If possible, trigger one of your alert conditions (for example, temporarily set a very low threshold) and confirm the alert fires and notifications are delivered

    <Check>
      **Success!** Your Orchestrator deployment is fully instrumented. Metrics and
      traces are being exported via OTLP, structured logs are flowing to your
      aggregation system, and alerts are configured for key operational events.
    </Check>
  </Step>
</Steps>

## Troubleshooting

<AccordionGroup>
  <Accordion title="Telemetry not reaching collector">
    Verify that the OTLP endpoint URL in your `telemetry` configuration is correct
    and reachable from the Orchestrator. Check that the collector is running and
    listening on the expected port. If the collector uses TLS, set `insecure: false`
    and ensure the Orchestrator can verify the collector's certificate. Check
    firewall rules and network policies between the Orchestrator and collector.
  </Accordion>

  <Accordion title="Log format not parsing correctly">
    If your log aggregation system is not parsing the Orchestrator's logs correctly,
    check that you are using a JSON parser (not a regex-based parser for plain text
    logs). The Orchestrator outputs JSON-formatted logs when `jsonOutput: true` is
    set.

    If you see raw JSON strings instead of parsed fields, your log shipper (Filebeat,
    Fluentd, Vector) may need a JSON parsing filter configured. Check the log
    shipper's documentation for JSON input configuration.
  </Accordion>

  <Accordion title="Alerts not triggering">
    If alerts are not firing when expected, check these common causes:

    * **Threshold too high** -- Your alert threshold may be higher than the actual
      metric values. Check the raw metric values in your monitoring platform to set realistic thresholds.
    * **Wrong metric name** -- Verify the exact metric name matches what
      your alert rule references. Metric names are case-sensitive.
    * **Evaluation interval** -- Alert rules only evaluate at configured intervals. If
      your evaluation interval is 5 minutes, the alert will not fire until the next
      evaluation window after the condition is met.
    * **Notification channel** -- The alert rule may be firing but the notification
      is not reaching you. Check your alerting tool's notification configuration
      (email, Slack, PagerDuty).
  </Accordion>
</AccordionGroup>

## Related Pages

<CardGroup cols={2}>
  <Card title="Operations Overview" icon="compass" href="/guides/operations/overview">
    Back to the Operations guides hub
  </Card>

  <Card title="Telemetry" icon="file-code" href="/reference/orchestrator/telemetry">
    Complete configuration reference for logger, telemetry, access logs, and health check settings
  </Card>

  <Card title="Deploy to Production" icon="rocket" href="/guides/operations/deploy">
    Set up your production deployment before configuring monitoring
  </Card>
</CardGroup>
