new ad create enhanced tool to validate and create multiple detectors #676

amitgalitz · 2025-12-02T00:06:31Z

Description

This PR introduces CreateAnomalyDetectorToolEnhanced, a tool that leverages LLM capabilities to automatically generate, validate, and create anomaly detectors for different given indices

Implementation Overview

The tool implements a step-by-step processing flow that handles batch operations on indices while maintaining robust error handling and validation mechanisms.

Code Flow

Extract & validate indices →
Get index mappings →
LLM generates detector config →
Validate detector structure →
Optimize parameters (suggest API) →
Validate model requirements →
Create & start detector

The implementation includes utilizing an LLM configuration generation, and retrying on different stages with the validate and suggest api to get optimal configuration that will lead to detector initializing
detector reliability.

The tool processes indices sequentially to prevent LLM throttling and implements retry logic with LLM-assisted error correction. Each index undergoes mapping analysis, LLM configuration generation, detector validation, suggest api based changes, and model validation before final deployment.

Usage Example

Input:

{
  "indices": ["ecommerce-data", "server-logs"]
}

Output:

{
  "ecommerce-data": {
    "status": "success",
    "detectorId": "abc123",
    "detectorName": "ecommerce-data-detector-xyz",
    "createResponse": "Detector created successfully",
    "startResponse": "Detector started successfully"
  },
  "server-logs": {
    "status": "failed_validation",
    "error": "Insufficient data for model training"
  }
}

Processing Flow

The implementation follows a step-by-step flow for each index: extraction and validation, mapping retrieval, LLM configuration generation, detector structure validation, parameter optimization via suggest API, model requirement validation, and finally detector creation and activation. The tool maintains state consistency across batch operations and writes it to the detector result object that is returned as a string so an LLM can use this as one of its tools

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

New functionality includes testing.
Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Amit Galitzky <[email protected]>

dbwiddis · 2025-12-02T04:56:34Z

src/main/java/org/opensearch/agent/tools/utils/AnomalyDetectorToolHelper.java

+                if (index.startsWith(".")) {
+                    throw new IllegalArgumentException("System indices not supported: " + index);
+                }


I think we have better detection of System indices than the leading dot. Example usage.

dbwiddis · 2025-12-02T05:05:37Z

src/main/java/org/opensearch/agent/tools/utils/ToolHelper.java

+        Set<String> result = new HashSet<>();
+        for (Map.Entry<String, String> entry : fieldsToType.entrySet()) {
+            String value = entry.getValue();
+            if (value.equals("date") || value.equals("date_nanos")) {
+                result.add(entry.getKey());
+            }
+        }
+        return result;


My knee-jerk reaction on seeing a loop that can be streamified:

Suggested change

Set<String> result = new HashSet<>();

for (Map.Entry<String, String> entry : fieldsToType.entrySet()) {

String value = entry.getValue();

if (value.equals("date") || value.equals("date_nanos")) {

result.add(entry.getKey());

}

}

return result;

Set<String> dateTypes = Set.of("date", "date_nanos");

return fieldsToType.entrySet().stream()

.filter(e -> dateTypes.contains(e.getValue()))

.map(Entry::getKey)

.collect(Collectors.toSet());

But since you're just moving an existing method probably ok to leave as is.

dbwiddis · 2025-12-02T05:15:19Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        if (dataAsMap.containsKey("response")) {
+            return (String) dataAsMap.get("response");
+        } else if (dataAsMap.containsKey("output")) {
+            // Parse Bedrock format: output.message.content[0].text
+            try {
+                Map<String, Object> output = (Map<String, Object>) dataAsMap.get("output");
+                Map<String, Object> message = (Map<String, Object>) output.get("message");
+                List<Map<String, Object>> content = (List<Map<String, Object>>) message.get("content");
+                return (String) content.getFirst().get("text");
+            } catch (Exception e) {
+                log.error("Failed to parse Bedrock response format", e);
+                return null;
+            }
+        } else if (dataAsMap.containsKey("content")) {
+            // Parse Claude format: content field as array
+            try {
+                List<Map<String, Object>> content = (List<Map<String, Object>>) dataAsMap.get("content");
+                return (String) content.getFirst().get("text");
+            } catch (Exception e) {
+                log.error("Failed to parse Claude content format", e);
+                return null;
+            }
+        } else {
+            log.error("Unknown response format. Available keys: {}", dataAsMap.keySet());
+            return null;
+        }
+    }


This hard-coded parsing and relying on exception handling seems like it could be improved. At least add a comment showing the expected format, so if this breaks in the future, the next maintainer (or future-you) can quickly identify the format change without walking through the code.

dbwiddis · 2025-12-02T05:18:24Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+            ModelTensors tensors = output.getMlModelOutputs().get(0);
+            ModelTensor tensor = tensors.getMlModelTensors().get(0);


No exception handling for empty outputs/tensors.

dbwiddis · 2025-12-02T05:22:26Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        String fixPrompt = createFixPrompt(originalSuggestions, validationError);
+        String fullPrompt = contextBuilder.toString() + fixPrompt;
+
+        callLLM(fullPrompt, tenantId, ActionListener.wrap(fixedResponse -> {


Lots of duplicated code between retryDetectorValidation() and retryModelValidation, consider a helper method for the common parts.

dbwiddis · 2025-12-02T05:56:43Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+    private static final String EXTRACT_INFORMATION_REGEX =
+        "(?s).*\\{category_field=([^|]*)\\|aggregation_field=([^|]*)\\|aggregation_method=([^|]*)\\|interval=([^}]*)}.*";


This seems to be a really strict regex that would fail on things like whitespace or the keys being in a different order (and cause retries which lose the benefit of the optimal date field.).

Given we're just extracting characters between known strings, it seems there's likely a better way of iterating between the | characters, splitting on =, and trimming the key/value into a map?

dbwiddis · 2025-12-02T06:03:34Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+                        } else {
+                            log.error("Max detector validation retries reached: {}", errorMessage);
+                            listener
+                                .onFailure(
+                                    new RuntimeException(
+                                        "Detector validation failed after " + MAX_DETECTOR_VALIDATION_RETRIES + " retries: " + errorMessage
+                                    )
+                                );
+                        }


This conditional (currentRetry >= MAX_DETECTOR_VALIDATION_RETRIES) should be a fast-fail at the top of the validateDetectorPhase method, otherwise you spend a lot of time validating the detector which, if validation succeeds, you suddenly can't use.

dbwiddis · 2025-12-02T06:05:04Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+
+                if (response.getIssue() != null) {
+                    String issueAspect = response.getIssue().getAspect().toString();
+                    boolean isBlockingError = issueAspect != null && issueAspect.toLowerCase(Locale.ROOT).startsWith("detector");


Is starting with "detector" always blocking?

Is there some helper method somewhere which defines blocking errors that you can use?

dbwiddis · 2025-12-02T06:10:13Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+                            respondWithError(listener, detector.getIndices().get(0), "validation", errorMessage);
+                        } else {
+                            DetectorResult result = DetectorResult
+                                .failedValidation(detector.getIndices().get(0), "Non-blocking warning: " + errorMessage);
+                            listener.onResponse((T) result.toJson());


If detector.getIndices().get(0) throws an exception the listener will never be completed. Probably need some exception handling here.

dbwiddis · 2025-12-02T06:28:03Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+            && (validationError.contains("240")
+                || validationError.contains("480")
+                || validationError.contains("960")
+                || validationError.contains("1440"));


These seem like magic numbers. Looks like minutes in a day?

At a minimum needs a comment.

Perhaps move the strings into a constant:

// Time Series values from IntervalCalculation private static final Set<String> KNOWN_HIGH_INTERVAL_STRINGS = IntStream.of(4, 8, 12, 24) .map(hours -> hours * 60) .mapToObj(String::valueOf) .collect(Collectors.toUnmodifiableSet());

And then

boolean isUnreasonableInterval = validationError.contains("interval") && KNOWN_HIGH_INTERVAL_STRINGS.stream() .anyMatch(validationError::contains);

Alternately just parse the interval from the string and compare it >= 240.

Also, highlighting the fact that magic numbers are generally bad, the INTERVAL_LADDER I linked to includes 720 and not 960. :)

kaituo

partial review

kaituo · 2025-12-04T23:40:05Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+    private void getMappingsAndFilterFields(String indexName, ActionListener<MappingContext> listener) {
+        GetMappingsRequest getMappingsRequest = new GetMappingsRequest().indices(indexName);
+
+        client.admin().indices().getMappings(getMappingsRequest, ActionListener.wrap(response -> {


Do you support remote index?

kaituo · 2025-12-05T18:54:18Z

src/main/resources/org/opensearch/agent/tools/CreateAnomalyDetectorEnhancedPrompt.json

@@ -0,0 +1,4 @@
+{
+  "CLAUDE": "\n\nHuman: Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method.\n\nAssistant:\"",


operational monitoring
the same operational issue

It may not be just operational like UBI data for business/marketing.

Mixed success/error fields

What if users want to monitor total traffic instead of just errors?

e.g., user_agent.keyword, message.keyword

What if I want to monitor the message volume and the number of agents running?

e.g., version.keyword, environment.keyword

We are not sure whether count is good or not until we know the meaning of environment.

OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:

We can add cardinality aggregation that is the unique count.

aggregation_field=field1,field2|aggregation_method=method1,method2

Should you be upfront on what happens for 1 feature (e.g., should it produce aggregation_method=mehtod1, or aggregation_method=method1?

category_field=field_name_or_empty

Should we tell LLM what happens when there is no categorical field?

interval=minutes

Should we tell LLM what happens when there are no suitable interval found?

If validation suggests intervals >120 minutes

Where do you put validation suggested interval in the prompt? I only saw that in the error message of fix prompt, not the initial default prompt.

can immediately investigate and resolve

Can you define what is the criteria by which the team can investigate and resolve?

kaituo · 2025-12-05T19:39:15Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        }
+        contextBuilder.append("Available date fields: ").append(String.join(", ", mappingContext.dateFields)).append("\n\n");
+
+        String fixPrompt = createFixPrompt(originalSuggestions, validationError);


Does the agent have memory what you have asked before? If yes, where is the previous conversation stored and how does agent decide which conversation to give to LLM?

kaituo · 2025-12-05T19:45:44Z

...h/agent/tools/register_flow_agent_of_create_anomaly_detector_tool_enhanced_request_body.json

@@ -0,0 +1,13 @@
+{
+  "name": "Test_create_anomaly_detector_enhanced_flow_agent",
+  "type": "flow",


The agent should be equipped with index search tool as well, right? If so, can you point me the code?

kaituo · 2025-12-05T21:30:39Z

src/main/resources/org/opensearch/agent/tools/CreateAnomalyDetectorEnhancedPrompt.json

@@ -0,0 +1,4 @@
+{
+  "CLAUDE": "\n\nHuman: Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method.\n\nAssistant:\"",
+  "OPENAI": "Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method"


It is hard to maintain two prompts with almost identical content. Can you try to generate the claude one programmatically by adding extra information?

kaituo · 2025-12-05T21:48:12Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        }
+
+        Map<String, String> indexInfo = ImmutableMap
+            .of("indexName", indexName, "indexMapping", tableInfoJoiner.toString(), "dateFields", dateFields);


Users may have huge index mapping. Do you have a max limit?

kaituo · 2025-12-05T23:00:40Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorTool.java

-    }
-
    @SuppressWarnings("unchecked")
+    **/


@SuppressWarnings should be outside of comment block. This may cause CI compile failure.

kaituo · 2025-12-05T23:01:42Z

build.gradle

 dependencies {
    // 3P dependencies
    compileOnly group: 'com.google.code.gson', name: 'gson', version: '2.10.1'
+    implementation group: 'org.owasp.encoder' , name: 'encoder', version: '1.3.1'


Where do you use owasp?

kaituo · 2025-12-06T00:06:08Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        if (response.getInterval() != null) {
+            newInterval = response.getInterval();
+        }


Would this override LLM's interval? If yes, is this on purpose? why do we need LLM's interval suggestion?

kaituo · 2025-12-06T00:07:44Z

src/main/java/org/opensearch/agent/tools/CreateAnomalyDetectorToolEnhanced.java

+        TimeConfiguration newWindowDelay = originalDetector.getWindowDelay();
+        Integer newHistoryIntervals = originalDetector.getHistoryIntervals();
+        if (response.getInterval() != null) {
+            newInterval = response.getInterval();


Would this override LLM's interval? If yes, is this on purpose? why do we need LLM's interval suggestion?

new ad create enhanced tool to validate and create multiple detectors

ecf7da0

Signed-off-by: Amit Galitzky <[email protected]>

amitgalitz requested review from Zhangxunmt, b4sjoo, dbwiddis, dhrubo-os, eirsep, jackiehanyang, jngz-es, joshpalis, joshuali925, kaituo, lezzago, model-collapse, ohltyler, owaiskazi19, rbhavna, sbcd90, xinyual, ylwu-amzn, zane-neo and zhichao-aws as code owners December 2, 2025 00:06

dbwiddis reviewed Dec 2, 2025

View reviewed changes

kaituo reviewed Dec 6, 2025

View reviewed changes

		ModelTensors tensors = output.getMlModelOutputs().get(0);
		ModelTensor tensor = tensors.getMlModelTensors().get(0);

		private static final String EXTRACT_INFORMATION_REGEX =
		"(?s).\\{category_field=([^\|])\\\|aggregation_field=([^\|])\\\|aggregation_method=([^\|])\\\|interval=([^}])}.";

		@@ -0,0 +1,4 @@
		{
		"CLAUDE": "\n\nHuman: Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* _count, _errors, _failures, _requests → 'sum' (if numeric) OR 'count' (if keyword)\n* _bytes, _size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty\|aggregation_field=field1,field2\|aggregation_method=method1,method2\|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method.\n\nAssistant:\"",

new ad create enhanced tool to validate and create multiple detectors #676

Are you sure you want to change the base?

new ad create enhanced tool to validate and create multiple detectors #676

Conversation

amitgalitz commented Dec 2, 2025

Description

Implementation Overview

Code Flow

Usage Example

Processing Flow

Related Issues

Check List

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kaituo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects