Skip to content

Conversation

@amitgalitz
Copy link
Member

Description

This PR introduces CreateAnomalyDetectorToolEnhanced, a tool that leverages LLM capabilities to automatically generate, validate, and create anomaly detectors for different given indices

Implementation Overview

The tool implements a step-by-step processing flow that handles batch operations on indices while maintaining robust error handling and validation mechanisms.

Code Flow

  1. Extract & validate indices →
  2. Get index mappings →
  3. LLM generates detector config →
  4. Validate detector structure →
  5. Optimize parameters (suggest API) →
  6. Validate model requirements →
  7. Create & start detector

The implementation includes utilizing an LLM configuration generation, and retrying on different stages with the validate and suggest api to get optimal configuration that will lead to detector initializing
detector reliability.

The tool processes indices sequentially to prevent LLM throttling and implements retry logic with LLM-assisted error correction. Each index undergoes mapping analysis, LLM configuration generation, detector validation, suggest api based changes, and model validation before final deployment.

Usage Example

Input:

{
  "indices": ["ecommerce-data", "server-logs"]
}

Output:

{
  "ecommerce-data": {
    "status": "success",
    "detectorId": "abc123",
    "detectorName": "ecommerce-data-detector-xyz",
    "createResponse": "Detector created successfully",
    "startResponse": "Detector started successfully"
  },
  "server-logs": {
    "status": "failed_validation",
    "error": "Insufficient data for model training"
  }
}

Processing Flow

The implementation follows a step-by-step flow for each index: extraction and validation, mapping retrieval, LLM configuration generation, detector structure validation, parameter optimization via suggest API, model requirement validation, and finally detector creation and activation. The tool maintains state consistency across batch operations and writes it to the detector result object that is returned as a string so an LLM can use this as one of its tools

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • Commits are signed per the DCO using --signoff.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Comment on lines +58 to +60
if (index.startsWith(".")) {
throw new IllegalArgumentException("System indices not supported: " + index);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have better detection of System indices than the leading dot. Example usage.

Comment on lines +90 to +97
Set<String> result = new HashSet<>();
for (Map.Entry<String, String> entry : fieldsToType.entrySet()) {
String value = entry.getValue();
if (value.equals("date") || value.equals("date_nanos")) {
result.add(entry.getKey());
}
}
return result;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My knee-jerk reaction on seeing a loop that can be streamified:

Suggested change
Set<String> result = new HashSet<>();
for (Map.Entry<String, String> entry : fieldsToType.entrySet()) {
String value = entry.getValue();
if (value.equals("date") || value.equals("date_nanos")) {
result.add(entry.getKey());
}
}
return result;
Set<String> dateTypes = Set.of("date", "date_nanos");
return fieldsToType.entrySet().stream()
.filter(e -> dateTypes.contains(e.getValue()))
.map(Entry::getKey)
.collect(Collectors.toSet());

But since you're just moving an existing method probably ok to leave as is.

Comment on lines +324 to +350
if (dataAsMap.containsKey("response")) {
return (String) dataAsMap.get("response");
} else if (dataAsMap.containsKey("output")) {
// Parse Bedrock format: output.message.content[0].text
try {
Map<String, Object> output = (Map<String, Object>) dataAsMap.get("output");
Map<String, Object> message = (Map<String, Object>) output.get("message");
List<Map<String, Object>> content = (List<Map<String, Object>>) message.get("content");
return (String) content.getFirst().get("text");
} catch (Exception e) {
log.error("Failed to parse Bedrock response format", e);
return null;
}
} else if (dataAsMap.containsKey("content")) {
// Parse Claude format: content field as array
try {
List<Map<String, Object>> content = (List<Map<String, Object>>) dataAsMap.get("content");
return (String) content.getFirst().get("text");
} catch (Exception e) {
log.error("Failed to parse Claude content format", e);
return null;
}
} else {
log.error("Unknown response format. Available keys: {}", dataAsMap.keySet());
return null;
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This hard-coded parsing and relying on exception handling seems like it could be improved. At least add a comment showing the expected format, so if this breaks in the future, the next maintainer (or future-you) can quickly identify the format change without walking through the code.

Comment on lines +493 to +494
ModelTensors tensors = output.getMlModelOutputs().get(0);
ModelTensor tensor = tensors.getMlModelTensors().get(0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No exception handling for empty outputs/tensors.

String fixPrompt = createFixPrompt(originalSuggestions, validationError);
String fullPrompt = contextBuilder.toString() + fixPrompt;

callLLM(fullPrompt, tenantId, ActionListener.wrap(fixedResponse -> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lots of duplicated code between retryDetectorValidation() and retryModelValidation, consider a helper method for the common parts.

Comment on lines +135 to +136
private static final String EXTRACT_INFORMATION_REGEX =
"(?s).*\\{category_field=([^|]*)\\|aggregation_field=([^|]*)\\|aggregation_method=([^|]*)\\|interval=([^}]*)}.*";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a really strict regex that would fail on things like whitespace or the keys being in a different order (and cause retries which lose the benefit of the optimal date field.).

Given we're just extracting characters between known strings, it seems there's likely a better way of iterating between the | characters, splitting on =, and trimming the key/value into a map?

Comment on lines +700 to +708
} else {
log.error("Max detector validation retries reached: {}", errorMessage);
listener
.onFailure(
new RuntimeException(
"Detector validation failed after " + MAX_DETECTOR_VALIDATION_RETRIES + " retries: " + errorMessage
)
);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conditional (currentRetry >= MAX_DETECTOR_VALIDATION_RETRIES) should be a fast-fail at the top of the validateDetectorPhase method, otherwise you spend a lot of time validating the detector which, if validation succeeds, you suddenly can't use.


if (response.getIssue() != null) {
String issueAspect = response.getIssue().getAspect().toString();
boolean isBlockingError = issueAspect != null && issueAspect.toLowerCase(Locale.ROOT).startsWith("detector");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is starting with "detector" always blocking?

Is there some helper method somewhere which defines blocking errors that you can use?

Comment on lines +899 to +903
respondWithError(listener, detector.getIndices().get(0), "validation", errorMessage);
} else {
DetectorResult result = DetectorResult
.failedValidation(detector.getIndices().get(0), "Non-blocking warning: " + errorMessage);
listener.onResponse((T) result.toJson());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If detector.getIndices().get(0) throws an exception the listener will never be completed. Probably need some exception handling here.

Comment on lines +1074 to +1077
&& (validationError.contains("240")
|| validationError.contains("480")
|| validationError.contains("960")
|| validationError.contains("1440"));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These seem like magic numbers. Looks like minutes in a day?

At a minimum needs a comment.

Perhaps move the strings into a constant:

// Time Series values from IntervalCalculation
private static final Set<String> KNOWN_HIGH_INTERVAL_STRINGS = 
    IntStream.of(4, 8, 12, 24)
        .map(hours -> hours * 60)
        .mapToObj(String::valueOf)
        .collect(Collectors.toUnmodifiableSet());

And then

boolean isUnreasonableInterval = validationError.contains("interval")
    && KNOWN_HIGH_INTERVAL_STRINGS.stream()
        .anyMatch(validationError::contains);

Alternately just parse the interval from the string and compare it >= 240.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, highlighting the fact that magic numbers are generally bad, the INTERVAL_LADDER I linked to includes 720 and not 960. :)

Copy link
Collaborator

@kaituo kaituo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

partial review

private void getMappingsAndFilterFields(String indexName, ActionListener<MappingContext> listener) {
GetMappingsRequest getMappingsRequest = new GetMappingsRequest().indices(indexName);

client.admin().indices().getMappings(getMappingsRequest, ActionListener.wrap(response -> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you support remote index?

@@ -0,0 +1,4 @@
{
"CLAUDE": "\n\nHuman: Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method.\n\nAssistant:\"",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

operational monitoring
the same operational issue

It may not be just operational like UBI data for business/marketing.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mixed success/error fields

What if users want to monitor total traffic instead of just errors?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., user_agent.keyword, message.keyword

What if I want to monitor the message volume and the number of agents running?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

e.g., version.keyword, environment.keyword

We are not sure whether count is good or not until we know the meaning of environment.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:

We can add cardinality aggregation that is the unique count.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aggregation_field=field1,field2|aggregation_method=method1,method2

Should you be upfront on what happens for 1 feature (e.g., should it produce aggregation_method=mehtod1, or aggregation_method=method1?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

category_field=field_name_or_empty

Should we tell LLM what happens when there is no categorical field?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

interval=minutes

Should we tell LLM what happens when there are no suitable interval found?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If validation suggests intervals >120 minutes

Where do you put validation suggested interval in the prompt? I only saw that in the error message of fix prompt, not the initial default prompt.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can immediately investigate and resolve

Can you define what is the criteria by which the team can investigate and resolve?

}
contextBuilder.append("Available date fields: ").append(String.join(", ", mappingContext.dateFields)).append("\n\n");

String fixPrompt = createFixPrompt(originalSuggestions, validationError);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the agent have memory what you have asked before? If yes, where is the previous conversation stored and how does agent decide which conversation to give to LLM?

@@ -0,0 +1,13 @@
{
"name": "Test_create_anomaly_detector_enhanced_flow_agent",
"type": "flow",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The agent should be equipped with index search tool as well, right? If so, can you point me the code?

@@ -0,0 +1,4 @@
{
"CLAUDE": "\n\nHuman: Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method.\n\nAssistant:\"",
"OPENAI": "Analyze this index and suggest ONE anomaly detector for operational monitoring that generates clear, actionable alerts.\n\nIndex: ${indexInfo.indexName}\nMapping: ${indexInfo.indexMapping}\nAvailable date fields: ${dateFields}\n\nCORE PRINCIPLE:\nCreate detectors that will find anomalies on operational issues teams can immediately investigate and resolve.\nFocus on 1-2 RELATED fields that will be used for defining our features which are the important KPIs that we are looking for anomalies in.\n\nSTEP 1 - IDENTIFY MONITORING PRIORITY:\nOperational impact priority:\n\n1. Service Reliability: error_count, failed_requests, 5xx_status, exceptions, timeout_count, failures\n2. Performance Issues: response_time, latency, processing_time, duration, delay_minutes\n3. Resource Problems: cpu_usage, memory_usage, disk_usage, connection_count\n4. Traffic/Capacity: request_count, bytes_transferred, active_connections, queue_size, throughput\n5. Security Events: blocked_requests, authentication_failures, suspicious_activity\n6. Business Impact: revenue, conversion_rate, transaction_amount, operational_failures\n\nSTEP 2 - FEATURE SELECTION STRATEGY:\nDEFAULT TO 1 FEATURE unless multiple features provide complementary evidence of the SAME operational issue.\n\n✓ EXCELLENT 2-FEATURE COMBINATIONS (investigated together):\n\n* [error_count, timeout_count] → \"Service degradation: errors up 300%, timeouts up 150%\"\n* [response_time, error_rate] → \"Performance issue: response time 2x higher, error rate spiked\"\n* [cpu_usage, memory_usage] → \"Resource exhaustion: CPU at 90%, memory at 85%\"\n* [failed_requests, retry_count] → \"Service instability: failures up 400%, retries up 250%\"\n* [bytes_sent, bytes_received] → \"Network anomaly: traffic pattern changed significantly\"\n* [blocked_requests, failed_auth] → \"Security event: attack pattern detected\"\n* [request_count, error_count] → \"Service stress: high load with increasing failures\"\n\n✓ GOOD SINGLE FEATURES (clear, actionable alerts):\n\n* [error_count] → \"Error spike: 400% increase in errors\"\n* [response_time] → \"Performance issue: response time 3x normal\"\n* [bytes_transferred] → \"Traffic anomaly: data transfer volume unusual\"\n* [cpu_usage] → \"Resource issue: CPU utilization spiked\"\n\nSTEP 3 - AGGREGATION METHOD RULES:\nCRITICAL: OpenSearch Anomaly Detection supports ONLY these 5 aggregation methods:\n\n* avg() - Average value of numeric fields\n* sum() - Sum total of numeric fields\n* min() - Minimum value of numeric fields\n* max() - Maximum value of numeric fields\n* count() - Count of documents (works on any field type)\n\nField Type Constraints:\n\n* Numeric fields (long, integer, double, float): Can use avg, sum, min, max, count\n* Keyword fields: Can ONLY use count ('count' ONLY when meaningful - avoid mixed good/bad values)\n* NEVER use sum/avg/min/max on keyword fields - Will cause errors\n\nOperational Logic Rules:\n\n* Times/Durations/Delays: ALWAYS 'avg' (NEVER 'sum' - summing time is meaningless)\n* Errors/Failures/Counts: 'sum' for totals, 'avg' for rates/percentages\n* Bytes/Size fields: 'sum' for total volume (bytes, object_size, response.bytes)\n* Memory/Resource fields: 'avg' for percentages, 'max' for absolute values\n* Business metrics: 'avg' for per-transaction values, 'sum' for revenue totals\n* Keyword fields for traffic: 'count' (counts specific when there is an error or something specific like bad status codes, not just all status codes)\n\nWhen to AVOID 'count' on keyword fields:\n\n* Mixed success/error fields: status_code.keyword (contains 200, 404, 500 - becomes total traffic, not error detection vs error_code.keyword which would be count of docs with error)\n* High-cardinality descriptive: user_agent.keyword, message.keyword (not operational KPIs)\n* Static/descriptive fields: version.keyword, environment.keyword (don't change operationally)\n\nField Pattern Recognition:\n\n* *_count, *_errors, *_failures, *_requests → 'sum' (if numeric) OR 'count' (if keyword)\n* *_bytes, *_size, object_size → 'sum' (if numeric)\n* status_code, method, protocol → 'count' (if keyword)\n\nSTEP 4 - CATEGORY FIELD SELECTION:\nIMPORTANT: Category fields are OPTIONAL. If no field provides actionable segmentation, leave empty.\n\nCRITICAL CONSTRAINT: Category field MUST be a keyword or ip field type from the mapping above.\nCheck the field type - ONLY use fields marked as \"keyword\" or \"ip\".\nIf no keyword/ip fields exist, leave category_field empty.\n\nChoose ONE keyword field that provides actionable segmentation for operations teams:\n\n✓ EXCELLENT choices (actionable alerts):\n\n* service_name, endpoint, api_path → \"Error spike on /checkout endpoint\"\n* host, instance_id, server → \"CPU spike on web-server-01\"\n* region, datacenter, availability_zone → \"Network issues in us-west-2\"\n* status_code, error_type → \"500 errors spiking\"\n* method, protocol → \"POST requests failing\"\n\n✓ GOOD choices (moderate cardinality):\n\n* device_type, browser, user_agent for web analytics\n* database_name, table_name for DB monitoring\n* queue_name, topic for messaging systems\n* payment_method, transaction_type for financial monitoring\n\n✗ AVOID (too specific or not actionable for general monitoring):\n\n* Unique identifiers: transaction_id, session_id, request_id\n* High-cardinality user data: user_id, customer_id\n* Timestamp fields, hash fields, UUIDs\n* Fields ending in _key, _hash, _uuid\n\nCARDINALITY GUIDELINES:\n\n* Ideal: 5-50 unique values (most operational use cases)\n* Acceptable: 50-500 values (if actionable segmentation)\n* No category field: Perfectly fine if no field provides actionable insights\n\nSTEP 5 - DETECTION INTERVAL GUIDELINES:\nMatch interval to data frequency and operational needs:\n\n* Real-time systems: 10-15 minutes (APIs, web services, errors, response times)\n* Infrastructure monitoring: 15-30 minutes (servers, databases, resource usage)\n* Business processes: 30-60 minutes (transactions, conversions)\n* Security logs: 10-30 minutes (access logs, firewalls, authentication)\n* Batch/ETL processes: 60+ minutes (data pipeline monitoring)\n* Sparse data: 60+ minutes (avoid false positives from empty buckets)\n\nSTEP 6 - SPARSE DATA INTELLIGENCE:\nCRITICAL: If validation suggests intervals >120 minutes due to sparse data:\n\n* PREFERRED SOLUTION: Remove category field entirely (better no segmentation than 4+ hour intervals)\n* ALTERNATIVE: Choose lower-cardinality category field\n* NEVER ACCEPT: If intervals are over 24 hours then we should aim for lower interval with different categorical fields.\n* PRINCIPLE: Operational usefulness over perfect segmentation\n\nOUTPUT FORMAT:\nReturn ONLY this structured format, no explanation:\n{category_field=field_name_or_empty|aggregation_field=field1,field2|aggregation_method=method1,method2|interval=minutes}\n\nIf no suitable fields exist, use empty strings for aggregation_field and aggregation_method"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is hard to maintain two prompts with almost identical content. Can you try to generate the claude one programmatically by adding extra information?

}

Map<String, String> indexInfo = ImmutableMap
.of("indexName", indexName, "indexMapping", tableInfoJoiner.toString(), "dateFields", dateFields);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Users may have huge index mapping. Do you have a max limit?

}

@SuppressWarnings("unchecked")
**/
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SuppressWarnings should be outside of comment block. This may cause CI compile failure.

dependencies {
// 3P dependencies
compileOnly group: 'com.google.code.gson', name: 'gson', version: '2.10.1'
implementation group: 'org.owasp.encoder' , name: 'encoder', version: '1.3.1'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you use owasp?

Comment on lines +800 to +802
if (response.getInterval() != null) {
newInterval = response.getInterval();
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this override LLM's interval? If yes, is this on purpose? why do we need LLM's interval suggestion?

TimeConfiguration newWindowDelay = originalDetector.getWindowDelay();
Integer newHistoryIntervals = originalDetector.getHistoryIntervals();
if (response.getInterval() != null) {
newInterval = response.getInterval();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this override LLM's interval? If yes, is this on purpose? why do we need LLM's interval suggestion?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants