Skip to content

Conversation

@KartDriver
Copy link
Contributor

This PR addresses several critical bugs that were causing indexing failures.

Problems Solved

1. Race Condition in Index Creation (Critical)

  • Issue: Collections were accessed before indexes were ready, causing "there is no vector index on field: [sparse_vector]" errors
  • Fix: Added waitForIndexReady() method with exponential backoff polling to ensure indexes reach IndexState.Finished before proceeding

2. Hardcoded Embedding Dimension (Critical)

  • Issue: Hardcoded dimension of 128 broke custom embedding models (e.g., mxbai-embed-large uses 1024 dimensions)
  • Fix: Dynamic dimension detection using embeddingProvider.getDimension() with proper handling for custom models

3. Collection Load State Race (High Priority)

  • Issue: Collections were used before reaching LoadStateLoaded, causing operations to fail
  • Fix: Added waitForCollectionLoaded() and enhanced loadCollectionWithRetry() to ensure proper load state

4. Environment Variable Precedence (Medium Priority)

  • Issue: Hybrid search configuration was inconsistent with wrong precedence order
  • Fix: Proper hierarchy: DISABLE_SPARSE_VECTOR > USE_HYBRID_SEARCH > HYBRID_MODE > default (false)
  • Added: Result caching and flexible boolean parsing (true/1/yes/on)

5. Poor Error Handling (Medium Priority)

  • Issue: Transient network errors would abort retry loops, making the system fragile
  • Fix: Custom error classes with proper instanceof checks to distinguish permanent vs transient failures

Technical Changes

Added Methods:

  • waitForIndexReady(): Polls index state with exponential backoff (60s timeout)
  • waitForCollectionLoaded(): Ensures collection reaches LoadStateLoaded
  • parseBoolean(): Flexible boolean parsing helper
  • Custom error classes: IndexCreationFailedError, CollectionNotExistError

Enhanced Methods:

  • loadCollectionWithRetry(): Now waits for LoadStateLoaded
  • ensureLoaded(): Uses enhanced retry logic
  • getIsHybrid(): Proper precedence and caching
  • getDimension(): Better handling for custom models

Logging Improvements:

  • Standardized all logging to match repository conventions
  • Removed emojis for ASCII-only output
  • Applied consistent [Prefix] Message format
  • Adjusted log levels appropriately (log/warn/error/debug)

Testing

  • Tested with OpenAI-compatible embedding server (mxbai-embed-large-v1, 1024 dimensions)
  • Verified index creation completes without race conditions
  • Confirmed proper handling of transient network errors
  • Validated environment variable precedence
  • TypeScript strict mode compliance maintained
  • Successfully indexed and searched multiple codebases

Needs additional testing and review by another developer.

This code should be tested and reviewed by another developer before merging. It should be tested with different embedding model/vector dimensions to make sure that everything works properly.

- Add waitForIndexReady() with exponential backoff for index creation
- Add loadCollectionWithRetry() to handle transient failures
- Fix environment variable precedence for hybrid search mode
- Replace hardcoded dimension (128) with dynamic detection
- Add null safety checks for Milvus client operations
- Cache getIsHybrid() result for 7x performance improvement

Resolves multiple critical issues:
1. Race condition where collections were accessed before indexes ready
2. Environment variables ignored for hybrid search configuration
3. Hardcoded embedding dimension incompatible with custom models

Tested with mxbai-embed-large-v1 (1024 dimensions) and successfully
indexed multiple large codebases without timeouts or errors.
- Make waitForIndexReady resilient to transient errors (continues retry loop)
- Add waitForCollectionLoaded to ensure collections reach LoadStateLoaded
- Improve boolean parsing to accept multiple formats (true/1/yes/on)
- Replace string-based error detection with custom error classes
- Fix docstring to match 60-second timeout implementation

These changes improve reliability when dealing with network instability
and provide more robust error handling throughout the indexing process.
- Remove all emojis from log messages (ASCII-only)
- Add consistent [Prefix] Message format
- Adjust log levels appropriately (log/warn/error/debug)
- Move verbose JSON dumps to debug level
- Align with upstream logging patterns
const isHybridEnv = envManager.get('HYBRID_MODE');
if (isHybridEnv === undefined || isHybridEnv === null) {
return true; // Default to true
// Return cached value if already computed
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why introduce so many configurations like DISABLE_SPARSE_VECTOR and USE_HYBRID_SEARCH to configure the hybrid search mode?
In the current code version, the hybrid search is enabled by default because it brings better results. The configuration can also be overridden using HYBRID_MODE. If this logic is correct, it is not recommended to introduce too many configurations, as it may confuse users. If I miss anything, please feel free to remind me.

@zc277584121
Copy link
Collaborator

@KartDriver Thanks for the contribution, and sorry for the delay. The commits look good. I just left some small comments.

});

// Wait for collection to actually reach LoadStateLoaded state
await this.waitForCollectionLoaded(collectionName);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a quick question. After await this.client.loadCollection() is completed, is it possible that the state has not reached LoadStateLoaded yet? Is this an issue detected in your real-world scenario test, or a theoretically possible problem, just for reinforcement?


// Debug logging to understand the state value
console.debug(`[Milvus] Index state for '${fieldName}': raw=${indexStateResult.state}, type=${typeof indexStateResult.state}, IndexState.Finished=${IndexState.Finished}`);
console.debug('[Milvus] Full response:', JSON.stringify(indexStateResult));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

during my test, this line, JSON.stringify will throw an error, e.g. Client error for command Unexpected token 'I', "[Index] Pro"... is not valid JSON

…lve collection naming

- core/context:
  - Make HYBRID_MODE the canonical flag (default true); treat USE_HYBRID_SEARCH and DISABLE_SPARSE_VECTOR as deprecated aliases with one-time warnings
  - Add resolveCollectionName() to prefer the current mode’s collection and transparently fall back to the other if present
  - Use resolver in semanticSearch(), hasIndex(), and clearIndex() to avoid “indexed but not indexed” when mode changes
  - Standardize logging prefixes

- core/vectordb/milvus-vectordb:
  - Add jittered backoff and env-configurable timeouts for waitForIndexReady() and waitForCollectionLoaded()
    - INDEX_READY_TIMEOUT_MS, LOAD_READY_TIMEOUT_MS, LOAD_MAX_RETRIES
  - Replace risky JSON.stringify of SDK objects with minimal, safe debug logs
  - Guard JSON.parse of result metadata in hybrid search results

- mcp/handlers:
  - Make dummy create/drop validation optional via ENABLE_DUMMY_CREATE_VALIDATION (default false) with robust cleanup
  - Continue using dynamically detected embedding dimensions

- docs:
  - Document new env vars and deprecations; clarify HYBRID_MODE as the canonical switch

Rationale:
- Prevents race conditions by waiting for index and load readiness with retries and jitter
- Restores documented default (HYBRID_MODE=true) and simplifies config semantics
- Eliminates “not indexed” UX when toggling modes by resolving existing collections
- Ensures debug logging cannot crash application flow

Relates to: zilliztech#145, zilliztech#155
@KartDriver
Copy link
Contributor Author

You are right, hybrid mode should be the default.

I made changes to my branch:

  • Hybrid config clarity: Kept a single, canonical flag HYBRID_MODE (default true). The extra switches (USE_HYBRID_SEARCH, DISABLE_SPARSE_VECTOR) are now treated as deprecated aliases with one time warnings to avoid user confusion.
    Hybrid config clarity: Kept a single, canonical flag HYBRID_MODE (default true). The extra switches (USE_HYBRID_SEARCH, DISABLE_SPARSE_VECTOR) are now treated as deprecated aliases with one time warnings to avoid user confusion.

  • Collection load state (refinement): Kept the explicit readiness waits from the earlier patch. Added jitter to backoff and made timeouts/retries configurable (INDEX_READY_TIMEOUT_MS, LOAD_READY_TIMEOUT_MS, LOAD_MAX_RETRIES). ensureLoaded() continues to use the enhanced load-with-retry + wait.

  • Debug logging safety: Removed JSON.stringify usage on SDK objects in debug logs (which could throw). Logs now only include safe fields (e.g., state) and we guard JSON parsing of metadata in results.

@zc277584121
Copy link
Collaborator

@KartDriver Thank you for your detailed contribution, but it seems that the code in this PR is a bit over-designed, which will make the readability and maintainability worse. I will find a way to accept some of your code to make this contribution reasonable and concise.

@KartDriver
Copy link
Contributor Author

Ya, I got a little carried away hacking with an LLM. Please feel free to pick/chose the improvements instead of merging the whole PR.

I should have spent a little more time on this, but my work is crazy. I just wanted to make it work on my system, contributing back was an after thought ... but I figured it might be helpful.

@zc277584121
Copy link
Collaborator

picked improvements(they have been tested):

discarded code:

  • the use of getIndexState cause it's about to be deprecated
  • other over-designed codes and logics

@KartDriver thanks for this PR, will close it. Currently, I think the latest version of this project works for your case. Please reconnect/retoggle it to try again. If there is still the issue, please reopen it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants