You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Search optimization and indexing based on datetime (#405)
**Related Issue(s):**
- #401
# Index Management System with Time-based Partitioning
## Description
This PR introduces a new index management system that enables automatic
index partitioning based on dates and index size control with automatic
splitting.
## How it works
### System Architecture
The system consists of several main components:
**1. Search Engine Adapters**
- `SearchEngineAdapter` - base class
- `ElasticsearchAdapter` and `OpenSearchAdapter` - implementations for
specific engines
**2. Index Selection Strategies**
- `AsyncDatetimeBasedIndexSelector` / `SyncDatetimeBasedIndexSelector` -
date-based index filtering
- `UnfilteredIndexSelector` - returns all indexes (fallback)
- Cache with TTL (default 1 hour) for performance
**3. Data Insertion Strategies**
- **Simple strategy**: one index per collection (behavior as before)
- **Datetime strategy**: indexes partitioned by dates with automatic
partitioning
### Datetime Strategy - Operation Details
**Index Format:**
```
items_collection-name_2025-01-01-2025-03-31
```
**Item Insertion Process:**
1. System checks item date (`properties.datetime`)
2. Looks for existing index that covers this date
3. If not found - creates new index from this date
4. Checks target index size
5. If exceeds limit (`DATETIME_INDEX_MAX_SIZE_GB`) - splits index
**Early Date Handling:**
If item has date earlier than oldest index:
1. Creates new index from this earlier date
2. Updates oldest index alias to end one day before new date
**Index Splitting:**
When index exceeds size limit:
1. Updates current index alias to end on last item's date
2. Creates new index from next day
3. New items go to new index
### Cache and Performance
**IndexCacheManager:**
- Stores mapping of collection aliases to index lists
- TTL default 1 hour
- Automatic refresh on expiration
- Manual refresh after index modifications
**AsyncIndexAliasLoader / SyncIndexAliasLoader:**
- Load alias information from search engine
- Use cache manager to store results
- Async and sync versions for different usage contexts
## Configuration
**New Environment Variables:**
```bash
# Enable datetime strategy (default false)
ENABLE_DATETIME_INDEX_FILTERING=true
# Maximum index size in GB before splitting (default 25)
DATETIME_INDEX_MAX_SIZE_GB=50
```
## Usage Examples
### Scenario 1: Adding items to new collection
1. First item with date `2025-01-15` → creates index
`items_collection_2025-01-15`
2. Subsequent items with similar dates → go to same index
### Scenario 2: Size limit exceeded
1. Index `items_collection_2025-01-01` reaches 25GB
2. New item with date `2025-03-15` → system splits index:
- Old: `items_collection_2025-01-01-2025-03-15`
- New: `items_collection_2025-03-16`
### Scenario 3: Item with early date
1. Existing index: `items_collection_2025-02-01`
2. New item with date `2024-12-15` → creates:
- New: `items_collection_2024-12-15-2025-01-31`
## Search
System automatically filters indexes during search:
**Query with date range:**
```json
{
"datetime": {
"gte": "2025-02-01",
"lte": "2025-02-28"
}
}
```
Searches only indexes containing items from this period, instead of all
collection indexes.
## Factories
**IndexSelectorFactory:**
- Creates appropriate selector based on configuration
- `create_async_selector()` / `create_sync_selector()`
**IndexInsertionFactory:**
- Creates insertion strategy based on configuration
- Automatically detects engine type and creates appropriate adapter
**SearchEngineAdapterFactory:**
- Detects whether you're using Elasticsearch or OpenSearch
- Creates appropriate adapter with engine-specific methods
## Backward Compatibility
- When `ENABLE_DATETIME_INDEX_FILTERING=false` → works as before
- Existing indexes remain unchanged
All operations have sync and async versions for different usage contexts
in the application.
**PR Checklist:**
- [x] Code is formatted and linted (run `pre-commit run --all-files`)
- [x] Tests pass (run `make test`)
- [x] Documentation has been updated to reflect changes, if applicable
- [x] Changes are added to the changelog
---------
Co-authored-by: Grzegorz Pustulka <[email protected]>
Copy file name to clipboardExpand all lines: CHANGELOG.md
+26Lines changed: 26 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -8,6 +8,32 @@ and this project adheres to [Semantic Versioning](http://semver.org/spec/v2.0.0.
8
8
9
9
## [Unreleased]
10
10
11
+
### Added
12
+
13
+
- Added comprehensive index management system with dynamic selection and insertion strategies for improved performance and scalability [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
14
+
- Added `ENABLE_DATETIME_INDEX_FILTERING` environment variable to enable datetime-based index selection using collection IDs. When enabled, the system creates indexes with UUID-based names and manages them through time-based aliases. Default is `false`. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
15
+
- Added `DATETIME_INDEX_MAX_SIZE_GB` environment variable to set maximum size limit in GB for datetime-based indexes. When an index exceeds this size, a new time-partitioned index will be created. Note: add +20% to target size due to ES/OS compression. Default is `25` GB. Only applies when `ENABLE_DATETIME_INDEX_FILTERING` is enabled. [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405)
16
+
- Added index operations system with unified interface for both Elasticsearch and OpenSearch [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
17
+
-`IndexOperations` class with common index creation and management methods
18
+
- UUID-based physical index naming: `{prefix}_{collection-id}_{uuid4}`
19
+
- Alias management: main collection alias, temporal aliases, and closed index aliases
20
+
- Automatic alias updates when indexes reach size limits
21
+
- Added datetime-based index selection strategies with caching support [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
22
+
-`DatetimeBasedIndexSelector` for temporal filtering with intelligent caching
23
+
-`IndexCacheManager` with configurable TTL-based cache expiration (default 1 hour)
24
+
-`IndexAliasLoader` for alias management and cache refresh
25
+
-`UnfilteredIndexSelector` as fallback for returning all available indexes
26
+
- Added index insertion strategies with automatic partitioning [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
27
+
- Simple insertion strategy (`SimpleIndexInserter`) for traditional single-index-per-collection approach
28
+
- Datetime-based insertion strategy (`DatetimeIndexInserter`) with time-based partitioning
29
+
- Automatic index size monitoring and splitting when limits exceeded
30
+
- Handling of chronologically early data and bulk operations
31
+
- Added index management utilities [#405](https://github.com/stac-utils/stac-fastapi-elasticsearch-opensearch/pull/405):
32
+
-`IndexSizeManager` for size monitoring and overflow handling with compression awareness
33
+
-`DatetimeIndexManager` for datetime-based index operations and validation
34
+
- Factory patterns (`IndexInsertionFactory`, `IndexSelectorFactory`) for strategy creation based on configuration
Copy file name to clipboardExpand all lines: README.md
+75-1Lines changed: 75 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -230,6 +230,81 @@ You can customize additional settings in your `.env` file:
230
230
> [!NOTE]
231
231
> The variables `ES_HOST`, `ES_PORT`, `ES_USE_SSL`, `ES_VERIFY_CERTS` and `ES_TIMEOUT` apply to both Elasticsearch and OpenSearch backends, so there is no need to rename the key names to `OS_` even if you're using OpenSearch.
232
232
233
+
# Datetime-Based Index Management
234
+
235
+
## Overview
236
+
237
+
SFEOS supports two indexing strategies for managing STAC items:
238
+
239
+
1.**Simple Indexing** (default) - One index per collection
240
+
2.**Datetime-Based Indexing** - Time-partitioned indexes with automatic management
241
+
242
+
The datetime-based indexing strategy is particularly useful for large temporal datasets. When a user provides a datetime parameter in a query, the system knows exactly which index to search, providing **multiple times faster searches** and significantly **reducing database load**.
243
+
244
+
## When to Use
245
+
246
+
**Recommended for:**
247
+
- Systems with large collections containing millions of items
248
+
- Systems requiring high-performance temporal searching
249
+
250
+
**Pros:**
251
+
- Multiple times faster queries with datetime filter
252
+
- Reduced database load - only relevant indexes are searched
253
+
254
+
**Cons:**
255
+
- Slightly longer item indexing time (automatic index management)
256
+
- Greater management complexity
257
+
258
+
## Configuration
259
+
260
+
### Enabling Datetime-Based Indexing
261
+
262
+
Enable datetime-based indexing by setting the following environment variable:
263
+
264
+
```bash
265
+
ENABLE_DATETIME_INDEX_FILTERING=true
266
+
```
267
+
268
+
### Related Configuration Variables
269
+
270
+
| Variable | Description | Default | Example |
271
+
|----------|-------------|---------|---------|
272
+
|`ENABLE_DATETIME_INDEX_FILTERING`| Enables time-based index partitioning |`false`|`true`|
273
+
|`DATETIME_INDEX_MAX_SIZE_GB`| Maximum size limit for datetime indexes (GB) - note: add +20% to target size due to ES/OS compression |`25`|`50`|
274
+
|`STAC_ITEMS_INDEX_PREFIX`| Prefix for item indexes |`items_`|`stac_items_`|
275
+
276
+
## How Datetime-Based Indexing Works
277
+
278
+
### Index and Alias Naming Convention
279
+
280
+
The system uses a precise naming convention:
281
+
282
+
**Physical indexes:**
283
+
```
284
+
{ITEMS_INDEX_PREFIX}{collection-id}_{uuid4}
285
+
```
286
+
287
+
**Aliases:**
288
+
```
289
+
{ITEMS_INDEX_PREFIX}{collection-id} # Main collection alias
290
+
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime} # Temporal alias
291
+
{ITEMS_INDEX_PREFIX}{collection-id}_{start-datetime}_{end-datetime} # Closed index alias
-`items_sentinel-2-l2a_2024-01-01` - active alias from January 1, 2024
302
+
-`items_sentinel-2-l2a_2024-01-01_2024-03-15` - closed index alias (reached size limit)
303
+
304
+
### Index Size Management
305
+
306
+
**Important - Data Compression:** Elasticsearch and OpenSearch automatically compress data. The configured `DATETIME_INDEX_MAX_SIZE_GB` limit refers to the compressed size on disk. It is recommended to add +20% to the target size to account for compression overhead and metadata.
307
+
233
308
## Interacting with the API
234
309
235
310
-**Creating a Collection**:
@@ -538,4 +613,3 @@ You can customize additional settings in your `.env` file:
538
613
- Ensures fair resource allocation among all clients
539
614
540
615
- **Examples**: Implementation examples are available in the [examples/rate_limit](examples/rate_limit) directory.
0 commit comments