Skip to content

Commit 04ccf6b

Browse files
committed
DA-1253 update: haystastck tutorials
- Add details on Hyperscale and Composite Vector Indexes - update model version to GPT-5 - Improve clarity on Couchbase integration
1 parent 9f3d311 commit 04ccf6b

File tree

2 files changed

+79
-26
lines changed

2 files changed

+79
-26
lines changed

tutorial/markdown/python/python-haystack-pdf-chat/query_based/python-haystack-pdf-chat.md

Lines changed: 75 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,40 @@ length: 45 Mins
2626

2727
Welcome to this comprehensive guide on constructing an AI-enhanced Chat Application using **Couchbase's Hyperscale Vector Index**. We will create a dynamic chat interface capable of delving into PDF documents to extract and provide summaries, key facts, and answers to your queries. By the end of this tutorial, you'll have a powerful tool at your disposal, transforming the way you interact with and utilize the information contained within PDFs.
2828

29-
This tutorial demonstrates the **Hyperscale vector search approach**, which is ideal for:
30-
- **High-performance vector search at massive scale** (billions of documents)
31-
- **Pure vector search** optimized for RAG applications
32-
- **SQL++ queries** for efficient vector retrieval
33-
- **Couchbase 8.0+** recommended for optimal performance
29+
### Choosing Between Hyperscale and Composite Vector Indexes
3430

35-
To learn more about the other vector search indexes options avaliable at Couchbase refer to the [docs](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).
31+
Couchbase supports multiple types of vector indexes for different use cases. This tutorial uses **Hyperscale Vector Index**, but you should be aware of the alternatives:
32+
33+
**Hyperscale Vector Index** (used in this tutorial):
34+
- Best for **pure vector similarity search** at massive scale
35+
- Optimized for RAG and chatbot applications
36+
- Handles billions of documents with sub-second query latency
37+
- Ideal when you need fast semantic search without metadata filtering
38+
39+
**Composite Vector Index**:
40+
- Best for **vector search with metadata filtering**
41+
- Combines vector fields with scalar fields (e.g., date, category, user_id)
42+
- Enables pre-filtering before vector search for more efficient queries
43+
- Ideal when you need to filter documents by attributes before semantic search
44+
- Example use case: "Find similar documents from last month" or "Search within user's documents"
45+
46+
**Search Vector Index**:
47+
- Best for **hybrid search** combining keywords, geospatial, and semantic search
48+
- Flexible full-text search combined with vector similarity
49+
- Complex filtering using FTS queries
50+
- Compatible with Couchbase 7.6+
51+
52+
> **For this PDF chat demo, we use Hyperscale Vector Index** for optimal performance in pure RAG applications. If your use case requires filtering by metadata (e.g., searching only in specific user's documents, or documents from a certain date range), consider using Composite Vector Index instead.
53+
54+
To learn more about choosing the right vector index, refer to the [official Couchbase vector index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).
3655

3756
This tutorial will demonstrate how to:
3857

3958
- Create a [Couchbase Hyperscale Vector Index](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) for high-performance Vector Search.
4059
- Chunk PDFs into Vectors with [Haystack](https://haystack.deepset.ai/) and use [Couchbase Vector Store](https://haystack.deepset.ai/integrations/couchbase-document-store) to store the vectors into Couchbase.
41-
- Query large language models via the [RAG framework](https://aws.amazon.com/what-is/retrieval-augmented-generation/) for contextual insights. We will use [OpenAI](https://openai.com) for generating Embeddings and querying the LLM.
42-
- Automatically create Hyperscale vector indexe after document upload.
43-
- Craft an elegant UI with Streamlit. All these components come together to create a seamless, AI-powered chat experience.
60+
- Use large language models via the [RAG framework](https://aws.amazon.com/what-is/retrieval-augmented-generation/) for contextual insights. We will use [OpenAI](https://openai.com) for generating Embeddings and querying the LLM.
61+
- Automatically create Hyperscale vector index after document upload.
62+
- Use a Streamlit interface to see it working in action. All these components come together to create a seamless, AI-powered chat experience.
4463

4564
## Prerequisites
4665

@@ -86,7 +105,7 @@ Specifically, you need to do the following:
86105

87106
### Create Bucket
88107

89-
- For this of this tutorial, we will use a specific bucket, scope, and collection. However, you may use any name of your choice but make sure to update names in all the steps.
108+
- For this tutorial, we will use a specific bucket, scope, and collection. However, you may use any name of your choice but make sure to update names in all the steps.
90109
- Create a bucket named `sample-bucket`. We will use the `scope` scope and `coll` collection of this bucket.
91110

92111
### Automatic Hyperscale Vector Index Creation
@@ -119,18 +138,47 @@ The application uses `CouchbaseQueryDocumentStore` which leverages SQL++ queries
119138

120139
> **Note**: If automatic creation fails, the application will attempt to create the index on first query as a fallback. If you prefer manual control, you can create the index yourself using the SQL++ query above after uploading documents.
121140
141+
#### Alternative: Using Composite Vector Index
142+
143+
If your application needs to filter documents by metadata before performing vector search, you can use a **Composite Vector Index** instead. The same code works with minimal changes!
144+
145+
**Creating a Composite Vector Index:**
146+
147+
```sql
148+
CREATE INDEX idx_<collection_name>_composite
149+
ON `<bucket_name>`.`<scope_name>`.`<collection_name>`(embedding VECTOR)
150+
WITH {
151+
"dimension": 1536,
152+
"similarity": "DOT"
153+
}
154+
```
155+
156+
**Key Difference**: Notice the keyword is `INDEX` (not `VECTOR INDEX`). Composite indexes combine vector fields with other scalar fields.
157+
158+
**When to Use Composite Vector Index:**
159+
- You need to filter by metadata (e.g., `user_id`, `date`, `category`) before vector search
160+
- Example: "Find similar documents uploaded by user X in the last 30 days"
161+
- Enables efficient pre-filtering to reduce the search space
162+
163+
**Code Compatibility:**
164+
The same `CouchbaseQueryDocumentStore` and application code works with both Hyperscale and Composite indexes! The only difference is:
165+
1. The SQL++ statement used to create the index
166+
2. How you query it (Composite allows WHERE clauses for filtering)
167+
168+
To learn more about Composite Vector Indexes, refer to the [official documentation](https://docs.couchbase.com/cloud/vector-index/composite-vector-index.html).
169+
122170
### Setup Environment Config
123171

124172
Copy the `secrets.example.toml` file in `.streamlit` folder and rename it to `secrets.toml` and replace the placeholders with the actual values for your environment. All configuration for communication with the database is read from the environment variables.
125173

126174
```bash
127-
DB_CONN_STR = "<couchbase_cluster_connection_string>"
128-
DB_USERNAME = "<couchbase_username>"
129-
DB_PASSWORD = "<couchbase_password>"
130-
DB_BUCKET = "<bucket_name>"
131-
DB_SCOPE = "<scope_name>"
132-
DB_COLLECTION = "<collection_name>"
133-
OPENAI_API_KEY = "<openai_api_key>"
175+
DB_CONN_STR = "<couchbase_cluster_connection_string>"
176+
DB_USERNAME = "<couchbase_username>"
177+
DB_PASSWORD = "<couchbase_password>"
178+
DB_BUCKET = "<bucket_name>"
179+
DB_SCOPE = "<scope_name>"
180+
DB_COLLECTION = "<collection_name>"
181+
OPENAI_API_KEY = "<openai_api_key>"
134182
```
135183

136184
> [OpenAI](https://openai.com) API Key is required for usage in generating embedding and querying LLM.
@@ -213,7 +261,7 @@ In the PDF Chat app, Haystack is used for several tasks:
213261
- **Vector store integration**: Haystack provides a [CouchbaseDocumentStore](https://haystack.deepset.ai/integrations/couchbase-document-store) class that seamlessly integrates with Couchbase's Vector Search, allowing the app to store and search through the embeddings and their corresponding text.
214262
- **Pipelines**: Haystack uses [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines) to combine different components for various tasks. In this app, we have an indexing pipeline for processing and storing documents, and a RAG pipeline for retrieval and generation.
215263
- **Prompt Building**: Haystack's [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) component allows you to create custom prompts that guide the language model's behavior and output.
216-
- **Streaming Output**: LangChain supports [streaming](https://python.langchain.com/docs/expression_language/streaming/), allowing the app to stream the generated answer to the client in real-time.
264+
- **Streaming Output**: LangChain supports [streaming](https://docs.langchain.com/oss/python/langchain/streaming), allowing the app to stream the generated answer to the client in real-time.
217265

218266
By combining Vector Search with Couchbase, RAG, and Haystack, the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models.
219267

@@ -233,6 +281,8 @@ On the Chat Area, the user can pose questions. These inquiries are processed by
233281

234282
The first step is connecting to Couchbase. Couchbase Hyperscale Vector Search is required for PDF Upload as well as during chat (For Retrieval). We will use the Haystack **CouchbaseQueryDocumentStore** to connect to the Couchbase cluster with Hyperscale vector search support. The connection is established in the `get_document_store` function.
235283

284+
> **Note**: The same `CouchbaseQueryDocumentStore` configuration works with both **Hyperscale** and **Composite** vector indexes! You only need to change the SQL++ statement when creating the index.
285+
236286
The connection string and credentials are read from the environment variables. We perform some basic required checks for the environment variable not being set in the `secrets.toml`, and then proceed to connect to the Couchbase cluster. We connect to the cluster using [connect](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html#connect) method.
237287

238288
```python
@@ -279,7 +329,7 @@ We will define the bucket, scope, and collection names from [Environment Variabl
279329

280330
## Initialize Couchbase Vector Store
281331

282-
We will now initialize the CouchbaseDocumentStore which will be used for storing and retrieving document embeddings.
332+
We will now initialize the CouchbaseQueryDocumentStore which will be used for storing and retrieving document embeddings.
283333
```python
284334
# Initialize document store
285335
document_store = get_document_store()
@@ -334,7 +384,10 @@ The indexing pipeline is created to handle the entire process of ingesting PDFs
334384
from haystack import Pipeline
335385
from haystack.components.converters import PyPDFToDocument
336386
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
337-
from haystack.components.embedders import OpenAIDocumentEmbedder
387+
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
388+
from haystack.components.generators import OpenAIGenerator
389+
from haystack.components.builders import PromptBuilder, AnswerBuilder
390+
from haystack.components.writers import DocumentWriter
338391

339392
indexing_pipeline = Pipeline()
340393
indexing_pipeline.add_component("converter", PyPDFToDocument())
@@ -362,7 +415,7 @@ We create a RAG (Retrieval-Augmented Generation) pipeline using Haystack compone
362415
The OpenAIGenerator is a crucial component in our RAG pipeline, responsible for generating human-like responses based on the retrieved context and user questions. Here's a more detailed explanation of its configuration and role:
363416

364417
- API Key: The OpenAIGenerator uses the OPENAI_API_KEY from the environment variables to authenticate with the OpenAI API.
365-
- Model: It's configured to use the "gpt-4o" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
418+
- Model: It's configured to use the "gpt-5" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
366419
- Role in the Pipeline: The OpenAIGenerator receives a prompt constructed by the PromptBuilder, which includes the user's question and relevant context retrieved from the vector store. It then generates a response based on this input.
367420
- Integration: The generator's output is connected to the AnswerBuilder component, which formats the final response for display to the user.
368421

@@ -388,7 +441,7 @@ rag_pipeline.add_component(
388441
"llm",
389442
OpenAIGenerator(
390443
api_key=OPENAI_API_KEY,
391-
model="gpt-4o",
444+
model="gpt-5",
392445
),
393446
)
394447
rag_pipeline.add_component("answer_builder", AnswerBuilder())

tutorial/markdown/python/python-haystack-pdf-chat/search_based/python-haystack-pdf-chat.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@
33
path: "/tutorial-python-haystack-pdf-chat-with-search-vector-index"
44
# title and description do not need to be added to markdown, start with H2 (##)
55
title: Build PDF Chat App With Haystack, OpenAI and Couchbase Search Vector Index
6-
short_title: Build PDF Chat App with Search Vector index
6+
short_title: Build PDF Chat App with Search Vector Index
77
description:
88
- Construct a PDF Chat App with Haystack, Couchbase Python SDK, Couchbase Vector Search, and Streamlit.
99
- Learn to upload PDFs into Couchbase Vector Store with Haystack.
@@ -113,7 +113,7 @@ You may also create a vector index using Search UI on both [Couchbase Capella](h
113113

114114
Here, we are creating the index on the documents with the following configuration:
115115
- **Vector field**: `embedding` with 1536 dimensions (matching OpenAI's text-embedding-ada-002/003 models)
116-
- **Text field**: `content` for document text content
116+
- **Text field**: `content` for document text content
117117
- **Metadata field**: `meta` with dynamic mapping to account for varying document structures
118118
- **Similarity metric**: `dot_product` (optimized for OpenAI embeddings)
119119

@@ -420,7 +420,7 @@ We create a RAG (Retrieval-Augmented Generation) pipeline using Haystack compone
420420
The OpenAIGenerator is a crucial component in our RAG pipeline, responsible for generating human-like responses based on the retrieved context and user questions. Here's a more detailed explanation of its configuration and role:
421421

422422
- API Key: The OpenAIGenerator uses the OPENAI_API_KEY from the environment variables to authenticate with the OpenAI API.
423-
- Model: It's configured to use the "gpt-4o" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
423+
- Model: It's configured to use the "gpt-5" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
424424
- Role in the Pipeline: The OpenAIGenerator receives a prompt constructed by the PromptBuilder, which includes the user's question and relevant context retrieved from the vector store. It then generates a response based on this input.
425425
- Integration: The generator's output is connected to the AnswerBuilder component, which formats the final response for display to the user.
426426

@@ -446,7 +446,7 @@ rag_pipeline.add_component(
446446
"llm",
447447
OpenAIGenerator(
448448
api_key=OPENAI_API_KEY,
449-
model="gpt-4o",
449+
model="gpt-5",
450450
),
451451
)
452452
rag_pipeline.add_component("answer_builder", AnswerBuilder())

0 commit comments

Comments
 (0)