You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: tutorial/markdown/python/python-haystack-pdf-chat/query_based/python-haystack-pdf-chat.md
+75-22Lines changed: 75 additions & 22 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -26,21 +26,40 @@ length: 45 Mins
26
26
27
27
Welcome to this comprehensive guide on constructing an AI-enhanced Chat Application using **Couchbase's Hyperscale Vector Index**. We will create a dynamic chat interface capable of delving into PDF documents to extract and provide summaries, key facts, and answers to your queries. By the end of this tutorial, you'll have a powerful tool at your disposal, transforming the way you interact with and utilize the information contained within PDFs.
28
28
29
-
This tutorial demonstrates the **Hyperscale vector search approach**, which is ideal for:
30
-
-**High-performance vector search at massive scale** (billions of documents)
31
-
-**Pure vector search** optimized for RAG applications
32
-
-**SQL++ queries** for efficient vector retrieval
33
-
-**Couchbase 8.0+** recommended for optimal performance
29
+
### Choosing Between Hyperscale and Composite Vector Indexes
34
30
35
-
To learn more about the other vector search indexes options avaliable at Couchbase refer to the [docs](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).
31
+
Couchbase supports multiple types of vector indexes for different use cases. This tutorial uses **Hyperscale Vector Index**, but you should be aware of the alternatives:
32
+
33
+
**Hyperscale Vector Index** (used in this tutorial):
34
+
- Best for **pure vector similarity search** at massive scale
35
+
- Optimized for RAG and chatbot applications
36
+
- Handles billions of documents with sub-second query latency
37
+
- Ideal when you need fast semantic search without metadata filtering
38
+
39
+
**Composite Vector Index**:
40
+
- Best for **vector search with metadata filtering**
- Enables pre-filtering before vector search for more efficient queries
43
+
- Ideal when you need to filter documents by attributes before semantic search
44
+
- Example use case: "Find similar documents from last month" or "Search within user's documents"
45
+
46
+
**Search Vector Index**:
47
+
- Best for **hybrid search** combining keywords, geospatial, and semantic search
48
+
- Flexible full-text search combined with vector similarity
49
+
- Complex filtering using FTS queries
50
+
- Compatible with Couchbase 7.6+
51
+
52
+
> **For this PDF chat demo, we use Hyperscale Vector Index** for optimal performance in pure RAG applications. If your use case requires filtering by metadata (e.g., searching only in specific user's documents, or documents from a certain date range), consider using Composite Vector Index instead.
53
+
54
+
To learn more about choosing the right vector index, refer to the [official Couchbase vector index documentation](https://docs.couchbase.com/cloud/vector-index/use-vector-indexes.html).
36
55
37
56
This tutorial will demonstrate how to:
38
57
39
58
- Create a [Couchbase Hyperscale Vector Index](https://docs.couchbase.com/cloud/vector-index/hyperscale-vector-index.html) for high-performance Vector Search.
40
59
- Chunk PDFs into Vectors with [Haystack](https://haystack.deepset.ai/) and use [Couchbase Vector Store](https://haystack.deepset.ai/integrations/couchbase-document-store) to store the vectors into Couchbase.
41
-
-Query large language models via the [RAG framework](https://aws.amazon.com/what-is/retrieval-augmented-generation/) for contextual insights. We will use [OpenAI](https://openai.com) for generating Embeddings and querying the LLM.
42
-
- Automatically create Hyperscale vector indexe after document upload.
43
-
-Craft an elegant UI with Streamlit. All these components come together to create a seamless, AI-powered chat experience.
60
+
-Use large language models via the [RAG framework](https://aws.amazon.com/what-is/retrieval-augmented-generation/) for contextual insights. We will use [OpenAI](https://openai.com) for generating Embeddings and querying the LLM.
61
+
- Automatically create Hyperscale vector index after document upload.
62
+
-Use a Streamlit interface to see it working in action. All these components come together to create a seamless, AI-powered chat experience.
44
63
45
64
## Prerequisites
46
65
@@ -86,7 +105,7 @@ Specifically, you need to do the following:
86
105
87
106
### Create Bucket
88
107
89
-
- For this of this tutorial, we will use a specific bucket, scope, and collection. However, you may use any name of your choice but make sure to update names in all the steps.
108
+
- For this tutorial, we will use a specific bucket, scope, and collection. However, you may use any name of your choice but make sure to update names in all the steps.
90
109
- Create a bucket named `sample-bucket`. We will use the `scope` scope and `coll` collection of this bucket.
91
110
92
111
### Automatic Hyperscale Vector Index Creation
@@ -119,18 +138,47 @@ The application uses `CouchbaseQueryDocumentStore` which leverages SQL++ queries
119
138
120
139
> **Note**: If automatic creation fails, the application will attempt to create the index on first query as a fallback. If you prefer manual control, you can create the index yourself using the SQL++ query above after uploading documents.
121
140
141
+
#### Alternative: Using Composite Vector Index
142
+
143
+
If your application needs to filter documents by metadata before performing vector search, you can use a **Composite Vector Index** instead. The same code works with minimal changes!
**Key Difference**: Notice the keyword is `INDEX` (not `VECTOR INDEX`). Composite indexes combine vector fields with other scalar fields.
157
+
158
+
**When to Use Composite Vector Index:**
159
+
- You need to filter by metadata (e.g., `user_id`, `date`, `category`) before vector search
160
+
- Example: "Find similar documents uploaded by user X in the last 30 days"
161
+
- Enables efficient pre-filtering to reduce the search space
162
+
163
+
**Code Compatibility:**
164
+
The same `CouchbaseQueryDocumentStore` and application code works with both Hyperscale and Composite indexes! The only difference is:
165
+
1. The SQL++ statement used to create the index
166
+
2. How you query it (Composite allows WHERE clauses for filtering)
167
+
168
+
To learn more about Composite Vector Indexes, refer to the [official documentation](https://docs.couchbase.com/cloud/vector-index/composite-vector-index.html).
169
+
122
170
### Setup Environment Config
123
171
124
172
Copy the `secrets.example.toml` file in `.streamlit` folder and rename it to `secrets.toml` and replace the placeholders with the actual values for your environment. All configuration for communication with the database is read from the environment variables.
> [OpenAI](https://openai.com) API Key is required for usage in generating embedding and querying LLM.
@@ -213,7 +261,7 @@ In the PDF Chat app, Haystack is used for several tasks:
213
261
-**Vector store integration**: Haystack provides a [CouchbaseDocumentStore](https://haystack.deepset.ai/integrations/couchbase-document-store) class that seamlessly integrates with Couchbase's Vector Search, allowing the app to store and search through the embeddings and their corresponding text.
214
262
-**Pipelines**: Haystack uses [Pipelines](https://docs.haystack.deepset.ai/docs/pipelines) to combine different components for various tasks. In this app, we have an indexing pipeline for processing and storing documents, and a RAG pipeline for retrieval and generation.
215
263
-**Prompt Building**: Haystack's [PromptBuilder](https://docs.haystack.deepset.ai/docs/promptbuilder) component allows you to create custom prompts that guide the language model's behavior and output.
216
-
-**Streaming Output**: LangChain supports [streaming](https://python.langchain.com/docs/expression_language/streaming/), allowing the app to stream the generated answer to the client in real-time.
264
+
-**Streaming Output**: LangChain supports [streaming](https://docs.langchain.com/oss/python/langchain/streaming), allowing the app to stream the generated answer to the client in real-time.
217
265
218
266
By combining Vector Search with Couchbase, RAG, and Haystack, the PDF Chat app can efficiently ingest PDF documents, convert their content into searchable embeddings, retrieve relevant information based on user queries and conversation context, and generate context-aware and informative responses using large language models.
219
267
@@ -233,6 +281,8 @@ On the Chat Area, the user can pose questions. These inquiries are processed by
233
281
234
282
The first step is connecting to Couchbase. Couchbase Hyperscale Vector Search is required for PDF Upload as well as during chat (For Retrieval). We will use the Haystack **CouchbaseQueryDocumentStore** to connect to the Couchbase cluster with Hyperscale vector search support. The connection is established in the `get_document_store` function.
235
283
284
+
> **Note**: The same `CouchbaseQueryDocumentStore` configuration works with both **Hyperscale** and **Composite** vector indexes! You only need to change the SQL++ statement when creating the index.
285
+
236
286
The connection string and credentials are read from the environment variables. We perform some basic required checks for the environment variable not being set in the `secrets.toml`, and then proceed to connect to the Couchbase cluster. We connect to the cluster using [connect](https://docs.couchbase.com/python-sdk/current/hello-world/start-using-sdk.html#connect) method.
237
287
238
288
```python
@@ -279,7 +329,7 @@ We will define the bucket, scope, and collection names from [Environment Variabl
279
329
280
330
## Initialize Couchbase Vector Store
281
331
282
-
We will now initialize the CouchbaseDocumentStore which will be used for storing and retrieving document embeddings.
332
+
We will now initialize the CouchbaseQueryDocumentStore which will be used for storing and retrieving document embeddings.
283
333
```python
284
334
# Initialize document store
285
335
document_store = get_document_store()
@@ -334,7 +384,10 @@ The indexing pipeline is created to handle the entire process of ingesting PDFs
334
384
from haystack import Pipeline
335
385
from haystack.components.converters import PyPDFToDocument
336
386
from haystack.components.preprocessors import DocumentCleaner, DocumentSplitter
337
-
from haystack.components.embedders import OpenAIDocumentEmbedder
387
+
from haystack.components.embedders import OpenAIDocumentEmbedder, OpenAITextEmbedder
388
+
from haystack.components.generators import OpenAIGenerator
389
+
from haystack.components.builders import PromptBuilder, AnswerBuilder
390
+
from haystack.components.writers import DocumentWriter
@@ -362,7 +415,7 @@ We create a RAG (Retrieval-Augmented Generation) pipeline using Haystack compone
362
415
The OpenAIGenerator is a crucial component in our RAG pipeline, responsible for generating human-like responses based on the retrieved context and user questions. Here's a more detailed explanation of its configuration and role:
363
416
364
417
- API Key: The OpenAIGenerator uses the OPENAI_API_KEY from the environment variables to authenticate with the OpenAI API.
365
-
- Model: It's configured to use the "gpt-4o" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
418
+
- Model: It's configured to use the "gpt-5" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
366
419
- Role in the Pipeline: The OpenAIGenerator receives a prompt constructed by the PromptBuilder, which includes the user's question and relevant context retrieved from the vector store. It then generates a response based on this input.
367
420
- Integration: The generator's output is connected to the AnswerBuilder component, which formats the final response for display to the user.
# title and description do not need to be added to markdown, start with H2 (##)
5
5
title: Build PDF Chat App With Haystack, OpenAI and Couchbase Search Vector Index
6
-
short_title: Build PDF Chat App with Search Vector index
6
+
short_title: Build PDF Chat App with Search Vector Index
7
7
description:
8
8
- Construct a PDF Chat App with Haystack, Couchbase Python SDK, Couchbase Vector Search, and Streamlit.
9
9
- Learn to upload PDFs into Couchbase Vector Store with Haystack.
@@ -113,7 +113,7 @@ You may also create a vector index using Search UI on both [Couchbase Capella](h
113
113
114
114
Here, we are creating the index on the documents with the following configuration:
115
115
-**Vector field**: `embedding` with 1536 dimensions (matching OpenAI's text-embedding-ada-002/003 models)
116
-
-**Text field**: `content` for document text content
116
+
-**Text field**: `content` for document text content
117
117
-**Metadata field**: `meta` with dynamic mapping to account for varying document structures
118
118
-**Similarity metric**: `dot_product` (optimized for OpenAI embeddings)
119
119
@@ -420,7 +420,7 @@ We create a RAG (Retrieval-Augmented Generation) pipeline using Haystack compone
420
420
The OpenAIGenerator is a crucial component in our RAG pipeline, responsible for generating human-like responses based on the retrieved context and user questions. Here's a more detailed explanation of its configuration and role:
421
421
422
422
- API Key: The OpenAIGenerator uses the OPENAI_API_KEY from the environment variables to authenticate with the OpenAI API.
423
-
- Model: It's configured to use the "gpt-4o" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
423
+
- Model: It's configured to use the "gpt-5" model, which is a powerful language model capable of understanding context and generating coherent, relevant responses.
424
424
- Role in the Pipeline: The OpenAIGenerator receives a prompt constructed by the PromptBuilder, which includes the user's question and relevant context retrieved from the vector store. It then generates a response based on this input.
425
425
- Integration: The generator's output is connected to the AnswerBuilder component, which formats the final response for display to the user.
0 commit comments