diff --git a/mistralai/fts/.env.sample b/mistralai/fts/.env.sample new file mode 100644 index 00000000..46818b90 --- /dev/null +++ b/mistralai/fts/.env.sample @@ -0,0 +1,7 @@ +MISTRAL_API_KEY= +CB_HOST= +CB_USERNAME= +CB_PASSWORD= +CB_BUCKET_NAME= +SCOPE_NAME= +COLLECTION_NAME= \ No newline at end of file diff --git a/mistralai/frontmatter.md b/mistralai/fts/frontmatter.md similarity index 64% rename from mistralai/frontmatter.md rename to mistralai/fts/frontmatter.md index 13f44aab..a6d2c63a 100644 --- a/mistralai/frontmatter.md +++ b/mistralai/fts/frontmatter.md @@ -1,10 +1,10 @@ --- # frontmatter -path: "/tutorial-mistralai-couchbase-vector-search" -title: Using Mistral AI Embeddings with Couchbase Vector Search -short_title: Mistral AI with Couchbase Vector Search +path: "/tutorial-mistralai-couchbase-vector-search-with-fts" +title: Using Mistral AI Embeddings with Couchbase Vector Search using FTS service +short_title: Mistral AI with Couchbase Vector Search using FTS service description: - - Learn how to generate embeddings using Mistral AI and store them in Couchbase. + - Learn how to generate embeddings using Mistral AI and store them in Couchbase using FTS service. - This tutorial demonstrates how to use Couchbase's vector search capabilities with Mistral AI embeddings. - You'll understand how to perform vector search to find relevant documents based on similarity. content_type: tutorial @@ -14,6 +14,7 @@ technology: tags: - Artificial Intelligence - Mistral AI + - FTS sdk_language: - python length: 30 Mins diff --git a/mistralai/mistralai.ipynb b/mistralai/fts/mistralai.ipynb similarity index 95% rename from mistralai/mistralai.ipynb rename to mistralai/fts/mistralai.ipynb index 286d5dd1..8d3f418c 100644 --- a/mistralai/mistralai.ipynb +++ b/mistralai/fts/mistralai.ipynb @@ -7,6 +7,8 @@ "source": [ "# Introduction\n", "\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the GSI index, please take a look at [this.](https://developer.couchbase.com//tutorial-mistralai-couchbase-vector-search-with-global-secondary-index)\n", + "\n", "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella™](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", "\n", "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral’s open source and commercial LLMs. \n", @@ -380,7 +382,7 @@ ], "metadata": { "kernelspec": { - "display_name": "Python 3 (ipykernel)", + "display_name": ".venv", "language": "python", "name": "python3" }, @@ -394,7 +396,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.13.0" + "version": "3.13.3" } }, "nbformat": 4, diff --git a/mistralai/mistralai_index.json b/mistralai/fts/mistralai_index.json similarity index 100% rename from mistralai/mistralai_index.json rename to mistralai/fts/mistralai_index.json diff --git a/mistralai/gsi/.env.sample b/mistralai/gsi/.env.sample new file mode 100644 index 00000000..46818b90 --- /dev/null +++ b/mistralai/gsi/.env.sample @@ -0,0 +1,7 @@ +MISTRAL_API_KEY= +CB_HOST= +CB_USERNAME= +CB_PASSWORD= +CB_BUCKET_NAME= +SCOPE_NAME= +COLLECTION_NAME= \ No newline at end of file diff --git a/mistralai/gsi/frontmatter.md b/mistralai/gsi/frontmatter.md new file mode 100644 index 00000000..fe55b4ac --- /dev/null +++ b/mistralai/gsi/frontmatter.md @@ -0,0 +1,21 @@ +--- +# frontmatter +path: "/tutorial-mistralai-couchbase-vector-search-with-global-secondary-index" +title: Using Mistral AI Embeddings using GSI Index +short_title: Mistral AI with Couchbase GSI Index +description: + - Learn how to generate embeddings using Mistral AI and store them in Couchbase using GSI. + - This tutorial demonstrates how to use Couchbase's GSI index capabilities with Mistral AI embeddings. + - You'll understand how to perform optimized vector search using Global Secondary Index for better performance. +content_type: tutorial +filter: sdk +technology: + - vector search +tags: + - Artificial Intelligence + - Mistral AI + - GSI +sdk_language: + - python +length: 30 Mins +--- \ No newline at end of file diff --git a/mistralai/gsi/mistralai.ipynb b/mistralai/gsi/mistralai.ipynb new file mode 100644 index 00000000..0e533ff8 --- /dev/null +++ b/mistralai/gsi/mistralai.ipynb @@ -0,0 +1,577 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Introduction\n", + "\n", + "In this guide, we will walk you through building a powerful semantic search engine using Couchbase as the backend database, [Mistral AI](https://mistral.ai/) as the AI-powered embedding Model. Semantic search goes beyond simple keyword matching by understanding the context and meaning behind the words in a query, making it an essential tool for applications that require intelligent information retrieval. This tutorial is designed to be beginner-friendly, with clear, step-by-step instructions that will equip you with the knowledge to create a fully functional semantic search system from scratch. Alternatively, if you want to perform semantic search using the FTS, please take a look at [this.](https://developer.couchbase.com//tutorial-mistralai-couchbase-vector-search-with-fts)\n", + "\n", + "Couchbase is a NoSQL distributed document database (JSON) with many of the best features of a relational DBMS: SQL, distributed ACID transactions, and much more. [Couchbase Capella™](https://cloud.couchbase.com/sign-up) is the easiest way to get started, but you can also download and run [Couchbase Server](http://couchbase.com/downloads) on-premises.\n", + "\n", + "Mistral AI is a research lab building the best open source models in the world. La Plateforme enables developers and enterprises to build new products and applications, powered by Mistral's open source and commercial LLMs. \n", + "\n", + "The [Mistral AI APIs](https://console.mistral.ai/) empower LLM applications via:\n", + "\n", + "- [Text generation](https://docs.mistral.ai/capabilities/completion/), enables streaming and provides the ability to display partial model results in real-time\n", + "- [Code generation](https://docs.mistral.ai/capabilities/code_generation/), enpowers code generation tasks, including fill-in-the-middle and code completion\n", + "- [Embeddings](https://docs.mistral.ai/capabilities/embeddings/), useful for RAG where it represents the meaning of text as a list of numbers\n", + "- [Function calling](https://docs.mistral.ai/capabilities/function_calling/), enables Mistral models to connect to external tools\n", + "- [Fine-tuning](https://docs.mistral.ai/capabilities/finetuning/), enables developers to create customized and specilized models\n", + "- [JSON mode](https://docs.mistral.ai/capabilities/json_mode/), enables developers to set the response format to json_object\n", + "- [Guardrailing](https://docs.mistral.ai/capabilities/guardrailing/), enables developers to enforce policies at the system level of Mistral models\n", + "\n", + "This tutorial demonstrates how to use Mistral AI's embedding capabilities with Couchbase's **Global Secondary Index (GSI)** for optimized vector search operations. GSI provides superior performance for vector operations compared to traditional search methods, especially for large-scale applications.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# How to run this tutorial\n", + "\n", + "This tutorial is available as a Jupyter Notebook (`.ipynb` file) that you can run interactively. You can access the original notebook [here](https://github.com/couchbase-examples/vector-search-cookbook/blob/main/mistralai/gsi/mistralai.ipynb).\n", + "\n", + "You can either download the notebook file and run it on [Google Colab](https://colab.research.google.com/) or run it on your system by setting up the Python environment.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Before you start\n", + "\n", + "## Get Credentials for Mistral AI\n", + "\n", + "Please follow the [instructions](https://console.mistral.ai/api-keys/) to generate the Mistral AI credentials.\n", + "\n", + "## Create and Deploy Your Free Tier Operational cluster on Capella\n", + "\n", + "To get started with Couchbase Capella, create an account and use it to deploy a forever free tier operational cluster. This account provides you with a environment where you can explore and learn about Capella with no time constraint.\n", + "\n", + "To know more, please follow the [instructions](https://docs.couchbase.com/cloud/get-started/create-account.html).\n", + "\n", + "**Note: To run this tutorial, you will need Capella with Couchbase Server version 8.0 or above as GSI vector search is supported only from version 8.0.**\n", + "\n", + "### Couchbase Capella Configuration\n", + "\n", + "When running Couchbase using [Capella](https://cloud.couchbase.com/sign-in), the following prerequisites need to be met.\n", + "\n", + "* Create the [database credentials](https://docs.couchbase.com/cloud/clusters/manage-database-users.html) to access the travel-sample bucket (Read and Write) used in the application.\n", + "* [Allow access](https://docs.couchbase.com/cloud/clusters/allow-ip-address.html) to the Cluster from the IP on which the application is running.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Install necessary libraries\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Requirement already satisfied: couchbase==4.4.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (4.4.0)\n", + "Requirement already satisfied: mistralai==1.9.10 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (1.9.10)\n", + "Requirement already satisfied: langchain-couchbase==0.5.0rc1 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (0.5.0rc1)\n", + "Collecting langchain-core==0.3.76\n", + " Using cached langchain_core-0.3.76-py3-none-any.whl.metadata (3.7 kB)\n", + "Requirement already satisfied: eval-type-backport>=0.2.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (0.2.2)\n", + "Requirement already satisfied: httpx>=0.28.1 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (0.28.1)\n", + "Requirement already satisfied: invoke<3.0.0,>=2.2.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (2.2.0)\n", + "Requirement already satisfied: pydantic>=2.10.3 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (2.11.9)\n", + "Requirement already satisfied: python-dateutil>=2.8.2 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (2.9.0.post0)\n", + "Requirement already satisfied: pyyaml<7.0.0,>=6.0.2 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (6.0.2)\n", + "Requirement already satisfied: typing-inspection>=0.4.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from mistralai==1.9.10) (0.4.1)\n", + "Collecting langsmith>=0.3.45 (from langchain-core==0.3.76)\n", + " Using cached langsmith-0.4.30-py3-none-any.whl.metadata (14 kB)\n", + "Requirement already satisfied: tenacity!=8.4.0,<10.0.0,>=8.1.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langchain-core==0.3.76) (9.1.2)\n", + "Requirement already satisfied: jsonpatch<2.0,>=1.33 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langchain-core==0.3.76) (1.33)\n", + "Requirement already satisfied: typing-extensions>=4.7 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langchain-core==0.3.76) (4.15.0)\n", + "Requirement already satisfied: packaging>=23.2 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langchain-core==0.3.76) (24.2)\n", + "Requirement already satisfied: jsonpointer>=1.9 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from jsonpatch<2.0,>=1.33->langchain-core==0.3.76) (3.0.0)\n", + "Requirement already satisfied: anyio in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from httpx>=0.28.1->mistralai==1.9.10) (4.10.0)\n", + "Requirement already satisfied: certifi in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from httpx>=0.28.1->mistralai==1.9.10) (2025.8.3)\n", + "Requirement already satisfied: httpcore==1.* in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from httpx>=0.28.1->mistralai==1.9.10) (1.0.9)\n", + "Requirement already satisfied: idna in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from httpx>=0.28.1->mistralai==1.9.10) (3.10)\n", + "Requirement already satisfied: h11>=0.16 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from httpcore==1.*->httpx>=0.28.1->mistralai==1.9.10) (0.16.0)\n", + "Requirement already satisfied: orjson>=3.9.14 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langsmith>=0.3.45->langchain-core==0.3.76) (3.11.3)\n", + "Requirement already satisfied: requests-toolbelt>=1.0.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langsmith>=0.3.45->langchain-core==0.3.76) (1.0.0)\n", + "Requirement already satisfied: requests>=2.0.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langsmith>=0.3.45->langchain-core==0.3.76) (2.32.5)\n", + "Requirement already satisfied: zstandard>=0.23.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from langsmith>=0.3.45->langchain-core==0.3.76) (0.25.0)\n", + "Requirement already satisfied: annotated-types>=0.6.0 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from pydantic>=2.10.3->mistralai==1.9.10) (0.7.0)\n", + "Requirement already satisfied: pydantic-core==2.33.2 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from pydantic>=2.10.3->mistralai==1.9.10) (2.33.2)\n", + "Requirement already satisfied: six>=1.5 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from python-dateutil>=2.8.2->mistralai==1.9.10) (1.17.0)\n", + "Requirement already satisfied: charset_normalizer<4,>=2 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from requests>=2.0.0->langsmith>=0.3.45->langchain-core==0.3.76) (3.4.3)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from requests>=2.0.0->langsmith>=0.3.45->langchain-core==0.3.76) (2.5.0)\n", + "Requirement already satisfied: sniffio>=1.1 in /Users/aayush.tyagi/Documents/couchbaselabs-vector-search-cookbook/vector-search-cookbook/.venv/lib/python3.13/site-packages (from anyio->httpx>=0.28.1->mistralai==1.9.10) (1.3.1)\n", + "Using cached langchain_core-0.3.76-py3-none-any.whl (447 kB)\n", + "Using cached langsmith-0.4.30-py3-none-any.whl (386 kB)\n", + "Installing collected packages: langsmith, langchain-core\n", + "\u001b[2K Attempting uninstall: langsmith\n", + "\u001b[2K Found existing installation: langsmith 0.2.11\n", + "\u001b[2K Uninstalling langsmith-0.2.11:\n", + "\u001b[2K Successfully uninstalled langsmith-0.2.11\n", + "\u001b[2K Attempting uninstall: langchain-core━━━━━━━━━━\u001b[0m \u001b[32m0/2\u001b[0m [langsmith]\n", + "\u001b[2K Found existing installation: langchain-core 0.3.280/2\u001b[0m [langsmith]\n", + "\u001b[2K Uninstalling langchain-core-0.3.28:━━━━━\u001b[0m \u001b[32m0/2\u001b[0m [langsmith]\n", + "\u001b[2K Successfully uninstalled langchain-core-0.3.282m0/2\u001b[0m [langsmith]\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m2/2\u001b[0m [langchain-core]m [langchain-core]\n", + "\u001b[1A\u001b[2KSuccessfully installed langchain-core-0.3.76 langsmith-0.4.30\n", + "Note: you may need to restart the kernel to use updated packages.\n" + ] + } + ], + "source": [ + "%pip install couchbase==4.4.0 mistralai==1.9.10 langchain-couchbase==0.5.0rc1 langchain-core==0.3.76 python-dotenv==1.1.1\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Imports\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "from datetime import timedelta\n", + "from mistralai import Mistral\n", + "from couchbase.auth import PasswordAuthenticator\n", + "from couchbase.cluster import Cluster\n", + "from couchbase.options import ClusterOptions\n", + "from langchain_couchbase.vectorstores import CouchbaseQueryVectorStore\n", + "from langchain_couchbase.vectorstores import DistanceStrategy, IndexType\n", + "from langchain_core.embeddings import Embeddings\n", + "from typing import List\n", + "from dotenv import load_dotenv\n", + "import os\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Prerequisites\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import getpass\n", + "\n", + "# Load environment variables from .env file if it exists\n", + "load_dotenv()\n", + "\n", + "# Load from environment variables or prompt for input\n", + "couchbase_cluster_url = os.getenv('COUCHBASE_CLUSTER_URL') or input(\"Cluster URL:\")\n", + "couchbase_username = os.getenv('COUCHBASE_USERNAME') or input(\"Couchbase username:\")\n", + "couchbase_password = os.getenv('COUCHBASE_PASSWORD') or getpass.getpass(\"Couchbase password:\")\n", + "couchbase_bucket = os.getenv('COUCHBASE_BUCKET') or input(\"Couchbase bucket:\")\n", + "couchbase_scope = os.getenv('COUCHBASE_SCOPE') or input(\"Couchbase scope:\")\n", + "couchbase_collection = os.getenv('COUCHBASE_COLLECTION') or input(\"Couchbase collection:\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Couchbase Connection\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "auth = PasswordAuthenticator(\n", + " couchbase_username,\n", + " couchbase_password\n", + ")\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "cluster = Cluster(couchbase_cluster_url, ClusterOptions(auth))\n", + "cluster.wait_until_ready(timedelta(seconds=5))\n", + "\n", + "bucket = cluster.bucket(couchbase_bucket)\n", + "scope = bucket.scope(couchbase_scope)\n", + "collection = scope.collection(couchbase_collection)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating Mistral AI Embeddings Wrapper\n", + "\n", + "Since Mistral AI doesn't have native LangChain integration, we need to create a custom wrapper class that implements the LangChain Embeddings interface. This will allow us to use Mistral AI's embedding model with Couchbase's GSI vector store.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "class MistralAIEmbeddings(Embeddings):\n", + " \"\"\"Custom Mistral AI Embeddings wrapper for LangChain compatibility.\"\"\"\n", + " \n", + " def __init__(self, api_key: str, model: str = \"mistral-embed\"):\n", + " self.client = Mistral(api_key=api_key)\n", + " self.model = model\n", + " \n", + " def embed_documents(self, texts: List[str]) -> List[List[float]]:\n", + " \"\"\"Embed search docs.\"\"\"\n", + " try:\n", + " response = self.client.embeddings.create(\n", + " model=self.model,\n", + " inputs=texts,\n", + " )\n", + " return [embedding.embedding for embedding in response.data]\n", + " except Exception as e:\n", + " raise ValueError(f\"Error generating embeddings: {str(e)}\")\n", + " \n", + " def embed_query(self, text: str) -> List[float]:\n", + " \"\"\"Embed query text.\"\"\"\n", + " try:\n", + " response = self.client.embeddings.create(\n", + " model=self.model,\n", + " inputs=[text],\n", + " )\n", + " return response.data[0].embedding\n", + " except Exception as e:\n", + " raise ValueError(f\"Error generating query embedding: {str(e)}\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Mistral Connection\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "MISTRAL_API_KEY = os.getenv('MISTRAL_API_KEY') or getpass.getpass(\"Mistral API Key:\")\n", + "embeddings = MistralAIEmbeddings(api_key=MISTRAL_API_KEY, model=\"mistral-embed\")\n", + "mistral_client = Mistral(api_key=MISTRAL_API_KEY)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Setting Up Couchbase GSI Vector Store\n", + "\n", + "Instead of using FTS (Full-Text Search), we'll use Couchbase's GSI (Global Secondary Index) for vector operations. GSI provides better performance for vector search operations and supports advanced index types like BHIVE and COMPOSITE indexes.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector Store created successfully!\n" + ] + } + ], + "source": [ + "vector_store = CouchbaseQueryVectorStore(\n", + " cluster=cluster,\n", + " bucket_name=couchbase_bucket,\n", + " scope_name=couchbase_scope,\n", + " collection_name=couchbase_collection,\n", + " embedding=embeddings,\n", + " distance_metric=DistanceStrategy.COSINE\n", + ")\n", + "\n", + "print(\"GSI Vector Store created successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Embedding Documents\n", + "\n", + "Mistral client can be used to generate vector embeddings for given text fragments. These embeddings represent the sentiment of corresponding fragments and can be stored in Couchbase for further retrieval. A custom embedding text can also be added into the embedding texts array by running this code block:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Documents added to GSI vector store successfully!\n" + ] + } + ], + "source": [ + "texts = [\n", + " \"Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\",\n", + " \"It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\",\n", + " input(\"custom embedding text\")\n", + "]\n", + "\n", + "# Store documents in the GSI vector store\n", + "vector_store.add_texts(texts)\n", + "\n", + "print(\"Documents added to GSI vector store successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Creating GSI Vector Index for Optimal Performance\n", + "\n", + "GSI supports different types of vector indexes for optimal performance:\n", + "\n", + "- **BHIVE (Hyperscale Vector Index)**: Best for pure vector searches with high performance and low memory footprint\n", + "- **COMPOSITE**: Best for filtered vector searches that combine vector similarity with scalar filtering\n", + "\n", + "Let's create a BHIVE index for our use case:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "BHIVE index created successfully!\n" + ] + } + ], + "source": [ + "# Create a BHIVE index for optimal vector search performance\n", + "vector_store.create_index(\n", + " index_type=IndexType.BHIVE, \n", + " index_name=\"mistral_bhive_index\",\n", + " index_description=\"IVF,SQ8\"\n", + ")\n", + "\n", + "print(\"BHIVE index created successfully!\")\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Searching For Embeddings with GSI\n", + "\n", + "Now we can search using GSI vector operations, which provide better performance than traditional FTS methods. The GSI vector store handles the embedding generation and similarity search internally:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector Search completed in 1.3109 seconds\n", + "------------------------------------------------------------\n", + "Result 1:\n", + "Score: 0.286969\n", + "Text: Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational databases such as SQL and ACID transactions with JSON's versatility, with a foundation that is extremely fast and scalable.\n", + "------------------------------------------------------------\n", + "Result 2:\n", + "Score: 0.348376\n", + "Text: It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vector search, high-speed caching, and much more.\n", + "------------------------------------------------------------\n", + "Result 3:\n", + "Score: 0.436688\n", + "Text: \n", + "------------------------------------------------------------\n" + ] + } + ], + "source": [ + "import time\n", + "\n", + "# Test query\n", + "query = \"name a multipurpose database with distributed capability\"\n", + "\n", + "# Perform GSI-optimized similarity search\n", + "start_time = time.time()\n", + "search_results = vector_store.similarity_search_with_score(query, k=3)\n", + "search_time = time.time() - start_time\n", + "\n", + "print(f\"GSI Vector Search completed in {search_time:.4f} seconds\")\n", + "print(\"-\" * 60)\n", + "\n", + "for i, (doc, distance) in enumerate(search_results):\n", + " print(f\"Result {i+1}:\")\n", + " print(f\"Distance: {distance:.6f}\")\n", + " print(f\"Text: {doc.page_content}\")\n", + " print(\"-\" * 60)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# GSI Performance Benefits\n", + "\n", + "The GSI approach provides several advantages over traditional FTS methods:\n", + "\n", + "1. **Better Performance**: GSI vector operations are optimized for similarity search\n", + "2. **Scalability**: BHIVE indexes can handle billions of vectors efficiently\n", + "3. **Memory Optimization**: Lower memory footprint compared to FTS\n", + "4. **Concurrent Operations**: Supports simultaneous searches and inserts\n", + "5. **Advanced Configuration**: Configurable centroids and quantization options\n", + "\n", + "Let's test with multiple queries to see the performance:\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "GSI Vector Search Performance Tests:\n", + "============================================================\n", + "Query 1: fast and scalable database solution\n", + "Search Time: 1.2303 seconds\n", + "Best Match (Score: 0.278119): Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational data...\n", + "------------------------------------------------------------\n", + "Query 2: JSON document database with SQL support\n", + "Search Time: 0.4636 seconds\n", + "Best Match (Score: 0.182554): Couchbase Server is a multipurpose, distributed database that fuses the strengths of relational data...\n", + "------------------------------------------------------------\n", + "Query 3: high-speed caching for applications\n", + "Search Time: 0.4016 seconds\n", + "Best Match (Score: 0.261582): It's used across industries for things like user profiles, dynamic product catalogs, GenAI apps, vec...\n", + "------------------------------------------------------------\n" + ] + } + ], + "source": [ + "# Test multiple queries to demonstrate GSI performance\n", + "test_queries = [\n", + " \"fast and scalable database solution\",\n", + " \"JSON document database with SQL support\",\n", + " \"high-speed caching for applications\"\n", + "]\n", + "\n", + "print(\"GSI Vector Search Performance Tests:\")\n", + "print(\"=\" * 60)\n", + "\n", + "for i, query in enumerate(test_queries, 1):\n", + " start_time = time.time()\n", + " results = vector_store.similarity_search_with_score(query, k=1)\n", + " search_time = time.time() - start_time\n", + " \n", + " print(f\"Query {i}: {query}\")\n", + " print(f\"Search Time: {search_time:.4f} seconds\")\n", + " if results:\n", + " doc, distance = results[0]\n", + " print(f\"Best Match (Distance: {distance:.6f}): {doc.page_content[:100]}...\")\n", + " print(\"-\" * 60)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Additional GSI Index Configuration (Optional)\n", + "\n", + "For more complex use cases, you can also create a COMPOSITE index which combines vector search with scalar filtering:\n" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [], + "source": [ + "# Optional: Create a COMPOSITE index for filtered vector searches\n", + "# vector_store.create_index(\n", + "# index_type=IndexType.COMPOSITE, \n", + "# index_name=\"mistral_composite_index\",\n", + "# index_description=\"(type, vector_embedding)\"\n", + "# )" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Conclusion\n", + "\n", + "This tutorial demonstrated how to use Mistral AI's embedding capabilities with Couchbase's GSI vector search for optimal performance. Key benefits of this approach include:\n", + "\n", + "1. **GSI Performance**: Faster vector operations compared to traditional search methods\n", + "2. **Mistral AI Integration**: Powerful embedding model with custom LangChain wrapper\n", + "3. **Scalability**: BHIVE indexes handle large-scale vector operations efficiently\n", + "4. **Flexibility**: Support for both BHIVE and COMPOSITE index types\n", + "\n", + "The GSI approach provides superior performance for vector search operations, making it ideal for production applications requiring fast semantic search capabilities.\n" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": ".venv", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.13.3" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +}