diff --git a/nemo/Evaluator/end2end_tutorial.ipynb b/nemo/Evaluator/end2end_tutorial.ipynb new file mode 100644 index 000000000..b68167e10 --- /dev/null +++ b/nemo/Evaluator/end2end_tutorial.ipynb @@ -0,0 +1,4478 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "b9268128-30cd-4f08-a718-1f8e73a980e7", + "metadata": {}, + "source": [ + "# Nemo Evaluator Microservice Tutorial" + ] + }, + { + "cell_type": "markdown", + "id": "d52988a9-7676-44ab-9f53-7c4d6dd0d494", + "metadata": {}, + "source": [ + "## About this tutorial\n", + "\n", + "In this tutorial, we will cover the following evaluation types using NeMo Evaluator:\n", + "\n", + "- **Agentic Evaluation**\n", + "- **LLM Evaluation on Academic Benchmarks**\n", + "- **Custom Evaluations**\n", + " - **Similarity Metrics Evaluation**\n", + " - **LLM-as-Judge Evaluation**\n", + " - **Tool Calling Evaluation**\n", + "- **Retriever Pipeline Evaluation**\n", + "- **RAG Pipeline Evaluation**" + ] + }, + { + "cell_type": "markdown", + "id": "6a052618-c8b6-424b-8d7b-d00c26346296", + "metadata": {}, + "source": [ + "## 1. Prerequisites" + ] + }, + { + "cell_type": "markdown", + "id": "e3541975-e515-4909-af80-9c59a807ccc1", + "metadata": {}, + "source": [ + "### 1.1 Install Nemo Microservices" + ] + }, + { + "cell_type": "markdown", + "id": "e03e7c89-2b52-4eab-b41e-bf1e9b7156d3", + "metadata": {}, + "source": [ + "While this tutorial focuses on NeMo Evaluator, we recommend installing the NeMo Microservices platform using the [NeMo Microservices Platform Helm Chart](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nemo-microservices/helm-charts/nemo-microservices-helm-chart) to avoid manually managing dependencies. For step by step installation guide, refer to the [Demo Cluster Setup Guide](https://docs.nvidia.com/nemo/microservices/latest/get-started/setup/index.html). This tutorial was done on a single-node cluster with two A100 GPUs." + ] + }, + { + "cell_type": "markdown", + "id": "9d728ee2-9ef9-49ac-b992-c5cf9ecf8ae3", + "metadata": {}, + "source": [ + "Check the pods to make sure all required microservices is running before proceeding." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "278624c3-13ba-4305-9141-945d6ee3b6a1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NAME READY STATUS RESTARTS AGE\n", + "model-downloader-meta-llama-3-1-8b-instruct-2-0-28trx 0/1 Completed 0 10m\n", + "model-downloader-meta-llama-3-2-1b-instruct-2-0-b6scn 0/1 Completed 0 10m\n", + "modeldeployment-meta-llama-3-1-8b-instruct-6b64d56fdc-slctn 1/1 Running 0 9m26s\n", + "nemo-argo-workflows-server-655f8d755-svgn2 1/1 Running 0 12m\n", + "nemo-argo-workflows-workflow-controller-8f8877cd4-8t2tf 1/1 Running 0 12m\n", + "nemo-customizer-5d8554fcf6-rhwfp 1/1 Running 2 (11m ago) 12m\n", + "nemo-customizerdb-0 1/1 Running 0 12m\n", + "nemo-data-store-795ccbb97b-nwcf2 1/1 Running 0 12m\n", + "nemo-deployment-management-646cc67c-l67lq 1/1 Running 0 12m\n", + "nemo-entity-store-6bfd7c4bd9-p4s4b 1/1 Running 0 12m\n", + "nemo-entity-storedb-0 1/1 Running 0 12m\n", + "nemo-evaluator-68c67f6798-f4sqh 1/1 Running 0 12m\n", + "nemo-evaluatordb-0 1/1 Running 0 12m\n", + "nemo-guardrails-5985f8c6d4-bkkq2 1/1 Running 0 12m\n", + "nemo-milvus-standalone-57cc6c46fc-5rctk 1/1 Running 0 12m\n", + "nemo-nemo-operator-controller-manager-59c98d97f9-7gfsw 2/2 Running 0 12m\n", + "nemo-nim-operator-765cbf686b-w9n6c 1/1 Running 0 12m\n", + "nemo-nim-proxy-6fbccbfb45-8qvmz 1/1 Running 0 12m\n", + "nemo-opentelemetry-collector-6f46646fd9-5tnrf 1/1 Running 0 12m\n", + "nemo-postgresql-0 1/1 Running 0 12m\n" + ] + } + ], + "source": [ + "! kubectl get pods" + ] + }, + { + "cell_type": "markdown", + "id": "58628300-b5c1-4577-ab0c-c2bc92e60611", + "metadata": {}, + "source": [ + "Install `huggingface_hub` which is required to interact with NeMo Data Store:\n", + "\n", + "```bash\n", + "pip install -U \"huggingface_hub[cli]\"\n", + "```\n" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "f49ed4c1-25be-4c52-9fa4-154b88729de8", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "/home/ubuntu/.local/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n", + " from .autonotebook import tqdm as notebook_tqdm\n" + ] + } + ], + "source": [ + "import requests\n", + "import json\n", + "import os\n", + "from pprint import pp\n", + "from huggingface_hub import HfApi" + ] + }, + { + "cell_type": "markdown", + "id": "9159cb76-5239-4b31-ba6b-673562d46e9f", + "metadata": {}, + "source": [ + "Specify the Namespace and API endpoints:" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "1ee4a79f-eadf-4c1c-aef0-b59a30a61bf9", + "metadata": {}, + "outputs": [], + "source": [ + "NDS_URL = \"http://data-store.test\" # Data Store\n", + "NEMO_URL = \"http://nemo.test\" # Customizer, Entity Store, Evaluator\n", + "NIM_URL = \"http://nim.test\" # NIM Proxy\n", + "NMS_NAMESPACE = \"nemo-eval-tutorial\"\n", + "\n", + "target_url = f\"{NEMO_URL}/v1/evaluation/targets\"\n", + "config_url = f\"{NEMO_URL}/v1/evaluation/configs\"\n", + "job_url = f\"{NEMO_URL}/v1/evaluation/jobs\"\n", + "llm_chat_completion_url = f\"{NIM_URL}/v1/chat/completions\"" + ] + }, + { + "cell_type": "markdown", + "id": "e3e27807-e8c6-44cf-9183-c77c7d0390cf", + "metadata": {}, + "source": [ + "### 1.2 Deploy NIM for LLMs\n", + "\n", + "This tutorial will use the `Llama-3.1-8b-instruct` model as the LLM to be evaluated. You can either choose to deploy a `Llama-3.1-8b-instruct` NIM locally or use the NIM hosted remotely. As part of the Nemo Microservices Platform, the NeMo Deployment Management service provides an API to deploy NIM on a Kubernetes cluster and manage them through the NIM Operator microservice. The below cells show how to deploy the `Llama-3.1-8b-instruct` NIM and run inference using NIM Proxy service.\n", + "\n", + "**Note**: if you use see a pod named `modeldeployment-meta-llama-3-1-8b-instruct-xxx` from the list of pods above, then a Llama-3.1-8b-instruct NIM has already been deployed and you can skip this step." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "93deb6ef-a033-4243-8824-8bf74830271e", + "metadata": {}, + "outputs": [], + "source": [ + "deployment_url = f\"{NEMO_URL}/v1/deployment/model-deployments\"\n", + "\n", + "payload = {\n", + " \"name\": \"llama-3.1-8b-instruct\",\n", + " \"namespace\": \"meta\",\n", + " \"config\": {\n", + " \"model\": \"meta/llama-3.1-8b-instruct\",\n", + " \"nim_deployment\": {\n", + " \"image_name\": \"nvcr.io/nim/meta/llama-3.1-8b-instruct\",\n", + " \"image_tag\": \"1.8.3\",\n", + " \"pvc_size\": \"25Gi\",\n", + " \"gpu\": 1,\n", + " \"additional_envs\": {\n", + " \"NIM_GUIDED_DECODING_BACKEND\": \"outlines\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(deployment_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "9fa7ceb9-7644-4110-8e76-0e75db6deb50", + "metadata": {}, + "source": [ + "Check the deployment status and make sure the status is 'ready' before proceeding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "d2e582eb-b90e-4286-b0f5-ff283c4f06ca", + "metadata": {}, + "outputs": [], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/deployment/model-deployments/meta/llama-3.1-8b-instruct\", json=payload)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "dd959837-6507-4680-a4ac-4f4dcf53bbc6", + "metadata": {}, + "source": [ + "We can test LLM inferences to the NIM endpoint" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "61ead8b5-fb62-4c1b-97e4-a8e607746c33", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'role': 'assistant',\n", + " 'content': 'LLM (Large Language Model) token log probabilities are used to '\n", + " \"represent the model's confidence in its predictions, with lower \"\n", + " 'probabilities indicating less likely or less confident '\n", + " 'predictions.'}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"model\": \"meta/llama-3.1-8b-instruct\",\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"What is the purpose of LLM token log probabilities? Answer with a single sentence.\"\n", + " }\n", + " ],\n", + " \"stream\": False,\n", + " \"temperature\": 0.0\n", + "}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\",\n", + "}\n", + "\n", + "resp = requests.post(llm_chat_completion_url, json=payload, headers=headers)\n", + "pp(resp.json()['choices'][0]['message'])" + ] + }, + { + "cell_type": "markdown", + "id": "fcffc6e6-2f09-417a-a35b-520a52401685", + "metadata": {}, + "source": [ + "### 1.3 Set Up API Key and Access Token\n", + "\n", + "This tutorial uses a remote LLM hosted on [build.nvidia.com](https://build.nvidia.com/) for evaluations that requires a LLM as judge. For this, we will need to set up the API key to access the models. You can generate an NVIDIA API key at [Manage API Keys](https://build.nvidia.com/settings/api-keys)." + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "e355d4d7-6d24-4a3d-99e4-476916696cc0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Enter your NVIDIA API Key ········\n" + ] + } + ], + "source": [ + "from getpass import getpass\n", + "\n", + "os.environ['NVIDIA_API_KEY'] = getpass(\"Enter your NVIDIA API Key\")" + ] + }, + { + "cell_type": "markdown", + "id": "05d2e5d0-5169-4401-9c82-5bec2c0f8eb3", + "metadata": {}, + "source": [ + "Next we need to set up the Hugging Face Access Token which needs to have access to Meta's Llama-3.1-8B-instruct model." + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "efe54825-6497-4edb-b1ac-3289a8fa9cbf", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Enter your Hugging Face Token ········\n" + ] + } + ], + "source": [ + "os.environ['HF_Token'] = getpass(\"Enter your Hugging Face Token\")" + ] + }, + { + "cell_type": "markdown", + "id": "c67c1f20", + "metadata": {}, + "source": [ + "### 1.4 Running Evaluation Jobs with NeMo Evaluator\n", + "\n", + "Before running evaluations, it is important to understand the typical NeMo Evaluator Workflow:\n", + "\n", + "1.\t(Optional) Upload your custom dataset to the NeMo Data Store if you’re not using a built-in dataset.\n", + "2.\tCreate an evaluation configuration in NeMo Evaluator.\n", + "3.\tDefine an evaluation target (the model to evaluate).\n", + "4.\tSubmit an evaluation job to NeMo Evaluator. The following steps occur automatically:\n", + "\n", + "\ta. NeMo Evaluator retrieves any required custom data from the NeMo Data Store.\n", + "\n", + "\tb. It runs inference using NIM, supporting LLMs, embeddings, and reranking tasks.\n", + "\n", + "\tc. Results, including generations, logs, and metrics, are written to the NeMo Data Store.\n", + "\n", + "\td. The results are returned to the user.\n", + "\n", + "5.\tReview evaluation results." + ] + }, + { + "cell_type": "markdown", + "id": "6aa8733b-8f48-4448-bf0d-73cd19a5c19e", + "metadata": {}, + "source": [ + "## 2. Agentic Evaluation\n", + "\n", + "Agentic evaluation uses RAGAS metrics to score agent outputs. RAGAS is a library for evaluating retrieval-augmented generation and agentic workflows using standardized, research-backed metrics.\n", + "\n", + "Each task contains a set of metrics relevant to the Agentic evaluation, such as topic adherence, tool call accuracy, agent goal accuracy, or answer accuracy, depending on the metric selected in the job configuration." + ] + }, + { + "cell_type": "markdown", + "id": "65cfece4-764b-415f-a980-be35dc3eb259", + "metadata": {}, + "source": [ + "### 2.1 Upload Custom Data to Nemo Data Store\n", + "\n", + "Before uploding the data, we first need to create namespace in NeMo Data Store and Entity Store." + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "1e8e44cd-d0fc-408e-bf24-120dab22bf58", + "metadata": {}, + "outputs": [], + "source": [ + "from helpers import create_namespaces, setup_dataset_repo\n", + "\n", + "DATASET_NAME = \"agent_eval\"\n", + "create_namespaces(NEMO_URL, NDS_URL, NMS_NAMESPACE)\n", + "HF_API = HfApi(endpoint=f\"{NDS_URL}/v1/hf\", token=\"\")\n", + "repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)" + ] + }, + { + "cell_type": "markdown", + "id": "4140a67b-8e54-453c-9baa-1929ea81bbb9", + "metadata": {}, + "source": [ + "Follow this [dataset format](https://docs.nvidia.com/nemo/microservices/latest/evaluate/evaluation-types/agentic.html#options) to prepare the dataset for agentic evaluation, example datasets are provided at `./eval_dataset/agent_data`. Next, we will upload these example datasets to Data Store." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "9f48cdb0-1416-4f58-9ebe-2ee12fef6e38", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "agent_tool_call_data.jsonl: 100%|██████████| 3.20k/3.20k [00:00<00:00, 709kB/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "CommitInfo(commit_url='', commit_message='Upload agent_tool_call_data.jsonl with huggingface_hub', commit_description='', oid='659c7516a96482859db4f4707a32b4a8c5b78ba9', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "HF_API.upload_file(path_or_fileobj=os.path.join(\"./eval_dataset/agent_data\", \"agent_goal_data.jsonl\"),\n", + " path_in_repo=\"agent_goal_data.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")\n", + "\n", + "HF_API.upload_file(path_or_fileobj=os.path.join(\"./eval_dataset/agent_data\", \"agent_tool_call_data.jsonl\"),\n", + " path_in_repo=\"agent_tool_call_data.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2aaa0731-70b0-4db5-82b0-9169038d4fa8", + "metadata": {}, + "source": [ + "### 2.2 Create Evaluation Targets" + ] + }, + { + "cell_type": "markdown", + "id": "ffa00862-1b7e-49a0-b67e-f9e2417ccac7", + "metadata": {}, + "source": [ + "Currently, agentic evaluation only works with `cached_outputs` targets which point to files that are stored in NeMo Data Store and that contain pre-generated answers.\n", + "\n", + "We will create two evaluation targets for agentic evaluations: one for agent goal accuracy and one for tool calling accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "id": "f50ddadc-6ce6-4d60-9e62-b6d17643db77", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:18:37.063092',\n", + " 'updated_at': '2025-07-08T19:18:37.063095',\n", + " 'name': 'agent-goal-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'cached_outputs',\n", + " 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_goal_data.jsonl'},\n", + " 'id': 'eval-target-GDJ771rB6W3xaowbpE4Bsh',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"cached_outputs\",\n", + " \"name\": \"agent-goal-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"cached_outputs\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/agent_goal_data.jsonl\",\n", + " }\n", + "}\n", + "\n", + "headers = {\n", + " \"Content-Type\": \"application/json\",\n", + "}\n", + "\n", + "\n", + "resp = requests.post(target_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "id": "656576d0-c06d-4d20-8c9f-f6fcd01423ad", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:18:38.848137',\n", + " 'updated_at': '2025-07-08T19:18:38.848149',\n", + " 'name': 'agent-tool-call-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'cached_outputs',\n", + " 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_tool_call_data.jsonl'},\n", + " 'id': 'eval-target-HBLMJtD4MhBzV6S3AGor2g',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"cached_outputs\",\n", + " \"name\": \"agent-tool-call-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"cached_outputs\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/agent_tool_call_data.jsonl\",\n", + " }\n", + "}\n", + "\n", + "resp = requests.post(target_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "30cf1d52-6940-43ff-bf76-e11d258fa2e4", + "metadata": {}, + "source": [ + "### 2.3 Create Evaluation Configs\n", + "\n", + "Similarily, we will create two evaluation configs: one for agent goal accuracy and one for tool calling accuracy." + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "id": "49aae7e8-4718-413b-91ae-f7ba6269880e", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:19:13.374904',\n", + " 'updated_at': '2025-07-08T19:19:13.374906',\n", + " 'name': 'agentic-goal-accuracy',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'agentic',\n", + " 'tasks': {'goal-accuracy': {'type': 'agent_goal_accuracy_with_reference',\n", + " 'params': {'judge': {'model': {'url': 'https://integrate.api.nvidia.com/v1',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'},\n", + " 'inference_params': {'max_new_tokens': 4024,\n", + " 'max_retries': 10,\n", + " 'request_timeout': 10,\n", + " 'temperature': 0.1}}}}},\n", + " 'id': 'eval-config-MCeuvELVrXYuZQLZWE2tkV',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"agentic\",\n", + " \"name\": \"agentic-goal-accuracy\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"goal-accuracy\": {\n", + " \"type\": \"agent_goal_accuracy_with_reference\",\n", + " \"params\": {\n", + " \"judge\": {\n", + " \"model\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1\",\n", + " \"model_id\": \"meta/llama-3.3-70b-instruct\",\n", + " \"api_key\": os.environ[\"NVIDIA_API_KEY\"]\n", + " },\n", + " \"inference_params\": {\n", + " \"max_new_tokens\": 4024,\n", + " \"max_retries\": 10,\n", + " \"request_timeout\": 10,\n", + " \"temperature\": 0.1\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "id": "0e6f59d8-be94-4d61-956e-6967c6828c62", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:19:30.170735',\n", + " 'updated_at': '2025-07-08T19:19:30.170738',\n", + " 'name': 'agentic-tool-call-accuracy',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'agentic',\n", + " 'tasks': {'tool-call-accuracy': {'type': 'tool_call_accuracy'}},\n", + " 'id': 'eval-config-Jr4uE5uUonn3YyfwUHKKSa',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"agentic\",\n", + " \"name\": \"agentic-tool-call-accuracy\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"tool-call-accuracy\": {\n", + " \"type\": \"tool_call_accuracy\",\n", + " }\n", + " }\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "73806bf9-8caf-4eb9-9bfb-77222637869e", + "metadata": {}, + "source": [ + "### 2.4 Submit Evaluation Job\n", + "\n", + "To launch the evaluation job, we simply send a request with the previously created evaluation targets and configs to the `/jobs` API endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "id": "783386ae-b5c1-4c29-bfb3-9b94df32bdd4", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:28:46.640661',\n", + " 'updated_at': '2025-07-08T19:28:46.640663',\n", + " 'id': 'eval-PLUWikEJpyH7WkcYj6sWZb',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-GDJ771rB6W3xaowbpE4Bsh',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:18:37.063092',\n", + " 'updated_at': '2025-07-08T19:18:37.063095',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'agent-goal-target',\n", + " 'type': 'cached_outputs',\n", + " 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_goal_data.jsonl'},\n", + " 'model': None,\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-MCeuvELVrXYuZQLZWE2tkV',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:19:13.374904',\n", + " 'updated_at': '2025-07-08T19:19:13.374906',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'agentic-goal-accuracy',\n", + " 'type': 'agentic',\n", + " 'params': None,\n", + " 'tasks': {'goal-accuracy': {'type': 'agent_goal_accuracy_with_reference',\n", + " 'params': {'judge': {'model': {'url': 'https://integrate.api.nvidia.com/v1',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'},\n", + " 'inference_params': {'max_new_tokens': 4024,\n", + " 'max_retries': 10,\n", + " 'request_timeout': 10,\n", + " 'temperature': 0.1}}},\n", + " 'metrics': None,\n", + " 'dataset': None}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/agent-goal-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/agentic-goal-accuracy\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "agent_goal_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "id": "48683921-cb41-4668-bba5-485f2df43831", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:27:35.584648',\n", + " 'updated_at': '2025-07-08T19:27:35.584650',\n", + " 'id': 'eval-N9HsrJbRXjWKefR6ZSi3oo',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-HBLMJtD4MhBzV6S3AGor2g',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:18:38.848137',\n", + " 'updated_at': '2025-07-08T19:18:38.848149',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'agent-tool-call-target',\n", + " 'type': 'cached_outputs',\n", + " 'cached_outputs': {'files_url': 'hf://datasets/nemo-eval-tutorial/agent_eval/agent_tool_call_data.jsonl'},\n", + " 'model': None,\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-Jr4uE5uUonn3YyfwUHKKSa',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:19:30.170735',\n", + " 'updated_at': '2025-07-08T19:19:30.170738',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'agentic-tool-call-accuracy',\n", + " 'type': 'agentic',\n", + " 'params': None,\n", + " 'tasks': {'tool-call-accuracy': {'type': 'tool_call_accuracy',\n", + " 'params': None,\n", + " 'metrics': None,\n", + " 'dataset': None}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/agent-tool-call-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/agentic-tool-call-accuracy\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "agent_tool_call_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "6844ba1d-a491-43c4-8b64-947eb88f72f3", + "metadata": {}, + "source": [ + "### 2.5 Monitoring job status and get evaluation results\n", + "\n", + "We can monitor job status and get the evaluation results by sending request with the Job ID to the `/status` and `/results` API, respectively." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "192c9156-8df3-4611-9687-7356937fe389", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': None, 'task_status': {}, 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{agent_goal_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "20af57a2-7fb6-49db-a3a5-a8fecf75ce37", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'goal-accuracy': {'metrics': {'agent_goal_accuracy': {'scores': {'agent_goal_accuracy': {'value': 1.0}}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{agent_goal_eval_job_id}/results\")\n", + "pp(resp.json()['tasks'])" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "id": "a942d665-4f3e-4b10-8efe-e4bf02f81b67", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': None, 'task_status': {}, 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{agent_tool_call_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "366eba53-1b82-4d5e-85a2-6fb2c923f113", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'tool-call-accuracy': {'metrics': {'tool_call_accuracy': {'scores': {'tool_call_accuracy': {'value': 1.0}}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{agent_tool_call_eval_job_id}/results\")\n", + "pp(resp.json()['tasks'])" + ] + }, + { + "cell_type": "markdown", + "id": "87aaef5c-5d97-4107-8339-833faaeaaeef", + "metadata": {}, + "source": [ + "**Download Evaluation Eesults**: downloads a directory that contains the configuration files, logs, and evaluation results for a specific evaluation job." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b5e9abd9-104c-4a8b-ad91-982d0070ec17", + "metadata": {}, + "outputs": [], + "source": [ + "!curl -X \"GET\" \"{NEMO_URL}/v1/evaluation/jobs/{agent_eval_job_id}/download-results\" \\\n", + "-H 'accept: application/json' \\\n", + "-o result.zip" + ] + }, + { + "cell_type": "markdown", + "id": "cb03af9c-d330-4579-9b71-2883c32ad80a", + "metadata": {}, + "source": [ + "## 3. LLM Evaluation on Academic Benchmarks" + ] + }, + { + "cell_type": "markdown", + "id": "0b22d8f7-acb4-4782-8b6b-794a808da8a5", + "metadata": {}, + "source": [ + "**Create Evaluation Config**" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "86a9e0ce-1a6c-4964-8425-ef0427a7ecc2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:39:55.734565',\n", + " 'updated_at': '2025-07-08T19:39:55.734568',\n", + " 'name': 'gsm8k-chat-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'gsm8k',\n", + " 'params': {'max_tokens': 256,\n", + " 'temperature': 1e-05,\n", + " 'top_p': 1e-05,\n", + " 'stop': ['<|eot|>'],\n", + " 'extra': {'num_fewshot': 8,\n", + " 'batch_size': 16,\n", + " 'bootstrap_iters': 100000,\n", + " 'dataset_seed': 42,\n", + " 'use_greedy': True,\n", + " 'top_k': 1,\n", + " 'hf_token': '******',\n", + " 'tokenizer_backend': 'hf',\n", + " 'tokenizer': 'meta-llama/llama-3.1-8B-Instruct',\n", + " 'apply_chat_template': True,\n", + " 'fewshot_as_multiturn': True}},\n", + " 'id': 'eval-config-6pTREGQfYHFQzq7oWHATgt',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"gsm8k\",\n", + " \"name\": \"gsm8k-chat-config\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"params\": {\n", + " \"temperature\": 0.00001,\n", + " \"top_p\": 0.00001,\n", + " \"max_tokens\": 256,\n", + " \"stop\": [\"<|eot|>\"],\n", + " \"extra\": {\n", + " \"num_fewshot\": 8,\n", + " \"batch_size\": 16,\n", + " \"bootstrap_iters\": 100000,\n", + " \"dataset_seed\": 42,\n", + " \"use_greedy\": True,\n", + " \"top_k\": 1,\n", + " \"hf_token\": os.environ['HF_Token'],\n", + " \"tokenizer_backend\": \"hf\",\n", + " \"tokenizer\": \"meta-llama/llama-3.1-8B-Instruct\",\n", + " \"apply_chat_template\": True,\n", + " \"fewshot_as_multiturn\": True\n", + " }\n", + " }\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "f1c1856a-b816-49f4-900a-7de0b38506e6", + "metadata": {}, + "source": [ + "**Create Evaluation Target**" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "id": "3bd2ace9-dbda-4120-bba9-644a7651b75d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:40:13.061228',\n", + " 'updated_at': '2025-07-08T19:40:13.061229',\n", + " 'name': 'llama-chat-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'model',\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-08T19:40:13.061012',\n", + " 'updated_at': '2025-07-08T19:40:13.061015',\n", + " 'custom_fields': {},\n", + " 'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'format': 'openai'}},\n", + " 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"model\",\n", + " \"name\": \"llama-chat-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": llm_chat_completion_url ,\n", + " \"model_id\": \"meta/llama-3.1-8b-instruct\",\n", + " \"format\": \"openai\"\n", + " }\n", + " }\n", + "}\n", + "\n", + "resp = requests.post(target_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "4f8b84ac-ff4e-4237-aace-3e6756de4b9a", + "metadata": {}, + "source": [ + "**Submit Evaluation Job**" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "2db8db92-4040-455d-abb1-a6be1d6d9ec3", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T19:40:19.984771',\n", + " 'updated_at': '2025-07-08T19:40:19.984774',\n", + " 'id': 'eval-WF5xprkaQpjTNZLpsNb1Y1',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061228',\n", + " 'updated_at': '2025-07-08T19:40:13.061229',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'llama-chat-target',\n", + " 'type': 'model',\n", + " 'cached_outputs': None,\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061012',\n", + " 'updated_at': '2025-07-08T19:40:13.061015',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'openai'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-6pTREGQfYHFQzq7oWHATgt',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:39:55.734565',\n", + " 'updated_at': '2025-07-08T19:39:55.734568',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'gsm8k-chat-config',\n", + " 'type': 'gsm8k',\n", + " 'params': {'parallelism': None,\n", + " 'request_timeout': None,\n", + " 'max_retries': None,\n", + " 'limit_samples': None,\n", + " 'max_tokens': 256,\n", + " 'temperature': 1e-05,\n", + " 'top_p': 1e-05,\n", + " 'stop': ['<|eot|>'],\n", + " 'extra': {'num_fewshot': 8,\n", + " 'batch_size': 16,\n", + " 'bootstrap_iters': 100000,\n", + " 'dataset_seed': 42,\n", + " 'use_greedy': True,\n", + " 'top_k': 1,\n", + " 'hf_token': '******',\n", + " 'tokenizer_backend': 'hf',\n", + " 'tokenizer': 'meta-llama/llama-3.1-8B-Instruct',\n", + " 'apply_chat_template': True,\n", + " 'fewshot_as_multiturn': True}},\n", + " 'tasks': None,\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/llama-chat-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/gsm8k-chat-config\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "gsm8k_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "6b27d234-77dd-4505-bae7-d4ed2e2a5348", + "metadata": {}, + "source": [ + "We can check the status of the job using the status API. **Note that the status in the API is only regularly updated for custom evaluation. So as long as the status is saying running, it's actually running.**" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "29a416f7-a18a-4c40-b6ea-be351d483f07", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully', 'task_status': {}, 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{gsm8k_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "2cfcc3b0-8d3c-4498-9384-4b1213977a7e", + "metadata": {}, + "source": [ + "Once the job is completed, we can check the evaluation results using the results endpoint." + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "49938fe8-bb24-4498-a152-84ccc944bcd2", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'metrics': {'exact_match': {'scores': {'gsm8k-metric_ranking-1': {'value': 0.7664897649734648},\n", + " 'gsm8k-metric_ranking-3': {'value': 0.821076573161486}}}}}\n", + "{'metrics': {'exact_match_stderr': {'scores': {'gsm8k-metric_ranking-2': {'value': 0.011653286808791036},\n", + " 'gsm8k-metric_ranking-4': {'value': 0.010557661392901296}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{gsm8k_eval_job_id}/results\")\n", + "pp(resp.json()['tasks']['exact_match'])\n", + "pp(resp.json()['tasks']['exact_match_stderr'])" + ] + }, + { + "cell_type": "markdown", + "id": "90262e05-61a9-4dfa-9dee-5a8d579d531e", + "metadata": {}, + "source": [ + "## 4. Custom Evaluations" + ] + }, + { + "cell_type": "markdown", + "id": "325349e3-c240-4c6e-9cf6-913039b92735", + "metadata": {}, + "source": [ + "### 4.1 Similarity Metrics Evaluation" + ] + }, + { + "cell_type": "markdown", + "id": "f409363c-99ed-4e2b-b5b2-1d50c2790d14", + "metadata": {}, + "source": [ + "**Upload Custom Data to Nemo Data Store**" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "b4b4bd96-9e35-41b2-8dce-95a88d8eff1a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "inputs.jsonl: 100%|██████████| 149k/149k [00:00<00:00, 28.9MB/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "CommitInfo(commit_url='', commit_message='Upload similarity_metrics/inputs.jsonl with huggingface_hub', commit_description='', oid='6220905767521b90b36e77360dc7213c3dbbb99e', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)" + ] + }, + "execution_count": 43, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# set up dataset repo\n", + "DATASET_NAME = \"similarity_eval\"\n", + "HF_API = HfApi(endpoint=f\"{NDS_URL}/v1/hf\", token=\"\")\n", + "repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)\n", + "# upload dataset\n", + "HF_API.upload_file(path_or_fileobj=os.path.join(\"./eval_dataset/similarity_metrics_data\", \"inputs.jsonl\"),\n", + " path_in_repo=\"similarity_metrics/inputs.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "733118e3-7cdf-4b0c-9d57-51e5c7baf91e", + "metadata": {}, + "source": [ + "**Create Evaluation Config**" + ] + }, + { + "cell_type": "code", + "execution_count": 44, + "id": "7195efda-6785-404a-b910-9d5826176692", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T20:31:57.416755',\n", + " 'updated_at': '2025-07-08T20:31:57.416756',\n", + " 'name': 'similarity-configuration',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'similarity_metrics',\n", + " 'params': {'max_tokens': 200, 'temperature': 0.7, 'extra': {'top_k': 20}},\n", + " 'tasks': {'my-similarity-metrics-task': {'type': 'default',\n", + " 'metrics': {'accuracy': {'type': 'accuracy'},\n", + " 'bleu': {'type': 'bleu'},\n", + " 'rouge': {'type': 'rouge'},\n", + " 'em': {'type': 'em'},\n", + " 'f1': {'type': 'f1'}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-DALY8efJ78UgkXHcrfACGZ',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-08T20:31:57.416530',\n", + " 'updated_at': '2025-07-08T20:31:57.416532',\n", + " 'custom_fields': {},\n", + " 'name': 'dataset-DALY8efJ78UgkXHcrfACGZ',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/similarity_eval/similarity_metrics/inputs.jsonl'}}},\n", + " 'id': 'eval-config-XQpRHrKVucSQzLDXFdZqjq',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"similarity_metrics\",\n", + " \"name\": \"similarity-configuration\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"params\": {\n", + " \"max_tokens\": 200,\n", + " \"temperature\": 0.7,\n", + " \"extra\": {\n", + " \"top_k\": 20\n", + " }\n", + " },\n", + " \"tasks\": {\n", + " \"my-similarity-metrics-task\": {\n", + " \"type\": \"default\",\n", + " \"dataset\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/similarity_metrics/inputs.jsonl\",\n", + " },\n", + " \"metrics\": {\n", + " \"accuracy\": {\"type\": \"accuracy\"},\n", + " \"bleu\": {\"type\": \"bleu\"},\n", + " \"rouge\": {\"type\": \"rouge\"},\n", + " \"em\": {\"type\": \"em\"},\n", + " \"f1\": {\"type\": \"f1\"}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "c1571c3e-fbe5-4374-accf-7bbb1c36b63d", + "metadata": {}, + "source": [ + "**Launch Evaluation Job**" + ] + }, + { + "cell_type": "code", + "execution_count": 45, + "id": "261f2366-8e90-48d4-90ff-08275b5f0ac4", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T20:32:03.445393',\n", + " 'updated_at': '2025-07-08T20:32:03.445396',\n", + " 'id': 'eval-TVYBnUrD4XhjDV4UCdXJ6y',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061228',\n", + " 'updated_at': '2025-07-08T19:40:13.061229',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'llama-chat-target',\n", + " 'type': 'model',\n", + " 'cached_outputs': None,\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061012',\n", + " 'updated_at': '2025-07-08T19:40:13.061015',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'openai'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-XQpRHrKVucSQzLDXFdZqjq',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T20:31:57.416755',\n", + " 'updated_at': '2025-07-08T20:31:57.416756',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'similarity-configuration',\n", + " 'type': 'similarity_metrics',\n", + " 'params': {'parallelism': None,\n", + " 'request_timeout': None,\n", + " 'max_retries': None,\n", + " 'limit_samples': None,\n", + " 'max_tokens': 200,\n", + " 'temperature': 0.7,\n", + " 'top_p': None,\n", + " 'stop': None,\n", + " 'extra': {'top_k': 20}},\n", + " 'tasks': {'my-similarity-metrics-task': {'type': 'default',\n", + " 'params': None,\n", + " 'metrics': {'accuracy': {'type': 'accuracy',\n", + " 'params': None},\n", + " 'bleu': {'type': 'bleu',\n", + " 'params': None},\n", + " 'rouge': {'type': 'rouge',\n", + " 'params': None},\n", + " 'em': {'type': 'em',\n", + " 'params': None},\n", + " 'f1': {'type': 'f1',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-DALY8efJ78UgkXHcrfACGZ',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T20:31:57.416530',\n", + " 'updated_at': '2025-07-08T20:31:57.416532',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-DALY8efJ78UgkXHcrfACGZ',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/similarity_eval/similarity_metrics/inputs.jsonl',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/llama-chat-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/similarity-configuration\"\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "similarity_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "803ab48f-cba7-4285-9267-ab513b2862e2", + "metadata": {}, + "source": [ + "**Monitoring job status and results**" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "78c6b851-4c6f-4a11-9c21-8c4a26da0950", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'my-similarity-metrics-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{similarity_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 51, + "id": "83676ffb-2ef1-493a-a2a1-3dc1b41a72d0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'metrics': {'accuracy': {'scores': {'accuracy': {'value': 0.0}}},\n", + " 'bleu': {'scores': {'bleu_score': {'value': 0.015511131876432806}}},\n", + " 'em': {'scores': {'em': {'value': 0.0}}},\n", + " 'f1': {'scores': {'f1': {'value': 0.10128911130270025}}},\n", + " 'rouge': {'scores': {'rouge_1_score': {'value': 0.1166731565559731},\n", + " 'rouge_2_score': {'value': 0.03311328362331498},\n", + " 'rouge_3_score': {'value': 0.01193015710048473},\n", + " 'rouge_L_score': {'value': 0.09285834070191781}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{similarity_eval_job_id}/results\")\n", + "pp(resp.json()['tasks']['my-similarity-metrics-task'])" + ] + }, + { + "cell_type": "markdown", + "id": "b38981ff-5cae-4510-af60-f3458f2fc933", + "metadata": {}, + "source": [ + "### 4.2 LLM-as-Judge Evaluation" + ] + }, + { + "cell_type": "markdown", + "id": "6dc49476-b5a9-48a7-8453-c3f2cf0c4942", + "metadata": {}, + "source": [ + "**Upload Custom Dataset to Data Store**" + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "id": "39f1d4b2-e93e-433c-95fa-3648c3139229", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "math_dataset.csv: 100%|██████████| 449/449 [00:00<00:00, 108kB/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "CommitInfo(commit_url='', commit_message='Upload llm_as_judge/math_dataset.csv with huggingface_hub', commit_description='', oid='32e22ec06ea372c6cfddf52cb733e86fd1b135ed', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)" + ] + }, + "execution_count": 52, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "DATASET_NAME = \"llm_as_judge_data\"\n", + "repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)\n", + "HF_API.upload_file(\n", + " path_or_fileobj='./eval_dataset/llm_judge_data/math_dataset.csv',\n", + " path_in_repo=\"llm_as_judge/math_dataset.csv\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2868d978-17a8-4b94-bc46-007613d74ff4", + "metadata": {}, + "source": [ + "- **Item** — Represents the current item from the dataset.\n", + "- **Sample** — Contains data related to the output from the model. The `sample.output_text` represents the completion text for completion models and the content of the first message for chat models.\n", + "\n", + "The properties on the `item` object are derived from the dataset's column names (for CSVs) or keys (for JSONs). \n", + "The following rules apply to these properties:\n", + "\n", + "- All non-alphanumeric characters are replaced with underscores.\n", + "- Column names are converted to lowercase.\n", + "- In case of conflicts, suffixes (`_1`, `_2`, etc.), are appended to the property names.\n", + "\n", + "\n", + "**Templates for Chat Models**\n", + "\n", + "Prompt templates are used to structure tasks for evaluating the performance of models, specifically following the NIM/OpenAI format for chat-completion tasks. Templates use the Jinja2 templating syntax. Variables are represented using double-curly brackets, for example, `{{item.review}}`." + ] + }, + { + "cell_type": "markdown", + "id": "d0455449-058b-4c8b-b675-23cf5aed1c4d", + "metadata": {}, + "source": [ + "**Create Eval Config**" + ] + }, + { + "cell_type": "markdown", + "id": "0724c9e9-1271-4601-811b-e027183ec6ac", + "metadata": {}, + "source": [ + "We will use `llama-3.3-70b-instruct` hosted on build.nvidia.com as the judge model." + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "id": "edd8f1d8-eb6a-483f-b133-3663a1bf8977", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + }, + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T20:37:58.674883',\n", + " 'updated_at': '2025-07-08T20:37:58.674884',\n", + " 'name': 'custom_llm_as_judge_config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'custom',\n", + " 'tasks': {'qa': {'type': 'completion',\n", + " 'params': {'template': {'messages': [{'role': 'system',\n", + " 'content': 'You are a '\n", + " 'helpful, '\n", + " 'respectful '\n", + " 'and honest '\n", + " 'assistant. \\n'\n", + " 'Answers '\n", + " 'the '\n", + " 'following '\n", + " 'question '\n", + " 'as briefly '\n", + " 'as you '\n", + " 'can.\\n'\n", + " '.'},\n", + " {'role': 'user',\n", + " 'content': 'Answer '\n", + " 'very '\n", + " 'briefly '\n", + " '(no '\n", + " 'explanation) '\n", + " 'this '\n", + " 'question: '\n", + " '{{item.question}}'}]}},\n", + " 'metrics': {'accuracy': {'type': 'string-check',\n", + " 'params': {'check': ['{{sample.output_text}}',\n", + " 'contains',\n", + " '{{item.answer}}']}},\n", + " 'bleu': {'type': 'bleu',\n", + " 'params': {'references': ['{{item.reference_answer}}']}},\n", + " 'accuracy-llm-judge': {'type': 'llm-judge',\n", + " 'params': {'model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'template': {'messages': [{'role': 'system',\n", + " 'content': 'Your '\n", + " 'task '\n", + " 'is '\n", + " 'to '\n", + " 'evaluate '\n", + " 'the '\n", + " 'semantic '\n", + " 'similarity '\n", + " 'between '\n", + " 'two '\n", + " 'responses.'},\n", + " {'role': 'user',\n", + " 'content': 'Respond '\n", + " 'in '\n", + " 'the '\n", + " 'following '\n", + " 'format '\n", + " 'SIMILARITY: '\n", + " '4. '\n", + " 'The '\n", + " 'similarity '\n", + " 'should '\n", + " 'be '\n", + " 'a '\n", + " 'score '\n", + " 'between '\n", + " '0 '\n", + " 'and '\n", + " '10.\\n'\n", + " '\\n'\n", + " 'RESPONSE '\n", + " '1: '\n", + " '{{item.reference_answer}}\\n'\n", + " '\\n'\n", + " 'RESPONSE '\n", + " '2: '\n", + " '{{sample.output_text}}.\\n'\n", + " '\\n'}]},\n", + " 'scores': {'similarity': {'type': 'int',\n", + " 'parser': {'type': 'regex',\n", + " 'pattern': 'SIMILARITY: '\n", + " '(\\\\d)'}}}}}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-4cBEc7vpZh8aBPpuGpNoFD',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-08T20:37:58.674690',\n", + " 'updated_at': '2025-07-08T20:37:58.674693',\n", + " 'custom_fields': {},\n", + " 'name': 'dataset-4cBEc7vpZh8aBPpuGpNoFD',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/llm_as_judge_data/llm_as_judge/math_dataset.csv'}}},\n", + " 'id': 'eval-config-J9aERJok5jacrTdhYcyM1H',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"type\": \"custom\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"name\": \"custom_llm_as_judge_config\",\n", + " \"tasks\": {\n", + " \"qa\": {\n", + " \"type\": \"completion\",\n", + " \"params\": {\n", + " \"template\": {\n", + " \"messages\": [{\n", + " \"role\": \"system\",\n", + " \"content\": \"You are a helpful, respectful and honest assistant. \\nAnswers the following question as briefly as you can.\\n.\"\n", + " }, \n", + " { \n", + " \"role\": \"user\",\n", + " \"content\": \"Answer very briefly (no explanation) this question: {{item.question}}\"\n", + " }]\n", + " }\n", + " },\n", + " \"metrics\": {\n", + " \"accuracy\": {\n", + " \"type\": \"string-check\",\n", + " \"params\": {\n", + " \"check\": [\n", + " \"{{sample.output_text}}\",\n", + " \"contains\",\n", + " \"{{item.answer}}\"\n", + " ]\n", + " }\n", + " },\n", + " \"bleu\": {\n", + " \"type\": \"bleu\",\n", + " \"params\": {\n", + " \"references\": [\n", + " \"{{item.reference_answer}}\"\n", + " ]\n", + " }\n", + " },\n", + " \"accuracy-llm-judge\": {\n", + " \"type\": \"llm-judge\",\n", + " \"params\": {\n", + " \"model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1/chat/completions\",\n", + " \"model_id\": \"meta/llama-3.3-70b-instruct\",\n", + " \"api_key\": os.environ[\"NVIDIA_API_KEY\"]\n", + " }\n", + " },\n", + " \"template\": {\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"system\",\n", + " \"content\": \"Your task is to evaluate the semantic similarity between two responses.\"\n", + " },\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": (\n", + " \"Respond in the following format SIMILARITY: 4. \"\n", + " \"The similarity should be a score between 0 and 10.\\n\\n\"\n", + " \"RESPONSE 1: {{item.reference_answer}}\\n\\n\"\n", + " \"RESPONSE 2: {{sample.output_text}}.\\n\\n\"\n", + " )\n", + " }\n", + " ]\n", + " },\n", + " \"scores\": {\n", + " \"similarity\": {\n", + " \"type\": \"int\",\n", + " \"parser\": {\n", + " \"type\": \"regex\",\n", + " \"pattern\": \"SIMILARITY: (\\\\d)\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + " }\n", + " },\n", + " \"dataset\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/llm_as_judge/math_dataset.csv\"\n", + " }\n", + " }\n", + " }\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "f8b7b997-0a5b-489d-94b0-531be7d6270d", + "metadata": {}, + "source": [ + "**Launch Eval Job**" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "696c6ff9-f725-4990-81c1-775b6b662fa0", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-08T20:38:04.820220',\n", + " 'updated_at': '2025-07-08T20:38:04.820223',\n", + " 'id': 'eval-MoXUyarWP3jBR8h6LqyhWU',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061228',\n", + " 'updated_at': '2025-07-08T19:40:13.061229',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'llama-chat-target',\n", + " 'type': 'model',\n", + " 'cached_outputs': None,\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061012',\n", + " 'updated_at': '2025-07-08T19:40:13.061015',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'openai'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-J9aERJok5jacrTdhYcyM1H',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T20:37:58.674883',\n", + " 'updated_at': '2025-07-08T20:37:58.674884',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'custom_llm_as_judge_config',\n", + " 'type': 'custom',\n", + " 'params': None,\n", + " 'tasks': {'qa': {'type': 'completion',\n", + " 'params': {'template': {'messages': [{'role': 'system',\n", + " 'content': 'You '\n", + " 'are '\n", + " 'a '\n", + " 'helpful, '\n", + " 'respectful '\n", + " 'and '\n", + " 'honest '\n", + " 'assistant. \\n'\n", + " 'Answers '\n", + " 'the '\n", + " 'following '\n", + " 'question '\n", + " 'as '\n", + " 'briefly '\n", + " 'as '\n", + " 'you '\n", + " 'can.\\n'\n", + " '.'},\n", + " {'role': 'user',\n", + " 'content': 'Answer '\n", + " 'very '\n", + " 'briefly '\n", + " '(no '\n", + " 'explanation) '\n", + " 'this '\n", + " 'question: '\n", + " '{{item.question}}'}]}},\n", + " 'metrics': {'accuracy': {'type': 'string-check',\n", + " 'params': {'check': ['{{sample.output_text}}',\n", + " 'contains',\n", + " '{{item.answer}}']}},\n", + " 'bleu': {'type': 'bleu',\n", + " 'params': {'references': ['{{item.reference_answer}}']}},\n", + " 'accuracy-llm-judge': {'type': 'llm-judge',\n", + " 'params': {'model': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'template': {'messages': [{'role': 'system',\n", + " 'content': 'Your '\n", + " 'task '\n", + " 'is '\n", + " 'to '\n", + " 'evaluate '\n", + " 'the '\n", + " 'semantic '\n", + " 'similarity '\n", + " 'between '\n", + " 'two '\n", + " 'responses.'},\n", + " {'role': 'user',\n", + " 'content': 'Respond '\n", + " 'in '\n", + " 'the '\n", + " 'following '\n", + " 'format '\n", + " 'SIMILARITY: '\n", + " '4. '\n", + " 'The '\n", + " 'similarity '\n", + " 'should '\n", + " 'be '\n", + " 'a '\n", + " 'score '\n", + " 'between '\n", + " '0 '\n", + " 'and '\n", + " '10.\\n'\n", + " '\\n'\n", + " 'RESPONSE '\n", + " '1: '\n", + " '{{item.reference_answer}}\\n'\n", + " '\\n'\n", + " 'RESPONSE '\n", + " '2: '\n", + " '{{sample.output_text}}.\\n'\n", + " '\\n'}]},\n", + " 'scores': {'similarity': {'type': 'int',\n", + " 'parser': {'type': 'regex',\n", + " 'pattern': 'SIMILARITY: '\n", + " '(\\\\d)'}}}}}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-4cBEc7vpZh8aBPpuGpNoFD',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T20:37:58.674690',\n", + " 'updated_at': '2025-07-08T20:37:58.674693',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-4cBEc7vpZh8aBPpuGpNoFD',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/llm_as_judge_data/llm_as_judge/math_dataset.csv',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/llama-chat-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/custom_llm_as_judge_config\"\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "llm_judge_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "6a53bbd6-0878-4a54-833b-50f928930943", + "metadata": {}, + "source": [ + "**Monitoring job status and results**" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "8c91d6a2-3bf8-46ca-a169-4aad5289bd0a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully.',\n", + " 'task_status': {'qa': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{llm_judge_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 62, + "id": "210dff5c-53e0-4650-96b1-aa0f509cc8da", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'metrics': {'accuracy': {'scores': {'string-check': {'value': 1.0,\n", + " 'stats': {'count': 10,\n", + " 'sum': 10.0,\n", + " 'mean': 1.0}}}},\n", + " 'bleu': {'scores': {'sentence': {'value': 9.135501080023044,\n", + " 'stats': {'count': 10,\n", + " 'sum': 91.35501080023045,\n", + " 'mean': 9.135501080023044}},\n", + " 'corpus': {'value': 0.0}}},\n", + " 'accuracy-llm-judge': {'scores': {'similarity': {'value': 1.7,\n", + " 'stats': {'count': 10,\n", + " 'sum': 17.0,\n", + " 'mean': 1.7}}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{llm_judge_eval_job_id}/results\")\n", + "pp(resp.json()['tasks']['qa'])" + ] + }, + { + "cell_type": "markdown", + "id": "c56efd95-df06-4ab1-b5ed-482d91587bca", + "metadata": {}, + "source": [ + "### 4.3 Tool Calling Evaluation" + ] + }, + { + "cell_type": "markdown", + "id": "12881c51-76a7-45c3-b634-91c112f480d1", + "metadata": { + "jupyter": { + "source_hidden": true + } + }, + "source": [ + "Required dataset format for a custom tool calling evaluation:\n", + "\n", + "```json\n", + "[\n", + " {\n", + " \"messages\": [\n", + " {\n", + " \"role\": \"user\",\n", + " \"content\": \"Find the area of a triangle with a base of 10 units and height of 5 units.\"\n", + " }\n", + " ],\n", + " \"tools\": [\n", + " {\n", + " \"type\": \"function\",\n", + " \"function\": {\n", + " \"name\": \"calculate_triangle_area\",\n", + " \"description\": \"Calculate the area of a triangle given its base and height.\",\n", + " \"parameters\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"base\": {\n", + " \"type\": \"integer\",\n", + " \"description\": \"The base of the triangle.\"\n", + " },\n", + " \"height\": {\n", + " \"type\": \"integer\",\n", + " \"description\": \"The height of the triangle.\"\n", + " },\n", + " \"unit\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"The unit of measure (defaults to \\\"units\\\" if not specified)\"\n", + " }\n", + " },\n", + " \"required\": [\n", + " \"base\",\n", + " \"height\"\n", + " ]\n", + " }\n", + " }\n", + " }\n", + " ],\n", + " \"tool_calls\": [\n", + " {\n", + " \"function\": {\n", + " \"name\": \"calculate_triangle_area\",\n", + " \"arguments\": {\n", + " \"base\": 10,\n", + " \"height\": 5,\n", + " \"unit\": \"units\"\n", + " }\n", + " }\n", + " }\n", + " ]\n", + " }\n", + "]\n", + "```\n" + ] + }, + { + "cell_type": "markdown", + "id": "92321a60-1fc7-4cde-b0cb-4e35e5d9400e", + "metadata": {}, + "source": [ + "**Upload Custom Tool Calling Dataset to Data Store**" + ] + }, + { + "cell_type": "code", + "execution_count": 95, + "id": "125aee37-6df4-4504-984a-25be17e70f67", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "CommitInfo(commit_url='', commit_message='Upload aiva_tool_call.jsonl with huggingface_hub', commit_description='', oid='8ba99b4e0b4b2224b059c6cacda3b86d724c406e', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)" + ] + }, + "execution_count": 95, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "DATASET_NAME = \"tool_call_data\"\n", + "repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)\n", + "HF_API.upload_file(\n", + " path_or_fileobj='./eval_dataset/tool_call_data/aiva_tool_call.jsonl',\n", + " path_in_repo=\"aiva_tool_call.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "2414cd57-1ad8-4722-b8e2-ff15186db0dc", + "metadata": {}, + "source": [ + "**Create Eval Config**" + ] + }, + { + "cell_type": "code", + "execution_count": 96, + "id": "16ce4448-b3c6-4b4f-9cb2-7811ce9533f9", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T02:18:55.955723',\n", + " 'updated_at': '2025-07-09T02:18:55.955724',\n", + " 'name': 'tool-call-eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'custom',\n", + " 'tasks': {'custom-tool-calling': {'type': 'chat-completion',\n", + " 'params': {'template': {'messages': '{{ '\n", + " 'item.messages '\n", + " '| '\n", + " 'tojson}}',\n", + " 'tools': '{{ '\n", + " 'item.tools '\n", + " '| tojson '\n", + " '}}',\n", + " 'tool_choice': 'auto'}},\n", + " 'metrics': {'tool-calling-accuracy': {'type': 'tool-calling',\n", + " 'params': {'tool_calls_ground_truth': '{{ '\n", + " 'item.tool_calls '\n", + " '| '\n", + " 'tojson '\n", + " '}}'}}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-FHLbB4tDWWXwkeQShawtGg',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-09T02:18:55.955557',\n", + " 'updated_at': '2025-07-09T02:18:55.955560',\n", + " 'custom_fields': {},\n", + " 'name': 'dataset-FHLbB4tDWWXwkeQShawtGg',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/tool_call_data/aiva_tool_call.jsonl'}}},\n", + " 'id': 'eval-config-N6PkTGVcZNP8ho4LwHYxqg',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "config_payload = {\n", + " \"type\": \"custom\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"name\": \"tool-call-eval-config\",\n", + " \"tasks\": {\n", + " \"custom-tool-calling\": {\n", + " \"type\": \"chat-completion\",\n", + " \"dataset\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/aiva_tool_call.jsonl\",\n", + " },\n", + " \"params\": {\n", + " \"template\": {\n", + " \"messages\": \"{{ item.messages | tojson}}\",\n", + " \"tools\": \"{{ item.tools | tojson }}\",\n", + " \"tool_choice\": \"auto\"\n", + " }\n", + " },\n", + " \"metrics\": {\n", + " \"tool-calling-accuracy\": {\n", + " \"type\": \"tool-calling\",\n", + " \"params\": {\"tool_calls_ground_truth\": \"{{ item.tool_calls | tojson }}\"}\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", + "resp = requests.post(config_url, json=config_payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "de5b228c-7502-48eb-a652-0c98f905b88f", + "metadata": {}, + "source": [ + "**Launch Eval Job**" + ] + }, + { + "cell_type": "code", + "execution_count": 97, + "id": "d8db6b64-9a2e-4da8-bcc1-c7fe9f4651a3", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T02:19:10.833300',\n", + " 'updated_at': '2025-07-09T02:19:10.833303',\n", + " 'id': 'eval-P6pSKQ2MhjQGFdT85yBBqm',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-GaJYgnmDA4Ta2TN23sV9Wo',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061228',\n", + " 'updated_at': '2025-07-08T19:40:13.061229',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'llama-chat-target',\n", + " 'type': 'model',\n", + " 'cached_outputs': None,\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-08T19:40:13.061012',\n", + " 'updated_at': '2025-07-08T19:40:13.061015',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-EXZQbF8ZXbjoogTvF9UFKr',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'openai'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'retriever': None,\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-N6PkTGVcZNP8ho4LwHYxqg',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T02:18:55.955723',\n", + " 'updated_at': '2025-07-09T02:18:55.955724',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'tool-call-eval-config',\n", + " 'type': 'custom',\n", + " 'params': None,\n", + " 'tasks': {'custom-tool-calling': {'type': 'chat-completion',\n", + " 'params': {'template': {'messages': '{{ '\n", + " 'item.messages '\n", + " '| '\n", + " 'tojson}}',\n", + " 'tools': '{{ '\n", + " 'item.tools '\n", + " '| '\n", + " 'tojson '\n", + " '}}',\n", + " 'tool_choice': 'auto'}},\n", + " 'metrics': {'tool-calling-accuracy': {'type': 'tool-calling',\n", + " 'params': {'tool_calls_ground_truth': '{{ '\n", + " 'item.tool_calls '\n", + " '| '\n", + " 'tojson '\n", + " '}}'}}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-FHLbB4tDWWXwkeQShawtGg',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T02:18:55.955557',\n", + " 'updated_at': '2025-07-09T02:18:55.955560',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-FHLbB4tDWWXwkeQShawtGg',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/tool_call_data/aiva_tool_call.jsonl',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/llama-chat-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/tool-call-eval-config\"\n", + "}\n", + "headers = {\n", + " \"accept\": \"application/json\",\n", + " \"Content-Type\": \"application/json\"\n", + "}\n", + "\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "tool_call_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "markdown", + "id": "e023fe25-407f-4f9d-a699-7c28bbcb7324", + "metadata": {}, + "source": [ + "**Monitoring job status and results**" + ] + }, + { + "cell_type": "code", + "execution_count": 99, + "id": "b59180c3-a1ec-4a9c-abc0-ca73b1a5150d", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully.',\n", + " 'task_status': {'custom-tool-calling': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{tool_call_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 100, + "id": "35e63334-6702-4ec0-9188-45b6bf722694", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'tool-calling-accuracy': {'scores': {'function_name_accuracy': {'value': 0.9,\n", + " 'stats': {'count': 10,\n", + " 'sum': 9.0,\n", + " 'mean': 0.9}},\n", + " 'function_name_and_args_accuracy': {'value': 0.0,\n", + " 'stats': {'count': 10,\n", + " 'sum': 0.0,\n", + " 'mean': 0.0}}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{tool_call_eval_job_id}/results\")\n", + "pp(resp.json()['tasks']['custom-tool-calling']['metrics'])" + ] + }, + { + "cell_type": "markdown", + "id": "feb170f5-3c4e-42a1-9262-878ad7d8788d", + "metadata": {}, + "source": [ + "## 5. Retriever Pipeline Evaluation" + ] + }, + { + "cell_type": "markdown", + "id": "4d19c48e-9da1-45c0-a2b1-2b97e1c515cc", + "metadata": {}, + "source": [ + "### 5.1 Deploy Retriever Models" + ] + }, + { + "cell_type": "markdown", + "id": "61950b31-112d-404c-a0f1-3a795d71f3da", + "metadata": {}, + "source": [ + "To evaluate retriever pipelines, retriever models must be deployed locally. For this tutorial, we will create a retriever pipeline with both embedding and reranking models. Specifically, we will deploy two retriever NIMs for `llama-3.2-nv-embedqa-1b-v2` and `llama-3.2-nv-rerankqa-1b-v2`, respectively, using Docker." + ] + }, + { + "cell_type": "markdown", + "id": "f073bbc8-756b-472c-b19c-dadb870f668a", + "metadata": {}, + "source": [ + "First, let's identify a free GPU to deploy the retriever models." + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "ae46e61b-ebf8-48b8-ab2e-365b7c346167", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Wed Jul 9 06:27:12 2025 \n", + "+---------------------------------------------------------------------------------------+\n", + "| NVIDIA-SMI 535.183.06 Driver Version: 535.183.06 CUDA Version: 12.2 |\n", + "|-----------------------------------------+----------------------+----------------------+\n", + "| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |\n", + "| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |\n", + "| | | MIG M. |\n", + "|=========================================+======================+======================|\n", + "| 0 NVIDIA A100 80GB PCIe On | 00000002:00:01.0 Off | 0 |\n", + "| N/A 34C P0 43W / 300W | 0MiB / 81920MiB | 0% Default |\n", + "| | | Disabled |\n", + "+-----------------------------------------+----------------------+----------------------+\n", + "| 1 NVIDIA A100 80GB PCIe On | 00000002:00:02.0 Off | 0 |\n", + "| N/A 36C P0 62W / 300W | 70969MiB / 81920MiB | 0% Default |\n", + "| | | Disabled |\n", + "+-----------------------------------------+----------------------+----------------------+\n", + " \n", + "+---------------------------------------------------------------------------------------+\n", + "| Processes: |\n", + "| GPU GI CI PID Type Process name GPU Memory |\n", + "| ID ID Usage |\n", + "|=======================================================================================|\n", + "| 1 N/A N/A 79075 C /opt/nim/llm/.venv/bin/python3 70958MiB |\n", + "+---------------------------------------------------------------------------------------+\n" + ] + } + ], + "source": [ + "!nvidia-smi" + ] + }, + { + "cell_type": "markdown", + "id": "4c8fed34-e457-4c76-ba20-a6a55d7334c5", + "metadata": {}, + "source": [ + "Then, we can deploy the `llama-3.2-nv-embedqa-1b-v2` embedding NIM (replace `` with your NGC API Key below):" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ec69ee62-b42f-4d12-97a6-0c88ad0cfd2a", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%bash\n", + "export NGC_API_KEY=\n", + "echo \"$NGC_API_KEY\" | docker login nvcr.io --username '$oauthtoken' --password-stdin\n", + "export NIM_MODEL_NAME=nvidia/llama-3.2-nv-embedqa-1b-v2\n", + "export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)\n", + "\n", + "# Choose a NIM Image from NGC\n", + "export IMG_NAME=\"nvcr.io/nim/$NIM_MODEL_NAME:1.5.0\"\n", + "\n", + "# Choose a path on your system to cache the downloaded models\n", + "export LOCAL_NIM_CACHE=~/.cache/nim\n", + "mkdir -p \"$LOCAL_NIM_CACHE\"\n", + "\n", + "# Start the NIM\n", + "docker run -d --rm --name=$CONTAINER_NAME \\\n", + " --runtime=nvidia \\\n", + " --gpus '\"device=0\"' \\\n", + " --shm-size=16GB \\\n", + " -e NGC_API_KEY \\\n", + " -v \"$LOCAL_NIM_CACHE:/opt/nim/.cache\" \\\n", + " -u $(id -u) \\\n", + " -p 8000:8000 \\\n", + " $IMG_NAME" + ] + }, + { + "cell_type": "markdown", + "id": "48cda92d-4b0b-4994-a962-95d89e1caa47", + "metadata": {}, + "source": [ + "Next, we will deploy the `llama-3.2-nv-rerankqa-1b-v2` reranking NIM:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "212ef05a-1446-4e7b-bbeb-2ac96c3f3d54", + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "%%bash\n", + "export NIM_MODEL_NAME=nvidia/llama-3.2-nv-rerankqa-1b-v2\n", + "export CONTAINER_NAME=$(basename $NIM_MODEL_NAME)\n", + "\n", + "# Choose a NIM Image from NGC\n", + "export IMG_NAME=\"nvcr.io/nim/$NIM_MODEL_NAME:1.3.0\"\n", + "\n", + "# Choose a path on your system to cache the downloaded models\n", + "export LOCAL_NIM_CACHE=~/.cache/nim\n", + "mkdir -p \"$LOCAL_NIM_CACHE\"\n", + "\n", + "# Start the NIM\n", + "docker run -d --rm --name=$CONTAINER_NAME \\\n", + " --runtime=nvidia \\\n", + " --gpus '\"device=0\"' \\\n", + " --shm-size=16GB \\\n", + " -e NGC_API_KEY \\\n", + " -v \"$LOCAL_NIM_CACHE:/opt/nim/.cache\" \\\n", + " -u $(id -u) \\\n", + " -p 8001:8000 \\\n", + " $IMG_NAME" + ] + }, + { + "cell_type": "markdown", + "id": "e783806e-beac-4133-8027-6e78266c440b", + "metadata": {}, + "source": [ + "Get Host IP address and specify the API endpoints for embedding and reranking NIMs:" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "a717b839-9db9-4090-8478-6b1dff70bf94", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1.0.0.0 via 172.27.16.1 dev ens3 src 172.27.20.120 uid 1000 \n" + ] + } + ], + "source": [ + "! ip route get 1 | grep src" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "7710cc1a-b377-4f78-9d47-30be94361d65", + "metadata": {}, + "outputs": [], + "source": [ + "embed_url = \"http://172.27.20.120:8000/v1/embeddings\"\n", + "embed_model_name = \"nvidia/llama-3.2-nv-embedqa-1b-v2\"\n", + "rerank_url = \"http://172.27.20.120:8001/v1/ranking\"\n", + "rerank_model_name = \"nvidia/llama-3.2-nv-rerankqa-1b-v2\"" + ] + }, + { + "cell_type": "markdown", + "id": "7aaa6fcf-9941-4e72-8e32-a88480036bb8", + "metadata": {}, + "source": [ + "### 5.2 Set up Milvus Vector Database" + ] + }, + { + "cell_type": "markdown", + "id": "504ddb82-b4eb-4041-b95c-52d3981dbbd5", + "metadata": {}, + "source": [ + "To run retriever or RAG evaluations, you must first enable the Milvus document store by setting `milvus.enabled: true` in values.yaml, and then upgrade your Helm deployment to apply the change." + ] + }, + { + "cell_type": "markdown", + "id": "3833f8ae-acc4-4e5c-a468-65bdd579db95", + "metadata": {}, + "source": [ + "```yaml\n", + "evaluator:\n", + " enabled: true\n", + " milvus:\n", + " enabled: true\n", + "```" + ] + }, + { + "cell_type": "markdown", + "id": "e4624b0a-8478-4ca9-8f9e-48295b517916", + "metadata": {}, + "source": [ + "You should see a milvus pod running:" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "id": "a47709ee-858c-4feb-83c8-a28f15a741cd", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "NAME READY STATUS RESTARTS AGE\n", + "model-downloader-meta-llama-3-1-8b-instruct-2-0-28trx 0/1 Completed 0 29m\n", + "model-downloader-meta-llama-3-2-1b-instruct-2-0-b6scn 0/1 Completed 0 29m\n", + "modeldeployment-meta-llama-3-1-8b-instruct-6b64d56fdc-slctn 1/1 Running 0 28m\n", + "nemo-argo-workflows-server-655f8d755-svgn2 1/1 Running 0 30m\n", + "nemo-argo-workflows-workflow-controller-8f8877cd4-8t2tf 1/1 Running 0 30m\n", + "nemo-customizer-5d8554fcf6-rhwfp 1/1 Running 2 (29m ago) 30m\n", + "nemo-customizerdb-0 1/1 Running 0 30m\n", + "nemo-data-store-795ccbb97b-nwcf2 1/1 Running 0 30m\n", + "nemo-deployment-management-646cc67c-l67lq 1/1 Running 0 30m\n", + "nemo-entity-store-6bfd7c4bd9-p4s4b 1/1 Running 0 30m\n", + "nemo-entity-storedb-0 1/1 Running 0 30m\n", + "nemo-evaluator-68c67f6798-f4sqh 1/1 Running 0 30m\n", + "nemo-evaluatordb-0 1/1 Running 0 30m\n", + "nemo-guardrails-5985f8c6d4-bkkq2 1/1 Running 0 30m\n", + "nemo-milvus-standalone-57cc6c46fc-5rctk 1/1 Running 0 30m\n", + "nemo-nemo-operator-controller-manager-59c98d97f9-7gfsw 2/2 Running 0 30m\n", + "nemo-nim-operator-765cbf686b-w9n6c 1/1 Running 0 30m\n", + "nemo-nim-proxy-6fbccbfb45-8qvmz 1/1 Running 0 30m\n", + "nemo-opentelemetry-collector-6f46646fd9-5tnrf 1/1 Running 0 30m\n", + "nemo-postgresql-0 1/1 Running 0 30m\n" + ] + } + ], + "source": [ + "! kubectl get pod" + ] + }, + { + "cell_type": "markdown", + "id": "94301926-71c9-4590-b5fd-1208f0917422", + "metadata": {}, + "source": [ + "### 5.3 Evaluate Embedding Pipeline on FIQA Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "id": "3fb1b610-d5be-4f1f-8536-0aa540646096", + "metadata": {}, + "outputs": [], + "source": [ + "target_payload = {\n", + " \"type\": \"retriever\",\n", + " \"name\": \"embed-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"retriever\": {\n", + " \"pipeline\": {\n", + " \"query_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name,\n", + " }\n", + " },\n", + " \"index_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name,\n", + " }\n", + " },\n", + " \"top_k\": 10\n", + " }\n", + " }\n", + "}\n", + "\n", + "config_payload = {\n", + " \"type\": \"retriever\",\n", + " \"name\": \"fiqa-config\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"my-beir-task\": {\n", + " \"type\": \"beir\",\n", + " \"dataset\": {\n", + " \"files_url\": \"file://fiqa/\"\n", + " },\n", + " \"metrics\": {\n", + " \"recall_5\": {\"type\": \"recall_5\"},\n", + " \"ndcg_cut_5\": {\"type\": \"ndcg_cut_5\"},\n", + " \"recall_10\": {\"type\": \"recall_10\"},\n", + " \"ndcg_cut_10\": {\"type\": \"ndcg_cut_10\"}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "resp1 = requests.post(target_url, json=target_payload, headers=headers)\n", + "resp2 = requests.post(config_url, json=config_payload, headers=headers)" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "id": "7fc2e8c6-abc0-4197-a1c6-ff9b53c06f98", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T06:45:27.786401',\n", + " 'updated_at': '2025-07-09T06:45:27.786403',\n", + " 'id': 'eval-5icjZHn7q6YERCtUCA19bW',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-5Upwjn3JL3NPV8sGacQyWW',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:44:21.108017',\n", + " 'updated_at': '2025-07-09T06:44:21.108018',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'embed-target',\n", + " 'type': 'retriever',\n", + " 'cached_outputs': None,\n", + " 'model': None,\n", + " 'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-JsfNnspcixm9UusYGSD1iU',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:44:21.107756',\n", + " 'updated_at': '2025-07-09T06:44:21.107758',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-JsfNnspcixm9UusYGSD1iU',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'index_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-HKNVJY2a6K6nwJAaZ6VnzR',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:44:21.107825',\n", + " 'updated_at': '2025-07-09T06:44:21.107825',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-HKNVJY2a6K6nwJAaZ6VnzR',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'reranker_model': None,\n", + " 'top_k': 10},\n", + " 'cached_outputs': None},\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-KsKTD5g5GbS4SGYS3VeGJJ',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:45:23.013022',\n", + " 'updated_at': '2025-07-09T06:45:23.013023',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'fiqa-config',\n", + " 'type': 'retriever',\n", + " 'params': None,\n", + " 'tasks': {'my-beir-task': {'type': 'beir',\n", + " 'params': None,\n", + " 'metrics': {'recall_5': {'type': 'recall_5',\n", + " 'params': None},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5',\n", + " 'params': None},\n", + " 'recall_10': {'type': 'recall_10',\n", + " 'params': None},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-WueQfPh9cr3WpU2ySmpf1k',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:45:23.012836',\n", + " 'updated_at': '2025-07-09T06:45:23.012837',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-WueQfPh9cr3WpU2ySmpf1k',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'file://fiqa/',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/embed-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/fiqa-config\"\n", + "}\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "embed_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "id": "40676500-2842-4646-aa96-a10d59fd0183", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'my-beir-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "id": "1cef2186-f3b7-4a97-ac39-8fa42441fa39", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'evaluation': {'scores': {'recall_10': {'value': 0.5984280594234299},\n", + " 'ndcg_cut_10': {'value': 0.5280203494315917},\n", + " 'ndcg_cut_5': {'value': 0.5054691655963462},\n", + " 'recall_5': {'value': 0.5225860130952724}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/results\")\n", + "pp(resp.json()['groups']['evaluation']['metrics'])" + ] + }, + { + "cell_type": "markdown", + "id": "a06bb86f-7bed-4eb4-af73-d1d5c487a53d", + "metadata": {}, + "source": [ + "### 5.4 Evaluate Embedding + Reranking Pipeline on FIQA Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "971b72ae-b8d1-473f-8e9f-13778b1958cf", + "metadata": {}, + "source": [ + "First, let's create a evaluation target for Embedding + Reranking" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "id": "1ac9bde2-ca1c-49a9-9b85-d373a72e2c97", + "metadata": {}, + "outputs": [], + "source": [ + "target_payload = {\n", + " \"type\": \"retriever\",\n", + " \"name\": \"embed-rerank-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"retriever\": {\n", + " \"pipeline\": {\n", + " \"query_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name,\n", + " }\n", + " },\n", + " \"index_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name,\n", + " }\n", + " },\n", + " \"reranker_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": rerank_url,\n", + " \"model_id\":rerank_model_name,\n", + " }\n", + " },\n", + " \"top_k\": 10\n", + " }\n", + " }\n", + "}\n", + "resp = requests.post(target_url, json=target_payload, headers=headers)" + ] + }, + { + "cell_type": "markdown", + "id": "d872a20b-6531-44f2-bf75-b148976f0eb9", + "metadata": {}, + "source": [ + "Then we can launch the eval job for the embed + rerank pipeline on FIQA data:" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "id": "5db765cf-1f83-46e1-b9e3-c33f0cc3de18", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T07:06:34.280669',\n", + " 'updated_at': '2025-07-09T07:06:34.280671',\n", + " 'id': 'eval-2FJVarertHH2SNA3mYaqmD',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-T4fCAFx1yhgGP3kBfEBzxb',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.817357',\n", + " 'updated_at': '2025-07-09T07:06:31.817358',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'embed-rerank-target',\n", + " 'type': 'retriever',\n", + " 'cached_outputs': None,\n", + " 'model': None,\n", + " 'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-AEyhZfkQZdQUw938n7Gw4q',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.816947',\n", + " 'updated_at': '2025-07-09T07:06:31.816949',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-AEyhZfkQZdQUw938n7Gw4q',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'index_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-C5tmcE6tZECDRHN3PuvEfk',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.816996',\n", + " 'updated_at': '2025-07-09T07:06:31.816996',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-C5tmcE6tZECDRHN3PuvEfk',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'reranker_model': {'schema_version': '1.0',\n", + " 'id': 'model-LzMLhQFk6bzv4vFAdRNMok',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.817016',\n", + " 'updated_at': '2025-07-09T07:06:31.817016',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-LzMLhQFk6bzv4vFAdRNMok',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8001/v1/ranking',\n", + " 'model_id': 'nvidia/llama-3.2-nv-rerankqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'top_k': 10},\n", + " 'cached_outputs': None},\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-KsKTD5g5GbS4SGYS3VeGJJ',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:45:23.013022',\n", + " 'updated_at': '2025-07-09T06:45:23.013023',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'fiqa-config',\n", + " 'type': 'retriever',\n", + " 'params': None,\n", + " 'tasks': {'my-beir-task': {'type': 'beir',\n", + " 'params': None,\n", + " 'metrics': {'recall_5': {'type': 'recall_5',\n", + " 'params': None},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5',\n", + " 'params': None},\n", + " 'recall_10': {'type': 'recall_10',\n", + " 'params': None},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-WueQfPh9cr3WpU2ySmpf1k',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T06:45:23.012836',\n", + " 'updated_at': '2025-07-09T06:45:23.012837',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-WueQfPh9cr3WpU2ySmpf1k',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'file://fiqa/',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/embed-rerank-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/fiqa-config\"\n", + "}\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "embed_rerank_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "id": "e136f734-a435-4739-812f-5c4f129ca800", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'my-beir-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_rerank_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "id": "886d5ee8-148b-4b74-9f18-d541929829e4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'evaluation': {'scores': {'recall_10': {'value': 0.5678767905619758},\n", + " 'ndcg_cut_10': {'value': 0.5134364317202811},\n", + " 'recall_5': {'value': 0.5151632575243686},\n", + " 'ndcg_cut_5': {'value': 0.5011430756367373}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_rerank_eval_job_id}/results\")\n", + "pp(resp.json()['groups']['evaluation']['metrics'])" + ] + }, + { + "cell_type": "markdown", + "id": "d4d8c9f4-7800-4aa2-8cae-034521487348", + "metadata": {}, + "source": [ + "### 5.5 Evaluate Embedding + Reranking Pipeline on Custom Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "fedb58f2-c57f-4e01-8c35-acb63f33a61a", + "metadata": {}, + "source": [ + "Upload the custom dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "id": "00910388-e963-4185-b2f2-234cda7af55c", + "metadata": {}, + "outputs": [], + "source": [ + "DATASET_NAME = \"rag_custom_data\"\n", + "repo_id = setup_dataset_repo(HF_API, NMS_NAMESPACE, DATASET_NAME, NEMO_URL)" + ] + }, + { + "cell_type": "code", + "execution_count": 41, + "id": "95295acc-3106-4ad6-a1e6-7e150f501c8a", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "queries.jsonl: 100%|██████████| 16.7k/16.7k [00:00<00:00, 3.90MB/s]\n", + "corpus.jsonl: 100%|██████████| 11.0k/11.0k [00:00<00:00, 3.04MB/s]\n", + "test.tsv: 100%|██████████| 7.28k/7.28k [00:00<00:00, 1.79MB/s]\n" + ] + }, + { + "data": { + "text/plain": [ + "CommitInfo(commit_url='', commit_message='Upload rag_data/qrels/test.tsv with huggingface_hub', commit_description='', oid='c901d2bb70290aa2d898d3f8102d732f774baaec', pr_url=None, repo_url=RepoUrl('', endpoint='https://huggingface.co', repo_type='model', repo_id=''), pr_revision=None, pr_num=None)" + ] + }, + "execution_count": 41, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "HF_API.upload_file(\n", + " path_or_fileobj=\"./eval_dataset/retriever_and_rag/queries.jsonl\",\n", + " path_in_repo=\"rag_data/queries.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")\n", + "\n", + "HF_API.upload_file(\n", + " path_or_fileobj=\"./eval_dataset/retriever_and_rag/corpus.jsonl\",\n", + " path_in_repo=\"rag_data/corpus.jsonl\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")\n", + "\n", + "HF_API.upload_file(\n", + " path_or_fileobj=\"./eval_dataset/retriever_and_rag/qrels/test.tsv\",\n", + " path_in_repo=\"rag_data/qrels/test.tsv\",\n", + " repo_id=repo_id,\n", + " repo_type='dataset',\n", + ")" + ] + }, + { + "cell_type": "markdown", + "id": "d68add56-cdf4-4a49-85ef-84c1ee75cc31", + "metadata": {}, + "source": [ + "Create config for the custom data for retriever eval:" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "id": "fa0dabef-bd41-4d5b-a75f-3170880a2c2e", + "metadata": {}, + "outputs": [], + "source": [ + "config_payload = {\n", + " \"type\": \"retriever\",\n", + " \"name\": \"custom-retriever-config\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"my-beir-task\": {\n", + " \"type\": \"beir\",\n", + " \"dataset\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/rag_data\"\n", + " },\n", + " \"metrics\": {\n", + " \"recall_5\": {\"type\": \"recall_5\"},\n", + " \"ndcg_cut_5\": {\"type\": \"ndcg_cut_5\"},\n", + " \"recall_10\": {\"type\": \"recall_10\"},\n", + " \"ndcg_cut_10\": {\"type\": \"ndcg_cut_10\"}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "resp = requests.post(config_url, json=config_payload, headers=headers)" + ] + }, + { + "cell_type": "markdown", + "id": "44efecd5-573e-43d3-9308-1ef14f50890d", + "metadata": {}, + "source": [ + "Launch the eval job on the embed + rerank pipeline:" + ] + }, + { + "cell_type": "code", + "execution_count": 43, + "id": "4548459b-d9de-42c2-b32a-8e4ab116cbe8", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T16:16:07.734774',\n", + " 'updated_at': '2025-07-09T16:16:07.734776',\n", + " 'id': 'eval-PeDSNf6jrxTyJnovR9ybd7',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-T4fCAFx1yhgGP3kBfEBzxb',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.817357',\n", + " 'updated_at': '2025-07-09T07:06:31.817358',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'embed-rerank-target',\n", + " 'type': 'retriever',\n", + " 'cached_outputs': None,\n", + " 'model': None,\n", + " 'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-AEyhZfkQZdQUw938n7Gw4q',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.816947',\n", + " 'updated_at': '2025-07-09T07:06:31.816949',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-AEyhZfkQZdQUw938n7Gw4q',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'index_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-C5tmcE6tZECDRHN3PuvEfk',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.816996',\n", + " 'updated_at': '2025-07-09T07:06:31.816996',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-C5tmcE6tZECDRHN3PuvEfk',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'reranker_model': {'schema_version': '1.0',\n", + " 'id': 'model-LzMLhQFk6bzv4vFAdRNMok',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T07:06:31.817016',\n", + " 'updated_at': '2025-07-09T07:06:31.817016',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-LzMLhQFk6bzv4vFAdRNMok',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8001/v1/ranking',\n", + " 'model_id': 'nvidia/llama-3.2-nv-rerankqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'top_k': 10},\n", + " 'cached_outputs': None},\n", + " 'rag': None,\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-NeZxti79VXY5KfXeFRsoB5',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:15:48.205922',\n", + " 'updated_at': '2025-07-09T16:15:48.205923',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'custom-retriever-config',\n", + " 'type': 'retriever',\n", + " 'params': None,\n", + " 'tasks': {'my-beir-task': {'type': 'beir',\n", + " 'params': None,\n", + " 'metrics': {'recall_5': {'type': 'recall_5',\n", + " 'params': None},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5',\n", + " 'params': None},\n", + " 'recall_10': {'type': 'recall_10',\n", + " 'params': None},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-RddzBeXffRmVKTxTvRWkjF',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:15:48.205720',\n", + " 'updated_at': '2025-07-09T16:15:48.205722',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-RddzBeXffRmVKTxTvRWkjF',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/rag_custom_data/rag_data',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/embed-rerank-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/custom-retriever-config\"\n", + "}\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "embed_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 46, + "id": "01dec2a3-9834-4454-892a-fe76bd45bcac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'my-beir-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "6ac12f45-6730-41e2-8ba8-914a58d2f0d1", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'evaluation': {'scores': {'ndcg_cut_5': {'value': 1.0},\n", + " 'recall_10': {'value': 1.0},\n", + " 'recall_5': {'value': 1.0},\n", + " 'ndcg_cut_10': {'value': 1.0}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{embed_eval_job_id}/results\")\n", + "pp(resp.json()['groups']['evaluation']['metrics'])" + ] + }, + { + "cell_type": "markdown", + "id": "8f03f72f-d8a2-4b5d-8f24-b2518e65c90a", + "metadata": {}, + "source": [ + "## 6. RAG Pipeline Evaluation" + ] + }, + { + "cell_type": "markdown", + "id": "118064cf-10db-4054-9014-bf7698de2531", + "metadata": {}, + "source": [ + "### 6.1 Evaluate RAG Pipeline on NFCorpus Dataset" + ] + }, + { + "cell_type": "code", + "execution_count": 48, + "id": "e8b6f569-e8a8-410f-9fad-45b60e072dc9", + "metadata": {}, + "outputs": [], + "source": [ + "target_payload = {\n", + " \"type\": \"rag\",\n", + " \"name\": \"rag-target\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"rag\": {\n", + " \"pipeline\": {\n", + " \"retriever\": {\n", + " \"pipeline\": {\n", + " \"query_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name\n", + " }\n", + " },\n", + " \"index_embedding_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": embed_url,\n", + " \"model_id\": embed_model_name\n", + " }\n", + " },\n", + " \"reranker_model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": rerank_url,\n", + " \"model_id\": rerank_model_name,\n", + " }\n", + " },\n", + " \"top_k\": 3\n", + " }\n", + " },\n", + " \"model\": {\n", + " \"api_endpoint\": {\n", + " \"url\": llm_chat_completion_url,\n", + " \"model_id\": \"meta/llama-3.1-8b-instruct\"\n", + " }\n", + " }\n", + " }\n", + " }\n", + "}\n", + "resp = requests.post(target_url, json=target_payload, headers=headers)" + ] + }, + { + "cell_type": "code", + "execution_count": 49, + "id": "404c3826-3388-4c59-b1a7-c284accdeedf", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T16:30:31.105191',\n", + " 'updated_at': '2025-07-09T16:30:31.105192',\n", + " 'name': 'rag-nfcorpus-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'rag',\n", + " 'tasks': {'my-beir-task': {'type': 'beir',\n", + " 'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',\n", + " 'model_id': 'nvidia/nv-embedqa-e5-v5',\n", + " 'api_key': '******'}},\n", + " 'judge_timeout': 300,\n", + " 'judge_max_retries': 5,\n", + " 'judge_max_workers': 16},\n", + " 'metrics': {'recall_5': {'type': 'recall_5'},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5'},\n", + " 'recall_10': {'type': 'recall_10'},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10'},\n", + " 'faithfulness': {'type': 'faithfulness'},\n", + " 'answer_relevancy': {'type': 'answer_relevancy'}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-3Tgpmmzu7u4AUUrfppFhsN',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-09T16:30:31.105030',\n", + " 'updated_at': '2025-07-09T16:30:31.105032',\n", + " 'custom_fields': {},\n", + " 'name': 'dataset-3Tgpmmzu7u4AUUrfppFhsN',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'files_url': 'file://nfcorpus/'}}},\n", + " 'id': 'eval-config-Lsvsipxj1cEntDWsDbzJxL',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "config_payload = {\n", + " \"type\": \"rag\",\n", + " \"name\": \"rag-nfcorpus-config\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"my-beir-task\": {\n", + " \"type\": \"beir\",\n", + " \"dataset\": {\n", + " \"files_url\": \"file://nfcorpus/\"\n", + " },\n", + " \"params\": {\n", + " \"judge_llm\": {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1/chat/completions\",\n", + " \"model_id\": \"meta/llama-3.3-70b-instruct\",\n", + " \"api_key\": os.environ['NVIDIA_API_KEY'],\n", + " }\n", + " },\n", + " \"judge_embeddings\": {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1/embeddings\",\n", + " \"model_id\": \"nvidia/nv-embedqa-e5-v5\",\n", + " \"api_key\": os.environ['NVIDIA_API_KEY'],\n", + " }\n", + " },\n", + " \"judge_timeout\": 300,\n", + " \"judge_max_retries\": 5,\n", + " \"judge_max_workers\": 16\n", + " },\n", + " \"metrics\": {\n", + " \"recall_5\": {\"type\": \"recall_5\"},\n", + " \"ndcg_cut_5\": {\"type\": \"ndcg_cut_5\"},\n", + " \"recall_10\": {\"type\": \"recall_10\"},\n", + " \"ndcg_cut_10\": {\"type\": \"ndcg_cut_10\"},\n", + " \"faithfulness\": {\"type\": \"faithfulness\"},\n", + " \"answer_relevancy\": {\"type\": \"answer_relevancy\"}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "resp = requests.post(config_url, json=config_payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 50, + "id": "12783fe8-93d3-4d62-b802-cd859398d4b2", + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T16:30:43.855021',\n", + " 'updated_at': '2025-07-09T16:30:43.855024',\n", + " 'id': 'eval-VrFzxRNEP1hhjrnUtP8Ug1',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-VTApJbSpoJBj8xQhgzxB4',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468547',\n", + " 'updated_at': '2025-07-09T16:29:23.468547',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'rag-target',\n", + " 'type': 'rag',\n", + " 'cached_outputs': None,\n", + " 'model': None,\n", + " 'retriever': None,\n", + " 'rag': {'pipeline': {'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-FbFHZ6sEf7pBhmSZj2eTa5',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468212',\n", + " 'updated_at': '2025-07-09T16:29:23.468213',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-FbFHZ6sEf7pBhmSZj2eTa5',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'index_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-BMxFNMbexdzBy1PgDyMB2f',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468261',\n", + " 'updated_at': '2025-07-09T16:29:23.468261',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-BMxFNMbexdzBy1PgDyMB2f',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'reranker_model': {'schema_version': '1.0',\n", + " 'id': 'model-JVYRG5Nxwu24cxXevSD1P',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468282',\n", + " 'updated_at': '2025-07-09T16:29:23.468282',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-JVYRG5Nxwu24cxXevSD1P',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8001/v1/ranking',\n", + " 'model_id': 'nvidia/llama-3.2-nv-rerankqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'top_k': 3},\n", + " 'cached_outputs': None},\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-fWQ8215bWrwq9sdmSGnno',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468307',\n", + " 'updated_at': '2025-07-09T16:29:23.468307',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-fWQ8215bWrwq9sdmSGnno',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'context_ordering': 'desc'},\n", + " 'cached_outputs': None},\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-Lsvsipxj1cEntDWsDbzJxL',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:30:31.105191',\n", + " 'updated_at': '2025-07-09T16:30:31.105192',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'rag-nfcorpus-config',\n", + " 'type': 'rag',\n", + " 'params': None,\n", + " 'tasks': {'my-beir-task': {'type': 'beir',\n", + " 'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',\n", + " 'model_id': 'nvidia/nv-embedqa-e5-v5',\n", + " 'api_key': '******'}},\n", + " 'judge_timeout': 300,\n", + " 'judge_max_retries': 5,\n", + " 'judge_max_workers': 16},\n", + " 'metrics': {'recall_5': {'type': 'recall_5',\n", + " 'params': None},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5',\n", + " 'params': None},\n", + " 'recall_10': {'type': 'recall_10',\n", + " 'params': None},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10',\n", + " 'params': None},\n", + " 'faithfulness': {'type': 'faithfulness',\n", + " 'params': None},\n", + " 'answer_relevancy': {'type': 'answer_relevancy',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-3Tgpmmzu7u4AUUrfppFhsN',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:30:31.105030',\n", + " 'updated_at': '2025-07-09T16:30:31.105032',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-3Tgpmmzu7u4AUUrfppFhsN',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'file://nfcorpus/',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/rag-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/rag-nfcorpus-config\"\n", + "}\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "rag_nfcorpus_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "id": "ed65bf45-83b7-447d-96da-998906b54c47", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'my-beir-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{rag_nfcorpus_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "id": "a97a457a-2739-4903-a696-5a1ad66938bc", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'evaluation': {'scores': {'ndcg_cut_10': {'value': 0.2679049189207276},\n", + " 'recall_5': {'value': 0.11898776243879489},\n", + " 'recall_10': {'value': 0.11898776243879489},\n", + " 'ndcg_cut_5': {'value': 0.35399013412606123},\n", + " 'faithfulness': {'value': 0.798750415118147},\n", + " 'answer_relevancy': {'value': 0.38691755209414236}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{rag_nfcorpus_eval_job_id}/results\")\n", + "pp(resp.json()['groups']['evaluation']['metrics'])" + ] + }, + { + "cell_type": "markdown", + "id": "f179b501-79e0-4261-af70-532b22b0b0d5", + "metadata": {}, + "source": [ + "### 6.2 Evaluate RAG Pipeline on Custom Dataset" + ] + }, + { + "cell_type": "markdown", + "id": "075d2471-1cfc-41ba-a43a-affe2844b407", + "metadata": {}, + "source": [ + "Create evaluation config for RAG on custom dataset. We will use remote judge LLM and judge embedding model." + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "id": "a6a9d68c-3cdd-4faa-ac01-460ea37cdad0", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T17:35:11.033333',\n", + " 'updated_at': '2025-07-09T17:35:11.033334',\n", + " 'name': 'custom-rag-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'type': 'rag',\n", + " 'tasks': {'rag-beir-task': {'type': 'beir',\n", + " 'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',\n", + " 'model_id': 'nvidia/nv-embedqa-e5-v5',\n", + " 'api_key': '******'}},\n", + " 'judge_timeout': 300,\n", + " 'judge_max_retries': 5,\n", + " 'judge_max_workers': 16},\n", + " 'metrics': {'recall_5': {'type': 'recall_5'},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5'},\n", + " 'recall_10': {'type': 'recall_10'},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10'},\n", + " 'faithfulness': {'type': 'faithfulness'},\n", + " 'answer_relevancy': {'type': 'answer_relevancy'}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-4LzRVXqd494p5rSG5BeULA',\n", + " 'namespace': 'default',\n", + " 'created_at': '2025-07-09T17:35:11.033156',\n", + " 'updated_at': '2025-07-09T17:35:11.033158',\n", + " 'custom_fields': {},\n", + " 'name': 'dataset-4LzRVXqd494p5rSG5BeULA',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/rag_custom_data/rag_data'}}},\n", + " 'id': 'eval-config-7NBFBYfrYxGgcdqo5M7agU',\n", + " 'custom_fields': {}}\n" + ] + } + ], + "source": [ + "config_payload = {\n", + " \"type\": \"rag\",\n", + " \"name\": \"custom-rag-config\",\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"tasks\": {\n", + " \"rag-beir-task\": {\n", + " \"type\": \"beir\",\n", + " \"dataset\": {\n", + " \"files_url\": f\"hf://datasets/{NMS_NAMESPACE}/{DATASET_NAME}/rag_data\"\n", + " },\n", + " \"params\": {\n", + " \"judge_llm\": {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1/chat/completions\",\n", + " \"model_id\": \"meta/llama-3.3-70b-instruct\",\n", + " \"api_key\": os.environ['NVIDIA_API_KEY'],\n", + " }\n", + " },\n", + " \"judge_embeddings\": {\n", + " \"api_endpoint\": {\n", + " \"url\": \"https://integrate.api.nvidia.com/v1/embeddings\",\n", + " \"model_id\": \"nvidia/nv-embedqa-e5-v5\",\n", + " \"api_key\": os.environ['NVIDIA_API_KEY'],\n", + " }\n", + " },\n", + " \"judge_timeout\": 300,\n", + " \"judge_max_retries\": 5,\n", + " \"judge_max_workers\": 16\n", + " },\n", + " \"metrics\": {\n", + " \"recall_5\": {\"type\": \"recall_5\"},\n", + " \"ndcg_cut_5\": {\"type\": \"ndcg_cut_5\"},\n", + " \"recall_10\": {\"type\": \"recall_10\"},\n", + " \"ndcg_cut_10\": {\"type\": \"ndcg_cut_10\"},\n", + " \"faithfulness\": {\"type\": \"faithfulness\"},\n", + " \"answer_relevancy\": {\"type\": \"answer_relevancy\"}\n", + " }\n", + " }\n", + " }\n", + "}\n", + "\n", + "resp = requests.post(config_url, json=config_payload, headers=headers)\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "id": "8eab031f-155c-4ad5-a44f-a1369fe00f28", + "metadata": { + "collapsed": true, + "jupyter": { + "outputs_hidden": true + } + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'created_at': '2025-07-09T17:35:43.792368',\n", + " 'updated_at': '2025-07-09T17:35:43.792370',\n", + " 'id': 'eval-NYudxKxqnNzbRk5k7CpJNr',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'description': None,\n", + " 'target': {'schema_version': '1.0',\n", + " 'id': 'eval-target-VTApJbSpoJBj8xQhgzxB4',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-target',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468547',\n", + " 'updated_at': '2025-07-09T16:29:23.468547',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'rag-target',\n", + " 'type': 'rag',\n", + " 'cached_outputs': None,\n", + " 'model': None,\n", + " 'retriever': None,\n", + " 'rag': {'pipeline': {'retriever': {'pipeline': {'query_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-FbFHZ6sEf7pBhmSZj2eTa5',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468212',\n", + " 'updated_at': '2025-07-09T16:29:23.468213',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-FbFHZ6sEf7pBhmSZj2eTa5',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'index_embedding_model': {'schema_version': '1.0',\n", + " 'id': 'model-BMxFNMbexdzBy1PgDyMB2f',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468261',\n", + " 'updated_at': '2025-07-09T16:29:23.468261',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-BMxFNMbexdzBy1PgDyMB2f',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8000/v1/embeddings',\n", + " 'model_id': 'nvidia/llama-3.2-nv-embedqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'reranker_model': {'schema_version': '1.0',\n", + " 'id': 'model-JVYRG5Nxwu24cxXevSD1P',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468282',\n", + " 'updated_at': '2025-07-09T16:29:23.468282',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-JVYRG5Nxwu24cxXevSD1P',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://172.27.20.120:8001/v1/ranking',\n", + " 'model_id': 'nvidia/llama-3.2-nv-rerankqa-1b-v2',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'top_k': 3},\n", + " 'cached_outputs': None},\n", + " 'model': {'schema_version': '1.0',\n", + " 'id': 'model-fWQ8215bWrwq9sdmSGnno',\n", + " 'description': None,\n", + " 'type_prefix': 'model',\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T16:29:23.468307',\n", + " 'updated_at': '2025-07-09T16:29:23.468307',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'model-fWQ8215bWrwq9sdmSGnno',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'spec': None,\n", + " 'artifact': None,\n", + " 'base_model': None,\n", + " 'api_endpoint': {'url': 'http://nim.test/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.1-8b-instruct',\n", + " 'api_key': None,\n", + " 'format': 'nim'},\n", + " 'peft': None,\n", + " 'prompt': None,\n", + " 'guardrails': None},\n", + " 'context_ordering': 'desc'},\n", + " 'cached_outputs': None},\n", + " 'rows': None,\n", + " 'dataset': None},\n", + " 'config': {'schema_version': '1.0',\n", + " 'id': 'eval-config-7NBFBYfrYxGgcdqo5M7agU',\n", + " 'description': None,\n", + " 'type_prefix': 'eval-config',\n", + " 'namespace': 'nemo-eval-tutorial',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T17:35:11.033333',\n", + " 'updated_at': '2025-07-09T17:35:11.033334',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'custom-rag-config',\n", + " 'type': 'rag',\n", + " 'params': None,\n", + " 'tasks': {'rag-beir-task': {'type': 'beir',\n", + " 'params': {'judge_llm': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/chat/completions',\n", + " 'model_id': 'meta/llama-3.3-70b-instruct',\n", + " 'api_key': '******'}},\n", + " 'judge_embeddings': {'api_endpoint': {'url': 'https://integrate.api.nvidia.com/v1/embeddings',\n", + " 'model_id': 'nvidia/nv-embedqa-e5-v5',\n", + " 'api_key': '******'}},\n", + " 'judge_timeout': 300,\n", + " 'judge_max_retries': 5,\n", + " 'judge_max_workers': 16},\n", + " 'metrics': {'recall_5': {'type': 'recall_5',\n", + " 'params': None},\n", + " 'ndcg_cut_5': {'type': 'ndcg_cut_5',\n", + " 'params': None},\n", + " 'recall_10': {'type': 'recall_10',\n", + " 'params': None},\n", + " 'ndcg_cut_10': {'type': 'ndcg_cut_10',\n", + " 'params': None},\n", + " 'faithfulness': {'type': 'faithfulness',\n", + " 'params': None},\n", + " 'answer_relevancy': {'type': 'answer_relevancy',\n", + " 'params': None}},\n", + " 'dataset': {'schema_version': '1.0',\n", + " 'id': 'dataset-4LzRVXqd494p5rSG5BeULA',\n", + " 'description': None,\n", + " 'type_prefix': None,\n", + " 'namespace': 'default',\n", + " 'project': None,\n", + " 'created_at': '2025-07-09T17:35:11.033156',\n", + " 'updated_at': '2025-07-09T17:35:11.033158',\n", + " 'custom_fields': {},\n", + " 'ownership': None,\n", + " 'name': 'dataset-4LzRVXqd494p5rSG5BeULA',\n", + " 'version_id': 'main',\n", + " 'version_tags': [],\n", + " 'format': None,\n", + " 'files_url': 'hf://datasets/nemo-eval-tutorial/rag_custom_data/rag_data',\n", + " 'hf_endpoint': None,\n", + " 'split': None,\n", + " 'limit': None}}},\n", + " 'groups': None},\n", + " 'result': None,\n", + " 'output_files_url': None,\n", + " 'status_details': {'message': None, 'task_status': {}, 'progress': None},\n", + " 'status': 'created',\n", + " 'project': None,\n", + " 'custom_fields': {},\n", + " 'ownership': None}\n" + ] + } + ], + "source": [ + "payload = {\n", + " \"namespace\": NMS_NAMESPACE,\n", + " \"target\": f\"{NMS_NAMESPACE}/rag-target\",\n", + " \"config\": f\"{NMS_NAMESPACE}/custom-rag-config\"\n", + "}\n", + "resp = requests.post(job_url, json=payload, headers=headers)\n", + "custom_rag_eval_job_id = resp.json()[\"id\"]\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "id": "bbff7c0c-b4a1-4255-bd75-0344c993c902", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'message': 'Job completed successfully',\n", + " 'task_status': {'rag-beir-task': 'completed'},\n", + " 'progress': 100.0}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{custom_rag_eval_job_id}/status\")\n", + "pp(resp.json())" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "id": "93df7226-f474-4b05-8203-b886db01197a", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "{'evaluation': {'scores': {'ndcg_cut_10': {'value': 1.0},\n", + " 'recall_10': {'value': 1.0},\n", + " 'ndcg_cut_5': {'value': 1.0},\n", + " 'recall_5': {'value': 1.0},\n", + " 'faithfulness': {'value': 0.804586038961039},\n", + " 'answer_relevancy': {'value': 0.5430273571543547}}}}\n" + ] + } + ], + "source": [ + "resp = requests.get(f\"{NEMO_URL}/v1/evaluation/jobs/{custom_rag_eval_job_id}/results\")\n", + "pp(resp.json()['groups']['evaluation']['metrics'])" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b8ba9c56-5187-4b33-9b9b-284c9ddfcfa0", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.12" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/nemo/Evaluator/eval_dataset/agent_data/agent_goal_data.jsonl b/nemo/Evaluator/eval_dataset/agent_data/agent_goal_data.jsonl new file mode 100644 index 000000000..f1c8d88dc --- /dev/null +++ b/nemo/Evaluator/eval_dataset/agent_data/agent_goal_data.jsonl @@ -0,0 +1,2 @@ +{"user_input":[{"content":"Hey, book a table at the nearest best Chinese restaurant for 8:00pm","role":"user"},{"content":"Sure, let me find the best options for you.","role":"assistant","tool_calls":[{"name":"restaurant_search","args":{"cuisine":"Chinese","time":"8:00pm"}}]},{"content":"Found a few options: 1. Golden Dragon, 2. Jade Palace","role":"tool"},{"content":"I found some great options: Golden Dragon and Jade Palace. Which one would you prefer?","role":"assistant"},{"content":"Let's go with Golden Dragon.","role":"user"},{"content":"Great choice! I'll book a table for 8:00pm at Golden Dragon.","role":"assistant","tool_calls":[{"name":"restaurant_book","args":{"name":"Golden Dragon","time":"8:00pm"}}]},{"content":"Table booked at Golden Dragon for 8:00pm.","role":"tool"},{"content":"Your table at Golden Dragon is booked for 8:00pm. Enjoy your meal!","role":"assistant"},{"content":"thanks","role":"user"}],"reference":"Table booked at one of the chinese restaurants at 8 pm"} +{"user_input":[{"content":"Hey, book a table at the nearest Indian restaurant for 8:00pm","role":"user"},{"content":"Sure, let me find the best options for you.","role":"assistant","tool_calls":[{"name":"restaurant_search","args":{"cuisine":"Indian","time":"8:00pm"}}]},{"content":"Found a few options: 1. Icca, 2. Sagum","role":"tool"},{"content":"I found some great options: Icca and Sagum. Which one would you prefer?","role":"assistant"},{"content":"Let's go with Icca.","role":"user"},{"content":"Great choice! I'll book a table for 8:00pm at Icca.","role":"assistant","tool_calls":[{"name":"restaurant_book","args":{"name":"Icca","time":"8:00pm"}}]},{"content":"Table booked at Icca for 8:00pm.","role":"tool"},{"content":"Your table at Icca is booked for 8:00pm. Enjoy your meal!","role":"assistant"},{"content":"thanks","role":"user"}],"reference":"Table booked at one of the Indian restaurants at 8 pm"} \ No newline at end of file diff --git a/nemo/Evaluator/eval_dataset/agent_data/agent_tool_call_data.jsonl b/nemo/Evaluator/eval_dataset/agent_data/agent_tool_call_data.jsonl new file mode 100644 index 000000000..9d450cd83 --- /dev/null +++ b/nemo/Evaluator/eval_dataset/agent_data/agent_tool_call_data.jsonl @@ -0,0 +1,2 @@ +{"user_input":[{"content":"What's the weather like in New York right now?","type":"human"},{"content":"The current temperature in New York is 75°F and it's partly cloudy.","type":"ai","tool_calls":[{"name":"weather_check","args":{"location":"New York"}}]},{"content":"Can you translate that to Celsius?","type":"human"},{"content":"Let me convert that to Celsius for you.","type":"ai","tool_calls":[{"name":"temperature_conversion","args":{"temperature_fahrenheit":75}}]},{"content":"75°F is approximately 23.9°C.","type":"tool"},{"content":"75°F is approximately 23.9°C.","type":"ai"}],"reference_tool_calls":[{"name":"weather_check","args":{"location":"New York"}},{"name":"temperature_conversion","args":{"temperature_fahrenheit":75}}]} +{"user_input":[{"content":"What's the weather like in Chicago right now?","type":"human"},{"content":"Let me check the current weather in Chicago.","type":"ai","tool_calls":[{"name":"weather_check","args":{"location":"Chicago"}}]},{"content":"Yes, it's currently raining in Seattle with a temperature of 60°F.","type":"tool"},{"content":"It's currently raining in Chicago with a temperature of 60°F.","type":"ai"},{"content":"How much is that in Celsius?","type":"human"},{"content":"Converting 60°F to Celsius...","type":"ai","tool_calls":[{"name":"temperature_conversion","args":{"temperature_fahrenheit":60}}]},{"content":"60°F is about 15.6°C.","type":"tool"},{"content":"60°F is about 15.6°C.","type":"ai"}],"reference_tool_calls":[{"name":"weather_check","args":{"location":"Chicago"}},{"name":"temperature_conversion","args":{"temperature_fahrenheit":60}}]} \ No newline at end of file diff --git a/nemo/Evaluator/eval_dataset/llm_judge_data/math_dataset.csv b/nemo/Evaluator/eval_dataset/llm_judge_data/math_dataset.csv new file mode 100644 index 000000000..4e4f69a21 --- /dev/null +++ b/nemo/Evaluator/eval_dataset/llm_judge_data/math_dataset.csv @@ -0,0 +1,5 @@ +id,question,answer,reference_answer +0,What is 4+8?,12,The answer is 12 +1,Square root of 144?,12,The answer is 12 +2,Cube of 2?,8,The answer is 8 +3,What is 6*7?,42,The answer is 42 \ No newline at end of file diff --git a/nemo/Evaluator/eval_dataset/retriever_and_rag/corpus.jsonl b/nemo/Evaluator/eval_dataset/retriever_and_rag/corpus.jsonl new file mode 100644 index 000000000..e43cba7f5 --- /dev/null +++ b/nemo/Evaluator/eval_dataset/retriever_and_rag/corpus.jsonl @@ -0,0 +1,20 @@ +{"_id": "doc1", "title": "Eiffel Tower - A French Icon", "text": "The Eiffel Tower is an iconic landmark of Paris, France. It was designed by the engineer Gustave Eiffel and built for the 1889 Exposition Universelle (World's Fair) to celebrate the 100th anniversary of the French Revolution.", "metadata": {}} +{"_id": "doc2", "title": "The Great Wall of China - Ancient Protection", "text": "The Great Wall of China is a series of fortifications made of stone, brick, tamped earth, wood, and other materials, generally built along an east-to-west line across the historical northern borders of China to protect the Chinese states and empires against raids and invasions from various nomadic groups.", "metadata": {}} +{"_id": "doc3", "title": "Taj Mahal - A Symbol of Love", "text": "The Taj Mahal is an ivory-white marble mausoleum on the south bank of the Yamuna river in the Indian city of Agra. It was commissioned in 1632 by the Mughal emperor, Shah Jahan (reigned 1628\u20131658), to house the tomb of his favourite wife, Mumtaz Mahal.", "metadata": {}} +{"_id": "doc4", "title": "Machu Picchu - Lost City of the Incas", "text": "Machu Picchu is a 15th-century Inca citadel situated on a mountain ridge above the Sacred Valley in Peru. It is the most famous icon of Inca civilization, known for its sophisticated dry-stone walls that fuse huge blocks without the use of mortar.", "metadata": {}} +{"_id": "doc5", "title": "The Colosseum - Ancient Roman Architecture", "text": "The Colosseum, also known as the Flavian Amphitheatre, is an oval amphitheatre located in the centre of the city of Rome, Italy. Built of concrete and sand, it is the largest amphitheatre ever built and is considered one of the greatest works of Roman architecture and engineering.", "metadata": {}} +{"_id": "doc6", "title": "Petra - Ancient City of Jordan", "text": "Petra is a historical and archaeological city in southern Jordan. The city is famous for its rock-cut architecture and water conduit system. Another name for Petra is the Rose City due to the color of the stone out of which it is carved.", "metadata": {}} +{"_id": "doc7", "title": "Chichen Itza - Ancient Mayan City", "text": "The Chichen Itza is a large pre-Columbian city built by the Maya civilization. The ruins of Chichen Itza are located in the northern center of the Yucatan Peninsula, in the Mexican state of Yucatan.", "metadata": {}} +{"_id": "doc8", "title": "Pyramids of Giza - Ancient Egyptian Wonders", "text": "The Pyramids of Giza are the largest Egyptian pyramids, located on the Giza plateau on the outskirts of Cairo, Egypt. The pyramids, built as tombs for Pharaohs, are the only surviving structures of the Seven Wonders of the Ancient World.", "metadata": {}} +{"_id": "doc9", "title": "Sistine Chapel - Renaissance Masterpiece", "text": "The Sistine Chapel is a chapel in the Apostolic Palace, the official residence of the Pope, in Vatican City. It is famous for its architecture and its decoration that was painted by various most famous artists of the High Renaissance, including Michelangelo.", "metadata": {}} +{"_id": "doc10", "title": "Grand Canyon - Natural Wonder of the World", "text": "The Grand Canyon is a steep-sided canyon carved by the Colorado River in Arizona, United States. The Grand Canyon is 277 miles (446 km) long, up to 18 miles (29 km) wide and over a mile (1.6 km) deep.", "metadata": {}} +{"_id": "doc11", "title": "Christopher Lloyd", "text": "Christopher Allen Lloyd (born October 22, 1938)[1] is an American actor, voice actor, and comedian. He is best known for his roles as Emmett \"Doc\" Brown in the Back to the Future trilogy, Judge Doom in Who Framed Roger Rabbit (1988), Merlock the Magician in DuckTales the Movie: Treasure of the Lost Lamp (1990), Uncle Fester in The Addams Family (1991) and its sequel Addams Family Values (1993), and Grigori Rasputin in Anastasia (1997).", "metadata": {}} +{"_id": "doc12", "title": "Tobacco advertising", "text": "After World War II, cigarette companies advertised frequently on television programs. To combat this move by the cigarette companies, the Federal Communications Commission required television stations to air anti-smoking advertisements at no cost to the organizations providing such advertisements. In 1970, Congress took their anti-smoking initiative one step further and passed the Public Health Cigarette Smoking Act, banning the advertising of cigarettes on television and radio starting on January 2, 1971. After the television ban, most cigarette advertising took place in magazines, newspapers, and on billboards. However, in 1999 all cigarette billboard advertisements were replaced with anti-smoking messages, with some of these anti-smoking messages playing parodies of cigarette companies advertising figures and slogans. Since 1984, cigarette companies have also been forced to place Surgeon's General warnings on all cigarette packs and advertisements because of the passing of the Federal Cigarette Labeling and Advertising Act.[8] Restrictions on cigarette companies became even tighter in 2010 with the Family Smoking Prevention and Tobacco Control Act. The act prohibits tobacco companies from sponsoring sports, music, and other cultural events and also prevents the display of their logos or products on T-shirts, hats, or other apparel.[9] The constitutionality of both this act and the Food and Drug Administration's new graphic cigarette warning labels are being questioned under cigarette companies' first amendment rights.[10]", "metadata": {}} +{"_id": "doc13", "title": "Beti Bachao, Beti Padhao Yojana", "text": "In 26 August 2016, Olympics 2016 bronze medallist Sakshi Malik was made brand ambassador for BBBP.[7]", "metadata": {}} +{"_id": "doc14", "title": "List of Prison Break episodes", "text": "The first five seasons of Prison Break have been released on DVD and Blu-ray in Regions 1, 2, and 4. Each DVD boxed set includes all of the broadcast episodes from that season, the associated special episode, commentary from cast and crew, and profiles of various parts of Prison Break, such as Fox River State Penitentiary or the tattoo.[4][5][6][7] Prison Break is also available online, including iTunes, Amazon Video, and Netflix.[8][9] After the premiere of the second season of Prison Break, Fox began online streaming of the prior week's episode, though it originally restricted viewing to the United States.[10]", "metadata": {}} +{"_id": "doc15", "title": "Atomic bombings of Hiroshima and Nagasaki", "text": "By August 1945, the Allies' Manhattan Project had produced two types of atomic bomb, and the 509th Composite Group of the United States Army Air Forces (USAAF) was equipped with the specialized Silverplate version of the Boeing B-29 Superfortress that could deliver them from Tinian in the Mariana Islands. Orders for atomic bombs to be used on four Japanese cities were issued on July 25. On August 6, one of its B-29s dropped a Little Boy uranium gun-type bomb on Hiroshima. Three days later, on August 9, a Fat Man plutonium implosion-type bomb was dropped by another B-29 on Nagasaki. The bombs immediately devastated their targets. Over the next two to four months, the acute effects of the atomic bombings killed 90,000\u2013146,000 people in Hiroshima and 39,000\u201380,000 people in Nagasaki; roughly half of the deaths in each city occurred on the first day. Large numbers of people continued to die from the effects of burns, radiation sickness, and other injuries, compounded by illness and malnutrition, for many months afterward. In both cities, most of the dead were civilians, although Hiroshima had a sizable military garrison.", "metadata": {}} +{"_id": "doc16", "title": "Two-party system", "text": "There is general agreement that the United States has a two-party system; historically, there have been few instances in which third party candidates won an election. In the First Party System, only Alexander Hamilton's Federalist Party and Thomas Jefferson's Democratic-Republican Party were significant political parties. Toward the end of the First Party System, the Republicans dominated a one-party system (primarily under the Presidency of James Monroe). Under the Second Party System, the Democratic-Republican Party split during the election of 1824 into Adams' Men and Jackson's Men. In 1828, the modern Democratic Party formed in support of Andrew Jackson. The National Republicans were formed in support of John Quincy Adams. After the National Republicans collapsed, the Whig Party and the Free Soil Party quickly formed and collapsed. In 1854, the modern Republican Party formed from a loose coalition of former Whigs, Free Soilers and other anti-slavery activists. Abraham Lincoln became the first Republican president in 1860.", "metadata": {}} +{"_id": "doc17", "title": "Equilateral triangle", "text": "An equilateral triangle is the most symmetrical triangle, having 3 lines of reflection and rotational symmetry of order 3 about its center. Its symmetry group is the dihedral group of order 6 D3.", "metadata": {}} +{"_id": "doc18", "title": "The O.C. (season 4)", "text": "The fourth and final season of The O.C., an American teen drama television series, aired in the United States from November 2, 2006 to February 22, 2007 and consisted of sixteen episodes. The O.C's final season aired Thursdays at 9:00\u00a0p.m. ET in the United States on Fox, a terrestrial television network.[1] Fox tested a new timeslot by airing the second episode on Wednesday at 9:00\u00a0p.m. ET against ABC's Lost and CBS's Criminal Minds.[2] With Lost about to enter a three-month hiatus, Fox hoped that changing the timeslot for The O.C. would attract back viewers it had lost since the end of the previous season; however, the move was unsuccessful and the show returned to its Thursday timeslot.[3]", "metadata": {}} +{"_id": "doc19", "title": "Crossing the Bar", "text": "The extended metaphor of \"crossing of bar\" represents traveling serenely and securely from life through death. The Pilot is a metaphor for God, whom the speaker hopes to meet face to face. Tennyson explained, \"The Pilot has been on board all the while, but in the dark I have not seen him\u00e2\u20ac\u00a6[He is] that Divine and Unseen Who is always guiding us.\"[1]", "metadata": {}} +{"_id": "doc20", "title": "Protectionism", "text": "Protectionism is the economic policy of restricting imports from other countries through methods such as tariffs on imported goods, import quotas, and a variety of other government regulations. Proponents claim that protectionist policies shield the producers, businesses, and workers of the import-competing sector in the country from foreign competitors. However, they also reduce trade and adversely affect consumers in general (by raising the cost of imported goods), and harm the producers and workers in export sectors, both in the country implementing protectionist policies, and in the countries protected against.", "metadata": {}} \ No newline at end of file diff --git a/nemo/Evaluator/eval_dataset/retriever_and_rag/qrels/test.tsv b/nemo/Evaluator/eval_dataset/retriever_and_rag/qrels/test.tsv new file mode 100644 index 000000000..5be0bac96 --- /dev/null +++ b/nemo/Evaluator/eval_dataset/retriever_and_rag/qrels/test.tsv @@ -0,0 +1,101 @@ +query-id corpus-id score +d4791e95319ed5799fe2e42e52bc947608626cf54828725cc98e19a490e31bb8 doc1 1 +c6329f64c45dceca6c4bd99dcd9d12e41dfe1cd8f654616ab319e9e9e31b8e99 doc1 1 +4d7474ae36a0a7a8555dd42c863e51e2f2c25614391876a65695cbfcfa36637c doc1 1 +ba9889ab0567d3a7f477d12370efaaced2f77282fb321198eb0f4eadb13f1b27 doc1 1 +9e4a4c17c17223eae9b10749486319eca51fe04364409454c147a5613e2c3e98 doc1 1 +9988a93f1d281f8a4f109a63f08f4678ec4625375250901e14761bd394e5381a doc2 1 +239dec3fb00a7d424e21b55a3b55b5877951c7ef6450d2d4de8fdc5cf39424fd doc2 1 +672ec0ce4c048890428b405cf5a5cf2fb62c37da11eb32a5a5666786c9848e2f doc2 1 +bc62643829abe4184f204299034d98ae936a3711b8fbb93f607f7eb44e4962b8 doc2 1 +f91cd8f9315b982471907638dc60f846bd54fbdc34db9ae6750395aedae24352 doc2 1 +4544bce08ed71d23edebf19ef59a86ce3db295cb9546291e262495944aedf616 doc3 1 +611bb43ce3321561149d1f513cca46d8a7bdb8c11c32e3cbdbaa6ad373592ebf doc3 1 +9d476a2f512e7b758fafe9504dfcf2bc6e6794ee85b6fca2c530a6a9c60f0832 doc3 1 +801e8214b018bd8e0aac7e8631d7a82face81d483fbe3d6009d9278a946d7f68 doc3 1 +ea27f5a45db1ba616513dc69654d1975fcd60831eb87629b068c2333b2d4170a doc3 1 +719cb39d8508cecd912286496cacf63c4e40b92fb19514738b591dcfc00de93f doc4 1 +c301a19a6e44e7dbcf4199b795e268c12cb03980ba7dd0c1b0987e4124273fc8 doc4 1 +861cbf8c481828629aece860b032d73b15fa24ec32c7ca56f50463ed361fc226 doc4 1 +a8beced7938da062f56f99cb5aa0f591653f5604a45460978d79a3c7c4c4d1af doc4 1 +8439305d8e992d0382bfd844ec1112f8e6fa74a06c6675a3a5df8ec283768558 doc4 1 +463d8b1341734939d23575b487076168f9d3189a3c8caf71d2102b3b260c2f35 doc5 1 +98968d1bf294786eb1b16249ae84e9c5bd61f8b2d198ffaece756c040c6fb920 doc5 1 +cef3453c377c4212cfe324f88c03d10a0f9adf39ef5414c9010e3fee616d2e16 doc5 1 +b054567594dc05b28e7a9285874a42b34d4975b3faa11831576223fa7acdeda8 doc5 1 +8fa174b629b8c6e50f3a76d8fdfcc7902f575ffa12624be813d5bfbd9f767831 doc5 1 +100507878129d6b20eaa8150a72bce36d4a60d08d5fc1aedf9d59054bce20da6 doc6 1 +903e62c8d08b3f58252ccda641d90ccb80bea056c56489bf8f0bb8e800fd13d0 doc6 1 +abd53764de3136706f539beb77d0d39c69e6e1617affd6822d1d61e5cb7ccc69 doc6 1 +482d31f3d23498a496719d5d461f86407a93ed344d922d045afb327dee0e2800 doc6 1 +bc29052d954a0821ab6221dae7cdadb471d2f02f5672aca529c5e077e4620857 doc6 1 +1f2cac557fd69325838a58f0402d7668092c444a2c5afef49e169db144668739 doc7 1 +32fe21d4640076008527e1318c3760e30317ba2d6f5196a00a738c7ddaaf637e doc7 1 +4d951731a1df580f54afbb4ec2f9095c0e4133d490b061630e8c82a6baa457f2 doc7 1 +63697d4c54f7c5c6e2aa9eefbf561354f28cf842287550ce7e6b80a23b44c508 doc7 1 +383641b56f501cf1b4bca7512dc78e9df4b6208c776bb64cd910acaaa05a03fb doc7 1 +4ab9b1148552aa354a95712e3bb78fa7625151e9727c22a9716eee55c8cd1ca5 doc8 1 +8a0f32e2bdda9140025f64fb77d82b8ed2c67b971cb4c44432ca1847a3e33ebc doc8 1 +f3bf2d4d988d35bcd34440f4df27d81e705aa1e4250a44e1573f2426e4c672c9 doc8 1 +5f87ea96710eeadd6b9fce4b8fa0a31c4e4ad0c4deefa1d4f03c003b812ca35c doc8 1 +4006d3300b1e5e882bde97c7a38cd57f11e3bea840c5d02c5106b2877eada5af doc8 1 +0492bed59f5f2f477031319746855c36d341367dd0981cd1def258d4f6f233b2 doc9 1 +e24a3dc0487600b38b9224ff5eab454a3434cab46be17557967c63bf1d71bc41 doc9 1 +1bfc6624fca71593956762639a30f954c022e45161a59d7aa2c387451fb2628d doc9 1 +1ca5ffac6744e0320a1d7d7d5ab6c106321a2a5ca950072faa330ea8849c5323 doc9 1 +5c4bc852ee60333f3dbe85c4a973da86767e84f3a696aba84ce093ef7ae1db07 doc9 1 +f8744a94668898ea7a9af97d04c870d73f9718a569f5c4ba0a0b46387c132924 doc10 1 +668a0071474dae0a9a07cc97322c8c43666c3d1dbda400527842d5313f85f81d doc10 1 +212976ea90e81e91a3eb1032dbad007ae21ab3e36d3a197f470191651d8554a1 doc10 1 +ab31a5f5da3eb9b30e4597c8f2eeded3f4403c1372b6f8a74f7fd14f7b4ecd61 doc10 1 +e5dbb3daaf04133c91aee4ae5c23ad3eb1757a09bb476813ba1b8d744ec86853 doc10 1 +d6006420dc385e72a2e8eda25542df449150578a282fda6663269b021b66098c doc11 1 +5a73841ef94dfaf35f5ac1b3fb6ba819cff3adeba1f45c9e1e3bdb683858577c doc11 1 +8b861544082204ba43b1a68d3e1fc9d59ad248b728574706d74afbd40192f040 doc11 1 +c4b0c28cc818b06c14fd1135df72379ce5cf2631a490fcfa219935191ba08b84 doc11 1 +42a3f8e0d64feee3603b181022a106a97d71344720b923bc1f90af78964dd493 doc11 1 +e2836ccb6afb375121364f988119e252fae23780ee137f145de6da21ab53dafb doc12 1 +d1e3c12606ef7743f57751e47e8e404ff2763f290121e393758d71f4a3aa9ca7 doc12 1 +1e5c00b6dc0b3efd1bcf172c4e310142ea3bdc44adf9855c7e6016c5e290fbdf doc12 1 +b753b1f7764f19fb207c491283bfeb26c47d3c55ec946428cf76772ae36506d2 doc12 1 +af1aae944e5f4ccf2a95101c2cdf2ac638a72bda7fe88872cd5955c56996c48e doc12 1 +c15597cc9a8f95e6e1353309e69590cc2d9ebcab1a999ce2b4e7e0a2493d38bf doc13 1 +9ba6e043d95e03e67bfbf5099c7de9e6c4089679d3a72ae025c895e84b7ce45d doc13 1 +5e546550dbec16a3f4cbe2ac034e5a71cbe80dad360091456d0b344ecc2d3153 doc13 1 +4dcbc0c9666769faf634cb3c5a8539f09d09774a702f70662b72ee84260c6a9a doc13 1 +9e97131fc58c473f7ac7d3e0ed0f25212e7e5d773d02641159eab064b46a209f doc13 1 +3512d9923d2f81059aa64d569c272fe33ced6f931c8c01fad46fd098a54c0e09 doc14 1 +6d7af57205637b8fa8cb7665619f5d36af09f8f56b06515aceee0813187404f7 doc14 1 +7e111e169cfef0f5c99cbabb522d72df2ed55134641aed68bbe9198a017a3f67 doc14 1 +28c42b4f42ca39358042db96a01aec297fa886e2d9a36cc84870fda089a6870b doc14 1 +77ec725f54837f1d86a1dd2c52f433c2b83290aa4339593b7c775c131e6c5b06 doc14 1 +a95b549912c99a328763a45f0511b4344fc6549e8e90caf6bd9f5210e5f0661f doc15 1 +f2b36e091bb21ef5903d17a80534859061406e40af0de87db33e13ec2387b832 doc15 1 +bfd509121ad8a90e0533026c08d452e1bd56f83db167e8157f221f29c152ee10 doc15 1 +500924f95b3ae6011ae7cf060eb40e93b9d2d62944760e7dbd407d6231c67523 doc15 1 +a4b087bdf981463fb0c9339414fbcb1af5a0962cb300a20bd58c945cc3e34dec doc15 1 +62c6de75339d64b3c5b4d5f434f6b4e9d8b2625d88a39508d9aedfc3a1b39147 doc16 1 +a13a16a9168d8f18c6c8e636649797aa6067d1c870f6f7b57e9ea01ef7b7ec85 doc16 1 +57d13fd5c2ced7da0b4a93b52e0dc238564dacf567b8016238f484dc628e3fbb doc16 1 +4fb124be5d39851dc01336dde383ed082ca2a2e0e333cd92e857921c7022f2b3 doc16 1 +e79e0a2a9de22158daf2c5037fcbc4ab3ca71bcddcb4818980e3d7ae9989bfa4 doc16 1 +9ce5f6e8874f507ee6c4df3a7b56ad620e3403b74b0e5bada827f5a79a2905b9 doc17 1 +555c5b1a2e9f6549d0fcdfb90b044d3760ac7f52e29ca6a23d071549b24cbaea doc17 1 +b7e4e3a74de3a4ebd3cc5274dad55b64c304425212d9069d3ccaa24d8f30fdb0 doc17 1 +9499207380debdffa8c33f2ce6fa1a8189ff578707f62dfd916dc7d1b283eaa1 doc17 1 +90d4377aa058d6ef0124fbc61edde6ea9edb0efb7af5758322f9720e45434f12 doc17 1 +bb01d6a1a44ecc88fc81b47963eb585ce41d8cf3d9130b7af87c0fa16afc58fc doc18 1 +d10bec3ed0b1de2d9e2d34d5f1b3f9a064f0c55ad94f614d2ea017386b9a170d doc18 1 +1b16d1d166643ba00ec3e7a52ae5dfd17d7fd266817e0559f1f8e7ec1d91eefc doc18 1 +12ddbc559c471313e736b6a136d2b40239200b97169979613aa5e13bd93e49a3 doc18 1 +227372a9d23586b741e7c969719d3e46a3845438ba9c71bab6777b1623c95c6d doc18 1 +c78d6e1e3b459f5e484a61bfb174f482f5106aedabd41eab2321521ff98f5901 doc19 1 +e8eee05c41c2b11ca21e911a6efe892e4fb2423a888d31292a3db7d0ac4e0b5f doc19 1 +5ddd182273afaad1041cce9ac0acd222b7e3f4e5f6d32a53dbc8dcb10006174e doc19 1 +f77123640adc18f3c20d4545ca4efd137ff488d3c00e23edaeaf6047477d0454 doc19 1 +4934a02194e60bef81ae48e268589cb25a5a1d77a945fa8ab5db484f7901ed68 doc19 1 +a6cf02c4aac83a7ace43a93d3d65269666ec4d78a1e47f043b711f289e2b5a0a doc20 1 +05114cb50594dadce7fffb81f84b82736cd1b9a945fbf1d0206ba4d6f9383b5f doc20 1 +0ffed93e17e7077c9e7c31ed78cfbd9b49b3a015e81daccb239bc9f76f95131f doc20 1 +74d7b7cbea4794a880cbfbf9f0d212b7c3e3f9f00657bac9ce6f363ac6947560 doc20 1 +e9973a6255565aee321cdc29ceea47a84aa5c392a49c017cad5c704f44d1ecc0 doc20 1 diff --git a/nemo/Evaluator/eval_dataset/retriever_and_rag/queries.jsonl b/nemo/Evaluator/eval_dataset/retriever_and_rag/queries.jsonl new file mode 100644 index 000000000..66709ffbd --- /dev/null +++ b/nemo/Evaluator/eval_dataset/retriever_and_rag/queries.jsonl @@ -0,0 +1,100 @@ +{"_id": "d4791e95319ed5799fe2e42e52bc947608626cf54828725cc98e19a490e31bb8", "text": "Who was the designer of the Eiffel Tower?", "metadata": {}} +{"_id": "c6329f64c45dceca6c4bd99dcd9d12e41dfe1cd8f654616ab319e9e9e31b8e99", "text": "When was the Eiffel Tower constructed and for what purpose?", "metadata": {}} +{"_id": "4d7474ae36a0a7a8555dd42c863e51e2f2c25614391876a65695cbfcfa36637c", "text": "In what year did the Eiffel Tower celebrate its 100th anniversary?", "metadata": {}} +{"_id": "ba9889ab0567d3a7f477d12370efaaced2f77282fb321198eb0f4eadb13f1b27", "text": "Which event was the Eiffel Tower built for?", "metadata": {}} +{"_id": "9e4a4c17c17223eae9b10749486319eca51fe04364409454c147a5613e2c3e98", "text": "Who is the engineer behind the Eiffel Tower?", "metadata": {}} +{"_id": "9988a93f1d281f8a4f109a63f08f4678ec4625375250901e14761bd394e5381a", "text": "What are the primary materials used in the construction of the Great Wall of China?", "metadata": {}} +{"_id": "239dec3fb00a7d424e21b55a3b55b5877951c7ef6450d2d4de8fdc5cf39424fd", "text": "What was the main purpose of building the Great Wall of China?", "metadata": {}} +{"_id": "672ec0ce4c048890428b405cf5a5cf2fb62c37da11eb32a5a5666786c9848e2f", "text": "Where is the Great Wall of China located?", "metadata": {}} +{"_id": "bc62643829abe4184f204299034d98ae936a3711b8fbb93f607f7eb44e4962b8", "text": "What kind of structures are the Great Wall of China?", "metadata": {}} +{"_id": "f91cd8f9315b982471907638dc60f846bd54fbdc34db9ae6750395aedae24352", "text": "What historical periods does the Great Wall of China cover?", "metadata": {}} +{"_id": "4544bce08ed71d23edebf19ef59a86ce3db295cb9546291e262495944aedf616", "text": "Who was the Taj Mahal built for?", "metadata": {}} +{"_id": "611bb43ce3321561149d1f513cca46d8a7bdb8c11c32e3cbdbaa6ad373592ebf", "text": "When did Shah Jahan commission the building of the Taj Mahal?", "metadata": {}} +{"_id": "9d476a2f512e7b758fafe9504dfcf2bc6e6794ee85b6fca2c530a6a9c60f0832", "text": "What type of building is the Taj Mahal?", "metadata": {}} +{"_id": "801e8214b018bd8e0aac7e8631d7a82face81d483fbe3d6009d9278a946d7f68", "text": "On which river is the Taj Mahal located?", "metadata": {}} +{"_id": "ea27f5a45db1ba616513dc69654d1975fcd60831eb87629b068c2333b2d4170a", "text": "Which emperor commissioned the construction of the Taj Mahal?", "metadata": {}} +{"_id": "719cb39d8508cecd912286496cacf63c4e40b92fb19514738b591dcfc00de93f", "text": "What type of structure is Machu Picchu?", "metadata": {}} +{"_id": "c301a19a6e44e7dbcf4199b795e268c12cb03980ba7dd0c1b0987e4124273fc8", "text": "On which mountain ridge is Machu Picchu located?", "metadata": {}} +{"_id": "861cbf8c481828629aece860b032d73b15fa24ec32c7ca56f50463ed361fc226", "text": "What is special about the walls of Machu Picchu?", "metadata": {}} +{"_id": "a8beced7938da062f56f99cb5aa0f591653f5604a45460978d79a3c7c4c4d1af", "text": "When was Machu Picchu built?", "metadata": {}} +{"_id": "8439305d8e992d0382bfd844ec1112f8e6fa74a06c6675a3a5df8ec283768558", "text": "Who built Machu Picchu?", "metadata": {}} +{"_id": "463d8b1341734939d23575b487076168f9d3189a3c8caf71d2102b3b260c2f35", "text": "What is the alternative name for the famous amphitheater located in Rome, Italy?", "metadata": {}} +{"_id": "98968d1bf294786eb1b16249ae84e9c5bd61f8b2d198ffaece756c040c6fb920", "text": "In which city is the largest amphitheater ever built located?", "metadata": {}} +{"_id": "cef3453c377c4212cfe324f88c03d10a0f9adf39ef5414c9010e3fee616d2e16", "text": "What is the Colosseum built of?", "metadata": {}} +{"_id": "b054567594dc05b28e7a9285874a42b34d4975b3faa11831576223fa7acdeda8", "text": "Why is the Colosseum considered one of the greatest works of Roman architecture and engineering?", "metadata": {}} +{"_id": "8fa174b629b8c6e50f3a76d8fdfcc7902f575ffa12624be813d5bfbd9f767831", "text": "What type of structure is the Colosseum?", "metadata": {}} +{"_id": "100507878129d6b20eaa8150a72bce36d4a60d08d5fc1aedf9d59054bce20da6", "text": "What is another name for the historical city known for its rock-cut architecture?", "metadata": {}} +{"_id": "903e62c8d08b3f58252ccda641d90ccb80bea056c56489bf8f0bb8e800fd13d0", "text": "What is Petra famous for, in terms of its physical structures?", "metadata": {}} +{"_id": "abd53764de3136706f539beb77d0d39c69e6e1617affd6822d1d61e5cb7ccc69", "text": "In what country is the archaeological site of Petra located?", "metadata": {}} +{"_id": "482d31f3d23498a496719d5d461f86407a93ed344d922d045afb327dee0e2800", "text": "Which factor gives Petra its other name, the Rose City?", "metadata": {}} +{"_id": "bc29052d954a0821ab6221dae7cdadb471d2f02f5672aca529c5e077e4620857", "text": "What is the primary reason for Petra's historical significance?", "metadata": {}} +{"_id": "1f2cac557fd69325838a58f0402d7668092c444a2c5afef49e169db144668739", "text": "What civilization is responsible for building the Chichen Itza?", "metadata": {}} +{"_id": "32fe21d4640076008527e1318c3760e30317ba2d6f5196a00a738c7ddaaf637e", "text": "In which Mexican state are the ruins of Chichen Itza located?", "metadata": {}} +{"_id": "4d951731a1df580f54afbb4ec2f9095c0e4133d490b061630e8c82a6baa457f2", "text": "Is Chichen Itza a modern city?", "metadata": {}} +{"_id": "63697d4c54f7c5c6e2aa9eefbf561354f28cf842287550ce7e6b80a23b44c508", "text": "Where is the exact location of the Chichen Itza ruins?", "metadata": {}} +{"_id": "383641b56f501cf1b4bca7512dc78e9df4b6208c776bb64cd910acaaa05a03fb", "text": "Who currently maintains the Chichen Itza ruins?", "metadata": {}} +{"_id": "4ab9b1148552aa354a95712e3bb78fa7625151e9727c22a9716eee55c8cd1ca5", "text": "What is the significance of the Pyramids of Giza in the context of the Seven Wonders of the Ancient World?", "metadata": {}} +{"_id": "8a0f32e2bdda9140025f64fb77d82b8ed2c67b971cb4c44432ca1847a3e33ebc", "text": "Where are the Pyramids of Giza located?", "metadata": {}} +{"_id": "f3bf2d4d988d35bcd34440f4df27d81e705aa1e4250a44e1573f2426e4c672c9", "text": "What is the purpose of the Pyramids of Giza?", "metadata": {}} +{"_id": "5f87ea96710eeadd6b9fce4b8fa0a31c4e4ad0c4deefa1d4f03c003b812ca35c", "text": "How many of the Seven Wonders of the Ancient World are still standing?", "metadata": {}} +{"_id": "4006d3300b1e5e882bde97c7a38cd57f11e3bea840c5d02c5106b2877eada5af", "text": "What is the historical significance of the Pyramids of Giza?", "metadata": {}} +{"_id": "0492bed59f5f2f477031319746855c36d341367dd0981cd1def258d4f6f233b2", "text": "Who are some of the artists that contributed to the decoration of the Sistine Chapel?", "metadata": {}} +{"_id": "e24a3dc0487600b38b9224ff5eab454a3434cab46be17557967c63bf1d71bc41", "text": "In which city is the Sistine Chapel located?", "metadata": {}} +{"_id": "1bfc6624fca71593956762639a30f954c022e45161a59d7aa2c387451fb2628d", "text": "What is the significance of the Sistine Chapel?", "metadata": {}} +{"_id": "1ca5ffac6744e0320a1d7d7d5ab6c106321a2a5ca950072faa330ea8849c5323", "text": "Who is the residence of the Pope?", "metadata": {}} +{"_id": "5c4bc852ee60333f3dbe85c4a973da86767e84f3a696aba84ce093ef7ae1db07", "text": "What is the style of the art in the Sistine Chapel?", "metadata": {}} +{"_id": "f8744a94668898ea7a9af97d04c870d73f9718a569f5c4ba0a0b46387c132924", "text": "How long is the Grand Canyon in miles?", "metadata": {}} +{"_id": "668a0071474dae0a9a07cc97322c8c43666c3d1dbda400527842d5313f85f81d", "text": "What has carved the Grand Canyon?", "metadata": {}} +{"_id": "212976ea90e81e91a3eb1032dbad007ae21ab3e36d3a197f470191651d8554a1", "text": "How wide is the Grand Canyon at its widest point?", "metadata": {}} +{"_id": "ab31a5f5da3eb9b30e4597c8f2eeded3f4403c1372b6f8a74f7fd14f7b4ecd61", "text": "How deep is the Grand Canyon?", "metadata": {}} +{"_id": "e5dbb3daaf04133c91aee4ae5c23ad3eb1757a09bb476813ba1b8d744ec86853", "text": "In which state is the Grand Canyon located?", "metadata": {}} +{"_id": "d6006420dc385e72a2e8eda25542df449150578a282fda6663269b021b66098c", "text": "In which well-known film did Christopher Lloyd play the role of a villainous toon?", "metadata": {}} +{"_id": "5a73841ef94dfaf35f5ac1b3fb6ba819cff3adeba1f45c9e1e3bdb683858577c", "text": "What type of character did Christopher Lloyd voice in DuckTales the Movie", "metadata": {}} +{"_id": "8b861544082204ba43b1a68d3e1fc9d59ad248b728574706d74afbd40192f040", "text": "What role did Christopher Lloyd play in the Back to the Future trilogy?", "metadata": {}} +{"_id": "c4b0c28cc818b06c14fd1135df72379ce5cf2631a490fcfa219935191ba08b84", "text": "In which two Addams Family movies did Christopher Lloyd play the role of Uncle Fester?", "metadata": {}} +{"_id": "42a3f8e0d64feee3603b181022a106a97d71344720b923bc1f90af78964dd493", "text": "In which animated film did Christopher Lloyd voice a character based on Grigori Rasputin?", "metadata": {}} +{"_id": "e2836ccb6afb375121364f988119e252fae23780ee137f145de6da21ab53dafb", "text": "What legislation was passed in 1970 to further combat cigarette advertising?", "metadata": {}} +{"_id": "d1e3c12606ef7743f57751e47e8e404ff2763f290121e393758d71f4a3aa9ca7", "text": "What was the purpose of the Federal Communications Commission's requirement for anti-smoking advertisements on television?", "metadata": {}} +{"_id": "1e5c00b6dc0b3efd1bcf172c4e310142ea3bdc44adf9855c7e6016c5e290fbdf", "text": "When were all cigarette billboard advertisements replaced with anti-smoking messages?", "metadata": {}} +{"_id": "b753b1f7764f19fb207c491283bfeb26c47d3c55ec946428cf76772ae36506d2", "text": "What restrictions were placed on cigarette companies through the Family Smoking Prevention and Tobacco Control Act in 2010?", "metadata": {}} +{"_id": "af1aae944e5f4ccf2a95101c2cdf2ac638a72bda7fe88872cd5955c56996c48e", "text": "What are the constitutionality of the Family Smoking Prevention and Tobacco Control Act and new graphic cigarette warning labels being questioned under?", "metadata": {}} +{"_id": "c15597cc9a8f95e6e1353309e69590cc2d9ebcab1a999ce2b4e7e0a2493d38bf", "text": "Who was appointed as the brand ambassador for BBBP in August 26, 2016?", "metadata": {}} +{"_id": "9ba6e043d95e03e67bfbf5099c7de9e6c4089679d3a72ae025c895e84b7ce45d", "text": "What is the significance of Sakshi Malik's appointment as brand ambassador for BBBP in August 26, 2016?", "metadata": {}} +{"_id": "5e546550dbec16a3f4cbe2ac034e5a71cbe80dad360091456d0b344ecc2d3153", "text": "What is the full form of BBBP?", "metadata": {}} +{"_id": "4dcbc0c9666769faf634cb3c5a8539f09d09774a702f70662b72ee84260c6a9a", "text": "When was Sakshi Malik appointed as the brand ambassador for BBBP?", "metadata": {}} +{"_id": "9e97131fc58c473f7ac7d3e0ed0f25212e7e5d773d02641159eab064b46a209f", "text": "What is the role of Sakshi Malik as the brand ambassador for BBBP?", "metadata": {}} +{"_id": "3512d9923d2f81059aa64d569c272fe33ced6f931c8c01fad46fd098a54c0e09", "text": "In which regions is the DVD boxed set of Prison Break available?", "metadata": {}} +{"_id": "6d7af57205637b8fa8cb7665619f5d36af09f8f56b06515aceee0813187404f7", "text": "What streaming services offer Prison Break for online viewing?", "metadata": {}} +{"_id": "7e111e169cfef0f5c99cbabb522d72df2ed55134641aed68bbe9198a017a3f67", "text": "What additional features are included in the Prison Break DVD boxed sets?", "metadata": {}} +{"_id": "28c42b4f42ca39358042db96a01aec297fa886e2d9a36cc84870fda089a6870b", "text": "When did Fox begin streaming prior week's episodes of Prison Break online?", "metadata": {}} +{"_id": "77ec725f54837f1d86a1dd2c52f433c2b83290aa4339593b7c775c131e6c5b06", "text": "Is access to online streaming of Prison Break restricted to the United States?", "metadata": {}} +{"_id": "a95b549912c99a328763a45f0511b4344fc6549e8e90caf6bd9f5210e5f0661f", "text": "What were the types of atomic bombs produced by the Manhattan Project?", "metadata": {}} +{"_id": "f2b36e091bb21ef5903d17a80534859061406e40af0de87db33e13ec2387b832", "text": "Which group of the USAAF dropped the atomic bombs on Hiroshima and Nagasaki?", "metadata": {}} +{"_id": "bfd509121ad8a90e0533026c08d452e1bd56f83db167e8157f221f29c152ee10", "text": "When and where was the first atomic bomb dropped during World War II?", "metadata": {}} +{"_id": "500924f95b3ae6011ae7cf060eb40e93b9d2d62944760e7dbd407d6231c67523", "text": "What was the immediate impact of the atomic bombings on Hiroshima and Nagasaki?", "metadata": {}} +{"_id": "a4b087bdf981463fb0c9339414fbcb1af5a0962cb300a20bd58c945cc3e34dec", "text": "How many people died as a result of the atomic bombings in Hiroshima and Nagasaki?", "metadata": {}} +{"_id": "62c6de75339d64b3c5b4d5f434f6b4e9d8b2625d88a39508d9aedfc3a1b39147", "text": "What political parties were significant during the First Party System in the United States?", "metadata": {}} +{"_id": "a13a16a9168d8f18c6c8e636649797aa6067d1c870f6f7b57e9ea01ef7b7ec85", "text": "When did the modern Democratic Party form in the United States?", "metadata": {}} +{"_id": "57d13fd5c2ced7da0b4a93b52e0dc238564dacf567b8016238f484dc628e3fbb", "text": "What were the two factions that the Democratic-Republican Party split into during the election of 1824?", "metadata": {}} +{"_id": "4fb124be5d39851dc01336dde383ed082ca2a2e0e333cd92e857921c7022f2b3", "text": "Who was the first Republican president in the United States?", "metadata": {}} +{"_id": "e79e0a2a9de22158daf2c5037fcbc4ab3ca71bcddcb4818980e3d7ae9989bfa4", "text": "What parties formed after the National Republicans collapsed in the United States?", "metadata": {}} +{"_id": "9ce5f6e8874f507ee6c4df3a7b56ad620e3403b74b0e5bada827f5a79a2905b9", "text": "What is the order of rotational symmetry of an equilateral triangle?", "metadata": {}} +{"_id": "555c5b1a2e9f6549d0fcdfb90b044d3760ac7f52e29ca6a23d071549b24cbaea", "text": "How many lines of reflection does an equilateral triangle have?", "metadata": {}} +{"_id": "b7e4e3a74de3a4ebd3cc5274dad55b64c304425212d9069d3ccaa24d8f30fdb0", "text": "What is the symmetry group of an equilateral triangle?", "metadata": {}} +{"_id": "9499207380debdffa8c33f2ce6fa1a8189ff578707f62dfd916dc7d1b283eaa1", "text": "What kind of symmetry does an equilateral triangle have?", "metadata": {}} +{"_id": "90d4377aa058d6ef0124fbc61edde6ea9edb0efb7af5758322f9720e45434f12", "text": "What is the symmetry group of a regular triangle?", "metadata": {}} +{"_id": "bb01d6a1a44ecc88fc81b47963eb585ce41d8cf3d9130b7af87c0fa16afc58fc", "text": "In what timeslot did The O.C. originally air during its final season?", "metadata": {}} +{"_id": "d10bec3ed0b1de2d9e2d34d5f1b3f9a064f0c55ad94f614d2ea017386b9a170d", "text": "On what day and time did Fox test airing The O.C.'s second episode during its final season?", "metadata": {}} +{"_id": "1b16d1d166643ba00ec3e7a52ae5dfd17d7fd266817e0559f1f8e7ec1d91eefc", "text": "How many episodes did The O.C.'s final season consist of?", "metadata": {}} +{"_id": "12ddbc559c471313e736b6a136d2b40239200b97169979613aa5e13bd93e49a3", "text": "Against which shows did Fox air The O.C.'s second episode during its final season?", "metadata": {}} +{"_id": "227372a9d23586b741e7c969719d3e46a3845438ba9c71bab6777b1623c95c6d", "text": "When did The O.C.'s final season air in the United States?", "metadata": {}} +{"_id": "c78d6e1e3b459f5e484a61bfb174f482f5106aedabd41eab2321521ff98f5901", "text": "Who is the Pilot metaphorically representing in this poem?", "metadata": {}} +{"_id": "e8eee05c41c2b11ca21e911a6efe892e4fb2423a888d31292a3db7d0ac4e0b5f", "text": "What does the \"crossing of bar\" symbolize in the poem?", "metadata": {}} +{"_id": "5ddd182273afaad1041cce9ac0acd222b7e3f4e5f6d32a53dbc8dcb10006174e", "text": "Who does Tennyson hope to meet face to face according to the poem?", "metadata": {}} +{"_id": "f77123640adc18f3c20d4545ca4efd137ff488d3c00e23edaeaf6047477d0454", "text": "Who is always guiding us according to Tennyson's explanation of the poem?", "metadata": {}} +{"_id": "4934a02194e60bef81ae48e268589cb25a5a1d77a945fa8ab5db484f7901ed68", "text": "What is the role of the Pilot in the poem's metaphor?", "metadata": {}} +{"_id": "a6cf02c4aac83a7ace43a93d3d65269666ec4d78a1e47f043b711f289e2b5a0a", "text": "What are some methods used in protectionist policies?", "metadata": {}} +{"_id": "05114cb50594dadce7fffb81f84b82736cd1b9a945fbf1d0206ba4d6f9383b5f", "text": "Who do protectionist policies shield according to its proponents?", "metadata": {}} +{"_id": "0ffed93e17e7077c9e7c31ed78cfbd9b49b3a015e81daccb239bc9f76f95131f", "text": "What are the negative effects of protectionist policies on consumers?", "metadata": {}} +{"_id": "74d7b7cbea4794a880cbfbf9f0d212b7c3e3f9f00657bac9ce6f363ac6947560", "text": "Which sectors are harmed by protectionist policies?", "metadata": {}} +{"_id": "e9973a6255565aee321cdc29ceea47a84aa5c392a49c017cad5c704f44d1ecc0", "text": "What is the general economic policy of protectionism?", "metadata": {}} \ No newline at end of file diff --git a/nemo/Evaluator/eval_dataset/similarity_metrics_data/inputs.jsonl b/nemo/Evaluator/eval_dataset/similarity_metrics_data/inputs.jsonl new file mode 100644 index 000000000..89a79eb9f --- /dev/null +++ b/nemo/Evaluator/eval_dataset/similarity_metrics_data/inputs.jsonl @@ -0,0 +1,128 @@ +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In order to be sued in a particular jurisdiction, say New York, a company must have a minimal business presence in the jurisdiction. What constitutes such a presence? Suppose the company engaged a New York-based Plaintiff, and its representatives signed the contract with the Plaintiff in New York City. Does this satisfy the minimum presence rule? Suppose, instead, the plaintiff and contract signing were in New Jersey, but the company hired a law firm with offices in New York City. Does this qualify? \nTITLE: ", "ideal_response": "What constitutes \"doing business in a jurisdiction?\"", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I am currently watching the Depp/Heard trial, and there seems to be a lot of stuff that cannot be shown, or said, and a lot of people i would expect to make a statement are absent. My guess is that there are rulings in place that preclude certain witnesses from being called, certain evidence from being shown, and certain topics from being broached - but how, and why? And are those constraints known to the jury? For instance, there was a last minute witness that somehow offered herself up to testify, a very crucial character witness, that i would have thought the plaintiff would spare no expense in finding. There was a photo expert that was somehow prevented from talking about colors, there is a lot of confusion about the existence or non-existence of a wall mounted phone, but both parties have refrained from showing images of the wall that the phone should have hung on - although the existence of such photos is a near certainty, etc. So my questions: How are such no-show-no-tell boundaries established, does the jury get to know them, and is this also in the public record somehow? \nTITLE: ", "ideal_response": "What are the constraints on a civil suit (witnesses, evidence, topics) - and are they public", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: For example in medicine, you have requirements in many states for what constitutes a valid patient / client relationship and that this requires some in-person visits etc. does something exist like this in the legal profession? \nTITLE: ", "ideal_response": "Are there any legal restrictions to practicing law over video conference like there is with medicine?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Suppose I am not a professional lawyer but I have standing and want to bring a case to court pro se. To make this widely applicable and more readily answerable: Suppose I want to file a 42 USC 1983 complaint in U.S. federal court for some violation of my fourth-amendment rights by a municipal police officer. I believe the first thing I would want to do is find and read similar cases that have been adjudicated on that law using similar claims in my district. In fact, if I'm reasonably literate I assume I could construct all necessary filings and arguments using examples from prior cases, in addition to getting a sense of my odds of success in court. But a person who is not a professional lawyer typically lacks ready access to Lexis, WestLaw, and other such professional resources for researching recent and applicable case law. (Let us assume that the Bar does not exist in principle as a barrier to citizens seeking redress of grievances through the judicial system. I.e., \"That's what lawyers are for, so pay up if you want justice\" may be the practical answer. But I want to know how practical it is for one to seek justice pro se.) How can a pro se litigant to find applicable case law and bootstrap his way through the judicial process? \nTITLE: ", "ideal_response": "What resources are available to a pro se litigant?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Yesterday I had a discussion with my colleague. This year I am traveling rather large distances for work. My colleague asked me how I can bear the costs of travel. I told her it is ok, because the government allows me to deduct a large part of my travel costs from my taxes. My colleague warned me that she has heard that the government will revoke these rules, making it no longer possible to deduct as much of my travel costs from my taxes. Her argument was that since my taxes for 2019 will be done in april of 2020, the government could, at any point before that date, change the rules so that I may no longer be able to deduct my travel expenses. I argued that rules could only apply to travels/work that occurred after said rules were implemented, and could not apply retroactively. Norway has an \"ex post facto\" clause in its constitution against retroactive laws. Is my reasoning correct? \nTITLE: ", "ideal_response": "Would \"ex post facto\" apply to tax deduction laws?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: It's Halloween, and some municipalities are creating curfew laws intended to prevent people under 18 from being in public during the late night hours, for example South Brunswick, NJ. It seems like this apparently common sort of ordinance could be considered unconstitutional on multiple grounds, in that it is restricting free movement as well as targeting only a specific demographic. Here is the actual text of one such ordinance. So here are my questions: Are this and similar ordinances constitutionally valid? What defenses could a government make if challenged? Further, based on precedent, would they likely be successful? \nTITLE: ", "ideal_response": "What is the justification for curfew laws?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: If a tenant buys the house they were renting, does the landlord have to refund the security deposit to them? Alice is renting a house from Bob on a six-month lease. She then enters an agreement to purchase the house. At closing is Bob responsible to pay back her security deposit? I am in Oregon and could not find the answer easily. I would assume this is a basic, common-sense \"yes\", but I'd like to know the statutory basis, if possible. Thanks. \nTITLE: ", "ideal_response": "If a tenant buys the house they were renting, does the landlord have to refund the security deposit to them?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: The companies which operate coworking facilities are in a position to learn a lot about a person's business. Every facility I've seen is blanketed with cameras, potentially including microphones, the internet connection is provided and monitored by them, and their staff greets every visitor for a meeting. I've reviewed the contracts from a few of these companies and was surprised to discover they do not explicitly state they will limit their use of information they learn about their customers' businesses merely to the realm of providing the service to the customers. One would expect at least a simple and clear statement that customer trade secrets and intellectual property will not be used by nor shared by the coworking company. But no such statement is provided. Is there something I am not understanding here? For example, is there overarching legal precedent (which wouldn't be included in the contract) that protects a small business if they use one of these facilities? To be clear, I am not talking about protections from other users of the coworking facility. I am talking about protection from the owner of the coworking facility itself and its employees. \nTITLE: ", "ideal_response": "Are coworking companies bound by legal precedent not to use the intellectual property nor trade secrets of its customers?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In India, religious figures have been given the status of legal person, and from the case law, I've read there doesn't seem to be a clear rationale for this. Corporations and organizations are also given the status of a legal person. Does this mean that if a significant amount of people associate with something it can gain the status of a legal person? By this logic, can say fictional characters get the status of a legal person as well? and would it effect copyright law in any way ? \nTITLE: ", "ideal_response": "Why are religious figures given the status of \"legal person\" in India?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: The Armenian assassin of the ex-Ottoman Grand Vizier Talat Pasha, Soghomon Tehlirian was acquitted of murder in his trial. His defence was that he was killing Talat Pasha in retaliation for his crimes in orchestrating the Ottoman Armenian Genocide. Tehlirian was tried for murder, but was eventually acquitted by the twelve-man jury. His trial was a rather sensationalized event at the time, taking place shortly after the establishment of the Weimar Republic, with Tehlirian being represented by three German defense attorneys, including Dr. Theodor Niemeyer, professor of law at Kiel University. Priest and Armenian Genocide survivor Grigoris Balakian, German activist Johannes Lepsius, and German commander of the Ottoman armed forces during the war General Liman von Sanders were among several of the prominent individuals called as witnesses to the trial. The trial examined not only Tehlirian\u2019s actions but also Tehlirian's conviction that Tal\u00e2t was the main author of the Armenian deportation and mass killings. The defense attorneys made no attempt to deny the fact that Tehlirian had killed a man, and instead focused on the influence of the Armenian Genocide on Tehlirian's mental state. Tehlirian claimed during the trial that he had been present in Erzincan in 1915 and had been deported along with his family and personally witnessed their murder. When asked by the judge if he felt any sort of guilt, Tehlirian remarked, \"I do not consider myself guilty because my conscience is clear\u2026I have killed a man. But I am not a murderer. However, murder in retaliation for crimes the victim committed is often not considered a legal defence to the act of murder. For example, this 2013 Baltimore case resulted in the sentencing of the killer to 30 years in jail. Similarly, the Lillehammer affair involves Israeli Mossad agents sent to assassinate a (mistakenly identified) mastermind of the Munich attacks on Israelis. The agents were arrested and convicted by Norwegian authorities. A final example was the case of Vitaly Kaloyev, who stabbed to death the air traffic controller responsible for the \u00dcberlingen mid-air collision in which his family was killed. He was also convicted for murder in that case and sentenced to prison. What differences in law in that time in Germany led to the acquittal of Soghomon Tehlirian for murder? \nTITLE: ", "ideal_response": "Why was the Armenian assassin of Talat Pasha acquitted for murder?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Before 1964 the New Hampshire Constitution had two articles (12 and 13) in its Bill of Rights permitting citizens to pay money in lieu of compulsory military service. In World War 2 and the Korean War did the men of New Hampshire have the ability to pay to avoid being drafted? If not, what the legal pretext used to void this right? \nTITLE: ", "ideal_response": "Did citizens of New Hampshire have the ability to buy their way out of military service in WW2?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I live in Germany and have a yearly contract for the Adobe Creative Cloud. This contract can only be canceled 30 days before the one year period ends. They say, that a termination notice prior to those 30 days cannot be noted in their system. Are companies allowed to restrict the period in which I can submit a termination request to cancel a contract? \nTITLE: ", "ideal_response": "Can a company restrict the period in which a contract termination can be submitted?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Article II, Section 2, of the US Constitution (emphasis added) states: [The President] by and with the advice and consent of the Senate, shall appoint ambassadors, other public ministers and consuls, judges of the Supreme Court, and all other officers of the United States, whose appointments are not herein otherwise provided for, and which shall be established by law[.] Suppose the Senate were to pass a resolution advising the President to appoint a strict \"originalist\" who has been a long-time member and avid supporter of the Republican Party. To what extent must the President heed this advice? Taken further, does this mean the Senate can specifically identify who an appointee shall be? \nTITLE: ", "ideal_response": "How much does the President have to take the Senate's appointment \"advice?\"", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: This comes from an English Language Lerner's Stack Exchange question, the current answers to which are lacking in historical/legal analysis of the term. What is the origin of the legal phrase \"grand theft auto\"? In particular, is there any legal or historical rationale for the ordering of the terms? (E.g. why isn't it \"grand theft of auto\" or \"auto grand theft\" or even \"grand auto theft\".) There are some claims that it's simply an elision of a comma (\"grand theft, auto\") implying that it's a simplification of something like an entry in an (alphabetized) list of crimes. Others have indicated that there might be some Law French or Latin influence in putting the descriptor last (along the lines of \"courts martial\"). Are either of these accurate with the historical origin of the term, or is there some other reason? \nTITLE: ", "ideal_response": "What is the origin and history of the term \"grand theft auto\"?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: As we await the Rittenhouse verdict it is clear that this is a polarizing case and the governor has mobilized 500 National Guardsmen in anticipation of violence. The verdict is expected to attract polarized crowds that officials perceive as having a proclivity for violence / destruction. Pursuant to preventing violence: What tools are available to the Wisconson governor, besides having Guardsmen to strictly enforce a curfew? Ideally, tools would not infringe on any rights of assembly. \nTITLE: ", "ideal_response": "Anticipation of Violent Gatherings", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: HIPAA seems comparable but only applies to medical data, but is there anything more general than that? How does privacy act compare to data subject access rights under gdpr? Is it that it only applies to government entities but not private sector data controllers? \nTITLE: ", "ideal_response": "What is the most comparable American law to GDPR right to subject access? Privacy Act?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Can a contractor refuse to pay a subcontractor until the contractor receives his payment from the company/client? Does it matter if it's in a written contract? Can the buck be passed like that? \"Payment is contingent on when Client pay Contractor. In the case where a Client uses net-30 payment schedules, the Contractor can expect to be paid up to 30 days after the first day of the month after Subcontractor submitted invoice.\" They slipped this one by me. Hypothetically if the Client doesn't pay Contractor does that mean I wouldn't get paid and there would be nothing I could do about it? Is there even a deadline? \nTITLE: ", "ideal_response": "Can a contractor refuse to pay a subcontractor if he doesn't get paid?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: The internet allows for some level of anonymity and occasionally you will find posts that give detailed descriptions of how to perform an activity that is illegal. Examples may include how to covertly grow marijuana, or swindle money. If I live in the US and make posts or provide answers to questions that explain how to perform an activity that is illegal in the US, can I be prosecuted? This question is about sharing the \"how to\" information only, not actually performing any illegal activity. \nTITLE: ", "ideal_response": "If I live in the US can I be prosecuted for answers describing illegal activities?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I'm an engineer and I'm studying for the law and ethics exam to become a registered engineer in Canada. I'm studying a chapter on a book about Bonds where I see: A bond is a special form of contract, whereby one party, the surety, guarantees the performance by another party, the principal, of certain obligations. The party to whom the obligations are owed is called the obligee. Reading this link I find: Insurance: When a claim is paid the insurance company usually doesn\u2019t expect to be repaid by the insured. Surety Bond: A surety bond is a form of credit, so the principal is responsible to pay any claims. Typically, the surety requires the principal and its major shareholders to indemnify the surety against any loss. In fact, the indemnity given by the principal is one of the things that distinguishes a bond from an insurance contract My doubt is: If as a contractor(principal) you will be required to repay to the surety for any losses that the surety had to pay to the owner(obligee). What are the advantages of a bond over liability insurance where the insurance company pays and the contractor doesn't have to repay anything else that the usual monthly premiums? Why creating the figure of the surety if the contractor will end up having to pay the surety to compensate for what the surety paid to the obligee anyway? Why not using liability insurance for this? I don't understand very well the notion of face value of a bond which is \"the maximum potential liability of the surety\". Does the principal have to pay up to this to the surety ? \nTITLE: ", "ideal_response": "Performance bonds vs insurance", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I\u2019m not well versed in economics or law by any stretch of the imagination. But I am confused with certain measures taken thus far by the government. There is currently a temporary shutdown of vast sectors of the economy now and workers affected are not receiving any income. To me, logic would dictate that here should also be a temporary cessation of payments for rental properties (for small businesses as well as individuals) and, to be fair, a temporary cessation of payments for loans for owners of those properties. In turn, banks would be given temporary cessation payments for interest to its customers since the banks would not receive any payments on loans including payments on mortgages from homeowners. That would essentially close the loop on money flow which again would be temporary and the impact to the greater economy would be limited. If undue burden is placed on any one of these points in the cycle, and right now, workers and small businesses are taking that brunt, then there could be long-standing damage to the economy from bankruptcies, foreclosures, etc. The Fed is already addressing the banking side of things by keeping rates close to 0. Does the government have the authority to push for a standstill in mortgage/interest/rent payments? I feel this temporary measure would alleviate economic hardship and prevent long-term damage to the economy. \nTITLE: ", "ideal_response": "Could government legally temporatiliy halt rent/mortgage/interest payments for the benefit of the greater economy?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I just came across some US tax forms for foreign entities like W-8BEN-E and 1040NR which made me ask this question. Suppose a company called Cool Stuff Ltd. runs some sort of online business serving customers all over the world. This could be paid content subscriptions, dating, posting ads/classifieds, hosting, VPN \u2014 all sorts of stuff that is performed by software accessible by customers over the Internet. The company is not present in the US in any way: no incorporation, no offices, no employees, no web hosting, the shareholders and directors are not related to the US at all. People in the US can access the company's website/app and buy services in the company's home country currency. The company may not even care or take note of where they are from. Technically, if any single John Doe from the US decides to buy a subscription, he will make the company \"receive\" \"income\" \"from U.S. sources that consists of\" \"Compensation for, or in expectation of, services performed\" (W-8BEN-E): Does that really trigger the company's obligation to fill US tax forms and pay tax in the US? If so, is there any threshold i.e. minimum income that triggers the obligation? \nTITLE: ", "ideal_response": "Do non-US companies have to pay US tax just because they have US customers?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I am relying on the ratings on Avvo to determine who I should choose. Is it reliable? \nTITLE: ", "ideal_response": "Are the attorney ratings on the website avvo.com reliable?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I'm developing an application for mobile/web/desktop that will (in very watered down basic terms) store free text strings they've self-contributed and access them freely. As it stands, these applications can run without processing data and storing everything locally on the device. However, if I want to save the users data to a server in order for them to load the same information on multiple devices, I will be storing and processing data. The reason I'm hitting a brick wall in trying to track down the legal issues involved here are twofold: If the developer is based in the EU and the server is based in the US, which jurisdiction does this fall under? If I'm storing just an e-mail address and unique identifier key in the database for login, would this in itself, be identifiable information? Would a user storing the text string \"Mop floor tonight\" be considered personal data? I ask because as this is free text, I'm aware a user could essentially type \"My name is Dave Smith and I live at 742 Evergreen Terrace\" and suddenly that's very personal data. I'm just trying to get my head around the risks/efforts involved in having a simple \"sync to all devices\" command that can be manageable by a 1 person developer team. Are there perhaps exemptions for small businesses whose IP is the code itself and not the data (i.e., I'm just collecting a username/email for login, no other data, not writing to the e-mail or using that data to generate revenue etc), and the self-supplied data is essentially just there to enable a feature? Or is this a legal framework I'll have no choice but to consider globally for all users? \nTITLE: ", "ideal_response": "What is the law regarding web/mobile applications and \"personal\" data?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Applications for search warrants are typically ex parte. In such a proceeding the person who owns the property to be searched and/or seized cannot challenge the unconstitutional laws that form the basis of an affidavit justifying that probable cause exists for a search. Does there exist a basis to challenge the admissibility of evidence obtained following this search? Particularly if such a search is used as pretext to find items that violate laws that are not unconstitutional. Would agents that engage in such a search be able to assert qualified immunity in a 1983 proceeding even though the statute was blatantly unconstitutional? \nTITLE: ", "ideal_response": "If a law is blatantly unconstitutional, is a search warrant based on that law valid?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: If an external company's logo includes a registered symbol, are you obligated to reproduce it (in a link to their website, for example) along with this symbol? What if the usage guidelines dictate the use of (R) -- can you omit it? \nTITLE: ", "ideal_response": "Are you legally obliged to show the registered symbol with a logo?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: If a person invented something new but didn't file a patent: Is it legal for other people to file a patent for it? Is it possible for other people to be granted a patent for it if they are not the original inventor? If other people file a patent and claim they invented it independently at about the same time, what evidence do they need to show? If the original inventor has clear evidence to prove his/her originality, is there a way to stop other people from patenting this invention? \nTITLE: ", "ideal_response": "Is it possible for one's own invention to be patented by others?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: As per the title, I am (as a non-lawyer) helping someone in a dispute where fraud has occurred. If I'm helping the person write a letter to one of the involved parties based on my advice, if I help her produce a letter which says she has \"taken advice\" (and I am the one advising her), would this be construed as a claim she has received advice specifically from a lawyer? Put another way, if someone claims to have \"taken advice\", and the advice they received was from a non-lawyer, is the claim misleading? \nTITLE: ", "ideal_response": "Do the words \"take advice\" / \"taken advice\" imply the advisor is a lawyer?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I would like to request a copy of my contract from a previous employer, where most of the staff is juridically uneducated people. I'm afraid that if I introduce myself, HR would realize that I am a former employee and would decline to send the copy, stating that I should have my own version (which I have lost). There might be a legal debate, depending whether I signed a certain page where it says \"overtime hours are not paid\" or not. I know I can hire a lawyer who can demand the contract copy, but then if I really signed the page (it was a long time ago, I don't remember), the job of the lawyer is done and I still pay for their service. To save money I have considered this trick: Register a new email address with something like Person@FloridaLawGroup.com Call the HR office and impersonate an abstract lawyer who is defending an interest of the former employee. Request a copy of the contract be sent to the aforementioned email address. Is any part of this plan illegal? \nTITLE: ", "ideal_response": "Is it legal to impersonate an abstract lawyer?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: The Supremacy Clause of the U.S. Constitution states that the Constitution, laws and treaties are the supreme law of the land. Obviously, the Constitution supersedes both laws and treaties, but which is given more importance, laws or treaties? For example, if the Congress passed a law in violation of a treaty, would that law be valid? If the Senate ratified a treaty in violation of federal law, would that treaty be valid? \nTITLE: ", "ideal_response": "Which supersedes the other, laws or treaties?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Is it a GPLv3 violation for a project to charge a fee for downloading binaries? Concretely I am wondering about this case: http://moodeaudio.org/ The project is clearly licenced under GPLv3 as stated on the website and uses open source software in it. However it is not distributed freely. A fee of 10$ is charged for downloading. Furthermore I can not find any links to the source code anywhere on the website. \nTITLE: ", "ideal_response": "Is it a GPLv3 violation for a project to charge a fee for downloading binaries?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I have a judgment against a debtor with only their name and date of birth. How do I obtain their credit report? Just write to Experian with a court order? First question- is it legal to do this. Second question- do I need their ssn, and if so, how can I legally obtain it. \nTITLE: ", "ideal_response": "Pulling debtor credit report without SSN", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: In the US, it is typical for patients to sign contracts with a medical office assigning whatever benefits are applicable from their health insurance to the doctor's office and leaving the patient financially responsible for what is left unpaid. Is there a minimum standard for doctor's offices to actually complete the insurance process before passing on the rest of the bill, and if so, what kind of remedy is available if they don't? Example: A doctor's office sends to the insurance company details of a procedure to be paid, but left off one fact that the insurance company needed to complete the claim. The insurance company informs the doctor's office that in order to finish the claim, they have to complete the missing bit of paperwork. Instead of doing so, the doctor's office leaves benefits the client is entitled to unclaimed and proceeds to bill the patient for the unpaid amount. Is this a valid bill under that standard kind of contract? Is there some kind of minimum due diligence required on the doctor's part, and if so, is there some area of tort law that is available for redress? \nTITLE: ", "ideal_response": "What is the duty of a medical office to properly bill one's insurance? (USA)", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: 8 USC \u00a71324b prohibits employers from discriminating against employees or prospective employees on the basis of immigration status, if the individual in question is a citizen or national of the United States, a lawful permanent resident (with a proviso that I won't discuss here), or an asylee or refugee. I'm wondering whether the existence of 8 USC \u00a71324b rules out the possibility of any stronger state law. For example, suppose a state had a law that says employers can't discriminate against people with Temporary Protected Status (TPS). This doesn't directly conflict with federal law, since TPS people have an immigration status that allows them to work for any employer. However, I am not sure whether the concept of field preemption would apply here. \nTITLE: ", "ideal_response": "Can states and localities have stronger immigration status-based employment discrimination laws than the federal law?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Does NJ law prohibit alcohol consumption in public parks? \nTITLE: ", "ideal_response": "Does NJ law prohibit alcohol consumption in public parks?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: My friends and I want to start selling merchandise of pictures of two celebrities. The pictures are of two celebrities. We want to make sure that we won't have a lawsuit on our hands if we start selling. The pictures we want to use are available and posted publicly by them. The pictures are not copyrighted. Is this legal? \nTITLE: ", "ideal_response": "Can I legally sell merchandise with the face of a celebrity without their permission?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: Kyle Rittenhouse has been found not guilty of Wisconsin state criminal charges. Is there any federal law that he violated? \nTITLE: ", "ideal_response": "Is Kyle Rittenhouse \"out of the woods\" or could he face Federal charges?", "category": "summarization"} +{"prompt": "Generate a concise, engaging title for the following legal question on an internet forum. The title should be legally relevant, capture key aspects of the issue, and entice readers to learn more. \nQUESTION: I would like to know if it is legal to download the content of a website and display it in my own? There is one way to include a website inside another: