Skip to content

Using ZeroEntropy for Semantic Search #8

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,8 @@ As of now, the guides in this cookbook are written in Python, but the same conce
Learn how to use ZeroEntropy as an Agent's search tool to access a knowledge base when responding to user queries.
8. **[Use LlamaParse in combination with ZeroEntropy to search and rerank PDFs](guides/rerank_llamaparsed_pdfs)**
Learn how to use ZeroEntropy and LlamaParse to parse, search, and rerank complex PDF documents.
9. **[Use ZeroEntropy for Semantic Search over articles (French Gossip Media)](guides/semantic_search_over_articles)**
Learn how to use ZeroEntropy on a semantic search RAG to scrap, index, search, and rerank media articles.

*(More guides coming soon...)*

Expand Down
7 changes: 7 additions & 0 deletions guides/semantic_search_over_articles/.env.example
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# AI Provider Configuration
OPENAI_API_KEY=""
ZEROENTROPY_API_KEY=""

# Environment
ENVIRONMENT=development
EOF < /dev/null
90 changes: 90 additions & 0 deletions guides/semantic_search_over_articles/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class

# C extensions
*.so

# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST

# PyInstaller
*.manifest
*.spec

# Installer logs
pip-log.txt
pip-delete-this-directory.txt
*.txt

# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
*.py,cover
.hypothesis/
.pytest_cache/
cover/

# Sphinx documentation
docs/_build/

# PyBuilder
.pybuilder/
target/

# Jupyter Notebook
.ipynb_checkpoints

# IPython
profile_default/
ipython_config.py

# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
__pypackages__/

# Environments
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
.xml

# Pyre type checker
.pyre/

# pytype static type analyzer
.pytype/

# Cython debug symbols
cython_debug/

# PyPI configuration file
.pypirc
47 changes: 47 additions & 0 deletions guides/semantic_search_over_articles/Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# Set python version
FROM python:3.10-slim

# Set environment variables
ENV PYTHONUNBUFFERED=1
ENV PYTHONDONTWRITEBYTECODE=1
ENV PYTHONPATH=/app

# Set working directory
WORKDIR /app

# Install system dependencies
RUN apt-get update && apt-get install -y \
build-essential \
git \
curl \
&& rm -rf /var/lib/apt/lists/*

# Copy requirements file
COPY requirements.txt .

# Install Python dependencies using uv
RUN pip install --no-cache-dir uv
RUN uv pip install --system --no-cache -r requirements.txt

# Copy application code
COPY backend/ ./backend/
COPY frontend/ ./frontend/

# Copy environment files (if any)
COPY .env* ./

# Create necessary directories
RUN mkdir -p /app/data /app/logs

# Set proper permissions
RUN chmod -R 755 /app

# Expose Streamlit port
EXPOSE 8501

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:8501/_stcore/health || exit 1

# Set the entrypoint to run Streamlit
CMD ["streamlit", "run", "frontend/streamlit_app.py", "--server.port=8501", "--server.address=0.0.0.0", "--server.headless=true"]
112 changes: 112 additions & 0 deletions guides/semantic_search_over_articles/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# French Gossip Semantic Search with ZeroEntropy

This is a guide of production-ready semantic search for French gossip articles from **vsd.fr** and **public.fr** using ZeroEntropy.

[![Python](https://img.shields.io/badge/python-3.10+-blue.svg)](https://www.python.org/downloads/)
[![ZeroEntropy](https://img.shields.io/badge/zeroentropy-latest-purple.svg)](https://zeroentropy.dev/)
[![Streamlit](https://img.shields.io/badge/streamlit-1.30+-green.svg)](https://streamlit.io/)
[![Docker](https://img.shields.io/badge/docker-ready-blue.svg)](https://docker.com/)

## Features

- **Advanced AI Retrieval**: Powered by ZeroEntropy's state-of-the-art search & reranking
- **Multiple Search Types**: Documents, snippets, pages, and advanced reranked results
- **Real-time RSS Scraping**: Automatically indexes articles from gossip websites
- **Interactive Web UI**: Beautiful Streamlit interface with advanced filtering
- **Smart Reranking**: Uses `zerank-1-small` model for improved relevance

## Quick Start

### 1. Setup Environment
```bash
# Clone repository
git clone <https://github.com/zeroentropy-ai/zcookbook>
cd .\guides\semantic_search_over_articles

# Install dependencies
pip install -r requirements.txt

# Configure API key then add your ZEROENTROPY Credentials
cp .env.example .env

```

### 2. Index Articles
```bash
# Scrape RSS feeds and index articles
python backend scrape --collection my_articles
```

### 3. Search Articles
```bash
# Search for articles (CLI)
python backend search "TPMP" --k 5 --collection my_articles
python backend search "famille royale" --search-type snippets
python backend search "célébrités" --search-type advanced --k 10
```

### 4. Web Interface
```bash
# Launch Streamlit app
streamlit run frontend/streamlit_app.py
```
Access at: `http://localhost:8501`

## Docker Deployment

```bash
# Build and run with Docker Compose
docker-compose up --build

# Or build docker image and run the command within the container
docker build -t gossip-search .
docker exec -it gossip-search-container python backend scrape --collection my_articles
docker exec -it gossip-search-container python backend search "TPMP" --k 5
```

## Project Structure

```
├── backend/
│ ├── main.py # Main CLI interface
│ ├── indexer_ze.py # RSS scraping & indexing
│ ├── search_ze.py # Search functionality
│ ├── utils_ze.py # Advanced utilities & reranking
│ └── logger.py # Logging configuration
├── frontend/
│ └── streamlit_app.py # Web interface
├── demo_notebook.ipynb # Interactive demo
├── docker-compose.yml # Container orchestration
├── requirements.txt # Dependencies
└── README.md # This file
```

## Usage Examples

### CLI Commands
```bash
# Collection management
python backend manage list
python backend manage status --collection my_articles

# Advanced search with filters
python backend search "mode" --filter-creator "Public" --reranker zerank-1-small

# Different search types
python backend search "actualité" --search-type documents --k 10
python backend search "télé" --search-type snippets --k 5
python backend search "people" --search-type advanced --k 8
```


## Demo Notebook

An Jupyter notebook is also available to explore the code and run it step by step;
```bash
jupyter notebook demo_notebook.ipynb
```

---
## Author & Contribution

**Created by [Naoufal Acharki](https://github.com/nacharki)**: This project is a demo of ZeroEntropy on a RAG for French gossip content.
Loading