Skip to content

Add article summary feature with OpenAI integration #32

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
66 changes: 54 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,23 +6,65 @@

### Configuration

| Command line | Environment | Default | Description |
|--------------|-----------------|----------------|-------------------------------------------------------|
| address | UKEEPER_ADDRESS | all interfaces | web server listening address |
| port | UKEEPER_PORT | `8080` | web server port |
| mongo-uri | MONGO_URI | none | MongoDB connection string, _required_ |
| frontend-dir | FRONTEND_DIR | `/srv/web` | directory with frontend files |
| token | TOKEN | none | token for /content/v1/parser endpoint auth |
| mongo-delay | MONGO_DELAY | `0` | mongo initial delay |
| mongo-db | MONGO_DB | `ureadability` | mongo database name |
| creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) |
| dbg | DEBUG | `false` | debug mode |
| Command line | Environment | Default | Description |
|----------------|-----------------|----------------|-------------------------------------------------------|
| --address | UKEEPER_ADDRESS | all interfaces | web server listening address |
| --port | UKEEPER_PORT | `8080` | web server port |
| --mongo-uri | MONGO_URI | none | MongoDB connection string, _required_ |
| --frontend-dir | FRONTEND_DIR | `/srv/web` | directory with frontend templates and static assets |
| --token | UKEEPER_TOKEN | none | token for /content/v1/parser endpoint auth |
| --mongo-delay | MONGO_DELAY | `0` | mongo initial delay |
| --mongo-db | MONGO_DB | `ureadability` | mongo database name |
| --creds | CREDS | none | credentials for protected calls (POST, DELETE /rules) |
| --dbg | DEBUG | `false` | debug mode |

OpenAI Configuration:

| Command line | Environment | Default | Description |
|------------------------------|----------------------------|---------------|----------------------------------------------------------------------|
| --openai.disable-summaries | OPENAI_DISABLE_SUMMARIES | `false` | disable summary generation with OpenAI |
| --openai.api-key | OPENAI_API_KEY | none | OpenAI API key for summary generation |
| --openai.model-type | OPENAI_MODEL_TYPE | `gpt-4o-mini` | OpenAI model name for summary generation (e.g., gpt-4o, gpt-4o-mini) |
| --openai.summary-prompt | OPENAI_SUMMARY_PROMPT | *see code* | custom prompt for summary generation |
| --openai.max-content-length | OPENAI_MAX_CONTENT_LENGTH | `10000` | maximum content length to send to OpenAI API |
| --openai.requests-per-minute | OPENAI_REQUESTS_PER_MINUTE | `10` | maximum number of OpenAI API requests per minute |
| --openai.cleanup-interval | OPENAI_CLEANUP_INTERVAL | `24h` | interval for cleaning up expired summaries |

### API

GET /api/content/v1/parser?token=secret&url=http://aa.com/blah - extract content (emulate Readability API parse call)
GET /api/content/v1/parser?token=secret&summary=true&url=http://aa.com/blah - extract content (emulate Readability API parse call), summary is optional and requires OpenAI key and token to be enabled
POST /api/v1/extract {url: http://aa.com/blah} - extract content

### Article Summary Feature

The application can generate concise summaries of article content using OpenAI's GPT models:

1. **Configuration**:
- Set `--openai.api-key` to your OpenAI API key
- Summaries are enabled by default, use `--openai.disable-summaries` to disable this feature
- Optionally set `--openai.model-type` to specify which model to use (e.g., `gpt-4o`, `gpt-4o-mini`)
- Default is `gpt-4o-mini` if not specified
- A server token must be configured for security reasons
- Customize rate limiting with `--openai.requests-per-minute` (default: 10)
- Control content length with `--openai.max-content-length` (default: 10000 characters)
- Configure cleanup interval with `--openai.cleanup-interval` (default: 24h)

2. **Usage**:
- Add `summary=true` parameter to the `/api/content/v1/parser` endpoint
- Example: `/api/content/v1/parser?token=secret&summary=true&url=http://example.com/article`

3. **Features**:
- Summaries are cached in MongoDB to reduce API costs and improve performance
- The cache stores:
- Content hash (to identify articles)
- Summary text
- Model used for generation
- Creation and update timestamps
- Expiration time (defaults to 1 month)
- If the same content is requested again, the cached summary is returned
- The preview page automatically shows summaries when available
- Expired summaries are automatically cleaned up based on the configured interval

## Development

### Running tests
Expand Down
12 changes: 10 additions & 2 deletions backend/datastore/mongo.go
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,8 @@ func New(connectionURI, dbName string, delay time.Duration) (*MongoServer, error

// Stores contains all DAO instances
type Stores struct {
Rules RulesDAO
Rules RulesDAO
Summaries SummariesDAO
}

// GetStores initialize collections and make indexes
Expand All @@ -50,8 +51,15 @@ func (m *MongoServer) GetStores() Stores {
{Keys: bson.D{{Key: "domain", Value: 1}, {Key: "match_urls", Value: 1}}},
}

sIndexes := []mongo.IndexModel{
{Keys: bson.D{{Key: "created_at", Value: 1}}},
{Keys: bson.D{{Key: "model", Value: 1}}},
{Keys: bson.D{{Key: "expires_at", Value: 1}}}, // index for cleaning up expired summaries
}

return Stores{
Rules: RulesDAO{Collection: m.collection("rules", rIndexes)},
Rules: RulesDAO{Collection: m.collection("rules", rIndexes)},
Summaries: SummariesDAO{Collection: m.collection("summaries", sIndexes)},
}
}

Expand Down
109 changes: 109 additions & 0 deletions backend/datastore/summaries.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
// Package datastore provides mongo implementation for store to keep and access summaries
package datastore

import (
"context"
"crypto/sha256"
"encoding/hex"
"fmt"
"time"

log "github.com/go-pkgz/lgr"
"go.mongodb.org/mongo-driver/bson"
"go.mongodb.org/mongo-driver/mongo"
"go.mongodb.org/mongo-driver/mongo/options"
)

// Summary contains information about a cached summary
type Summary struct {
ID string `bson:"_id"` // SHA256 hash of the content
Content string `bson:"content"` // original content that was summarized (could be truncated for storage efficiency)
Summary string `bson:"summary"` // generated summary
Model string `bson:"model"` // openAI model used for summarization
CreatedAt time.Time `bson:"created_at"`
UpdatedAt time.Time `bson:"updated_at"`
ExpiresAt time.Time `bson:"expires_at"` // when this summary expires
}

// SummariesDAO handles database operations for article summaries
type SummariesDAO struct {
Collection *mongo.Collection
}

// Get returns summary by content hash
func (s SummariesDAO) Get(ctx context.Context, content string) (Summary, bool) {
contentHash := GenerateContentHash(content)
res := s.Collection.FindOne(ctx, bson.M{"_id": contentHash})
if res.Err() != nil {
if res.Err() == mongo.ErrNoDocuments {
return Summary{}, false
}
log.Printf("[WARN] can't get summary for hash %s: %v", contentHash, res.Err())
return Summary{}, false
}

summary := Summary{}
if err := res.Decode(&summary); err != nil {
log.Printf("[WARN] can't decode summary document for hash %s: %v", contentHash, err)
return Summary{}, false
}

return summary, true
}

// Save creates or updates summary in the database
func (s SummariesDAO) Save(ctx context.Context, summary Summary) error {
if summary.ID == "" {
summary.ID = GenerateContentHash(summary.Content)
}

if summary.CreatedAt.IsZero() {
summary.CreatedAt = time.Now()
}
summary.UpdatedAt = time.Now()

// set default expiration of 1 month if not specified
if summary.ExpiresAt.IsZero() {
summary.ExpiresAt = time.Now().AddDate(0, 1, 0)
}

opts := options.Update().SetUpsert(true)
_, err := s.Collection.UpdateOne(
ctx,
bson.M{"_id": summary.ID},
bson.M{"$set": summary},
opts,
)
if err != nil {
return fmt.Errorf("failed to save summary: %w", err)
}
return nil
}

// Delete removes summary from the database
func (s SummariesDAO) Delete(ctx context.Context, contentHash string) error {
_, err := s.Collection.DeleteOne(ctx, bson.M{"_id": contentHash})
if err != nil {
return fmt.Errorf("failed to delete summary: %w", err)
}
return nil
}

// CleanupExpired removes all summaries that have expired
func (s SummariesDAO) CleanupExpired(ctx context.Context) (int64, error) {
now := time.Now()
result, err := s.Collection.DeleteMany(
ctx,
bson.M{"expires_at": bson.M{"$lt": now}},
)
if err != nil {
return 0, fmt.Errorf("failed to cleanup expired summaries: %w", err)
}
return result.DeletedCount, nil
}

// GenerateContentHash creates a hash for the content to use as an ID
func GenerateContentHash(content string) string {
hash := sha256.Sum256([]byte(content))
return hex.EncodeToString(hash[:])
}
165 changes: 165 additions & 0 deletions backend/datastore/summaries_test.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,165 @@
package datastore

import (
"context"
"os"
"testing"
"time"

"github.com/stretchr/testify/assert"
"github.com/stretchr/testify/require"
"go.mongodb.org/mongo-driver/bson"
"go.mongodb.org/mongo-driver/mongo"
)

func TestSummariesDAO_SaveAndGet(t *testing.T) {
if _, ok := os.LookupEnv("ENABLE_MONGO_TESTS"); !ok {
t.Skip("ENABLE_MONGO_TESTS env variable is not set")
}

mdb, err := New("mongodb://localhost:27017", "test_ureadability", 0)
require.NoError(t, err)

// create a unique collection for this test to avoid conflicts
collection := mdb.client.Database(mdb.dbName).Collection("summaries_test")
defer func() {
_ = collection.Drop(context.Background())
}()

// create an index on the expiresAt field
_, err = collection.Indexes().CreateOne(context.Background(),
mongo.IndexModel{
Keys: bson.D{{"expires_at", 1}},
})
require.NoError(t, err)

dao := SummariesDAO{Collection: collection}

content := "This is a test article content. It should generate a unique hash."
summary := Summary{
Content: content,
Summary: "This is a test summary of the article.",
Model: "gpt-4o-mini",
CreatedAt: time.Now(),
}

// test saving a summary
err = dao.Save(context.Background(), summary)
require.NoError(t, err)

// test getting the summary
foundSummary, found := dao.Get(context.Background(), content)
assert.True(t, found)
assert.Equal(t, summary.Summary, foundSummary.Summary)
assert.Equal(t, summary.Model, foundSummary.Model)
assert.NotEmpty(t, foundSummary.ID)

// test getting a non-existent summary
_, found = dao.Get(context.Background(), "non-existent content")
assert.False(t, found)

// test updating an existing summary
updatedSummary := Summary{
ID: foundSummary.ID,
Content: content,
Summary: "This is an updated summary.",
Model: "gpt-4o-mini",
CreatedAt: foundSummary.CreatedAt,
}

err = dao.Save(context.Background(), updatedSummary)
require.NoError(t, err)

foundSummary, found = dao.Get(context.Background(), content)
assert.True(t, found)
assert.Equal(t, "This is an updated summary.", foundSummary.Summary)
assert.Equal(t, updatedSummary.CreatedAt, foundSummary.CreatedAt)
assert.NotEqual(t, updatedSummary.UpdatedAt, foundSummary.UpdatedAt) // UpdatedAt should be set by the DAO

// test deleting a summary
err = dao.Delete(context.Background(), foundSummary.ID)
require.NoError(t, err)

_, found = dao.Get(context.Background(), content)
assert.False(t, found)
}

func TestGenerateContentHash(t *testing.T) {
content1 := "This is a test content."
content2 := "This is a different test content."

hash1 := GenerateContentHash(content1)
hash2 := GenerateContentHash(content2)

assert.NotEqual(t, hash1, hash2)
assert.Equal(t, hash1, GenerateContentHash(content1)) // same content should produce same hash
assert.Equal(t, 64, len(hash1)) // SHA-256 produces 64 character hex string
}

func TestSummariesDAO_CleanupExpired(t *testing.T) {
if _, ok := os.LookupEnv("ENABLE_MONGO_TESTS"); !ok {
t.Skip("ENABLE_MONGO_TESTS env variable is not set")
}

mdb, err := New("mongodb://localhost:27017", "test_ureadability", 0)
require.NoError(t, err)

// create a unique collection for this test to avoid conflicts
collection := mdb.client.Database(mdb.dbName).Collection("summaries_expired_test")
defer func() {
_ = collection.Drop(context.Background())
}()

// create an index on the expiresAt field
_, err = collection.Indexes().CreateOne(context.Background(),
mongo.IndexModel{
Keys: bson.D{{"expires_at", 1}},
})
require.NoError(t, err)

dao := SummariesDAO{Collection: collection}
ctx := context.Background()

// add expired summary
expiredSummary := Summary{
Content: "This is an expired summary",
Summary: "Expired content",
Model: "gpt-4o-mini",
CreatedAt: time.Now().Add(-48 * time.Hour),
UpdatedAt: time.Now().Add(-48 * time.Hour),
ExpiresAt: time.Now().Add(-24 * time.Hour), // expired 24 hours ago
}
err = dao.Save(ctx, expiredSummary)
require.NoError(t, err)

// add valid summary
validSummary := Summary{
Content: "This is a valid summary",
Summary: "Valid content",
Model: "gpt-4o-mini",
CreatedAt: time.Now(),
UpdatedAt: time.Now(),
ExpiresAt: time.Now().Add(24 * time.Hour), // expires in 24 hours
}
err = dao.Save(ctx, validSummary)
require.NoError(t, err)

// verify both summaries exist
_, foundExpired := dao.Get(ctx, expiredSummary.Content)
assert.True(t, foundExpired, "Expected to find expired summary before cleanup")

_, foundValid := dao.Get(ctx, validSummary.Content)
assert.True(t, foundValid, "Expected to find valid summary before cleanup")

// run cleanup
count, err := dao.CleanupExpired(ctx)
require.NoError(t, err)
assert.Equal(t, int64(1), count, "Expected to clean up exactly one record")

// verify expired summary is gone but valid remains
_, foundExpired = dao.Get(ctx, expiredSummary.Content)
assert.False(t, foundExpired, "Expected expired summary to be deleted")

_, foundValid = dao.Get(ctx, validSummary.Content)
assert.True(t, foundValid, "Expected valid summary to still exist")
}
Loading
Loading