Support S3 without CDN #688

bl1nkker · 2025-07-21T13:04:46Z

this pr adds support for generating signed URLs with aws s3 compatible storage providers

Added settings:

papermerge__s3__provider
aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url
papermerge__s3__bucket_name

Added classes:

AWSS3Storage – generates CloudFront signed URLs
GenericS3Storage – uses boto3 for S3-compatible storage

Added factory method get_storage()

Updated resource_sign_url to use storage abstraction

refs #14

bl1nkker · 2025-07-21T13:09:44Z

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

bl1nkker · 2025-07-21T13:12:17Z

also, tested locally using MinIO. I don't have access to AWS S3 + CloudFront, but I plan to test it in my work environment (which uses VK Object Storage)

ciur · 2025-07-22T03:45:13Z

@ciur

noticed that _s3_page_svg_url and _s3_docver_download_url also perform URL signing, but are currently unused

should these be updated to use the new storage abstraction or left as-is?

You are right, those two functions are not used anymore. I will remove them, no need to do updates related to them.

ciur · 2025-07-22T04:13:49Z

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files.
In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:

https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R

You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

ciur · 2025-07-22T04:23:46Z

papermerge/core/config.py

@@ -37,6 +37,13 @@ class Settings(BaseSettings):
    papermerge__ocr__automatic: bool = False
    papermerge__search__url: str | None = None

+    papermerge__s3__provider: str = "aws"
+    aws_access_key_id: str | None = None


As I mentioned in comments, so far, when using S3 storage, Papermerge does not serve files. Thus, there is no need for aws_access_key_id etc. Just keep this in mind, because I assume in your S3 setup, you want Papermerge to serve files as well? Which means that you will need to add code for downloading from S3?

yes, you are right. in the default Papermerge + cloudfront + S3 setup those variables are not needed.

However, in my use case (and in this PR) I’m implementing a setup without a CDN.

In this scenario Papermerge uses boto3 to generate signed URLs for direct access to files in the object storage, so aws_access_key_id and other variables are required

bl1nkker · 2025-07-22T06:55:26Z

For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like: /api/document-versions/<document-version-id>/download: in such case clients (e.g. browser) makes one more request for getting the document to the Papermerge i.e. Papermerge serves files. In more complex scenarios (with S3), Papermerge returns URL like https://<s3-server>/url-of-the-file-in-s3-server e.g:
https://ddmnua7cm301s.cloudfront.net/docvers/be/0e/be0ec1db-97dc-43ca-9e24-2665d4673181/The%20Project%20Gutenberg%20eBook%20of%20Also%20sprach%20Zarathustra%2C%20by%20Friedrich%20Wilhelm%20Nietzsche.pdf?Expires=1753157508&Signature=PQuB-J52QHhqHtc-x00VtRBI-5WBwT8d74LBxBhAXt9VnPfKia~pRe-XsBLgmzK~8M6S26hSMfkZ3rG83xdGYP9R9s6ksLq-D2vk3mS4KihV7r~KGjM9b5vgl0FlAebEV19stjoPs9lFeG9sUtoAZCnZpqatDdTVZuyHd9-WMDC16Gg84n6QsqKeLoTPFuQZVA6kro~Yd-OniMgfWjU3f6lrP2grbNoPxywGIZtq6591etbsyw27TAxczyVI0uP8WICbOqiZz0W7hcAGKUGHCTR9uVf6UeUtbVPwcyGf8FeVTCuM0WtQ96jRgIoC-3r4voFuaa0jDXSVk2uz-4tBoQ__&Key-Pair-Id=K19GCMLERJU26R
You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo).

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

@bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server?

at the moment in production files are served by Papermerge and stored on local storage

however i'm facing a new requirement to offload files to external object storage since more than 2 million pages will be uploaded soon. that's why I’m working on integrating papermerge with VK Cloud

bl1nkker · 2025-07-22T07:00:37Z

Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need aws_access_key_id, aws_secret_access_key, aws_region_name, aws_endpoint_url. Of course this scenario is different in sense that S3 here acts as CDN as well.

regarding the variables, I don’t have much experience with boto3 but according to their documentation signing a URL requires creating a client which in turn requires those credentials

bl1nkker · 2025-07-22T07:06:06Z

i'm sorry, i was wrong

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

however i think for non-AWS providers like minio or VK Cloud specifying at least endpoint_url is likely necessary for proper functionality

I'll update the PR once I finish testing

bl1nkker · 2025-07-22T13:01:26Z

tested papermerge with another S3 provider. after a few small adjustments (I'll add them to the pull request a little later), everything works as expected

note: my setup does not use CloudFront or any CDN.

note2: please don’t review the code just yet. when I was working on testing I noticed that there's already a storage.py file in the project but I created a new storage/ module which causes an import conflict and breaks the app. It works in my environment only because I patched it in my custom Docker image.

this is my docker-compose setup (note: it uses my custom images to support non-AWS signing):

services:
  webapp:
    image: blinkker/papermerge:0.0.9-dev
    environment:
      PAPERMERGE__SECURITY__SECRET_KEY: 12345
      PAPERMERGE__AUTH__USERNAME: admin
      PAPERMERGE__AUTH__PASSWORD: admin
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__OCR__LANG_CODES: "deu,eng,kaz,rus"
      PAPERMERGE__OCR__DEFAULT_LANG_CODE: "deu"

      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>

      PAPERMERGE__MAIN__FILE_SERVER: s3
      # options are: vk, minio, aws
      PAPERMERGE__S3__PROVIDER: vk
    volumes:
      - media_root:/var/media/pmg
    ports:
      - "12000:80"

  s3worker:
    image: blinkker/papermerge-s3-worker:0.0.2
    command: worker
    environment:
      PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
      PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
      PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
      S3_WORKER_ARGS: "-Q s3 -c 2"
      PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
      AWS_ACCESS_KEY_ID: <aws-access-key>
      AWS_SECRET_ACCESS_KEY: <aws-secret-key>
      AWS_ENDPOINT_URL: <aws-endpoint-url>
      AWS_REGION_NAME: us-east-1
    volumes:
      - media_root:/var/media/pmg

volumes:
  media_root:

bl1nkker · 2025-07-22T13:08:25Z

some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials

the environment variables I added actually are needed for signing S3 URLs directly. in my case I want Papermerge to work only with the object storage letting S3 serve all files directly

also, I don’t want Papermerge to store files locally at all. But according to the documentation it seems like local storage is still used even when S3 is configured

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

ciur · 2025-07-22T16:26:25Z

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

I don't really understand what you mean by "only S3". Also what exactly do you mean by "locally" ? Locally for whom? for REST API server or for S3 worker? Also the docker compose is just an example. On real production server there can be any number of REST API servers (1, 2, 3, 4, ... N) an each of them has its own "local" storage - which they don't share. Same for S3-workers. So what do you exactly mean by "locally" ?

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

PS: "production" for me means k8s cluster with N ( N>= 3) REST API server instances, each with ephemeral storage (i.e. storage which can be replaced at any time) and with N s3-workers (each worker has access to the same storage of it's peer REST API server so that it can upload files to S3)

bl1nkker · 2025-07-26T11:39:14Z

Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you.

right now my setup is very simple: I have a single Ubuntu 22.04 server with 8GB RAM and 500GB disk. I'm running only one container with the Papermerge REST API (via Docker) and currently it stores files in a volume mounted on the host (no s3wokrer on my setup)

i also don’t plan to migrate to k8s. Currently my deployment is based on docker compose. Once I complete integration with VK Cloud I plan to move towards docker swarm as I intend to provision a dedicated server for the database

I don't really understand what you mean by "only S3".

what I want to achieve is the following: I’d like to run an additional s3worker container and make Papermerge store all documents exclusively in S3 object storage (in my case VK Cloud) without consuming any local disk space on the server. so instead of saving documents locally first and then offloading to S3 later I want Papermerge to write directly to S3 and read from there as well skipping local volumes altogether. and yes, this means that s3 itself provides access to files. with my current setup Papermerge retrieves the files directly from my storage (I'm running MinIO and the Papermerge API locally). For example, here’s a generated pre-signed URL that the frontend uses to access a document:

http://host.docker.internal:9000/papermerge-test/1.png?X-Amz-Algorithm=AWS4-HMAC-SHA256&X-Amz-Credential=minioadmin%2F20250726%2Fus-east-1%2Fs3%2Faws4_request&X-Amz-Date=20250726T115842Z&X-Amz-Expires=600&X-Amz-SignedHeaders=host&X-Amz-Signature=a27e1cb6674d19c0160a651a536ff1ea175d2c204505bee12eb699e369cf9629

bl1nkker · 2025-07-26T11:46:38Z

is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally?

based on this diagram I understand that with the current architecture there is no way. this is not very critical for me, but I think in the future I will think about how to solve this problem (at least in my fork)

ciur · 2025-07-26T17:37:30Z

Right. Currently uploaded files are saved on the disk first. But in general you can write another application which will take over /api/{document_id}/upload endpoint and do what you want (e.g. stream files directly to S3 storage). But as I mentioned (did I mentioned that?) - for this is out of Papermerge scope.

bl1nkker · 2025-08-07T06:31:26Z

@ciur apologies for not working on this pull request for a while. i've had a busy period at work and since my current solution worked well enough for our use case, I had to temporarily pause work on the PR (same for s3-worker pull request)

However, I’d be happy to continue contributing

bl1nkker · 2025-08-07T06:33:42Z

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

ciur · 2025-08-07T07:13:53Z

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

ciur · 2025-08-07T07:21:07Z

also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case)

@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/

… plain s3)

bl1nkker · 2025-08-12T16:32:45Z

@ciur i’ve mostly finished the work, but I still have some questions

Regarding the S3 Worker-Part 2 diagram I noticed that cloudfront is omitted. However in the current implementation CDN is actually a critical component because (as far as I understand) without it supporting S3-compatible storage systems is not possible. Could you clarify why cloudfront was left out of the diagram?
In the current architecture of my setup (Papermerge -> S3 without CDN) is it really necessary to download the file to the node's local storage? you repeatedly mentioned the need to download the file but I’m wondering if this is an actual requirement in my case.
I encountered a problem where two authentication methods are being sent to S3: the standard Authorization header and the x-amz-* headers. This results in request failures. Currently, I’ve worked around it by explicitly removing the Authorization header, like this:

// frontend/apps/ui/src/features/document/utils.ts
export async function getDocLastVersion(docID: UUID): Promise<ClientReturn> {
  . . . 
  resp = await client.get(docVer.download_url, {
      responseType: "blob",
      headers: { Authorization: undefined }
  });

I didn't include it in the pull request, as it's an obvious workaround (I’ve also considered handling this using nginx by configuring it as a reverse proxy to s3 and stripping the extra Authorization header before forwarding the request to s3). Is there a better or more recommended way to handle this issue?

bl1nkker · 2025-08-12T16:43:38Z

also i’ve reviewed the changes and realized the PR would be more accurately titled “Support S3 without CDN” instead of current one

ciur · 2025-08-17T11:46:47Z

Regarding the S3 Worker-Part 2 diagram I noticed that cloudfront is omitted. However in the current implementation CDN is actually a critical component because (as far as I understand) without it supporting S3-compatible storage systems is not possible. Could you clarify why cloudfront was left out of the diagram?

S3 will work perfectly without CDN (e.g. cloudfront). From tech point of view is same thing. The difference is only from user point of view - CDN serves content closer to the user (say, app runs in Europe, but user in Asia, then CDN will serve a copy of the document from the Asia, closer to end-user, while S3 will serve copy from Europe...). In other words, CDN brings content closer to user.
I omitted it from the diagram because it is irrelevant in the context of understanding the bigger picture.

Now I think I see what you are trying to do - you try to upload docs directly from the client (e.g. browser) directly to S3; which may probably work; however for me that completely new territory: Perpermerge is designed with the concept in mind that client will push docs to backend first and then backend will take care of uploading documents to S3.

In the current architecture of my setup (Papermerge -> S3 without CDN) is it really necessary to download the file to the node's local storage?

Will if you have only one node and one storage - then that is not necessary. The necessity comes into picture when use have multi node deployment (and each node has separate storage) i.e. when app container is stateless: it does not depend on storage as storage may come and go.

bl1nkker · 2025-08-17T17:47:46Z

S3 will work perfectly without CDN (e.g. cloudfront). From tech point of view is same thing. The difference is only from user point of view - CDN serves content closer to the user (say, app runs in Europe, but user in Asia, then CDN will serve a copy of the document from the Asia, closer to end-user, while S3 will serve copy from Europe...). In other words, CDN brings content closer to user.

yes, that’s exactly right, an object storage can work without a CDN. However, in the context of what I wrote above, I meant the implementation of papermerge support with object storage. in the current Papermerge implementation, the backend (when S3 is used) signs URLs using the sign_url method with CloudFront:

def sign_url(url: str, valid_for: int = 600):
    """
    :type url: str
    :param url: The URL of the protected object

    :type valid_for: int
    :param valid_for: number of seconds the url will be valid for, defaults
        to 600 (i.e. 10 minutes)
    """
    key_id = settings.papermerge__main__cf_sign_url_key_id
    tz = pytz.timezone(
        settings.papermerge__main__timezone
    )
    if key_id is None:
        raise ValueError(
            "CF_SIGN_URL_KEY_ID is empty"
        )
    cf_signer = CloudFrontSigner(key_id, rsa_signer)
    date_less_than = datetime.now(tz) + timedelta(seconds=valid_for)
    signed_url = cf_signer.generate_presigned_url(
        url,
        date_less_than=date_less_than
    )
    return signed_url

For this approach, no additional environment variables are required.

But if we want the client browser to fetch content directly from object storage the backend must also be able to generate signed URLs without relying on CloudFront. For that additional environment variables are needed so that we can use boto3 to generate presigned S3 URLs (here):

def generate_s3_signed_url(path: str):
    client = boto3.client(
        "s3",
        aws_access_key_id=settings.aws_access_key_id,
        aws_secret_access_key=settings.aws_secret_access_key,
        region_name=settings.aws_region_name,
        endpoint_url=settings.aws_endpoint_url,
        config=Config(signature_version="s3v4"),
    )
    return client.generate_presigned_url(
        ClientMethod="get_object",
        Params={"Bucket": settings.papermerge__s3__bucket_name, "Key": path},
        ExpiresIn=VALID_FOR_SECONDS,
    )

bl1nkker · 2025-08-17T17:54:06Z

Now I think I see what you are trying to do - you try to upload docs directly from the client (e.g. browser) directly to S3; which may probably work; however for me that completely new territory: Perpermerge is designed with the concept in mind that client will push docs to backend first and then backend will take care of uploading documents to S3.

that’s partially correct. I’m actually fine with the current design where the backend uploads documents to S3 (I don’t want to bypass it)

The real issue is that in my setup I don’t have a CDN, and even if I add one later it will most likely not be CloudFront. Because of the way Papermerge currently signs URLs (specifically tied to CloudFront), I don’t have a way to let users fetch documents directly from object storage

bl1nkker · 2025-08-17T17:58:13Z

@ciur I just want to clarify this once more, because I feel like we might still be talking a bit past each other

Right now, since URL signing in Papermerge is done via CloudFront, I’m not sure it will actually work:

with other object storage providers (like MinIO or VK Cloud), and
without using a CDN at all

That’s why I suggested adding support for presigned S3 URLs,so it’s not tied only to cloudfront.

But it’s really no problem if we see this differently. I just wanted to explain my perspective more clearly. In my setup I’ve already made it work the way I need, so I’m good either way 👍

That said, if you think the idea makes sense for Papermerge, I’d be more than happy to work on these changes.

feat: support signed URLs for non-AWS object storage providers

76497bf

ciur reviewed Jul 22, 2025

View reviewed changes

bl1nkker added 5 commits August 12, 2025 18:30

fix: duplicate folder deleted

4eb0f72

feat: moved object_storage factory to new folder

8133c68

fix: replaced classes with two sing_url new methods (for s3 + cdn and…

a44b46e

… plain s3)

chore: ensure cloud extras are installed for tests

70da67c

Merge branch 'master' into feat/custom-s3-sign-urls

ab21af7

bl1nkker changed the title ~~[WIP] Support signed URLs for non-AWS object storage providers~~ Support S3 without CDN Aug 12, 2025

Support S3 without CDN #688

Are you sure you want to change the base?

Support S3 without CDN #688

Uh oh!

Conversation

bl1nkker commented Jul 21, 2025

Uh oh!

bl1nkker commented Jul 21, 2025

Uh oh!

bl1nkker commented Jul 21, 2025

Uh oh!

ciur commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ciur commented Jul 22, 2025

Uh oh!

ciur Jul 22, 2025

Choose a reason for hiding this comment

Uh oh!

bl1nkker Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

bl1nkker commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

bl1nkker commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 22, 2025

Uh oh!

ciur commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Jul 26, 2025

Uh oh!

ciur commented Jul 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Aug 7, 2025

Uh oh!

ciur commented Aug 7, 2025

Uh oh!

ciur commented Aug 7, 2025

Uh oh!

bl1nkker commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bl1nkker commented Aug 12, 2025

Uh oh!

ciur commented Aug 17, 2025

Uh oh!

bl1nkker commented Aug 17, 2025

Uh oh!

bl1nkker commented Aug 17, 2025

Uh oh!

bl1nkker commented Aug 17, 2025

Uh oh!

Uh oh!

ciur commented Jul 22, 2025 •

edited

Loading

bl1nkker Aug 12, 2025 •

edited

Loading

bl1nkker commented Jul 22, 2025 •

edited

Loading

bl1nkker commented Jul 22, 2025 •

edited

Loading

ciur commented Jul 22, 2025 •

edited

Loading

bl1nkker commented Jul 26, 2025 •

edited

Loading

ciur commented Jul 26, 2025 •

edited

Loading

bl1nkker commented Aug 7, 2025 •

edited

Loading

bl1nkker commented Aug 12, 2025 •

edited

Loading