-
Notifications
You must be signed in to change notification settings - Fork 76
Support S3 without CDN #688
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
noticed that should these be updated to use the new storage abstraction or left as-is? |
also, tested locally using MinIO. I don't have access to AWS S3 + CloudFront, but I plan to test it in my work environment (which uses VK Object Storage) |
You are right, those two functions are not used anymore. I will remove them, no need to do updates related to them. |
For documents/nodes Papermerge REST API returns an URL pointing to where client should get the file from. In simple cases (i.e. without S3), that URL is something like:
You can see this setup in action in here https://demo.papermerge.com (username, password demo/demo). Why I am telling this, is that in the scenario I am using S3, Papermerge does not serve files: it just give back to the client correct URL and for that REST API server does not need @bl1nkker, in your setup, who is serving files ? Is it Papermerge ? Or S3 server? |
@@ -37,6 +37,13 @@ class Settings(BaseSettings): | |||
papermerge__ocr__automatic: bool = False | |||
papermerge__search__url: str | None = None | |||
|
|||
papermerge__s3__provider: str = "aws" | |||
aws_access_key_id: str | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned in comments, so far, when using S3 storage, Papermerge does not serve files. Thus, there is no need for aws_access_key_id
etc. Just keep this in mind, because I assume in your S3 setup, you want Papermerge to serve files as well? Which means that you will need to add code for downloading from S3?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, you are right. in the default Papermerge + cloudfront + S3 setup those variables are not needed.
However, in my use case (and in this PR) I’m implementing a setup without a CDN.
In this scenario Papermerge uses boto3
to generate signed URLs for direct access to files in the object storage, so aws_access_key_id
and other variables are required
at the moment in production files are served by Papermerge and stored on local storage however i'm facing a new requirement to offload files to external object storage since more than 2 million pages will be uploaded soon. that's why I’m working on integrating papermerge with VK Cloud |
regarding the variables, I don’t have much experience with |
i'm sorry, i was wrong some of the params are actually not strictly required. According to the boto3 docs, the client can work without explicitly passing credentials however i think for non-AWS providers like minio or VK Cloud specifying at least I'll update the PR once I finish testing |
tested papermerge with another S3 provider. after a few small adjustments (I'll add them to the pull request a little later), everything works as expected note: my setup does not use CloudFront or any CDN. note2: please don’t review the code just yet. when I was working on testing I noticed that there's already a this is my services:
webapp:
image: blinkker/papermerge:0.0.9-dev
environment:
PAPERMERGE__SECURITY__SECRET_KEY: 12345
PAPERMERGE__AUTH__USERNAME: admin
PAPERMERGE__AUTH__PASSWORD: admin
PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
PAPERMERGE__OCR__LANG_CODES: "deu,eng,kaz,rus"
PAPERMERGE__OCR__DEFAULT_LANG_CODE: "deu"
AWS_ACCESS_KEY_ID: <aws-access-key>
AWS_SECRET_ACCESS_KEY: <aws-secret-key>
AWS_ENDPOINT_URL: <aws-endpoint-url>
AWS_REGION_NAME: us-east-1
PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
PAPERMERGE__MAIN__FILE_SERVER: s3
# options are: vk, minio, aws
PAPERMERGE__S3__PROVIDER: vk
volumes:
- media_root:/var/media/pmg
ports:
- "12000:80"
s3worker:
image: blinkker/papermerge-s3-worker:0.0.2
command: worker
environment:
PAPERMERGE__DATABASE__URL: postgresql://postgres:[email protected]:5432/pmgdb
PAPERMERGE__REDIS__URL: redis://host.docker.internal:6379/0
PAPERMERGE__MAIN__MEDIA_ROOT: /var/media/pmg
S3_WORKER_ARGS: "-Q s3 -c 2"
PAPERMERGE__S3__BUCKET_NAME: <bucket-name>
AWS_ACCESS_KEY_ID: <aws-access-key>
AWS_SECRET_ACCESS_KEY: <aws-secret-key>
AWS_ENDPOINT_URL: <aws-endpoint-url>
AWS_REGION_NAME: us-east-1
volumes:
- media_root:/var/media/pmg
volumes:
media_root: |
the environment variables I added actually are needed for signing S3 URLs directly. in my case I want Papermerge to work only with the object storage letting S3 serve all files directly also, I don’t want Papermerge to store files locally at all. But according to the documentation it seems like local storage is still used even when S3 is configured is there a way to fully switch papermerge to use only S3 for storing and serving documents without saving anything locally? |
I don't really understand what you mean by "only S3". Also what exactly do you mean by "locally" ? Locally for whom? for REST API server or for S3 worker? Also the docker compose is just an example. On real production server there can be any number of REST API servers (1, 2, 3, 4, ... N) an each of them has its own "local" storage - which they don't share. Same for S3-workers. So what do you exactly mean by "locally" ? Maybe you explain here in detail your production setup (e.g. do you plan to deploy in k8s, do you plan to have only one REST API server? etc) so that I can further help you. PS: "production" for me means k8s cluster with N ( N>= 3) REST API server instances, each with ephemeral storage (i.e. storage which can be replaced at any time) and with N s3-workers (each worker has access to the same storage of it's peer REST API server so that it can upload files to S3) |
right now my setup is very simple: I have a single Ubuntu 22.04 server with 8GB RAM and 500GB disk. I'm running only one container with the Papermerge REST API (via Docker) and currently it stores files in a volume mounted on the host (no s3wokrer on my setup) i also don’t plan to migrate to k8s. Currently my deployment is based on docker compose. Once I complete integration with VK Cloud I plan to move towards docker swarm as I intend to provision a dedicated server for the database
what I want to achieve is the following: I’d like to run an additional s3worker container and make Papermerge store all documents exclusively in S3 object storage (in my case VK Cloud) without consuming any local disk space on the server. so instead of saving documents locally first and then offloading to S3 later I want Papermerge to write directly to S3 and read from there as well skipping local volumes altogether. and yes, this means that s3 itself provides access to files. with my current setup Papermerge retrieves the files directly from my storage (I'm running MinIO and the Papermerge API locally). For example, here’s a generated pre-signed URL that the frontend uses to access a document:
|
Right. Currently uploaded files are saved on the disk first. But in general you can write another application which will take over |
@ciur apologies for not working on this pull request for a while. i've had a busy period at work and since my current solution worked well enough for our use case, I had to temporarily pause work on the PR (same for s3-worker pull request) However, I’d be happy to continue contributing |
also, I’d like to reiterate the approach I took for supporting alternative object storages: Papermerge generates a pre-signed URL for downloading the file from the object storage and sends it to the client (the files themselves are stored on the main papermerge node and in object storage, as was originally the case) |
@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/ |
@bl1nkker don't forget to rebase. I changed master recently (all sync went async). Also you may want to have a look at this: https://docs.papermerge.io/3.5/developer-manual/architecture/ |
@ciur i’ve mostly finished the work, but I still have some questions
// frontend/apps/ui/src/features/document/utils.ts
export async function getDocLastVersion(docID: UUID): Promise<ClientReturn> {
. . .
resp = await client.get(docVer.download_url, {
responseType: "blob",
headers: { Authorization: undefined }
}); I didn't include it in the pull request, as it's an obvious workaround (I’ve also considered handling this using nginx by configuring it as a reverse proxy to s3 and stripping the extra Authorization header before forwarding the request to s3). Is there a better or more recommended way to handle this issue? |
also i’ve reviewed the changes and realized the PR would be more accurately titled “Support S3 without CDN” instead of current one |
S3 will work perfectly without CDN (e.g. cloudfront). From tech point of view is same thing. The difference is only from user point of view - CDN serves content closer to the user (say, app runs in Europe, but user in Asia, then CDN will serve a copy of the document from the Asia, closer to end-user, while S3 will serve copy from Europe...). In other words, CDN brings content closer to user. Now I think I see what you are trying to do - you try to upload docs directly from the client (e.g. browser) directly to S3; which may probably work; however for me that completely new territory: Perpermerge is designed with the concept in mind that client will push docs to backend first and then backend will take care of uploading documents to S3.
Will if you have only one node and one storage - then that is not necessary. The necessity comes into picture when use have multi node deployment (and each node has separate storage) i.e. when app container is stateless: it does not depend on storage as storage may come and go. |
yes, that’s exactly right, an object storage can work without a CDN. However, in the context of what I wrote above, I meant the implementation of papermerge support with object storage. in the current Papermerge implementation, the backend (when S3 is used) signs URLs using the def sign_url(url: str, valid_for: int = 600):
"""
:type url: str
:param url: The URL of the protected object
:type valid_for: int
:param valid_for: number of seconds the url will be valid for, defaults
to 600 (i.e. 10 minutes)
"""
key_id = settings.papermerge__main__cf_sign_url_key_id
tz = pytz.timezone(
settings.papermerge__main__timezone
)
if key_id is None:
raise ValueError(
"CF_SIGN_URL_KEY_ID is empty"
)
cf_signer = CloudFrontSigner(key_id, rsa_signer)
date_less_than = datetime.now(tz) + timedelta(seconds=valid_for)
signed_url = cf_signer.generate_presigned_url(
url,
date_less_than=date_less_than
)
return signed_url For this approach, no additional environment variables are required. But if we want the client browser to fetch content directly from object storage the backend must also be able to generate signed URLs without relying on CloudFront. For that additional environment variables are needed so that we can use def generate_s3_signed_url(path: str):
client = boto3.client(
"s3",
aws_access_key_id=settings.aws_access_key_id,
aws_secret_access_key=settings.aws_secret_access_key,
region_name=settings.aws_region_name,
endpoint_url=settings.aws_endpoint_url,
config=Config(signature_version="s3v4"),
)
return client.generate_presigned_url(
ClientMethod="get_object",
Params={"Bucket": settings.papermerge__s3__bucket_name, "Key": path},
ExpiresIn=VALID_FOR_SECONDS,
) |
that’s partially correct. I’m actually fine with the current design where the backend uploads documents to S3 (I don’t want to bypass it) The real issue is that in my setup I don’t have a CDN, and even if I add one later it will most likely not be CloudFront. Because of the way Papermerge currently signs URLs (specifically tied to CloudFront), I don’t have a way to let users fetch documents directly from object storage |
@ciur I just want to clarify this once more, because I feel like we might still be talking a bit past each other Right now, since URL signing in Papermerge is done via CloudFront, I’m not sure it will actually work:
That’s why I suggested adding support for presigned S3 URLs,so it’s not tied only to cloudfront. But it’s really no problem if we see this differently. I just wanted to explain my perspective more clearly. In my setup I’ve already made it work the way I need, so I’m good either way 👍 That said, if you think the idea makes sense for Papermerge, I’d be more than happy to work on these changes. |
this pr adds support for generating signed URLs with aws s3 compatible storage providers
Added settings:
papermerge__s3__provider
aws_access_key_id
,aws_secret_access_key
,aws_region_name
,aws_endpoint_url
papermerge__s3__bucket_name
Added classes:
AWSS3Storage
– generates CloudFront signed URLsGenericS3Storage
– uses boto3 for S3-compatible storageAdded factory method
get_storage()
Updated
resource_sign_url
to use storage abstractionrefs #14