-
Notifications
You must be signed in to change notification settings - Fork 13
Simpler csv dump #5227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Simpler csv dump #5227
Conversation
This bash script replaces the use of Sling for CSV export.
Why?
1. Sling introduces upstream security concerns that we must constantly
respond to.
2. The sling python library does not import well in our multi-platform
(Mac, Linux, Windows) Docker stack.
3. The procedss is *slow*. Many hours?
This runs in 15m locally; ymmv in the cloud.
It takes a list of endpoints from an API, and does the following query
against each:
local query="\COPY "
query+="( SELECT * FROM ${API_VERSION}.${endpoint} "
query+=" WHERE report_id in ( "
query+=" SELECT report_id from ${API_VERSION}.general "
query+=" WHERE fac_accepted_date >= '${start_date}' "
query+=" AND fac_accepted_date <= '${end_date}' "
query+=")) "
query+="TO '${ROOT}/${endpoint}.csv' "
query+="WITH (FORMAT CSV, HEADER, DELIMITER ',');"
This creates a CSV file in the filesystem that we then use the aws-cli
to ship over to S3. By putting the file in the correct place, the static
site can pick up the CSVs.
Initial checks suggest this is exporting data in a manner consistent
with the previous tooling.
TODO:
- [ ] Test in `staging`
- [ ] Remove the sling tooling
- [ ] Modify the GH action to use this
|
Terraform plan for meta No changes. Your infrastructure matches the configuration.📝 Plan generated in Pull Request Checks #161 |
|
Terraform plan for dev Plan: 3 to add, 5 to change, 3 to destroy.Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~ update in-place
-/+ destroy and then create replacement
Terraform will perform the following actions:
# module.dev.cloudfoundry_network_policy.app-network-policy will be updated in-place
!~ resource "cloudfoundry_network_policy" "app-network-policy" {
!~ policies = [
!~ {
!~ destination_app = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
# (3 unchanged attributes hidden)
},
!~ {
!~ destination_app = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
# (3 unchanged attributes hidden)
},
# (1 unchanged element hidden)
]
}
# module.dev.cloudfoundry_network_policy.clamav-network-policy will be updated in-place
!~ resource "cloudfoundry_network_policy" "clamav-network-policy" {
!~ policies = [
!~ {
!~ source_app = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
# (3 unchanged attributes hidden)
},
!~ {
!~ source_app = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> (known after apply)
# (3 unchanged attributes hidden)
},
]
}
# module.dev.cloudfoundry_network_policy.scanner-network-policy will be updated in-place
!~ resource "cloudfoundry_network_policy" "scanner-network-policy" {
!~ policies = [
!~ {
!~ destination_app = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> (known after apply)
# (3 unchanged attributes hidden)
},
# (1 unchanged element hidden)
]
}
# module.dev.module.clamav.cloudfoundry_app.clamav_api must be replaced
-/+ resource "cloudfoundry_app" "clamav_api" {
+ buildpacks = (known after apply)
!~ created_at = "2025-08-12T17:37:02Z" -> (known after apply)
!~ docker_image = "ghcr.io/gsa-tts/fac/clamav@sha256:f0b490065d736a7c4151744e04683760418fe7a43ba9e14102b162305d94966a" -> "ghcr.io/gsa-tts/fac/clamav@sha256:3c7acdf614fba2604a5aaf4a015c803a02a6bb79cd68f1636577557a0a9384bf"
!~ enable_ssh = false -> (known after apply)
!~ health_check_type = "port" -> (known after apply)
!~ id = "************************************" -> (known after apply)
!~ log_rate_limit_per_second = "-1" -> (known after apply)
name = "fac-av-dev"
!~ readiness_health_check_type = "process" -> (known after apply)
!~ routes = [
- {
- protocol = "http1" -> null
- route = "fac-av-dev.apps.internal" -> null
},
] -> (known after apply)
+ service_bindings = (known after apply) # forces replacement
!~ stack = null -> (known after apply)
!~ updated_at = "2025-08-12T17:37:09Z" -> (known after apply)
# (8 unchanged attributes hidden)
}
# module.dev.module.cors.null_resource.cors_header must be replaced
-/+ resource "null_resource" "cors_header" {
!~ id = "*******************" -> (known after apply)
!~ triggers = { # forces replacement
!~ "always_run" = "2025-08-22T18:58:30Z" -> (known after apply)
}
}
# module.dev.module.file_scanner_clamav.cloudfoundry_app.clamav_api must be replaced
-/+ resource "cloudfoundry_app" "clamav_api" {
+ buildpacks = (known after apply)
!~ created_at = "2025-08-12T17:39:11Z" -> (known after apply)
!~ docker_image = "ghcr.io/gsa-tts/fac/clamav@sha256:f0b490065d736a7c4151744e04683760418fe7a43ba9e14102b162305d94966a" -> "ghcr.io/gsa-tts/fac/clamav@sha256:3c7acdf614fba2604a5aaf4a015c803a02a6bb79cd68f1636577557a0a9384bf"
!~ enable_ssh = false -> (known after apply)
!~ health_check_type = "port" -> (known after apply)
!~ id = "************************************" -> (known after apply)
!~ log_rate_limit_per_second = "-1" -> (known after apply)
name = "fac-av-dev-fs"
!~ readiness_health_check_type = "process" -> (known after apply)
!~ routes = [
- {
- protocol = "http1" -> null
- route = "fac-av-dev-fs.apps.internal" -> null
},
] -> (known after apply)
+ service_bindings = (known after apply) # forces replacement
!~ stack = null -> (known after apply)
!~ updated_at = "2025-08-12T17:39:17Z" -> (known after apply)
# (8 unchanged attributes hidden)
}
# module.dev.module.clamav.module.route.cloudfoundry_route.app_route will be updated in-place
!~ resource "cloudfoundry_route" "app_route" {
!~ destinations = [
- {
- app_id = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> null
- app_process_type = "web" -> null
- id = "e72b846c-1652-4ba8-ab3e-03ed29621a63" -> null
- port = 8080 -> null
- protocol = "http1" -> null
},
+ {
+ app_id = (known after apply)
+ app_process_type = (known after apply)
+ id = (known after apply)
+ port = (known after apply)
+ protocol = (known after apply)
},
]
id = "877bbc1f-e036-4a96-b7a3-70cff8c35c3c"
!~ updated_at = "2025-08-12T17:37:10Z" -> (known after apply)
# (6 unchanged attributes hidden)
}
# module.dev.module.file_scanner_clamav.module.route.cloudfoundry_route.app_route will be updated in-place
!~ resource "cloudfoundry_route" "app_route" {
!~ destinations = [
- {
- app_id = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> null
- app_process_type = "web" -> null
- id = "ca09f24f-42c2-4f01-b2ce-d9dcec58286d" -> null
- port = 8080 -> null
- protocol = "http1" -> null
},
+ {
+ app_id = (known after apply)
+ app_process_type = (known after apply)
+ id = (known after apply)
+ port = (known after apply)
+ protocol = (known after apply)
},
]
id = "18f0f97d-c1a1-4628-8700-70fe47c19114"
!~ updated_at = "2025-08-22T18:59:03Z" -> (known after apply)
# (6 unchanged attributes hidden)
}
Plan: 3 to add, 5 to change, 3 to destroy.📝 Plan generated in Pull Request Checks #161 |
asteel-gsa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved.
Tested this on call in both preview and staging. ~15 min run time compared to current csv export (7.5h)
|
|
Before this ships, if it is not too difficult, it would be good to include a dump of This would be valuable for upstream consumers of the data. We have some Federal partners who are pulling the CSVs as an import, and having to compute the It will require an update to the public pages (so we link to the combined CSV), but it should be a straight-forward add. |
This replaces the sling-based CSV dumper with a bash-based dumper.
This runs in ~15 minutes in production-like environments (compared to a few hours for the sling dumper).
This code will not run yet. Meaning, this is not "wired into" any GH actions.
Future TODO
Testing
This can be tested locally. It was tested w/ @asteel-gsa and @jperson1 as reviewers in
previewandstaging, and we see full E2E for dump->S3, and links route correctly from static.PR Checklist: Submitter
maininto your branch shortly before creating the PR. (You should also be mergingmaininto your branch regularly during development.)git status | grep migrations. If there are any results, you probably need to add them to the branch for the PR. Your PR should have only one new migration file for each of the component apps, except in rare circumstances; you may need to delete some and re-runpython manage.py makemigrationsto reduce the number to one. (Also, unless in exceptional circumstances, your PR should not delete any migration files.)PR Checklist: Reviewer
make docker-clean; make docker-first-run && docker compose up; then rundocker compose exec web /bin/bash -c "python manage.py test"The larger the PR, the stricter we should be about these points.
Pre Merge Checklist: Merger
-/+ resource "null_resource" "cors_header"should be destroying and recreating its self and~ resource "cloudfoundry_app" "clamav_api"might be updating itssha256for thefac-file-scannerandfac-av-${ENV}by default.main.