Skip to content

Conversation

@jadudm
Copy link

@jadudm jadudm commented Aug 26, 2025

This replaces the sling-based CSV dumper with a bash-based dumper.

This runs in ~15 minutes in production-like environments (compared to a few hours for the sling dumper).

This code will not run yet. Meaning, this is not "wired into" any GH actions.

Future TODO

  • Look at the GH Actions, and uncomment the bash script line for the weekly CSV export
  • Get rid of the matrix runner (the bash script handles all the years it needs to)
  • Decide if you want to dump more often (see the invoker cron)

Testing

This can be tested locally. It was tested w/ @asteel-gsa and @jperson1 as reviewers in preview and staging, and we see full E2E for dump->S3, and links route correctly from static.

PR Checklist: Submitter

  • Link to an issue if possible. If there’s no issue, describe what your branch does. Even if there is an issue, a brief description in the PR is still useful.
  • List any special steps reviewers have to follow to test the PR. For example, adding a local environment variable, creating a local test file, etc.
  • For extra credit, submit a screen recording like this one.
  • Make sure you’ve merged main into your branch shortly before creating the PR. (You should also be merging main into your branch regularly during development.)
  • Make sure you’ve accounted for any migrations. When you’re about to create the PR, bring up the application locally and then run git status | grep migrations. If there are any results, you probably need to add them to the branch for the PR. Your PR should have only one new migration file for each of the component apps, except in rare circumstances; you may need to delete some and re-run python manage.py makemigrations to reduce the number to one. (Also, unless in exceptional circumstances, your PR should not delete any migration files.)
  • Make sure that whatever feature you’re adding has tests that cover the feature. This includes test coverage to make sure that the previous workflow still works, if applicable.
  • Make sure the full-submission.cy.js Cypress test passes, if applicable.
  • Do manual testing locally. Our tests are not good enough yet to allow us to skip this step. If that’s not applicable for some reason, check this box.
  • Verify that no Git surgery was necessary, or, if it was necessary at any point, repeat the testing after it’s finished.
  • Once a PR is merged, keep an eye on it until it’s deployed to dev, and do enough testing on dev to verify that it deployed successfully, the feature works as expected, and the happy path for the broad feature area (such as submission) still works.
  • Ensure that prior to merging, the working branch is up to date with main and the terraform plan is what you expect.

PR Checklist: Reviewer

  • Pull the branch to your local environment and run make docker-clean; make docker-first-run && docker compose up; then run docker compose exec web /bin/bash -c "python manage.py test"
  • Manually test out the changes locally, or check this box to verify that it wasn’t applicable in this case.
  • Check that the PR has appropriate tests. Look out for changes in HTML/JS/JSON Schema logic that may need to be captured in Python tests even though the logic isn’t in Python.
  • Verify that no Git surgery is necessary at any point (such as during a merge party), or, if it was, repeat the testing after it’s finished.

The larger the PR, the stricter we should be about these points.

Pre Merge Checklist: Merger

  • Ensure that prior to approving, the terraform plan is what we expect it to be. -/+ resource "null_resource" "cors_header" should be destroying and recreating its self and ~ resource "cloudfoundry_app" "clamav_api" might be updating its sha256 for the fac-file-scanner and fac-av-${ENV} by default.
  • Ensure that the branch is up to date with main.
  • Ensure that a terraform plan has been recently generated for the pull request.

Matt Jadud added 3 commits August 23, 2025 07:33
This bash script replaces the use of Sling for CSV export.

Why?

1. Sling introduces upstream security concerns that we must constantly
   respond to.
2. The sling python library does not import well in our multi-platform
   (Mac, Linux, Windows) Docker stack.
3. The procedss is *slow*. Many hours?

This runs in 15m locally; ymmv in the cloud.

It takes a list of endpoints from an API, and does the following query
against each:

  local query="\COPY "
  query+="( SELECT * FROM ${API_VERSION}.${endpoint} "
  query+="  WHERE report_id in ( "
  query+="    SELECT report_id from ${API_VERSION}.general "
  query+="    WHERE fac_accepted_date >= '${start_date}' "
  query+="    AND fac_accepted_date <= '${end_date}' "
  query+=")) "
  query+="TO '${ROOT}/${endpoint}.csv' "
  query+="WITH (FORMAT CSV, HEADER, DELIMITER ',');"

This creates a CSV file in the filesystem that we then use the aws-cli
to ship over to S3. By putting the file in the correct place, the static
site can pick up the CSVs.

Initial checks suggest this is exporting data in a manner consistent
with the previous tooling.

TODO:

- [ ] Test in `staging`
- [ ] Remove the sling tooling
- [ ] Modify the GH action to use this
@github-actions
Copy link
Contributor

Terraform plan for meta

No changes. Your infrastructure matches the configuration.
No changes. Your infrastructure matches the configuration.

Terraform has compared your real infrastructure against your configuration
and found no differences, so no changes are needed.

📝 Plan generated in Pull Request Checks #161

@github-actions
Copy link
Contributor

Terraform plan for dev

Plan: 3 to add, 5 to change, 3 to destroy.
Terraform used the selected providers to generate the following execution
plan. Resource actions are indicated with the following symbols:
!~  update in-place
-/+ destroy and then create replacement

Terraform will perform the following actions:

  # module.dev.cloudfoundry_network_policy.app-network-policy will be updated in-place
!~  resource "cloudfoundry_network_policy" "app-network-policy" {
!~      policies = [
!~          {
!~              destination_app = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
#                (3 unchanged attributes hidden)
            },
!~          {
!~              destination_app = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
#                (3 unchanged attributes hidden)
            },
#            (1 unchanged element hidden)
        ]
    }

  # module.dev.cloudfoundry_network_policy.clamav-network-policy will be updated in-place
!~  resource "cloudfoundry_network_policy" "clamav-network-policy" {
!~      policies = [
!~          {
!~              source_app      = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> (known after apply)
#                (3 unchanged attributes hidden)
            },
!~          {
!~              source_app      = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> (known after apply)
#                (3 unchanged attributes hidden)
            },
        ]
    }

  # module.dev.cloudfoundry_network_policy.scanner-network-policy will be updated in-place
!~  resource "cloudfoundry_network_policy" "scanner-network-policy" {
!~      policies = [
!~          {
!~              destination_app = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> (known after apply)
#                (3 unchanged attributes hidden)
            },
#            (1 unchanged element hidden)
        ]
    }

  # module.dev.module.clamav.cloudfoundry_app.clamav_api must be replaced
-/+ resource "cloudfoundry_app" "clamav_api" {
+       buildpacks                      = (known after apply)
!~      created_at                      = "2025-08-12T17:37:02Z" -> (known after apply)
!~      docker_image                    = "ghcr.io/gsa-tts/fac/clamav@sha256:f0b490065d736a7c4151744e04683760418fe7a43ba9e14102b162305d94966a" -> "ghcr.io/gsa-tts/fac/clamav@sha256:3c7acdf614fba2604a5aaf4a015c803a02a6bb79cd68f1636577557a0a9384bf"
!~      enable_ssh                      = false -> (known after apply)
!~      health_check_type               = "port" -> (known after apply)
!~      id                              = "************************************" -> (known after apply)
!~      log_rate_limit_per_second       = "-1" -> (known after apply)
        name                            = "fac-av-dev"
!~      readiness_health_check_type     = "process" -> (known after apply)
!~      routes                          = [
-           {
-               protocol = "http1" -> null
-               route    = "fac-av-dev.apps.internal" -> null
            },
        ] -> (known after apply)
+       service_bindings                = (known after apply) # forces replacement
!~      stack                           = null -> (known after apply)
!~      updated_at                      = "2025-08-12T17:37:09Z" -> (known after apply)
#        (8 unchanged attributes hidden)
    }

  # module.dev.module.cors.null_resource.cors_header must be replaced
-/+ resource "null_resource" "cors_header" {
!~      id       = "*******************" -> (known after apply)
!~      triggers = { # forces replacement
!~          "always_run" = "2025-08-22T18:58:30Z" -> (known after apply)
        }
    }

  # module.dev.module.file_scanner_clamav.cloudfoundry_app.clamav_api must be replaced
-/+ resource "cloudfoundry_app" "clamav_api" {
+       buildpacks                      = (known after apply)
!~      created_at                      = "2025-08-12T17:39:11Z" -> (known after apply)
!~      docker_image                    = "ghcr.io/gsa-tts/fac/clamav@sha256:f0b490065d736a7c4151744e04683760418fe7a43ba9e14102b162305d94966a" -> "ghcr.io/gsa-tts/fac/clamav@sha256:3c7acdf614fba2604a5aaf4a015c803a02a6bb79cd68f1636577557a0a9384bf"
!~      enable_ssh                      = false -> (known after apply)
!~      health_check_type               = "port" -> (known after apply)
!~      id                              = "************************************" -> (known after apply)
!~      log_rate_limit_per_second       = "-1" -> (known after apply)
        name                            = "fac-av-dev-fs"
!~      readiness_health_check_type     = "process" -> (known after apply)
!~      routes                          = [
-           {
-               protocol = "http1" -> null
-               route    = "fac-av-dev-fs.apps.internal" -> null
            },
        ] -> (known after apply)
+       service_bindings                = (known after apply) # forces replacement
!~      stack                           = null -> (known after apply)
!~      updated_at                      = "2025-08-12T17:39:17Z" -> (known after apply)
#        (8 unchanged attributes hidden)
    }

  # module.dev.module.clamav.module.route.cloudfoundry_route.app_route will be updated in-place
!~  resource "cloudfoundry_route" "app_route" {
!~      destinations = [
-           {
-               app_id           = "830bfc16-6865-4644-a3f3-5a1a69d6ec5f" -> null
-               app_process_type = "web" -> null
-               id               = "e72b846c-1652-4ba8-ab3e-03ed29621a63" -> null
-               port             = 8080 -> null
-               protocol         = "http1" -> null
            },
+           {
+               app_id           = (known after apply)
+               app_process_type = (known after apply)
+               id               = (known after apply)
+               port             = (known after apply)
+               protocol         = (known after apply)
            },
        ]
        id           = "877bbc1f-e036-4a96-b7a3-70cff8c35c3c"
!~      updated_at   = "2025-08-12T17:37:10Z" -> (known after apply)
#        (6 unchanged attributes hidden)
    }

  # module.dev.module.file_scanner_clamav.module.route.cloudfoundry_route.app_route will be updated in-place
!~  resource "cloudfoundry_route" "app_route" {
!~      destinations = [
-           {
-               app_id           = "d9fa2027-96c9-4f98-bdb8-13809c6f569a" -> null
-               app_process_type = "web" -> null
-               id               = "ca09f24f-42c2-4f01-b2ce-d9dcec58286d" -> null
-               port             = 8080 -> null
-               protocol         = "http1" -> null
            },
+           {
+               app_id           = (known after apply)
+               app_process_type = (known after apply)
+               id               = (known after apply)
+               port             = (known after apply)
+               protocol         = (known after apply)
            },
        ]
        id           = "18f0f97d-c1a1-4628-8700-70fe47c19114"
!~      updated_at   = "2025-08-22T18:59:03Z" -> (known after apply)
#        (6 unchanged attributes hidden)
    }

Plan: 3 to add, 5 to change, 3 to destroy.

📝 Plan generated in Pull Request Checks #161

Copy link
Contributor

@asteel-gsa asteel-gsa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved.

Tested this on call in both preview and staging. ~15 min run time compared to current csv export (7.5h)

@github-actions
Copy link
Contributor

Code Coverage

Package Line Rate Branch Rate Health
. 100% 100%
api 98% 86%
api.serializers 97% 88%
api.views 91% 100%
audit 95% 80%
audit.cross_validation 97% 86%
audit.fixtures 84% 50%
audit.formlib 36% 0%
audit.intakelib 89% 83%
audit.intakelib.checks 92% 85%
audit.intakelib.common 98% 82%
audit.intakelib.transforms 100% 95%
audit.management.commands 78% 17%
audit.migrations 100% 100%
audit.models 91% 68%
audit.templatetags 100% 100%
audit.views 72% 49%
census_historical_migration 96% 65%
census_historical_migration.migrations 100% 100%
census_historical_migration.sac_general_lib 92% 84%
census_historical_migration.transforms 95% 90%
census_historical_migration.workbooklib 68% 69%
config 78% 37%
curation 98% 100%
curation.curationlib 88% 72%
curation.migrations 100% 100%
dissemination 89% 69%
dissemination.analytics 27% 0%
dissemination.forms 80% 30%
dissemination.migrations 97% 25%
dissemination.models 100% 100%
dissemination.report_generation 21% 0%
dissemination.report_generation.excel 32% 0%
dissemination.searchlib 61% 44%
dissemination.templatetags 48% 0%
dissemination.views 67% 44%
djangooidc 53% 38%
djangooidc.tests 100% 94%
report_submission 100% 96%
report_submission.migrations 100% 100%
report_submission.templatetags 74% 100%
report_submission.views 78% 61%
support 94% 75%
support.migrations 100% 100%
support.models 90% 50%
tools 98% 50%
users 95% 86%
users.fixtures 100% 83%
users.management 100% 100%
users.management.commands 100% 100%
users.migrations 100% 100%
Summary 88% (21772 / 24649) 69% (2662 / 3878)

@jadudm
Copy link
Author

jadudm commented Sep 12, 2025

Before this ships, if it is not too difficult, it would be good to include a dump of combined. Given that it is an MV, and it is a JOIN across tables that are guaranteed to be public (double check), it might be possible to do a SELECT * against the dissemination_combined table and export it wholesale as well.

This would be valuable for upstream consumers of the data. We have some Federal partners who are pulling the CSVs as an import, and having to compute the JOINs on their end is not fun. Given that we have already computed the join, if we export the table as part of this simpler dump process, it's a win for everyone.

It will require an update to the public pages (so we link to the combined CSV), but it should be a straight-forward add.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants