Skip to content
This repository was archived by the owner on Nov 5, 2018. It is now read-only.

Conversation

@klown
Copy link

@klown klown commented Jul 10, 2018

@cindyli Here is a pull request for the changes to data loading that

  • runs a Docker image where the latest version of universal snapsets are used,
  • runs a script that deletes the current snapsets and their keys from the database,
  • loads the latest snapsets and their keys into the database.

This requires my GPII-3138 branch of universal and its associated pull request, which I will now go create. I'll add a comment here when it's ready.

Here's the link to the universal pull request: GPII/universal#626

klown added 10 commits June 29, 2018 16:18
… reload them.

First checkin:  Replaced loadData.sh with deleteAndLoadSnapsets.js.  This is a
partial reimplementation of the load script and currently does only:
1. Find the current snapset Prefs Safes in the data base.
2. Find the Prefs Safes associated GPII Keys
3. Delete the above records from the data base

Still to come:  loading the current snapsets as before (and polishing
of code).
Fixed an errant true that should not have been a string.
Added more logging to see why the call to doBatchDelete() happens
before doBatchDelete() is called.  The bug is that only the snapset
Prefs Safes are deleted.  Their GPII Keys are added to the 'docsToRemove'
array after that deletion.  I'm not sure why, but suspect it has to
do with some kind of asynchrony in the http responses. Also, I have not
found the condition to test when all the GPII Key records have been
added.
This version uses a DELETE url to delete each GPII Key as it is
fetched using the findGpiiKeysByPrefsSafeId view with the Prefs Safe
ID.  Then the snapset Prefs Safes are deleted in bulk.
Changed name of script as it only deletes the old snapset Prefs
Safes and their associated GPII Key records.  The bash script
still loads the latest snapsets into the database.
- Cleaned up logging in deleteSnapsets.js,
- Modified Dockerfile re:  new name of script "deleteAndLoadSnapsets.sh",
- Modified deleteAndLoadSnapsets.sh regarding NODE_PATH for node to use
Moved the delete script to universal.
Modified the way that universal is set up within the Docker image
in order that the latest version of universal is used.  This is
done by cloning from github when the image is run, not when
the image is built.  Also, this means that the preferences are
converted to Prefs Safes and GPII Keys at "run-time", not
"build-time".
@cindyli
Copy link
Contributor

cindyli commented Jul 10, 2018

Please update README.

httpTests.js probably can be removed if it's a test script.

log "Working directory: `pwd`"

# Set up universal
git clone https://github.com/GPII/universal.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for making this change by dynamically pulling universal repo at every run.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cindyli.

You're right, the httpTests.js was just an investigative script to see how the http module worked, especially with CouchDB urls. I've removed it.

I've also updated the README

# Set up universal
git clone https://github.com/GPII/universal.git
cd universal
rm -f package-lock.json
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is no longer necessary as package-lock.json has been removed from the universal repo.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right; however, the subsequent npm install commands create a package-lock.json. I don't think it matters since the point is to set up a clone of universal and use it to load and modify the database. That is, I don't think package-lock.json affects the database operations. But, I'm not 100% sure, so, for the time being I've moved the rm command to after the npm install commands.


STATIC_DATA_DIR=${STATIC_DATA_DIR:-/home/node/universal/testData/dbData}
BUILD_DATA_DIR=${BUILD_DATA_DIR:-/home/node/universal/build/dbData}
NODE_PATH=${NODE_PATH:-/home/node/universal}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking if a var name like UNIVERSAL_PATH is better than NODE_PATH because this path is defined for accessing scripts/ in the universal repo. What do you think?

Copy link
Contributor

@mrtyler mrtyler Jul 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that NODE_PATH isn't very clear.

It's a little ambiguous because "universal" can mean so many things, but I like UNIVERSAL_DIR or maybe UNIVERSAL_ROOT_DIR (_PATH isn't consistent with how the other variables are named).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a way to eliminate the NODE_PATH variable entirely. Related: I discovered that since the deleteSnapSets.js script uses infusion, an npm install infusion is required here. Simply cloning universal does not provide an instance of infusion within its node_modules folder. If infusion is so installed, and a cd universal is executed after it is cloned (already done), that's enough for the node command to find infusion. No need for NODE_PATH.

README.md Outdated

```
$ docker run --name dataloader --link couchdb -v /home/vagrant/sync/universal/testData/dbData:/static_data -e STATIC_DATA_DIR=/static_data -v /home/vagrant/sync/universal/build/dbData:/build_data -e BUILD_DATA_DIR=/build_data -e COUCHDB_URL=http://couchdb:5984/gpii -e CLEAR_INDEX=1 gpii/gpii-dataloader
$ docker run --name dataloader --link couchdb -v /home/vagrant/sync/universal/testData/dbData:/static_data -e STATIC_DATA_DIR=/static_data -v /home/vagrant/sync/universal/build/dbData:/build_data -e BUILD_DATA_DIR=/build_data -e COUCHDB_URL=http://couchdb:5984/gpii [-e CLEAR_INDEX=1] -v /home/vagrant/sync/universal:/universal -e NODE_PATH=/universal gpii/gpii-dataloader
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think [-e CLEAR_INDEX=1] is meant to indicate that this parameter is optional (hence the square brackets). Was this the intent?

I think this is going to confuse users, because when I copy/paste the line I get this unhelpful error:

docker: invalid reference format.
See 'docker run --help'.

I suggest removing the [-e CLEAR_INDEX=1] clause from this command and mentioning it in a bullet point afterward, or listing the complete command twice -- once with and once without the [-e CLEAR_INDEX=1].

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea @mrtyler. I've taken the second approach showing the two ways to run the image, one with CLEAR_INDEX set, and another with it not set.

log "Working directory: `pwd`"

# Set up universal
git clone https://github.com/GPII/universal.git
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realize you are just moving what was there before (and I probably should have caught it when reviewing the change when Cindy made it ;)), but please add --depth 1. It will make the clone run faster, and will result in a (slightly) smaller Docker image.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure. Done.

Modified in response to Cindy and Tyler's comments.
@mrtyler
Copy link
Contributor

mrtyler commented Jul 16, 2018

LGTM.

I'd like to chat about the deployment plan for this and GPII/universal#626 before we merge.

klown added 2 commits July 20, 2018 14:00
Removed old log message regarding NODE_PATH.
Modified bash script in light of changes to universal's "deleteAndLoadSnapsets.js"
node script.
@klown
Copy link
Author

klown commented Jul 23, 2018

@cindyli @mrtyler
I noticed that the convertPrefs.js script does not use the BUILD_DATA_DIR variable. In fact, it never has. Instead, the literal "build/dbData" is/was passed.

That suggests using build/dbData, and testData/dbData when calling deleteAndLoadSnapsets.js ; and that neither STATIC_DATA_DIR nor BUILD_DATA_DIR are needed by deleteAndLoadSnapsets.sh. Either that or convertPref.js should be using them.

If STATIC_DATA_DIR and BUILD_DATA_DIR are unnecessary here, then their use by the gpii/gpii-dataloader Docker image within universal's vagrantCloudBasedContainers.sh is unnecessary. I've checked and they are only used in vagrantCloudBasedContainers.sh when running the gpii-dataloader Docker image. What do you think about eliminating these environment variables for gpii-dataloader? It removes the issue of their not being properly set, if they aren't actually used.

Aside: it looks like two copies of universal are in use by vagrantCloudBasedContainers.sh:

Should there be two copies of universal? I'm too much of a newbie with respect to Docker to judge. Perhaps this is how it's supposed work. It does make gpii-dataloader self-contained, using its own clone of universal.

@cindyli
Copy link
Contributor

cindyli commented Jul 23, 2018

@klown, the convenience of having STATIC_DATA_DIR and BUILD_DATA_DIR is, at the development stage when developers want to load data from a local in-dev universal branch, they can simply supply these environment vars as you pointed in vagrantCloudBasedContainers.sh, without re-building gpii-dataloader docker image. In the case these env vars are not set, gpii-dataloader will use the default converted prefs data it carries. It would be nice if gpii-dataloader can continue to support these env vars. The loading logic would be:

  1. If env vars are set, read converted prefs data from them;
  2. If env vars are set incorrectly, for example, converted prefs data cannot be found in these dirs, report an error;
  3. If env vars are not set, run this section in deleteAndLoadSnapsets.sh

What do you think?

Yes, as you found out, we do have two copies of universal used by 2 docker images in vagrantCloudBasedContainers.sh. It would be great if you could find a solution to consolidate them.

@mrtyler
Copy link
Contributor

mrtyler commented Jul 24, 2018

Re: *_DATA_DIR variables: The pattern Cindy described -- if $FOO is set in the environment, use that value; otherwise use this default -- is common and useful.

Re: dueling universals: I agree it is not awesome and has the potential to cause confusion ("why isn't dataloader picking up my local changes?"). I'm pretty sure Cindy and I discussed alternatives when designing the current strategy and didn't come up with anything we loved, but things have changed a bit so maybe we should revisit the conversation.

@klown
Copy link
Author

klown commented Jul 24, 2018

Re: *_DATA_DIR variables: The pattern Cindy described -- if $FOO is set in the environment, use that value; otherwise use this default -- is common and useful.

Fair enough. Then why doesn't (didn't) the convertprefs.js script use them? Why does it use hard-coded paths instead?

More generally, I think the use of universal in this docker image is (possibly) twofold. The primary one is to have access to the latest codebase -- the scripts run here are in "universal/scripts", after all. A way to get at the scripts, node modules, etc. is to put a clone of universal in the docker image, and use the code therein.

A second use of universal is to have access to the latest preferences, snapsets, and credentials, or "data". However, I can see that a developer might want to provide that data from somewhere else; hence using environment variables as an override of the default values. Thanks for clarifying that @cindyli.

Regarding this loading logic:

  1. If env vars are set, read converted prefs data from them;
  2. If env vars are set incorrectly, for example, converted prefs data cannot be found in these dirs, report an error;
  3. If env vars are not set, run this section in deleteAndLoadSnapsets.sh

I get the first two, and will adjust the script as appropriate. But the third, "If env vars are not set", is puzzling. The way the variables are declared at the top of the script entails that they must be set. They are either set in the environment, or they are set to default values. They are never not set.

Also, running the section that clones and sets up universal is still needed, as I noted above, for the code. That is independent of where the data is. Which, oddly, leads back to "dueling universals"...

Re: dueling universals: ...

If the docker image had access to the latest universal code outside of the image, then it wouldn't need a clone of universal for the code. But before we go down this potential rabbit hole, is solving the dueling universals issue a priority for the Aug release?

klown added 2 commits July 24, 2018 13:42
Bash script now checks that the static and build data folders
exist and contain data.  If not, the script exits with an error
message.
Modified in light of changes to universal where both "user" and
"snapset" Prefs Safes are created.  The gpii-dataloader docker image
updates only "snapset" Prefs Safes.
Copy link
Contributor

@mrtyler mrtyler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quick reminder that I'd like to discuss deployment before we merge this work.

fi

log "Starting"
log "CouchDB: $COUCHDB_URL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not log (unsanitized) COUCHDB_URL as it may contain credentials.

(I see that the code was doing so before, but I'd appreciate if you could fix it while you're in here.)

I see two obvious fixes:

  1. Remove these log lines.

  2. Sanitize COUCHDB_URL into COUCHDB_URL_SANITIZED and use the latter for logging. I think the following sed is pretty bulletproof:

$ echo 'http://user:pass@couchdb:5984.couchdb.badger' | sed -e 's,\(://\)[^/]*\(@\),\1<SENSITIVE>\2,g'
http://<SENSITIVE>@couchdb:5984.couchdb.badger

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for catching that. I sanitized a similar log message in the related node script (in universal) at an earlier suggestion of yours, but didn't likewise here. I'll use your sed command as you suggest.


# Initialize (possibly clear) data base
if [ ! -z "$CLEAR_INDEX" ]; then
log "Deleting database at $COUCHDB_URL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

fi
fi

log "Creating database at $COUCHDB_URL"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ditto.

Removed sensitive information from CouchDB url when logged to the
console.
@klown
Copy link
Author

klown commented Jul 31, 2018

@mrtyler

Quick reminder that I'd like to discuss deployment before we merge this work.

Is GPII-3192 part of this discussion? The gpii-infra respository uses gpii-dataloader where a "chart" template always sets gpii-dataloader's CLEAR_INDEX parameter to true. I don't know enough about what this use of gpii-dataloader is doing to determine:

  1. if CLEAR_INDEX should always be true, and
  2. if not, under what conditions it is set to true or false.

@mrtyler
Copy link
Contributor

mrtyler commented Jul 31, 2018

The sed change LGTM. Thanks.

The deployment strategy for dataloader was based on its old behavior, which you have changed (read: improved :)) drastically. The need for CLEAR_INDEX is one of the things that has changed -- that flag should no longer be used during (ordinary) deploys to production. So yes, GPII-3192 is part of the discussion :).

klown added 2 commits July 31, 2018 16:37
Updated to handle demo 'user' Prefs Safes.
Updated README.md based on changes due to demo user preferences.
@mrtyler
Copy link
Contributor

mrtyler commented Aug 1, 2018

LGTM

klown added 3 commits August 5, 2018 15:51
Modified to not bail if the provided static, build, or demo user
preferences directories are non-existent or do not contain data.
Each directory is checked individually and if invalid, the script
uses the version of the directory within the cloned universal
repository.
Modified to use curl command within bash to load just the static
data.  Loading static data using the "deleteAndLoadSnapsets.js"
node script results in a permissions error, whereas using bash/curl
does not.  But, also, the node script is for modifying snapsets,
not "updating" the static data.
Followup to previous commit:  removed STATIC_DATA_DIR argument from
call to deleteAndLoadSnapsets.js node script.
Merged changes from upstream master GPII branch.
Moved loading/updating of static data to universal's deleteAndLoadSnapsets.js
script.
Bash script loads universal from klown's GPII-3138 branch.
NB:  This needs to change to gpii master universal branch when
merged.
Removed lines that had to do with the 20 empty user preference,
and "Roy" and "Quint", as they are handled elsewhere.
@mrtyler
Copy link
Contributor

mrtyler commented Sep 12, 2018

Note to self: last review here. (Joseph, everything still looks fine.)

Fixed a case of logging sensitive information from the CouchDB URL
(username and password).
Modified to exit early with a non-zero status if the
deleteAndLoadSnapsets.js fails.
Modified to exit early with a non-zero status if the
deleteAndLoadSnapsets.js fails.
@mrtyler
Copy link
Contributor

mrtyler commented Sep 21, 2018

Reviewed to here.

Updated the README.md to remove references to uploading "user"
preferences (PrefsSafes and their GPII Keys) as the dataloader
is no longer responsible for that.
Copy link

@stepanstipl stepanstipl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klown I did have a quick look and I think and I see one problem with this approach - every time the dataloader job will be run, it will have to clone the github repository and install the npm dependencies. This will be slow, and prone to issues - what happens if GitHub is unreachable or if one of the npm dependencies can't be installed? Also what about versioning - if we clone master it will be different each time, depending on when the dataloader was run.

I would say that everything from cloning the repo to installing the npm modules (included) should happen when we build the docker image.

I'm happy to help with this.

Also since we're cloning the universal repo here and basically the only content of this repo is the deleteAndLoadSnapsets.sh script, what would you think about moving that script to the universal and reusing the universal container? I'm not familiar with the history and/or reasons behind this split, but seems like we could save ourselves some duplication of effort.

Update: I've looked into the universal Dockerfile and it looks like the image already has all the npm dependencies installed here, so that should work imho.

@mrtyler @cindyli I saw the duplication issue was briefly discussed above, but I didn't see any conclusion. What do you think about moving the deleteAndLoadSnapsets.sh script to universal and reusing the universal image (it already has everything)?

log "Working directory: `pwd`"

# Set up universal
git clone --depth 1 --branch GPII-3138 https://github.com/klown/universal.git

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pointing to .../klown/.. repository, should be changed before merging.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanstipl One of the requirements of the new data loader is that it run whenever the snapset preferences inputs change or the '_design/views' input changes. That happens when either of those changes are committed to universal (master). If universal was cloned only when the dataloader docker image is built, then it has a fixed version of the preferences and views, and it won't always update CouchDB with the latest. That was the rationale for moving the clone operation inside the image so it happens every time the dataloader runs. That guarantees that the latest snapsets and views will be used.

However, that may not be the only way. One way would be to move cloning to the docker image build step. If the dataloader's docker image was re-built (and rerun) every time a snapset/views change is committed, that would be okay. But that isn't much of an improvement. It suffers the same slowness and github and npm issues you mentioned.

Another thought: Is the universal Docker image re-built every time there is a commit to universal? If so, that may be when the dataloader updates CouchDB. Also, the code here could be moved into universal's codebase.

The earlier discussion about dupicating universal was not resolved because we were trying to get the new dataloader ready for the August demo. So, we punted the duplication issue for the time being. Looks like it's time to reconsider.

Thanks for pointing out that the clone points to the klown repository. I said something along those lines at last week's architecture meeting, but it's good to repeat that point. There are analogous issues in my universal pull, where vagrantCloudBasedContainers.sh.sh references my docker repository, not gpii's.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@klown Thank you Joseph for explaining the state of things. On infra side of things, currently the data loader would be triggered only if there's new image (that's how it works), so it would have to be rebuilt every time anyway (or we would have to change the mechanism).

I think moving this code to universal solves all of these issues:

  • we get rid of the duplication
  • we get rid of cloning the repo and installing npm modules when the container is executed
  • data loader will always have the new universal code (as it will be running in the universal image)
  • data loader will get triggered every time there's a change to universal (as the universal image will get updated)

What would you say moving the deleteAndLoadSnapsets.sh in the universal, probably under the /scripts, as part of GPII/universal#626 (and also include the README bits somewhere where you see fit)? I would then try to modify and test the infrastructure side of things.

Copy link
Author

@klown klown Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@stepanstipl I have no objection to moving everything into universal, but I would like @cindyli's and @mrtyler's input. Also, I'm wrestling with some questions/issues:

  • what code is moved into universal? Is it just the deleteAndLoadSnapsets.sh script? What happens to the Dockerfile and docker container? Do they just vanish? Is its functionality replaced by running something inside the universal image?
  • the vagrantCloudBasedContainers.sh script in universal will change if the dataloader docker image is no more. It's not clear what takes its place. Maybe just execute deleteAndLoadSnapsets.sh from within vagrantCloudBasedContainers.sh?
  • If the dataloader docker container vanishes, does pull 83 vanish as well? Its purpose it to set CLEAR_INDEX so the dataloader does not flush the database when it runs. If pull 83 is no longer needed, what takes its place -- what sets CLEAR_INDEX?
  • in the kubernetes cluster, the current dataloader docker image runs to completion and then terminates. If the status code is zero on termination, it stays terminated. However, if its status code is non-zero, it restarts and runs again. This repeats until the dataloader succeeds. This is different compared to, say, the flowmanager which is deployed and keeps on going. A reason why the dataloader fails is a race condition where CouchDB is also starting up and isn't ready for any request. The dataloader fails, but starts again, and eventually succeeds when CouchDB is ready.

I'm thinking about these issues and if I have any insights, I'll write more here. Thanks.

Copy link

@stepanstipl stepanstipl Oct 17, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @klown, to answer your questions:

what code is moved into universal? Is it just the deleteAndLoadSnapsets.sh script? What happens to the Dockerfile and docker container? Do they just vanish? Is its functionality replaced by running something inside the universal image?

Just deleteAndLoadSnapsets.sh and whatever portions from README are relevant. The Dockerfile is not needed as there's already one for universal that seems to contain all the needed dependencies.

the vagrantCloudBasedContainers.sh script in universal will change if the dataloader docker image is no more. It's not clear what takes its place. Maybe just execute deleteAndLoadSnapsets.sh from within vagrantCloudBasedContainers.sh?

Yes, calling the deleteAndLoadSnapsets.sh directly sounds like the correct solution since they'll all be in the same place already.

If the dataloader docker container vanishes, does pull 83 vanish as well? Its purpose it to set CLEAR_INDEX so the dataloader does not flush the database when it runs. If pull 83 is no longer needed, what takes its place -- what sets CLEAR_INDEX?

https://github.com/gpii-ops/gpii-infra/pull/83/files will not vanish, we will still set the env variable, just the image used in the infra part (shared/charts/gpii-dataloader/templates/job.yaml) will be the universal one and we will call the deleteAndLoadSnapsets.sh as a command. I'll update that PR once we agree on the move. We will keep the dataloader batch Job, it'll just use the universal image.

in the kubernetes cluster, the current dataloader docker image runs to completion and then terminates. If the status code is zero on termination, it stays terminated. However, if its status code is non-zero, it restarts and runs again. This repeats until the dataloader succeeds. This is different compared to, say, the flowmanager which is deployed and keeps on going. A reason why the dataloader fails is a race condition where CouchDB is also starting up and isn't ready for any request. The dataloader fails, but starts again, and eventually succeeds when CouchDB is ready.

This behaviour will stay. We will still have the dataloader batch Job, just this job will use the universal image.

Since universal dataloader pull was merged into universal master,
updated to reference that repository.
@stepanstipl
Copy link

stepanstipl commented Oct 19, 2018

@klown I took the liberty of doing a quick first take on the dataloader move, the first commit imports the deleteAndLoadSnapsets.sh and README.md from current PR in a state where it was left here to universal (to scripts/deleteAndLoadSnapsets.sh and documentation/DataLoader.md resp.).

Please have a look @klown @cindyli and let's continue the discussion there -> GPII/universal#696

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants