-
Notifications
You must be signed in to change notification settings - Fork 245
DRIVERS-2917 - Standardized Performance Testing of ODMs and Integrations #1828
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
@@ -0,0 +1 @@ | |||
{"field1":"miNVpaKW","field2":"CS5VwrwN","field3":"Oq5Csk1w","field4":"ZPm57dhu","field5":"gxUpzIjg","field6":"Smo9whci","field7":"TW34kfzq","field8":55336395,"field9":41992681,"field10":72188733,"field11":46660880,"field12":3527055,"field13":74094448} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format
|
||
### Benchmark Server | ||
|
||
The MongoDB ODM Performance Benchmark must be run against a standalone MongoDB server running the latest stable database |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can open up this to be a standalone or a replica set with a size of 1. (This is because some ODMs leverage transactions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using a replica set of size 1 makes more sense here, agreed.
|
||
### Benchmark placement and scheduling | ||
|
||
The MongoDB ODM Performance Benchmark should be placed within the ODM's test directory as an independent test suite. Due |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think we should leave an option for folks to create their own benchmarking repo if that helps out. I'm open to others take on this one seeing as I worry about maintainers not wanting a benchmark repo.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't agree that they should be in the tests
directory but haven't ruled out including them in the ODM. For the purposes of getting the spec done, I wonder if requiring the ODM to document the location of the test suite is enough. If not, I would definitely remove the "test directory" requirement and make it "should be placed within the ODM". I think that is enough to make it clear that the goal is to have the perf tests included in the ODM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think a separate benchmark repo is a good choice here. We could reach out to existing maintainers and see if they want to weigh in, but I imagine having a separate repo for benchmarking is more trouble than it's worth.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I worry that ODMs might not be receptive to a large addition of performance tests to their repository. The ticket makes it sound like we (DBX) planned to run these tests ourselves, probably in a CI we maintain:
The testing criteria would be documented in a human readable form (such as either a docs page or a markdown file), and once benchmarks have been developed we would run these against each new notable release of the client library. Providing well documented benchmarks will hopefully also encourage the developer community to contribute additional tests to further improve coverage.
I don't see any mention of where these tests will live in the scope, either.
Why do we plan on contributing spec tests to ODM repos, instead of creating a pipeline similar to the ai testing pipeline? Or just integrating perf testing within drivers' existing performance testing? We already have the test runners and infrastructure to run these ourselves. And to @ajcvickers 's point, we already have dedicated performance test hosts in evergreen that are stable and isolated from other hosts in CI.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe there was any concrete plan one way or the other at the time the ticket and scope were created.
In my view, there are a few fundamental differences between the libraries being tested here versus for AI integrations.
- Many ODMs are or are planned to be first-party libraries rather than contributions to third-party AI frameworks.
- The AI space moves extremely rapidly and broken CI/CD or testing suites are extremely common. Both factors were significant motivators in the creation of our AI testing pipeline. Those motivations don't seem to exist here.
- AI frameworks tend to have several to dozens of integrations all housed within a single repo, each with their own dependencies and tests. Third-party ODMs are more often standalone repos with far less complexity in this manner, so adding a single additional test file for performance testing is much less significant.
What would integrating perf testing within the existing drivers perf testing look like? Would all of the ODM benchmarks live in a separate repo, with each driver cloning and using the specific subdirectory that contains the ODMs they want to test?
Using the same skeleton of test runners and infrastructure for the ODM testing makes it very easy to get these tests up and running without polluting the existing drivers tests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Django, Entity Framework, and Hibernate are all first-party ODMs either in development or recently released.
We also have Spring Data MongoDB, Laravel, and Doctrine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First party meaning "we own the repo" or "we contributed the code"? I was going with the former
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spring Data MongoDB and Doctrine we don't own the repo, true. Still, I expect the number of first-party ODMs to continue to grow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, that's more evenly split between "we control the changes that go in" and "others control the changes that go in" than I originally thought.
I still am not sure that putting the tests in ODM repos makes sense, unless the goal is to integrate testing into the ODM's processes (even for repos we do not control). My primary concern here is that I anticipate difficulties getting buy-in from external stakeholders for these tests (speaking from experience contributing to integrations, both AI and ODMs, in the JS ecosystem). Also, I'd like to understand the goal here, because if its for our own understanding of performance of ODMs, putting the tests in a repo we do not maintain or one that doesn't use evergreen raises a lot of unnecessary questions:
What do we do if a maintainer doesn't want / pushes back on these tests? What happens if they break? When should the tests be run? Who handles regressions? What is the triage process for potential regressions and how is flakiness/false positives handled? How do maintainers / us ensure builds are stable without dedicated perf testing hosts in CI? How do we expect each repository to setup the dedicated testing cluster (Mongoose pushed back against drivers-evergreen-tools)? etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's the scope doc, which covers the motivations of this work: https://docs.google.com/document/d/1GCle2vTQLdoSaDJJXyXeXYqtcAfymr8pM5oyV4gSI4A/edit?tab=t.0#heading=h.b1os3ai9s8t3.
Integrating testing into ODM processes is preferable for both visibility and maintenance reasons. Users will likely be more comfortable using a library with very public and integrated performance tests, and having all testing for an ODM live within a single repo streamlines maintenance work. Having the performance tests be integrated also shows a higher level of accountability and transparency, especially if we end up adding performance tests that directly compare against Postgres or other SQL databases.
That said, I agree that maintainers refusing to allow us adding the perf test suite to a third-party repo puts us in a difficult spot. One option would be a split approach: first-party ODMs have performance tests within their own repos, third-party ODMs have theirs in an odm-testing-pipeline
repo explicitly for that purpose. Then if maintainers tell us that they'd actually prefer to have the performance tests inside the ODM repo directly, we can migrate that suite out of the odm-testing-pipeline
repo.
Do any third-party ODMs already have robust performance testing that we would be competing with? What are the most common reasons we've gotten for pushback against similar work being contributed in the past?
to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every | ||
PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating | ||
this suite of tests. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per your suggestion earlier, we should include some new information about testing mainline usecases.
As discussed earlier in this document, ODM feature sets vary significantly across libraries. Many ODMs have features | ||
unique to them or their niche in the wider ecosystem, which makes specifying concrete benchmark test cases for every | ||
possible API unfeasible. Instead, ODM authors should determine what mainline use cases of their library are not covered | ||
by the benchmarks specified above and expand this testing suite with additional benchmarks to cover those areas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section is attempting to specify that ODMs should implement additional benchmark tests to cover mainline use cases that do not fall into those included in this specification. One example would be the use of Django's in
filter operator: Model.objects.filter(field__in=["some_val"])
.
### Benchmark Server | ||
|
||
The MongoDB ODM Performance Benchmark must be run against a MongoDB replica set of size 1 running the latest stable | ||
database version without authentication or SSL enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we concerned at all about accounting for performance variation due to server performance differences? In the drivers, we keep the server version patch-pinned and upgrade rarely and intentionally via independent commits in order to ensure that our performance testing results are meaningful and are only reflective of the changes in the system under test (the driver, or, in this case, the ODM). If the goal is only to track the performance of ODMs relative to each other and relative to the corresponding drivers, is the intention to have the drivers also implement these tests against the latest server so that we could get that apples-to-apples comparison?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we concerned at all about accounting for performance variation due to server performance differences?
From the Django implementation:
This is NOT intended to be a comprehensive test suite for every operation, only the most common and widely applicable
@NoahStapp and @Jibola are working on this project for DBX Python (although I am reviewing the implementation PR), so this is just a drive by comment from me, but my impression is that the spec is at least initially focused on getting all the ODMs to agree on what to test.
In the drivers, we keep the server version patch-pinned and upgrade rarely and intentionally via independent commits in order to ensure that our performance testing results are meaningful and are only reflective of the changes in the system under test (the driver, or, in this case, the ODM). If the goal is only to track the performance of ODMs relative to each other and relative to the corresponding drivers, is the intention to have the drivers also implement these tests against the latest server so that we could get that apples-to-apples comparison?
One more drive by comment: I'd expect each ODM to "perform well" under similar server circumstances (testing the driver is a good call out!) but I'm not sure apples-to-apples is the goal. If other ODMs test their performance using the spec and can demonstrate "good performance" and/or catch performance issues they would otherwise have missed, that would indicate some measure of success to me in the spec design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose latest stable
server version here for the following reason: we've made server performance an explicit company-wide goal. When users experience performance issues on older server versions, one of the first things we recommend is that they upgrade to a newer version. At least in the Python driver, we only run performance tests against 8.0. Using the latest stable version ensures that our performance tests always take advantage of any server improvements and isolate performance issues in the ODM or underlying driver.
Implementing these same tests in the driver for a direct apples-to-apples comparison is a significant amount of work. Several of the tests here use similar datasets as the driver tests for easier comparison, so using the same version of the server as the driver tests to reduce differences could be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the latest stable version ensures that our performance tests always take advantage of any server improvements and isolate performance issues in the ODM or underlying driver.
I think we should be careful about our goals here: if it is to take advantage of any server improvements and track performance explicitly relative to the most current server performance, then this approach is fine. However, this approach will not isolate performance issues in the ODM or driver because: 1) server performance is not guaranteed to always improve in every release for every feature: the overall trends of the server performance for most features will hopefully keep moving up, but between releases there may be "acceptable" regressions to certain features that are considered a tradeoff to an improvement in another area, and 2) server performance improvements could mask ODM regressions that happen concurrently with the server upgrade. We should be explicit about accepting both of these risks if we are going to move forward with this approach (i.e., note this somewhere in the spec text).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good callouts. What if we test the benchmarks against both the latest stable version as well as the latest major release? Currently that would be 8.1 and 8.0, for example. That would give us a yearly cadence of upgrading that should allow us to catch server regressions without blindly masking ODM regressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about the stability. If we see a perf regression (or improvement), we then have to consider whether we actually made things worse (or better) or if we happened to run on a newer server version that had different perf characteristics. We have correctness tests against different server versions. I don't think there is value in testing the server's performance in our ODM tests. Thus I would suggest we choose 8.0.13 (latest stable as of today) and make an explicit choice to update it on an annual cadence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The main advantage of testing against rapid server versions is query performance improvements. Since ODMs necessarily construct database queries for the user, they don't have any control over what's actually sent to the server barring a feature like raw_aggregate
that allows them to specify the actual query itself. With the server improving query performance and optimization (for example, $in inside $expr using indexes starting in 8.1: SERVER-32549), it's possible we run into situations where the best way to improve performance is for a user to upgrade their server version. Some of these, such as using $expr where it's not necessary, can be fixed with ODM code improvements, but that isn't a guarantee. Being able to tell users that upgrading to the latest rapid release will improve performance for their use case could be helpful, but I can see the downside of testing an additional server version besides latest stable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looking good. The most pressing concerns are around the percentile calculation and picking a stable server version to test against.
- Sort the array into ascending order (i.e. shortest time first) | ||
- Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1` | ||
|
||
*N.B. This is the [Nearest Rank](https://en.wikipedia.org/wiki/Percentile#The_Nearest_Rank_method) algorithm, chosen for |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
#The_Nearest_Rank_method
anchor should be #The_nearest-rank_method
.
|
||
- Given a 0-indexed array A of N iteration wall clock times | ||
- Sort the array into ascending order (i.e. shortest time first) | ||
- Let the index i for percentile p in the range [1,100] be defined as: `i = int(N * p / 100) - 1` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that the maximum iteration count is 10 (see line 109 above), the 90th, 95th, 98th, and 99th percentiles will all be A[8] since int(float)
truncates the float.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted in the Wikipedia article, fewer than 100 measurements will result in the same value being reported for multiple percentiles.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good points--this whole section is copied from the existing driver benchmark spec for consistency, which raises the question of should we (as a separate ticket) update that spec as well? I would say yes to keep both benchmarking specs as consistent in behavior and design as possible.
The data size, shape, and specific operation of a benchmark are the limiting factors for how many iterations are ultimately run. We expect most of the tests to run more than 100 iterations in the allotted time, but the more expensive ones don't. Have we historically actually used these percentiles or plan to in the future? From my experience, at least the Python team primarily uses the MB/s metric to identify regressions. If this is a consistent pattern across teams and continues to be, recording this additional data doesn't seem useful.
Unless otherwise specified, the number of iterations to measure per task is variable: | ||
|
||
- iterations should loop for at least 30 seconds cumulative execution time | ||
- iterations should stop after 10 iterations or 1 minute cumulative execution time, whichever is shorter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those two conditions seem to be working at cross purposes. The measurement should loop for at least 30 seconds but not more than 60, but stop after 10 iterations. This caps the number of iterations at 10, possibly fewer if each iteration takes longer than 6 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is confusing on my part (also taken from the driver benchmarking spec).
The intent is to have a 30 second minimum execution time with a 120 second execution time cap. Once the minimum time is reached, we stop the benchmark being executed once it reaches 120 seconds of execution time or once at least 10 iterations have completed.
The data will be stored as strict JSON with no extended types. These JSON representations must be converted into | ||
equivalent models as part of each benchmark task. | ||
|
||
Flat model benchmark tasks include:s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Extraneous s
at the end of the line.
|
||
| Phase | Description | | ||
| ----------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| Setup | Load the SMALL_DOC dataset into memory as an ODM-appropriate model object. Insert 10,000 instances into the database, saving the inserted `id` field for each into a list. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the _id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I'll update the wording to clarify since ODM naming conventions for the document _id
will vary.
Summary: This benchmark tests ODM performance creating a single large model. | ||
|
||
Dataset: The dataset (LARGE_DOC) is contained within `large_doc.json` and consists of a sample document stored as strict | ||
JSON with an encoded length of approximately 8,000 bytes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8,000 bytes is still relatively small. Do we want to have perf tests for huge documents close to the 16MB limit? While we may not recommend large models, customers will run into these scenarios especially if their models contain large arrays of subdocuments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Close to the 16MB limit seems excessive and would increase execution time significantly. Increasing the size here to be a few MB, similar to what the driver benchmarks use for their large document tests, would likely result in similar performance characteristics without as large of a latency hit. The downside to increasing the size of documents here is that we need to define the data's structure carefully to not significantly complicate the process of model creation for implementing ODMs, which is not a concern for the driver benchmarks.
### Benchmark Server | ||
|
||
The MongoDB ODM Performance Benchmark must be run against a MongoDB replica set of size 1 running the latest stable | ||
database version without authentication or SSL enabled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point about the stability. If we see a perf regression (or improvement), we then have to consider whether we actually made things worse (or better) or if we happened to run on a newer server version that had different perf characteristics. We have correctness tests against different server versions. I don't think there is value in testing the server's performance in our ODM tests. Thus I would suggest we choose 8.0.13 (latest stable as of today) and make an explicit choice to update it on an annual cadence.
- Nested models -- reading and writing nested models of various sizes, to explore basic operation efficiency for complex | ||
data | ||
|
||
The suite is intentionally kept small for several reasons. One, ODM feature sets vary significantly across libraries. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
May prefer bulleted list here e.g.
The suite is intentionally kept small for the following reasons:
- ODM feature sets vary …
- Several popular MongoDB ODMs are maintained by third-parties …
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM pending clarification of if "in the repo" can replace "in the repo's test dir".
|
||
We expect substantial performance differences between ODMs based on both their language families (e.g. static vs. | ||
dynamic or compiled vs. virtual-machine-based) as well as their inherent design (e.g. web frameworks such as Django vs. | ||
application-agnostic such as Mongoose). However we still expect "vertical" comparison within families of ODMs to expose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is worthwhile to compare different ODMs to each other. The performance of ODMs doing different types of things varies widely based on the approach taken by the ODM, as opposed to anything the provider/adapter for Mongo is doing.
I do think it could be valuable to compare a given ODM with Mongo to that same ODM but with a similar (e.g. Cosmos) database, and a different (e.g. PostgreSQL) database. Whether or not this will show differences in the client is dependent on many things. For example, in .NET making the PostgreSQL provider faster is measurable because the data transfer and server can keep up. On the other hand, making the SQL Server provider faster makes no difference, because the wire protocol and server blocking is already the limiting factor.
It may also be useful to test raw driver to ODM perf, especially since customers often ask about this. However, in most cases the performance overhead will come from the core ODM code, rather than anything we are doing, so I doubt that there will be much actionable to come out of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comparing ODMs to each other could be useful in identifying potential design or implementation issues. For example, if one ODM implements embedded document querying in an inefficient way, comparing its performance on a benchmark to a similar ODM with much better performance could unlock improvements that would be difficult to identify otherwise. Outside of that specific case, I agree that ODM comparisons are not very useful.
Comparing performance across databases is an interesting idea. Django did apples-to-apples comparisons with benchmarks against both MongoDB and Postgres and got a lot of useful data out of that. ODMs make doing so relatively easy as only the backend configuration and models (for differences like embedded documents and relational links) need to change. We'd need to be careful to isolate performance differences to the database alone as much as possible, due to all the factors you state.
Comparing raw driver to ODM perf is part of the stated goals of this project. Determining exactly which benchmarks should be directly compared is still under consideration, for both maintainability and overhead concerns.
to the relatively long runtime of the benchmarks, including them as part of an automated suite that runs against every | ||
PR is not recommended. Instead, scheduling benchmark runs on a regular cadence is the recommended method of automating | ||
this suite of tests. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have a dedicated, isolated perf lab, with machines that won't get changes unless we know about it? My experience with perf testing over many years is that unless you have such a system, then the noise makes it very difficult to see when things really change. For example, OS updates, platform/language changes, virus checking, auto-updates kicking in mid run, and so on, all make the data hard to interpret.
How do you currently handle driver perf test machines? Can you point me to charts, or even raw data I guess, that should variation/noise over time? Also, how often do they run? Is there only a single change between each run so that it's feasible to trace back a perf difference to a single change, be that external or a code change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here's an example of what the Python driver perf tests output. The driver perf tests have experienced all of the issues you've stated, but still provide useful metrics that let us catch regressions and identify places for improvement. Running on Evergreen doesn't allow us (AFAIK) to have our own dedicated set of machines.
The Python driver perf tests run weekly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drivers do have access to a special host in evergreen to run dedicated performance tasks on to ensure stability and consistency (rhel90-dbx-perf-large
).
Please complete the following before merging:
clusters).
Python Django implementation: mongodb/django-mongodb-backend#366.