forked from tskit-dev/tskit
-
Notifications
You must be signed in to change notification settings - Fork 0
Two locus optimizations #6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.
This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code. The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the `index` argument. This helps us avoid creating another `ld_matrix` method for the TreeSequence object. In other words, for a one-way statistic, a user would specify: ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]]) ``` Which would output a 3D ndarray containing one LD matrix per sample set. ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1)]) ``` Which would output a 2D ndarray containing one LD matrix for the index pair. This would use our `D2_ij_summary_func`, instead of the `D2_summary_func`. Finally, if a user provided ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1), (1, 1)]) ``` We would output a 3D ndarray containing one LD matrix _per_ index pair provided. Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2. I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually. I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.
I had to make tsk_treeseq_two_locus_count_stat public to test it. This is a precursor to creating the general two locus stat api.
…intermediate data
This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.
This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.
Compute A/B counts upfront for each sample set. We were computing them redundantly each for each site pair. This is performed in the new function get_mutation_sample_sets, which takes the samples for each site and computes the sample for each site for each sample set. During this operation, we compute the number of samples containing the given allele for each site. Add a non-normalized version of compute_general_two_site_stat_result for situations where we're computing stats from biallelic loci. For site statistics, we do not convert sample sets to bitsets, opting to defer that action to get_mutation_sample_sets. We should consider doing this for branch stats as well, but it will be trickier. Finally, produce an optimized version of r2_summary_function to reduce the number of divisions we're doing. Ultimately, the issue we were running into was memory latency when accessing elements within the state_row. This has been mitigated by providing the entire expression to the compiler, allowing instruction level parallelism to compute values as the memory we depend on becomes available. The expression we've chosen also reduces the amount of memory dependencies, reducing the latency of these computations. This all works in the case of a single sample set, tests are not passing for multiple sample sets. I'll need to streamline the computation of site offsets to handle this, I first wanted to see how much of an improvement we would see from these optimizations.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Final optimizations result. Let's see how it builds on windows, etc.