Two locus optimizations #6

lkirk · 2025-08-09T15:56:20Z

Final optimizations result. Let's see how it builds on windows, etc.

This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.

This PR implements two-way LD statistics, specified between sample sets. During the development of this functionality, a number of issues with the designation of state_dims/result_dims were discovered. These have been resolved, testing clean for existing code and providing the proper behavior for this new code. The mechanism by which users will specify a multi-population (or two-way) statistic is by providing the `index` argument. This helps us avoid creating another `ld_matrix` method for the TreeSequence object. In other words, for a one-way statistic, a user would specify: ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]]) ``` Which would output a 3D ndarray containing one LD matrix per sample set. ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1)]) ``` Which would output a 2D ndarray containing one LD matrix for the index pair. This would use our `D2_ij_summary_func`, instead of the `D2_summary_func`. Finally, if a user provided ```python ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1), (1, 1)]) ``` We would output a 3D ndarray containing one LD matrix _per_ index pair provided. Since these are two-way statistics, the indexes must be length 2. We plan on enabling users to implement k-way via a "general_stat" api. We did not implement anything more than two-way statistics here because of the combinatoric explosion of logic required for indexes > 2. I added some basic tests to demonstrate that things were working properly. If we compute two-way statistics on identical sample sets, they should be equal to the one-way statistics. Unfortunately, this does not apply to unbiased statistics, which I've validated manually. I've also cleaned up the docstrings a bit and fixed a bug with the D_prime statistic, which should not be weighted by haplotype frequency.

…s and indexes

I had to make tsk_treeseq_two_locus_count_stat public to test it. This is a precursor to creating the general two locus stat api.

…intermediate data

This PR implements the C and Python API for computing two-way two-locus statistics. The algorithm is identical to the python version, except during testing I uncovered a small issue with normalisation. We need to handle the case where sample sets are of different sizes. The fix for this was to average the normalisation factor for each sample set. Test coverage has been added to cover C, low-level python and some high-level tests.

…s and indexes

Compute A/B counts upfront for each sample set. We were computing them redundantly each for each site pair. This is performed in the new function get_mutation_sample_sets, which takes the samples for each site and computes the sample for each site for each sample set. During this operation, we compute the number of samples containing the given allele for each site. Add a non-normalized version of compute_general_two_site_stat_result for situations where we're computing stats from biallelic loci. For site statistics, we do not convert sample sets to bitsets, opting to defer that action to get_mutation_sample_sets. We should consider doing this for branch stats as well, but it will be trickier. Finally, produce an optimized version of r2_summary_function to reduce the number of divisions we're doing. Ultimately, the issue we were running into was memory latency when accessing elements within the state_row. This has been mitigated by providing the entire expression to the compiler, allowing instruction level parallelism to compute values as the memory we depend on becomes available. The expression we've chosen also reduces the amount of memory dependencies, reducing the latency of these computations. This all works in the case of a single sample set, tests are not passing for multiple sample sets. I'll need to streamline the computation of site offsets to handle this, I first wanted to see how much of an improvement we would see from these optimizations.

lkirk added 30 commits August 8, 2025 14:26

initial implementation; need to work on output dims/scalar sample set…

96525ee

…s and indexes

get rid of disjoint check

bef9b8c

fix normalisation and a small indexing bug. dimension tests

b3f61b9

lints

0ad3e67

fix logic error for mode testing

698a1f7

fix norm_hap_weighted_ij. needs to get wAB instead of using indices

13c9398

Create tests with pragmatic test coverage

5788f70

I had to make tsk_treeseq_two_locus_count_stat public to test it. This is a precursor to creating the general two locus stat api.

trivial cleanup

0e14329

small lint grr

48b9b36

lowlevel test coverage

8a47521

concat is not compatible with numpy<2

5502e9d

remove unreachable code path (we switch on the presence of indexes)

e3bb8f5

just error early if node stat requested

bdcdeaf

better docstring for norm func

5328faa

remove todo

a39dc70

remove todo

8f6f89b

remove todo

d31ca45

move allocations out of the hot path; create a structure to hold the …

87d92aa

…intermediate data

fix ruff formatting

01498e2

oops forgot to resolve all conflicts

4da9492

Fix another ruff formatting change that snuck by

7d084be

a few more things that snuck by merge conflict resolution

944334f

fix norm_hap_weighted_ij. needs to get wAB instead of using indices

ab60e50

concat is not compatible with numpy<2

81aef93

remove unreachable code path (we switch on the presence of indexes)

0dac45d

better docstring for norm func

f918f80

remove todo

0c16992

lkirk added 28 commits August 8, 2025 14:40

fix norm_hap_weighted_ij. needs to get wAB instead of using indices

ca5cca0

concat is not compatible with numpy<2

87ac1f7

remove unreachable code path (we switch on the presence of indexes)

9291f9d

better docstring for norm func

7b60d4b

remove todo

d57ad92

remove todo

f125781

fix normalisation and a small indexing bug. dimension tests

029e620

k should be unsigned

ac8646d

remove todo

0c56d6d

first pass at bit array refactor

ec28f2f

initial implementation; need to work on output dims/scalar sample set…

88b6a88

…s and indexes

fix normalisation and a small indexing bug. dimension tests

eb55592

k should be unsigned

da3b0ab

fix norm_hap_weighted_ij. needs to get wAB instead of using indices

41aa9b7

concat is not compatible with numpy<2

6c29c80

better docstring for norm func

3fb5523

remove todo

58b866c

remove todo

43d38b6

first pass at bit array refactor

73671bf

clean up unused routines

6502758

fix stats for >1 sample set; need a dup sample check

05c89e5

add dup sample check; now all tests pass

89b75fe

merge cleanup

bd4c9ff

zero out whole row, not just one element

20bf634

fix sample dup checking; wrong tmp row; not using index map

a8a6c22

clean commit to remove noise; will get on PR

9bb79fe

lkirk closed this Sep 12, 2025

lkirk deleted the two-locus-optimizations branch October 2, 2025 06:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Two locus optimizations #6

Two locus optimizations #6

Uh oh!

lkirk commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Two locus optimizations #6

Two locus optimizations #6

Uh oh!

Conversation

lkirk commented Aug 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants