Skip to content

Conversation

@lkirk
Copy link
Owner

@lkirk lkirk commented Aug 9, 2025

Final optimizations result. Let's see how it builds on windows, etc.

lkirk added 30 commits August 8, 2025 14:26
This PR implements the C and Python API for computing two-way two-locus
statistics. The algorithm is identical to the python version, except
during testing I uncovered a small issue with normalisation. We need to
handle the case where sample sets are of different sizes. The fix for
this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some
high-level tests.
This PR implements two-way LD statistics, specified between sample sets.
During the development of this functionality, a number of issues with
the designation of state_dims/result_dims were discovered. These have
been resolved, testing clean for existing code and providing the proper
behavior for this new code.

The mechanism by which users will specify a multi-population (or
two-way) statistic is by providing the `index` argument. This helps us
avoid creating another `ld_matrix` method for the TreeSequence object.

In other words, for a one-way statistic, a user would specify:
```python
ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]])
```
Which would output a 3D ndarray containing one LD matrix per sample set.

```python
ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1)])
```
Which would output a 2D ndarray containing one LD matrix for the index
pair. This would use our `D2_ij_summary_func`, instead of the
`D2_summary_func`. Finally, if a user provided
```python
ts.ld_matrix(stat="D2", sample_sets=[[ss1, ss2]], indexes=[(0, 1), (1, 1)])
```
We would output a 3D ndarray containing one LD matrix _per_ index pair
provided.

Since these are two-way statistics, the indexes must be length 2. We
plan on enabling users to implement k-way via a "general_stat" api. We did
not implement anything more than two-way statistics here because of the
combinatoric explosion of logic required for indexes > 2.

I added some basic tests to demonstrate that things were working
properly. If we compute two-way statistics on identical sample sets,
they should be equal to the one-way statistics. Unfortunately, this does
not apply to unbiased statistics, which I've validated manually.

I've also cleaned up the docstrings a bit and fixed a bug with the
D_prime statistic, which should not be weighted by haplotype frequency.
I had to make tsk_treeseq_two_locus_count_stat public to test it. This
is a precursor to creating the general two locus stat api.
This PR implements the C and Python API for computing two-way two-locus
statistics. The algorithm is identical to the python version, except
during testing I uncovered a small issue with normalisation. We need to
handle the case where sample sets are of different sizes. The fix for
this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some
high-level tests.
lkirk added 28 commits August 8, 2025 14:40
This PR implements the C and Python API for computing two-way two-locus
statistics. The algorithm is identical to the python version, except
during testing I uncovered a small issue with normalisation. We need to
handle the case where sample sets are of different sizes. The fix for
this was to average the normalisation factor for each sample set.

Test coverage has been added to cover C, low-level python and some
high-level tests.
Compute A/B counts upfront for each sample set. We were computing them
redundantly each for each site pair. This is performed in the new
function get_mutation_sample_sets, which takes the samples for each site
and computes the sample for each site for each sample set. During this
operation, we compute the number of samples containing the given allele
for each site. Add a non-normalized version of
compute_general_two_site_stat_result for situations where we're
computing stats from biallelic loci. For site statistics, we do not
convert sample sets to bitsets, opting to defer that action to
get_mutation_sample_sets. We should consider doing this for branch stats
as well, but it will be trickier. Finally, produce an optimized version
of r2_summary_function to reduce the number of divisions we're doing.
Ultimately, the issue we were running into was memory latency when
accessing elements within the state_row. This has been mitigated by
providing the entire expression to the compiler, allowing instruction
level parallelism to compute values as the memory we depend on becomes
available. The expression we've chosen also reduces the amount of
memory dependencies, reducing the latency of these computations. This
all works in the case of a single sample set, tests are not passing for
multiple sample sets. I'll need to streamline the computation of site
offsets to handle this, I first wanted to see how much of an improvement
we would see from these optimizations.
@lkirk lkirk closed this Sep 12, 2025
@lkirk lkirk deleted the two-locus-optimizations branch October 2, 2025 06:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants