You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Add H5 GenomicFeatures support for more flexible target datatypes (#200)
* selene changes for methylation model
* remove f1 metric
* bugfix and some temporary profiling
* adjustments to loss breakdown and logging
* adjust return type for get_features_data
* bugfix for get_feature_data
* remove event object init
* initial changes to h5 dataloader
* methylation sampler excl
* add type specification to unpackbits
* change loss
* experimenting w loss
* add pearsonr
* fix metrix nan removal bug
* changes to sampler for positives-only hdf5
* loss weighting
* adjust methylation perf metric output checking
* revert multisampler and add spearmanr to training log
* explicit metric fn checking, refine later
* minor adjustment to dataloader
* non strand specific temp changes
* positives only sampler
* revamp dataloader, seq length flexibility
* attempted a nonstrandspecific utils module, not being used
* evaluation classes
* revert non strand specific module
* trying to figure out unet dataloader changes
* trying to figure out unet dataloader changes - add tgt shift
* remove print debug statements
* addr memory issue in eval
* comment
* troubleshooting dataloader shift
* add strand arg to _retrieve
* shift testing
* adjust casting of targets in dataloader
* minor changes to eval and sampling
* remove unused code in nonstrandspecific module
* remove indicators from dataloader, clean up methylation performance metrics
* remove unused files from version control, adjust multi sampler get dataset batches function
* remove commented out code in shift sections
* remove ind commented out code in multisampler
* add excl chr optional arg
* make adjustments to nonstrandspecific, performancemetrics, for what can be generalized from the methylation specific code
* remove files that we have merged the functionality into existing classes
* clean up indicator code that is no longer used
* integrate changes for methylation prediction in training and metrics
* incorporate changes from previous PR on config yaml and model file saving
* update pytorch version constraint
* variable name fix for copying model file / directory
* remove train methylation model class
* fix bug in random positions sampler with exclude_chrs, overload targets_path in both sampler classes
* remove unused method in genomicfeaturesh5
* adjust strand vs feature (target) column ordering assumptions in tabix-indexed BED file
* adjust descriptions for target classes
* minor adjustments to docstrings
* tuple output handling for non strand specific
* line breaks for formatting in performance metrics file
* adjust CLI config parsing so that a copy of the input config file is made and saved to the output dir
* update versioning
* add new dependency
* adjustment in
* update release notes for 0.6.0
* adjustment for sampling at the end of N_samples
* adjust expected BED file format when strand is included for seqweaver script
* adjust method in seqweaver script to accept lr input and new config loading function
Copy file name to clipboardExpand all lines: RELEASE_NOTES.md
+20Lines changed: 20 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,6 +2,26 @@
2
2
3
3
This is a document describing new functionality, bug fixes, breaking changes, etc. associated with Selene version releases from v0.5.0 onwards.
4
4
5
+
## Version 0.6.0
6
+
-`config_utils.py`: Add additional information saved upon running Selene. Specifically, we now save the version of Selene that the latest run used, make a copy of the input configuration file, and save this along with the model architecture file in the output directory. This adds a new dependency to Selene, the package `ruamel.yaml`
7
+
-`H5Dataloader` and `_H5Dataset`: Previously `H5Dataloader` had a number of arguments that were used to then initialize `_H5Dataset` internally. One major change in this version is that we now ask that users initialize `_H5Dataset` explicitly and then pass it to `H5Dataloader` as a class argument. This makes the two classes consistent with the PyTorch specifications for `Dataset` and `DataLoader` classes, enabling them to be compatible with different data parallelization configurations supported by PyTorch and the PyTorch Lightning framework.
8
+
-`_H5Dataset` class initialization optional arguments:
9
+
- `unpackbits` can now be specified separately for sequences and targets by way of `unpackbits_seq` and `unpackbits_tgt`
10
+
- `use_seq_len` enables subsetting to the center `use_seq_len` length of the sequences in the dataset.
11
+
- `shift` (particularly paired with `use_seq_len`) allows for retrieving sequences shifted from the center position by `shift` bases. Note currently `shift` only allows shifting in one direction, depending on whether you pass in a positive or negative integer.
12
+
-`GenomicFeaturesH5`: This is a new targets class to handle continuous-valued targets, stored in an HDF5 file, that can be retrieved based on genomic coordinate. Once again, genomic regions are stored in a tabix-indexed .bed file, with the main change being that the BED file now specifies for each genomic regions the index of the row in the HDF5 matrix that contains all the target values to predict. If multiple target rows are returned for a query region, the average of those rows is returned.
13
+
-`RandomPositionsSampler`:
14
+
- `exclude_chrs`: Added a new optional argument which by default excludes all nonstandard chromosomes `exclude_chrs=['_']` by ignoring all chromosomes with an underscore in the name. Pass in a list of chromosomes or substrings to exclude. When loading possible sampling positions, the class now iterates through the `exclude_chrs` list and checks for each substring `s` in list if `s in chrom`, and if so, skips that chromosome entirely.
15
+
- Internal function `_retrieve` now takes in an optional argument `strand` (default `None`) to enable explicit retrieval of a sequence at `chrom, position` for a specific side. The default behavior of the `RandomPositionsSampler` class remains the same, with the strand side randomly selected for each genomic position sampled.
16
+
-`PerformanceMetrics`:
17
+
- Now supports `spearmanr` and `pearsonr` from `scipy.stats`. Room for improvement to generalize this class in the future.
18
+
- The `update` function now has an optional argument `scores` which can pass in a subset of the metrics as `list(str)` to compute.
19
+
-`TrainModel`:
20
+
- `self.step` starts from `self._start_step`, which is non-zero if loaded from a Selene-saved checkpoint
21
+
- removed call to `self._test_metrics.visualize` in `evaluate` since the visualize method does not generalize well.
22
+
-`NonStrandSpecific`: Can now handle a model outputting two outputs in `forward`, will handle by taking either the mean or max of each of the two individual outputs for their forward and reverse predictions. A custom `NonStrandSpecific` class is recommended for more specific cases.
0 commit comments