Skip to content

Commit 983244b

Browse files
committed
Add NEWS update and README.large_positions.md file
1 parent f84bba1 commit 983244b

File tree

2 files changed

+243
-0
lines changed

2 files changed

+243
-0
lines changed

NEWS

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ Noteworthy changes in release a.b
44
* Incompatible changes: Several functions and data types have been changed
55
in this release, and the shared library soversion has been bumped.
66

7+
- HTSlib now supports 64 bit reference positions. This means several
8+
structures, function parameters, and return values have been made bigger
9+
to allow larger values to be stored. While most code that uses
10+
HTSlib interfaces should still build after this change, some alterations
11+
may be needed - notably to printf() formats where the values of structure
12+
members are being printed.
13+
14+
Due to file format limitations, large positions are only supported
15+
when reading and writing SAM and VCF files.
16+
17+
See README.large_positions.md for more information.
18+
719
- An extra field has been added to the kbitset_t struct so bitsets can
820
be made smaller (and later enlarged) without involving memory allocation.
921

README.large_positions.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# HTSlib 64 bit reference positions
2+
3+
HTSlib version 1.10 onwards internally use 64 bit reference positions. This
4+
is to support analysis of species like axolotl, tulip and marbled lungfish
5+
which have, or are expected to have, chromosomes longer than two gigabases.
6+
7+
# File format support
8+
9+
Currently 64 bit positions can only be stored in SAM and VCF format files.
10+
Binary BAM, CRAM and BCF cannot be used due to limitations in the formats
11+
themselves. As SAM and VCF are text formats, they have no limit on the
12+
size of numeric values.
13+
14+
# Compatibility issues to check
15+
16+
Various data structure members, function parameters, and return values have
17+
been expanded from 32 to 64 bits. As a result, some changes may be needed to
18+
code that uses the library, even if it does not support long references.
19+
20+
## Variadic functions taking format strings
21+
22+
The type of various structure members (e.g. `bam1_core_t::pos`) and return
23+
values from some functions (e.g. `bam_cigar2rlen()`) have been changed to
24+
`hts_pos_t`, which is a 64-bit signed integer. Using these in 32-bit
25+
code will generally work (as long as the stored positions are within range),
26+
however care needs to be taken when these values are passed directly
27+
to functions like `printf()` which take a variable-length argument list and
28+
a format string.
29+
30+
Header file `htslib/hts.h` defines macro `PRIhts_pos` which can be
31+
used in `printf()` format strings to get the correct format specifier for
32+
an `hts_pos_t` value. Code that needs to print positions should be
33+
changed from:
34+
35+
```c
36+
printf("Position is %d\n", bam->core.pos);
37+
```
38+
39+
to:
40+
41+
```c
42+
printf("Position is %"PRIhts_pos"\n", bam->core.pos);
43+
```
44+
45+
If for some reason compatibility with older versions of HTSlib (which do
46+
not have `hts_pos_t` or `PRIhts_pos`) is needed, the value can be cast to
47+
`int64_t` and printed as an explicitly 64-bit value:
48+
49+
```c
50+
#include <inttypes.h> // For PRId64 and int64_t
51+
52+
printf("Position is %" PRId64 "\n", (int64_t) bam->core.pos);
53+
```
54+
55+
Passing incorrect types to variadic functions like `printf()` can lead
56+
to incorrect behaviour and security risks, so it important to track down
57+
and fix all of the places where this may happen. Modern C compilers like
58+
gcc (version 3.0 onwards) and clang can check `printf()` and `scanf()`
59+
parameter types for compatibility against the format string. To
60+
enable this, build code with `-Wall` or `-Wformat` and fix all the
61+
reported warnings.
62+
63+
Where functions that take `printf`-style format strings are implemented,
64+
they should use the appropriate gcc attributes to enable format string
65+
checking. `htslib/hts_defs.h` includes macros `HTS_FORMAT` and
66+
`HTS_PRINTF_FMT` which can be used to provide the attribute declaration
67+
in a portable way. For example, `test/sam.c` uses them for a function
68+
that prints error messages:
69+
70+
```
71+
void HTS_FORMAT(HTS_PRINTF_FMT, 1, 2) fail(const char *fmt, ...) { /* ... */ }
72+
```
73+
74+
## Implicit type conversions
75+
76+
Conversion of signed `int` or `int32_t` to `hts_pos_t` will always work.
77+
78+
Conversion of `hts_pos_t` to `int` or `int32_t` will work as long as the value
79+
converted is within the range that can be stored in the destination.
80+
81+
Code that casts unsigned `uint32_t` values to signed with the expectation
82+
that the result may be negative will no longer work as `hts_pos_t` can store
83+
values over UINT32_MAX. Such code should be changed to use signed values.
84+
85+
Functions hts_parse_region() and hts_parse_reg64() return special value
86+
`HTS_POS_MAX` for regions which extend to the end of the reference.
87+
This value is slightly smaller than INT64_MAX, but should be larger than
88+
any reference that is likely to be used. When cast to `int32_t` the
89+
result should be `INT32_MAX`.
90+
91+
# Upgrading code to work with 64 bit positions
92+
93+
Variables used to store reference positions should be changed to
94+
type `hts_pos_t`. Use `PRIhts_pos` in format strings when printing them.
95+
96+
When converting positions stored in strings, use `strtoll()` in place of
97+
`atoi()` or `strtol()` (which produces a 32 bit value on 64-bit Windows and
98+
all 32-bit platforms).
99+
100+
Programs which need to look up a reference sequence length from a `sam_hdr_t`
101+
structure should use `sam_hdr_tid2len()` instead of the old
102+
`sam_hdr_t::target_len` array (which is left as 32-bit for reasons of
103+
compatibility). `sam_hdr_tid2len()` returns `hts_pos_t`, so works correctly
104+
for large references.
105+
106+
Various functions which take pointer arguments have new versions which
107+
support `hts_pos_t *` arguments. Code supporting 64-bit positions should
108+
use the new versions. These are:
109+
110+
Original function | 64-bit version
111+
------------------ | --------------------
112+
fai_fetch() | fai_fetch64()
113+
fai_fetchqual() | fai_fetchqual64()
114+
faidx_fetch_seq() | faidx_fetch_seq64()
115+
faidx_fetch_qual() | faidx_fetch_qual64()
116+
hts_parse_reg() | hts_parse_reg64() or hts_parse_region()
117+
bam_plp_auto() | bam_plp64_auto()
118+
bam_plp_next() | bam_plp64_next()
119+
bam_mplp_auto() | bam_mplp64_auto()
120+
121+
Limited support has been added for 64-bit INFO values in VCF files, for large
122+
values in structural variant END tags. New functions `bcf_update_info_int64()`
123+
and `bcf_get_info_int64()` can be used to set and fetch 64-bit INFO values.
124+
They both take arrays of `int64_t`. `bcf_int64_missing` and
125+
`bcf_int64_vector_end` can be used to set missing and vector end values in
126+
these arrays. The INFO data is stored in the minimum size needed, so there
127+
is no harm in using these functions to store smaller integer values.
128+
129+
# Structure members that have changed size
130+
131+
```
132+
File htslib/hts.h:
133+
hts_pair32_t::begin
134+
hts_pair32_t::end
135+
136+
(typedef hts_pair_pos_t is provided as a better-named replacement for hts_pair32_t)
137+
138+
hts_reglist_t::min_beg
139+
hts_reglist_t::max_end
140+
141+
hts_itr_t::beg
142+
hts_itr_t::end
143+
hts_itr_t::curr_beg
144+
hts_itr_t::curr_end
145+
146+
File htslib/regidx.h:
147+
reg_t::start
148+
reg_t::end
149+
150+
File htslib/sam.h:
151+
bam1_core_t::pos
152+
bam1_core_t::mpos
153+
bam1_core_t::isize
154+
155+
File htslib/synced_bcf_reader.h:
156+
bcf_sr_regions_t::start
157+
bcf_sr_regions_t::end
158+
bcf_sr_regions_t::prev_start
159+
160+
File htslib/vcf.h:
161+
bcf_idinfo_t::info
162+
163+
bcf_info_t::v1::i
164+
165+
bcf1_t::pos
166+
bcf1_t::rlen
167+
```
168+
169+
# Functions where parameters or the return value have changed size
170+
171+
Functions are annotated as follows:
172+
173+
* `[new]` The function has been added since version 1.9
174+
* `[parameters]` Function parameters have changed size
175+
* `[return]` Function return value has changed size
176+
177+
```
178+
File htslib/faidx.h:
179+
180+
[new] fai_fetch64()
181+
[new] fai_fetchqual64()
182+
[new] faidx_fetch_seq64()
183+
[new] faidx_fetch_qual64()
184+
[new] fai_parse_region()
185+
186+
File htslib/hts.h:
187+
188+
[parameters] hts_idx_push()
189+
[new] hts_parse_reg64()
190+
[parameters] hts_itr_query()
191+
[parameters] hts_reg2bin()
192+
193+
File htslib/kstring.h:
194+
195+
[new] kputll()
196+
197+
File htslib/regidx.h:
198+
199+
[parameters] regidx_overlap()
200+
201+
File htslib/sam.h:
202+
203+
[new] sam_hdr_tid2len()
204+
[return] bam_cigar2rlen()
205+
[return] bam_endpos()
206+
[parameters] bam_itr_queryi()
207+
[parameters] sam_itr_queryi()
208+
[new] bam_plp64_next()
209+
[new] bam_plp64_auto()
210+
[new] bam_mplp64_auto()
211+
[parameters] sam_cap_mapq()
212+
[parameters] sam_prob_realn()
213+
214+
File htslib/synced_bcf_reader.h:
215+
216+
[parameters] bcf_sr_seek()
217+
[parameters] bcf_sr_regions_overlap()
218+
219+
File htslib/tbx.h:
220+
221+
[parameters] tbx_readrec()
222+
223+
File htslib/vcf.h:
224+
225+
[parameters] bcf_readrec()
226+
[new] bcf_update_info_int64()
227+
[new] bcf_get_info_int64()
228+
[return] bcf_dec_int1()
229+
[return] bcf_dec_typed_int1()
230+
231+
```

0 commit comments

Comments
 (0)