Skip to content

Commit 7fc833d

Browse files
committed
Merge 64-bit reference positions (PR samtools#709)
Bumped TWO_TO_THREE_TRANSITION_COUNT due to ABI changes.
2 parents 813a0d9 + 983244b commit 7fc833d

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

53 files changed

+2175
-457
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -37,6 +37,7 @@ lib*.so.*
3737
/test/fieldarith
3838
/test/hfile
3939
/test/hts_endian
40+
/test/longrefs/*.tmp.*
4041
/test/pileup
4142
/test/sam
4243
/test/tabix/*.tmp.*

Makefile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -105,7 +105,7 @@ PACKAGE_VERSION := $(shell ./version.sh)
105105

106106
# Increment this for each ABI breaking change until ABI version 3 becomes
107107
# stable
108-
TWO_TO_THREE_TRANSITION_COUNT = 11
108+
TWO_TO_THREE_TRANSITION_COUNT = 12
109109
LIBHTS_SOVERSION = 2to3part$(TWO_TO_THREE_TRANSITION_COUNT)
110110
MACH_O_COMPATIBILITY_VERSION = 2.$(TWO_TO_THREE_TRANSITION_COUNT)
111111

@@ -532,7 +532,7 @@ htslib-uninstalled.pc: htslib.pc.tmp
532532

533533

534534
testclean:
535-
-rm -f test/*.tmp test/*.tmp.* test/tabix/*.tmp.* test/tabix/FAIL*
535+
-rm -f test/*.tmp test/*.tmp.* test/longrefs/*.tmp.* test/tabix/*.tmp.* test/tabix/FAIL*
536536

537537
mostlyclean: testclean
538538
-rm -f *.o *.pico cram/*.o cram/*.pico test/*.o test/*.dSYM version.h

NEWS

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,18 @@ Noteworthy changes in release a.b
44
* Incompatible changes: Several functions and data types have been changed
55
in this release, and the shared library soversion has been bumped.
66

7+
- HTSlib now supports 64 bit reference positions. This means several
8+
structures, function parameters, and return values have been made bigger
9+
to allow larger values to be stored. While most code that uses
10+
HTSlib interfaces should still build after this change, some alterations
11+
may be needed - notably to printf() formats where the values of structure
12+
members are being printed.
13+
14+
Due to file format limitations, large positions are only supported
15+
when reading and writing SAM and VCF files.
16+
17+
See README.large_positions.md for more information.
18+
719
- An extra field has been added to the kbitset_t struct so bitsets can
820
be made smaller (and later enlarged) without involving memory allocation.
921

README.large_positions.md

Lines changed: 231 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,231 @@
1+
# HTSlib 64 bit reference positions
2+
3+
HTSlib version 1.10 onwards internally use 64 bit reference positions. This
4+
is to support analysis of species like axolotl, tulip and marbled lungfish
5+
which have, or are expected to have, chromosomes longer than two gigabases.
6+
7+
# File format support
8+
9+
Currently 64 bit positions can only be stored in SAM and VCF format files.
10+
Binary BAM, CRAM and BCF cannot be used due to limitations in the formats
11+
themselves. As SAM and VCF are text formats, they have no limit on the
12+
size of numeric values.
13+
14+
# Compatibility issues to check
15+
16+
Various data structure members, function parameters, and return values have
17+
been expanded from 32 to 64 bits. As a result, some changes may be needed to
18+
code that uses the library, even if it does not support long references.
19+
20+
## Variadic functions taking format strings
21+
22+
The type of various structure members (e.g. `bam1_core_t::pos`) and return
23+
values from some functions (e.g. `bam_cigar2rlen()`) have been changed to
24+
`hts_pos_t`, which is a 64-bit signed integer. Using these in 32-bit
25+
code will generally work (as long as the stored positions are within range),
26+
however care needs to be taken when these values are passed directly
27+
to functions like `printf()` which take a variable-length argument list and
28+
a format string.
29+
30+
Header file `htslib/hts.h` defines macro `PRIhts_pos` which can be
31+
used in `printf()` format strings to get the correct format specifier for
32+
an `hts_pos_t` value. Code that needs to print positions should be
33+
changed from:
34+
35+
```c
36+
printf("Position is %d\n", bam->core.pos);
37+
```
38+
39+
to:
40+
41+
```c
42+
printf("Position is %"PRIhts_pos"\n", bam->core.pos);
43+
```
44+
45+
If for some reason compatibility with older versions of HTSlib (which do
46+
not have `hts_pos_t` or `PRIhts_pos`) is needed, the value can be cast to
47+
`int64_t` and printed as an explicitly 64-bit value:
48+
49+
```c
50+
#include <inttypes.h> // For PRId64 and int64_t
51+
52+
printf("Position is %" PRId64 "\n", (int64_t) bam->core.pos);
53+
```
54+
55+
Passing incorrect types to variadic functions like `printf()` can lead
56+
to incorrect behaviour and security risks, so it important to track down
57+
and fix all of the places where this may happen. Modern C compilers like
58+
gcc (version 3.0 onwards) and clang can check `printf()` and `scanf()`
59+
parameter types for compatibility against the format string. To
60+
enable this, build code with `-Wall` or `-Wformat` and fix all the
61+
reported warnings.
62+
63+
Where functions that take `printf`-style format strings are implemented,
64+
they should use the appropriate gcc attributes to enable format string
65+
checking. `htslib/hts_defs.h` includes macros `HTS_FORMAT` and
66+
`HTS_PRINTF_FMT` which can be used to provide the attribute declaration
67+
in a portable way. For example, `test/sam.c` uses them for a function
68+
that prints error messages:
69+
70+
```
71+
void HTS_FORMAT(HTS_PRINTF_FMT, 1, 2) fail(const char *fmt, ...) { /* ... */ }
72+
```
73+
74+
## Implicit type conversions
75+
76+
Conversion of signed `int` or `int32_t` to `hts_pos_t` will always work.
77+
78+
Conversion of `hts_pos_t` to `int` or `int32_t` will work as long as the value
79+
converted is within the range that can be stored in the destination.
80+
81+
Code that casts unsigned `uint32_t` values to signed with the expectation
82+
that the result may be negative will no longer work as `hts_pos_t` can store
83+
values over UINT32_MAX. Such code should be changed to use signed values.
84+
85+
Functions hts_parse_region() and hts_parse_reg64() return special value
86+
`HTS_POS_MAX` for regions which extend to the end of the reference.
87+
This value is slightly smaller than INT64_MAX, but should be larger than
88+
any reference that is likely to be used. When cast to `int32_t` the
89+
result should be `INT32_MAX`.
90+
91+
# Upgrading code to work with 64 bit positions
92+
93+
Variables used to store reference positions should be changed to
94+
type `hts_pos_t`. Use `PRIhts_pos` in format strings when printing them.
95+
96+
When converting positions stored in strings, use `strtoll()` in place of
97+
`atoi()` or `strtol()` (which produces a 32 bit value on 64-bit Windows and
98+
all 32-bit platforms).
99+
100+
Programs which need to look up a reference sequence length from a `sam_hdr_t`
101+
structure should use `sam_hdr_tid2len()` instead of the old
102+
`sam_hdr_t::target_len` array (which is left as 32-bit for reasons of
103+
compatibility). `sam_hdr_tid2len()` returns `hts_pos_t`, so works correctly
104+
for large references.
105+
106+
Various functions which take pointer arguments have new versions which
107+
support `hts_pos_t *` arguments. Code supporting 64-bit positions should
108+
use the new versions. These are:
109+
110+
Original function | 64-bit version
111+
------------------ | --------------------
112+
fai_fetch() | fai_fetch64()
113+
fai_fetchqual() | fai_fetchqual64()
114+
faidx_fetch_seq() | faidx_fetch_seq64()
115+
faidx_fetch_qual() | faidx_fetch_qual64()
116+
hts_parse_reg() | hts_parse_reg64() or hts_parse_region()
117+
bam_plp_auto() | bam_plp64_auto()
118+
bam_plp_next() | bam_plp64_next()
119+
bam_mplp_auto() | bam_mplp64_auto()
120+
121+
Limited support has been added for 64-bit INFO values in VCF files, for large
122+
values in structural variant END tags. New functions `bcf_update_info_int64()`
123+
and `bcf_get_info_int64()` can be used to set and fetch 64-bit INFO values.
124+
They both take arrays of `int64_t`. `bcf_int64_missing` and
125+
`bcf_int64_vector_end` can be used to set missing and vector end values in
126+
these arrays. The INFO data is stored in the minimum size needed, so there
127+
is no harm in using these functions to store smaller integer values.
128+
129+
# Structure members that have changed size
130+
131+
```
132+
File htslib/hts.h:
133+
hts_pair32_t::begin
134+
hts_pair32_t::end
135+
136+
(typedef hts_pair_pos_t is provided as a better-named replacement for hts_pair32_t)
137+
138+
hts_reglist_t::min_beg
139+
hts_reglist_t::max_end
140+
141+
hts_itr_t::beg
142+
hts_itr_t::end
143+
hts_itr_t::curr_beg
144+
hts_itr_t::curr_end
145+
146+
File htslib/regidx.h:
147+
reg_t::start
148+
reg_t::end
149+
150+
File htslib/sam.h:
151+
bam1_core_t::pos
152+
bam1_core_t::mpos
153+
bam1_core_t::isize
154+
155+
File htslib/synced_bcf_reader.h:
156+
bcf_sr_regions_t::start
157+
bcf_sr_regions_t::end
158+
bcf_sr_regions_t::prev_start
159+
160+
File htslib/vcf.h:
161+
bcf_idinfo_t::info
162+
163+
bcf_info_t::v1::i
164+
165+
bcf1_t::pos
166+
bcf1_t::rlen
167+
```
168+
169+
# Functions where parameters or the return value have changed size
170+
171+
Functions are annotated as follows:
172+
173+
* `[new]` The function has been added since version 1.9
174+
* `[parameters]` Function parameters have changed size
175+
* `[return]` Function return value has changed size
176+
177+
```
178+
File htslib/faidx.h:
179+
180+
[new] fai_fetch64()
181+
[new] fai_fetchqual64()
182+
[new] faidx_fetch_seq64()
183+
[new] faidx_fetch_qual64()
184+
[new] fai_parse_region()
185+
186+
File htslib/hts.h:
187+
188+
[parameters] hts_idx_push()
189+
[new] hts_parse_reg64()
190+
[parameters] hts_itr_query()
191+
[parameters] hts_reg2bin()
192+
193+
File htslib/kstring.h:
194+
195+
[new] kputll()
196+
197+
File htslib/regidx.h:
198+
199+
[parameters] regidx_overlap()
200+
201+
File htslib/sam.h:
202+
203+
[new] sam_hdr_tid2len()
204+
[return] bam_cigar2rlen()
205+
[return] bam_endpos()
206+
[parameters] bam_itr_queryi()
207+
[parameters] sam_itr_queryi()
208+
[new] bam_plp64_next()
209+
[new] bam_plp64_auto()
210+
[new] bam_mplp64_auto()
211+
[parameters] sam_cap_mapq()
212+
[parameters] sam_prob_realn()
213+
214+
File htslib/synced_bcf_reader.h:
215+
216+
[parameters] bcf_sr_seek()
217+
[parameters] bcf_sr_regions_overlap()
218+
219+
File htslib/tbx.h:
220+
221+
[parameters] tbx_readrec()
222+
223+
File htslib/vcf.h:
224+
225+
[parameters] bcf_readrec()
226+
[new] bcf_update_info_int64()
227+
[new] bcf_get_info_int64()
228+
[return] bcf_dec_int1()
229+
[return] bcf_dec_typed_int1()
230+
231+
```

bcf_sr_sort.c

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -288,7 +288,7 @@ void debug_vbuf(sr_sort_t *srt)
288288
for (i=0; i<srt->sr->nreaders; i++)
289289
{
290290
vcf_buf_t *buf = &srt->vcf_buf[i];
291-
fprintf(stderr,"\t%d", buf->rec[j] ? buf->rec[j]->pos+1 : 0);
291+
fprintf(stderr,"\t%"PRIhts_pos, buf->rec[j] ? buf->rec[j]->pos+1 : 0);
292292
}
293293
fprintf(stderr,"\n");
294294
}
@@ -330,7 +330,7 @@ int bcf_sr_sort_add_active(sr_sort_t *srt, int idx)
330330
srt->active[srt->nactive - 1] = idx;
331331
return 0; // FIXME: check for errs in this function
332332
}
333-
static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int min_pos)
333+
static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t min_pos)
334334
{
335335
if ( !srt->grp_str2int )
336336
{
@@ -556,7 +556,7 @@ static int bcf_sr_sort_set(bcf_srs_t *readers, sr_sort_t *srt, const char *chr,
556556
return 0; // FIXME: check for errs in this function
557557
}
558558

559-
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int min_pos)
559+
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t min_pos)
560560
{
561561
int i,j;
562562
assert( srt->nactive>0 );

bcf_sr_sort.h

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -90,15 +90,16 @@ typedef struct
9090
int moff, noff, *off, mcharp;
9191
char **charp;
9292
const char *chr;
93-
int pos, nsr, msr;
93+
hts_pos_t pos;
94+
int nsr, msr;
9495
int pair;
9596
int nactive, mactive, *active; // list of readers with lines at the current pos
9697
}
9798
sr_sort_t;
9899

99100
sr_sort_t *bcf_sr_sort_init(sr_sort_t *srt);
100101
void bcf_sr_sort_reset(sr_sort_t *srt);
101-
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, int pos);
102+
int bcf_sr_sort_next(bcf_srs_t *readers, sr_sort_t *srt, const char *chr, hts_pos_t pos);
102103
int bcf_sr_sort_set_active(sr_sort_t *srt, int i);
103104
int bcf_sr_sort_add_active(sr_sort_t *srt, int i);
104105
void bcf_sr_sort_destroy(sr_sort_t *srt);

bgzf.c

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,8 @@ enum mtaux_cmd {
110110
// When multi-threaded bgzf_tell won't work, so we delay the hts_idx_push
111111
// until we've written the last block.
112112
typedef struct {
113-
int tid, beg, end, is_mapped; // args for hts_idx_push
113+
hts_pos_t beg, end;
114+
int tid, is_mapped; // args for hts_idx_push
114115
uint64_t offset, block_number;
115116
} hts_idx_cache_entry;
116117

@@ -183,7 +184,7 @@ struct __bgzidx_t
183184
* Returns 0 on success,
184185
* -1 on failure
185186
*/
186-
int bgzf_idx_push(BGZF *fp, hts_idx_t *hidx, int tid, int beg, int end, uint64_t offset, int is_mapped) {
187+
int bgzf_idx_push(BGZF *fp, hts_idx_t *hidx, int tid, hts_pos_t beg, hts_pos_t end, uint64_t offset, int is_mapped) {
187188
hts_idx_cache_entry *e;
188189
mtaux_t *mt = fp->mt;
189190

0 commit comments

Comments
 (0)