Release v0.9.0
Highlights
In this release, we offer a new schema for output BigQuery tables. The new schema utilizes BigQuery's integer range partitioning which significantly reduces the query costs. We also allow users to store BigQuery tables which are highly optimized for sample lookup queries, such as:
Find all variants of Patient X
Note this release contains backwards incompatible changes. Please see details below.
New Features / Improvements
- By default one BigQuery table per chromosome is created; each table is integer range partitioned.
- Output tables have suffixes such as 
__chr1,__chr2, … - Output tables can be changed by modifying the sharding config file.
 
 - Output tables have suffixes such as 
 call.nameis replaced withcall.sample_id, wheresample_idis the hash of sample name.- In cases where multiple VCF files have the same 
name, file path can be included in the hash value to distinguish between samples. 
- In cases where multiple VCF files have the same 
 - An extra BQ table with 
__sample_infosuffix is created. This table contains the mapping betweensample_idtosample_nameandvcf_file_path.- We also include an 
ingestion_datetimecolumn in sample info table to record the ingestion datetime of each VCF file. 
 - We also include an 
 - 1-based coordinate is used by default for genomic indexing to make BigQuery tables more compatible with VCF files.
 - If 
--appendis set, we ensure all expected output tables already exist before we append them. 
New flags
vcf_to_bq:--sample_lookup_optimized_output_table: to store a second copy of variants in BigQuery tables that are optimized for sample lookup queries. This feature is particularly useful when the input VCF file contains joint genotyped samples.--keep_intermediate_avro_files: to store intermediate Avro files in your temp directory on GCS bucket.--use_1_based_coordinate: By default start position will be 1-based, and end position will be inclusive. You can set this flag to False to use 0-based coordinate.--sample_name_encoding: determines the waysample_idis hashed. Default value isWITHOUT_FILE_PATH. If set toWITH_FILE_PATH, thensample_idwill be a hash of[vcf_file_path, sample_name].--sharding_config_path: replaces--partition_config_path.
bq_to_vcf:--bq_uses_1_based_coordinate: set to False, if--use_1_based_coordinatewas set to False when generating the BQ tables, and hence, start positions are 0-based.--sample_names: replaces--call_names.--preserve_sample_order: replaces--preserve_call_names_order.
docker runflags:- All the following flags are required:
--project--regions--temp_location
 - If you need to run Variant Transforms in a subnetwork using private IP addresses:
--subnetwork ${CUSTOM_SUBNETWORK}--use_public_ips false
 
- All the following flags are required:
 
Deprecated flags
The following flags are deprecated and will be removed in the next release:
--optimize_for_large_inputs: because sharing is done by default for all inputs.--num_bigquery_write_shards: because we are using Avro sink in the Dataflow pipeline.--output_avro_path: replaced with--keep_intermediate_avro_files.--reference_names: You can achieve the same goal by modifying default sharding config file.
Underlying improvements
- Switched our default VCF parser from PyVcf to PySam.
 - Update to Beam 2.22.
 - Launcher VM is changed to g1-small to reduce the overall cost of running VT.
 
Breaking Changes
- By default 1-based coordinate is used for genomic indexing. We use the same default value for 
bq_to_vcfsoVCF -> BigQuery -> VCFwith default flags should work. call.nameis replaced withcall.sample_id--partition_config_pathis replaced with--sharding_config_path- Sharding config YAML format has changed.
 - output table name cannot contain 
__because we reserve this string for separating table base name from the suffixes that we read from sharding config file. 
The following flags have been removed in this release:
--vcf_parser--partition_config_path: replaced with--sharding_config_path--call_names: replaced with--sample_names--preserve_call_names_order: replaced with--preserve_sample_order