-
Notifications
You must be signed in to change notification settings - Fork 25
Description
Hi Mike,
I'm a PhD student in Laura Hug's lab currently working on a side project related to constructing universal ToLs. I was following your ToL example using the universal marker gene set and noticed a big discrepancy when I try to run the same code.
I downloaded your code and original files (https://figshare.com/articles/dataset/GToTree_ToL_example_data/19372322) and ran it myself, so it should be using the exact same accessions that you fetched in 2018 (1698 genomes). However, the resulting tree has more organisms that were excluded on the basis of having too few hits (Genomes_removed_for_too_few_hits.tsv). Your run had 7, mine had 17.
ToL_test_MC_Aug2024.zip
Taking a closer look at the NCBI_genomes_summary_info.tsv file, if I look up a eukaryotic accession from the Genomes_removed_for_too_few_hits.tsv file, it actually has numbers >1 in many of the columns. From what I can tell, it treats multiple hits for a gene as no hits. When I ran it again using -G 0 instead of -G 0.4 (I think this is the minimum genes for inclusion?) it was able to include the genomes again in the tree.
I understand you have a disclaimer about working with eukaryotes in GToTree in terms of completion and redundancy, but currently even reference eukaryotic genomes are unable to be placed, which is problematic for creating universal trees.
As a side note, an additional issue to consider when working with eukaryotic genomes is the possibility that mitochondrial or chloroplast genes could be mistaken for marker genes, potentially leading to misplacement or conflicting data. In cases of multiple hits, how does GToTree decide which one to use? If it is based on best match/lowest E value, then it would prioritize mitochondrial and chloroplast genes over eukaryotic versions. I ran into this a few times in my own project where it would place a big group of plants and photosynthetic eukaryotes next to the cyanobacteria.
I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.
Let me know if you want more info on anything I've mentioned here or would like to discuss further.