Skip to content

Universal ToL and Eukaryotes #98

@molly-kholodova

Description

@molly-kholodova

Hi Mike,

I'm a PhD student in Laura Hug's lab currently working on a side project related to constructing universal ToLs. I was following your ToL example using the universal marker gene set and noticed a big discrepancy when I try to run the same code.

I downloaded your code and original files (https://figshare.com/articles/dataset/GToTree_ToL_example_data/19372322) and ran it myself, so it should be using the exact same accessions that you fetched in 2018 (1698 genomes). However, the resulting tree has more organisms that were excluded on the basis of having too few hits (Genomes_removed_for_too_few_hits.tsv). Your run had 7, mine had 17.
ToL_test_MC_Aug2024.zip

Taking a closer look at the NCBI_genomes_summary_info.tsv file, if I look up a eukaryotic accession from the Genomes_removed_for_too_few_hits.tsv file, it actually has numbers >1 in many of the columns. From what I can tell, it treats multiple hits for a gene as no hits. When I ran it again using -G 0 instead of -G 0.4 (I think this is the minimum genes for inclusion?) it was able to include the genomes again in the tree.

I understand you have a disclaimer about working with eukaryotes in GToTree in terms of completion and redundancy, but currently even reference eukaryotic genomes are unable to be placed, which is problematic for creating universal trees.

As a side note, an additional issue to consider when working with eukaryotic genomes is the possibility that mitochondrial or chloroplast genes could be mistaken for marker genes, potentially leading to misplacement or conflicting data. In cases of multiple hits, how does GToTree decide which one to use? If it is based on best match/lowest E value, then it would prioritize mitochondrial and chloroplast genes over eukaryotic versions. I ran into this a few times in my own project where it would place a big group of plants and photosynthetic eukaryotes next to the cyanobacteria.

I am implementing a solution right now by building a separate HMM for eukaryotic ribosomal genes and running that separately for eukaryotic genomes. This does require sorting the genomes beforehand, but perhaps you could incorporate it into GToTree as an option for dealing with genomes with multiple SCG hits.

Let me know if you want more info on anything I've mentioned here or would like to discuss further.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions