Fork me on GitHub

get-ncbi-genomes: Fetch entire genomes and associated taxonomies and metadata using NCBI Datasets.

Citations
  • Karen Clark, Ilene Karsch-Mizrachi, David J. Lipman, James Ostell, and Eric W. Sayers. Genbank. Nucleic Acids Research, 44(D1):D67–72, January 2016. doi:10.1093/nar/gkv1276.

  • Nuala A. O'Leary, Mathew W. Wright, J. Rodney Brister, Stacy Ciufo, Diana Haddad, Rich McVeigh, Bhanu Rajput, Barbara Robbertse, Brian Smith-White, Danso Ako-Adjei, Alexander Astashyn, Azat Badretdin, Yiming Bao, Olga Blinkova, Vyacheslav Brover, Vyacheslav Chetvernin, Jinna Choi, Eric Cox, Olga Ermolaeva, Catherine M. Farrell, Tamara Goldfarb, Tripti Gupta, Daniel Haft, Eneida Hatcher, Wratko Hlavina, Vinita S. Joardar, Vamsi K. Kodali, Wenjun Li, Donna Maglott, Patrick Masterson, Kelly M. McGarvey, Michael R. Murphy, Kathleen O'Neill, Shashikant Pujar, Sanjida H. Rangwala, Daniel Rausch, Lillian D. Riddick, Conrad Schoch, Andrei Shkeda, Susan S. Storz, Hanzhen Sun, Francoise Thibaud-Nissen, Igor Tolstoy, Raymond E. Tully, Anjana R. Vatsan, Craig Wallin, David Webb, Wendy Wu, Melissa J. Landrum, Avi Kimchi, Tatiana Tatusova, Michael DiCuccio, Paul Kitts, Terence D. Murphy, and Kim D. Pruitt. Reference sequence (refseq) database at ncbi: current status, taxonomic expansion, and functional annotation. Nucleic Acids Research, 44(D1):D733–745, January 2016. doi:10.1093/nar/gkv1189.

  • Conrad L Schoch, Stacy Ciufo, Mikhail Domrachev, Carol L Hotton, Sivakumar Kannan, Rogneda Khovanskaya, Detlef Leipe, Richard Mcveigh, Kathleen O'Neill, Barbara Robbertse, Shobha Sharma, Vladimir Soussov, John P Sullivan, Lu Sun, Seán Turner, and Ilene Karsch-Mizrachi. Ncbi taxonomy: a comprehensive update on curation, resources and tools. Database: The Journal of Biological Databases and Curation, 2020:baaa062, August 2020. doi:10.1093/database/baaa062.

Docstring:

Usage: qiime rescript get-ncbi-genomes [OPTIONS]

  Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide sequences
  and protein/gene annotations will be fetched and supplemented with full
  taxonomy of every sequence.

Parameters:
  --p-taxon TEXT         NCBI Taxonomy ID or name (common or scientific) at
                         any taxonomic rank.                        [required]
  --p-assembly-source TEXT Choices('refseq', 'genbank', 'all')
                         Fetch only RefSeq or GenBank genome assemblies.
                                                           [default: 'refseq']
  --p-assembly-levels TEXT... Choices('complete_genome', 'chromosome',
    'scaffold', 'contig')
                         Fetch only genome assemblies that are one of the
                         specified assembly levels.
                                                [default: ['complete_genome']]
  --p-only-reference / --p-no-only-reference
                         Fetch only reference and representative genome
                         assemblies.                           [default: True]
  --p-only-genomic / --p-no-only-genomic
                         Exclude plasmid, mitochondrial and chloroplast
                         molecules from the final results (i.e., keep only
                         genomic DNA).                        [default: False]
  --p-tax-exact-match / --p-no-tax-exact-match
                         If true, only return assemblies with the given NCBI
                         Taxonomy ID, or name. Otherwise, assemblies from
                         taxonomy subtree are included, too.  [default: False]
  --p-page-size INTEGER Range(20, 1000, inclusive_end=True)
                         The maximum number of genome assemblies to return
                         per request. If number of genomes to fetch is higher
                         than this number, requests will be repeated until all
                         assemblies are fetched.                 [default: 20]
  --p-ranks TEXT... Choices('domain', 'superkingdom', 'kingdom',
    'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum',
    'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder',
    'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family',
    'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group',
    'species subgroup', 'species', 'subspecies', 'forma')
                         List of taxonomic ranks for building a taxonomy from
                         the NCBI Taxonomy database.
[default: ['kingdom', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
  --p-rank-propagation / --p-no-rank-propagation
                         If a rank has no taxonomy associated with it, the
                         taxonomy from the upper-level rank of that lineage,
                         will be propagated downward. For example, if we are
                         missing the genus label for 'f__Pasteurellaceae;
                         g__'then the 'f__' rank will be propagated to become:
                         'f__Pasteurellaceae; g__Pasteurellaceae'.
                                                               [default: True]
Outputs:
  --o-genome-assemblies ARTIFACT FeatureData[Sequence]
                         Nucleotide sequences of requested genomes. [required]
  --o-loci ARTIFACT      Loci features of requested genomes.
    GenomeData[Loci]                                                [required]
  --o-proteins ARTIFACT GenomeData[Proteins]
                         Protein sequences originating from requested
                         genomes.                                   [required]
  --o-taxonomies ARTIFACT FeatureData[Taxonomy]
                         Taxonomies of requested genomes.           [required]
Miscellaneous:
  --output-dir PATH      Output unspecified results to a directory
  --verbose / --quiet    Display verbose output to stdout and/or stderr
                         during execution of this action. Or silence output if
                         execution is successful (silence is golden).
  --example-data PATH    Write example data and exit.
  --citations            Show citations and exit.
  --use-cache DIRECTORY  Specify the cache to be used for the intermediate
                         work of this action. If not provided, the default
                         cache under $TMP/qiime2/ will be used.
                         IMPORTANT FOR HPC USERS: If you are on an HPC system
                         and are using parallel execution it is important to
                         set this to a location that is globally accessible to
                         all nodes in the cluster.
  --help                 Show this message and exit.

Import:

from qiime2.plugins.rescript.methods import get_ncbi_genomes

Docstring:

Fetch entire genomes and associated taxonomies and metadata using NCBI
Datasets.

Uses NCBI Datasets to fetch genomes for indicated taxa. Nucleotide
sequences and protein/gene annotations will be fetched and supplemented
with full taxonomy of every sequence.

Parameters
----------
taxon : Str
    NCBI Taxonomy ID or name (common or scientific) at any taxonomic rank.
assembly_source : Str % Choices('refseq', 'genbank', 'all'), optional
    Fetch only RefSeq or GenBank genome assemblies.
assembly_levels : List[Str % Choices('complete_genome', 'chromosome', 'scaffold', 'contig')], optional
    Fetch only genome assemblies that are one of the specified assembly
    levels.
only_reference : Bool, optional
    Fetch only reference and representative genome assemblies.
only_genomic : Bool, optional
    Exclude plasmid, mitochondrial and chloroplast molecules from the final
    results (i.e., keep only genomic DNA).
tax_exact_match : Bool, optional
    If true, only return assemblies with the given NCBI Taxonomy ID, or
    name. Otherwise, assemblies from taxonomy subtree are included, too.
page_size : Int % Range(20, 1000, inclusive_end=True), optional
    The maximum number of genome assemblies to return per request. If
    number of genomes to fetch is higher than this number, requests will be
    repeated until all assemblies are fetched.
ranks : List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
    List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy
    database.
rank_propagation : Bool, optional
    If a rank has no taxonomy associated with it, the taxonomy from the
    upper-level rank of that lineage, will be propagated downward. For
    example, if we are missing the genus label for 'f__Pasteurellaceae;
    g__'then the 'f__' rank will be propagated to become:
    'f__Pasteurellaceae; g__Pasteurellaceae'.

Returns
-------
genome_assemblies : FeatureData[Sequence]
    Nucleotide sequences of requested genomes.
loci : GenomeData[Loci]
    Loci features of requested genomes.
proteins : GenomeData[Proteins]
    Protein sequences originating from requested genomes.
taxonomies : FeatureData[Taxonomy]
    Taxonomies of requested genomes.