Fork me on GitHub

get-ncbi-data: Download, parse, and import NCBI sequences and taxonomies

Citations
  • Dennis A Benson, Mark Cavanaugh, Karen Clark, Ilene Karsch-Mizrachi, David J Lipman, James Ostell, and Eric W Sayers. Genbank. Nucleic acids research, 41(D1):D36–D42, 2012.

  • NCBI Resource Coordinators. Database resources of the national center for biotechnology information. Nucleic acids research, 46(D1):D8–D13, 2018. URL: https://doi.org/10.1093/nar/gkx1095, doi:10.1093/nar/gkx1095.

Docstring:

Usage: qiime rescript get-ncbi-data [OPTIONS]

  Download and import sequences from the NCBI Nucleotide database and
  download, parse, and import the corresponding taxonomies from the NCBI
  Taxonomy database.

  Please be aware of the NCBI Disclaimer and Copyright notice
  (https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly "run
  retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays
  for any series of more than 100 requests". As a rough guide, if you are
  downloading more than 125,000 sequences, only run this method at those
  times.

  The NCBI servers can be capricious but reward polite persistence. If the
  download fails and gives you a message that contains the words "Last
  exception was ReadTimeout", you should probably try again, maybe with more
  connections. If it fails for any other reason, please create an issue at
  https://github.com/bokulich-lab/RESCRIPt.

Parameters:
  --p-query TEXT          Query on the NCBI Nucleotide database     [optional]
  --m-accession-ids-file METADATA...
    (multiple arguments   List of accession ids for sequences in the NCBI
     will be merged)      Nucleotide database.                      [optional]
  --p-ranks TEXT... Choices('domain', 'superkingdom', 'kingdom',
    'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum',
    'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder',
    'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family',
    'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group',
    'species subgroup', 'species', 'subspecies', 'forma')
                          List of taxonomic ranks for building a taxonomy
                          from the NCBI Taxonomy database. [default:
                          'kingdom', 'phylum', 'class', 'order', 'family',
                          'genus', 'species']                       [optional]
  --p-rank-propagation / --p-no-rank-propagation
                          Propagate known ranks to missing ranks if true
                                                               [default: True]
  --p-logging-level TEXT Choices('DEBUG', 'INFO', 'WARNING', 'ERROR',
    'CRITICAL')           Logging level, set to INFO for download progress or
                          DEBUG for copious verbosity               [optional]
  --p-n-jobs INTEGER      Number of concurrent download connections. More is
    Range(1, None)        faster until you run out of bandwidth.  [default: 1]
Outputs:
  --o-sequences ARTIFACT FeatureData[Sequence]
                          Sequences from the NCBI Nucleotide database
                                                                    [required]
  --o-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Taxonomies from the NCBI Taxonomy database
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this action. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.rescript.methods import get_ncbi_data

Docstring:

Download, parse, and import NCBI sequences and taxonomies

Download and import sequences from the NCBI Nucleotide database and
download, parse, and import the corresponding taxonomies from the NCBI
Taxonomy database.  Please be aware of the NCBI Disclaimer and Copyright
notice (https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly
"run retrieval scripts on weekends or between 9 pm and 5 am Eastern Time
weekdays for any series of more than 100 requests". As a rough guide, if
you are downloading more than 125,000 sequences, only run this method at
those times.  The NCBI servers can be capricious but reward polite
persistence. If the download fails and gives you a message that contains
the words "Last exception was ReadTimeout", you should probably try again,
maybe with more connections. If it fails for any other reason, please
create an issue at https://github.com/bokulich-lab/RESCRIPt.

Parameters
----------
query : Str, optional
    Query on the NCBI Nucleotide database
accession_ids : Metadata, optional
    List of accession ids for sequences in the NCBI Nucleotide database.
ranks : List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
    List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy
    database. [default: 'kingdom', 'phylum', 'class', 'order', 'family',
    'genus', 'species']
rank_propagation : Bool, optional
    Propagate known ranks to missing ranks if true
logging_level : Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'), optional
    Logging level, set to INFO for download progress or DEBUG for copious
    verbosity
n_jobs : Int % Range(1, None), optional
    Number of concurrent download connections. More is faster until you run
    out of bandwidth.

Returns
-------
sequences : FeatureData[Sequence]
    Sequences from the NCBI Nucleotide database
taxonomy : FeatureData[Taxonomy]
    Taxonomies from the NCBI Taxonomy database