Docstring:
Usage: qiime rescript get-ncbi-data-protein [OPTIONS]
Download and import sequences from the NCBI Protein database and download,
parse, and import the corresponding taxonomies from the NCBI Taxonomy
database.
Please be aware of the NCBI Disclaimer and Copyright notice
(https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly "run
retrieval scripts on weekends or between 9 pm and 5 am Eastern Time weekdays
for any series of more than 100 requests". As a rough guide, if you are
downloading more than 125,000 sequences, only run this method at those
times.
The NCBI servers can be capricious but reward polite persistence. If the
download fails and gives you a message that contains the words "Last
exception was ReadTimeout", you should probably try again, maybe with more
connections. If it fails for any other reason, please create an issue at
https://github.com/bokulich-lab/RESCRIPt.
Parameters:
--p-query TEXT Query on the NCBI Protein database [optional]
--m-accession-ids-file METADATA...
(multiple arguments List of accession ids for sequences in the NCBI
will be merged) Protein database. [optional]
--p-ranks TEXT... Choices('domain', 'superkingdom', 'kingdom',
'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum',
'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder',
'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family',
'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group',
'species subgroup', 'species', 'subspecies', 'forma')
List of taxonomic ranks for building a taxonomy
from the NCBI Taxonomy database. [default:
'kingdom', 'phylum', 'class', 'order', 'family',
'genus', 'species'] [optional]
--p-rank-propagation / --p-no-rank-propagation
Propagate known ranks to missing ranks if true
[default: True]
--p-logging-level TEXT Choices('DEBUG', 'INFO', 'WARNING', 'ERROR',
'CRITICAL') Logging level, set to INFO for download progress or
DEBUG for copious verbosity [optional]
--p-n-jobs INTEGER Number of concurrent download connections. More is
Range(1, None) faster until you run out of bandwidth. [default: 1]
Outputs:
--o-sequences ARTIFACT FeatureData[ProteinSequence]
Sequences from the NCBI Protein database [required]
--o-taxonomy ARTIFACT FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database
[required]
Miscellaneous:
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr
during execution of this action. Or silence output
if execution is successful (silence is golden).
--example-data PATH Write example data and exit.
--citations Show citations and exit.
--use-cache DIRECTORY Specify the cache to be used for the intermediate
work of this action. If not provided, the default
cache under $TMP/qiime2/ will be used.
IMPORTANT FOR HPC USERS: If you are on an HPC system
and are using parallel execution it is important to
set this to a location that is globally accessible
to all nodes in the cluster.
--help Show this message and exit.
Import:
from qiime2.plugins.rescript.methods import get_ncbi_data_protein
Docstring:
Download, parse, and import NCBI protein sequences and taxonomies
Download and import sequences from the NCBI Protein database and download,
parse, and import the corresponding taxonomies from the NCBI Taxonomy
database. Please be aware of the NCBI Disclaimer and Copyright notice
(https://www.ncbi.nlm.nih.gov/home/about/policies/), particularly "run
retrieval scripts on weekends or between 9 pm and 5 am Eastern Time
weekdays for any series of more than 100 requests". As a rough guide, if
you are downloading more than 125,000 sequences, only run this method at
those times. The NCBI servers can be capricious but reward polite
persistence. If the download fails and gives you a message that contains
the words "Last exception was ReadTimeout", you should probably try again,
maybe with more connections. If it fails for any other reason, please
create an issue at https://github.com/bokulich-lab/RESCRIPt.
Parameters
----------
query : Str, optional
Query on the NCBI Protein database
accession_ids : Metadata, optional
List of accession ids for sequences in the NCBI Protein database.
ranks : List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
List of taxonomic ranks for building a taxonomy from the NCBI Taxonomy
database. [default: 'kingdom', 'phylum', 'class', 'order', 'family',
'genus', 'species']
rank_propagation : Bool, optional
Propagate known ranks to missing ranks if true
logging_level : Str % Choices('DEBUG', 'INFO', 'WARNING', 'ERROR', 'CRITICAL'), optional
Logging level, set to INFO for download progress or DEBUG for copious
verbosity
n_jobs : Int % Range(1, None), optional
Number of concurrent download connections. More is faster until you run
out of bandwidth.
Returns
-------
sequences : FeatureData[ProteinSequence]
Sequences from the NCBI Protein database
taxonomy : FeatureData[Taxonomy]
Taxonomies from the NCBI Taxonomy database