Docstring:
Usage: qiime rescript dereplicate [OPTIONS]
Dereplicate FASTA format sequences and taxonomies wherever sequences and
taxonomies match; duplicated sequences and taxonomies are dereplicated using
the "mode" parameter to either: retain all sequences that have unique
taxonomic annotations even if the sequences are duplicates (uniq); or return
only dereplicated sequences labeled by either the least common ancestor
(lca) or the most common taxonomic label associated with sequences in that
cluster (majority). Note: all taxonomy strings will be coerced to semicolon
delimiters without any leading or trailing spaces. If this is not desired,
please use 'rescript edit-taxonomy' to make any changes.
Inputs:
--i-sequences ARTIFACT FeatureData[Sequence]
Sequences to be dereplicated [required]
--i-taxa ARTIFACT FeatureData[Taxonomy]
Taxonomic classifications of sequences to be
dereplicated [required]
Parameters:
--p-mode TEXT Choices('uniq', 'lca', 'majority', 'super')
How to handle dereplication when sequences map to
distinct taxonomies. "uniq" will retain all
sequences with unique taxonomic affiliations. "lca"
will find the least common ancestor among all taxa
sharing a sequence. "majority" will find the most
common taxonomic label associated with that
sequence; note that in the event of a tie,
"majority" will pick the winner arbitrarily. "super"
finds the LCA consensus while giving preference to
majority labels and collapsing substrings into
superstrings. For example, when a more specific
taxonomy does not contradict a less specific
taxonomy, the more specific is chosen. That is,
"g__Faecalibacterium; s__prausnitzii", will be
preferred over "g__Faecalibacterium; s__"
[default: 'uniq']
--p-perc-identity PROPORTION Range(0, 1, inclusive_start=False,
inclusive_end=True) The percent identity at which clustering should be
performed. This parameter maps to vsearch's --id
parameter. [default: 1.0]
--p-threads INTEGER Number of computation threads to use (1 to 256).
Range(1, 256) The number of threads should be lesser or equal to
the number of available CPU cores. [default: 1]
--p-rank-handles VALUES... List[Str % Choices('disable')] | List[Str %
Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum',
'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass',
'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder',
'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
'genus', 'subgenus', 'species group', 'species subgroup', 'species',
'subspecies', 'forma')]
Specifies the set of rank handles used to backfill
missing ranks in the resulting dereplicated
taxonomy. Use 'disable' to prevent applying
'rank-handles'.
[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
--p-derep-prefix / --p-no-derep-prefix
Merge sequences with identical prefixes. If a
sequence is identical to the prefix of two or more
longer sequences, it is clustered with the shortest
of them. If they are equally long, it is clustered
with the most abundant. [default: False]
Outputs:
--o-dereplicated-sequences ARTIFACT FeatureData[Sequence]
[required]
--o-dereplicated-taxa ARTIFACT FeatureData[Taxonomy]
[required]
Miscellaneous:
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr
during execution of this action. Or silence output
if execution is successful (silence is golden).
--example-data PATH Write example data and exit.
--citations Show citations and exit.
--use-cache DIRECTORY Specify the cache to be used for the intermediate
work of this action. If not provided, the default
cache under $TMP/qiime2/ will be used.
IMPORTANT FOR HPC USERS: If you are on an HPC system
and are using parallel execution it is important to
set this to a location that is globally accessible
to all nodes in the cluster.
--help Show this message and exit.
Import:
from qiime2.plugins.rescript.methods import dereplicate
Docstring:
Dereplicate features with matching sequences and taxonomies.
Dereplicate FASTA format sequences and taxonomies wherever sequences and
taxonomies match; duplicated sequences and taxonomies are dereplicated
using the "mode" parameter to either: retain all sequences that have unique
taxonomic annotations even if the sequences are duplicates (uniq); or
return only dereplicated sequences labeled by either the least common
ancestor (lca) or the most common taxonomic label associated with sequences
in that cluster (majority). Note: all taxonomy strings will be coerced to
semicolon delimiters without any leading or trailing spaces. If this is not
desired, please use 'rescript edit-taxonomy' to make any changes.
Parameters
----------
sequences : FeatureData[Sequence]
Sequences to be dereplicated
taxa : FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated
mode : Str % Choices('uniq', 'lca', 'majority', 'super'), optional
How to handle dereplication when sequences map to distinct taxonomies.
"uniq" will retain all sequences with unique taxonomic affiliations.
"lca" will find the least common ancestor among all taxa sharing a
sequence. "majority" will find the most common taxonomic label
associated with that sequence; note that in the event of a tie,
"majority" will pick the winner arbitrarily. "super" finds the LCA
consensus while giving preference to majority labels and collapsing
substrings into superstrings. For example, when a more specific
taxonomy does not contradict a less specific taxonomy, the more
specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii",
will be preferred over "g__Faecalibacterium; s__"
perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True), optional
The percent identity at which clustering should be performed. This
parameter maps to vsearch's --id parameter.
threads : Int % Range(1, 256), optional
Number of computation threads to use (1 to 256). The number of threads
should be lesser or equal to the number of available CPU cores.
rank_handles : List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
Specifies the set of rank handles used to backfill missing ranks in the
resulting dereplicated taxonomy. Use 'disable' to prevent applying
'rank_handles'.
derep_prefix : Bool, optional
Merge sequences with identical prefixes. If a sequence is identical to
the prefix of two or more longer sequences, it is clustered with the
shortest of them. If they are equally long, it is clustered with the
most abundant.
Returns
-------
dereplicated_sequences : FeatureData[Sequence]
dereplicated_taxa : FeatureData[Taxonomy]