Warning

This site has been replaced by the new QIIME 2 “amplicon distribution” documentation, as of the 2025.4 release of QIIME 2. You can still access the content from the “old docs” here for the QIIME 2 2024.10 and earlier releases, but we recommend that you transition to the new documentation at https://amplicon-docs.qiime2.org. Content on this site is no longer updated and may be out of date.

Are you looking for:

the QIIME 2 homepage? That’s https://qiime2.org.
learning resources for microbiome marker gene (i.e., amplicon) analysis? See the QIIME 2 amplicon distribution documentation.
learning resources for microbiome metagenome analysis? See the MOSHPIT documentation.
installation instructions, plugins, books, videos, workshops, or resources? See the QIIME 2 Library.
general help? See the QIIME 2 Forum.

Old content beyond this point… 👴👵

dereplicate: Dereplicate features with matching sequences and taxonomies.¶

Citations	Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. Vsearch: a versatile open source tool for metagenomics. PeerJ, 4:e2584, 2016. doi:10.7717/peerj.2584.

Command line interface
Artifact API

Docstring:

Usage: qiime rescript dereplicate [OPTIONS]

  Dereplicate FASTA format sequences and taxonomies wherever sequences and
  taxonomies match; duplicated sequences and taxonomies are dereplicated using
  the "mode" parameter to either: retain all sequences that have unique
  taxonomic annotations even if the sequences are duplicates (uniq); or return
  only dereplicated sequences labeled by either the least common ancestor
  (lca) or the most common taxonomic label associated with sequences in that
  cluster (majority). Note: all taxonomy strings will be coerced to semicolon
  delimiters without any leading or trailing spaces. If this is not desired,
  please use 'rescript edit-taxonomy' to make any changes.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          Sequences to be dereplicated              [required]
  --i-taxa ARTIFACT FeatureData[Taxonomy]
                          Taxonomic classifications of sequences to be
                          dereplicated                              [required]
Parameters:
  --p-mode TEXT Choices('uniq', 'lca', 'majority', 'super')
                          How to handle dereplication when sequences map to
                          distinct taxonomies. "uniq" will retain all
                          sequences with unique taxonomic affiliations. "lca"
                          will find the least common ancestor among all taxa
                          sharing a sequence. "majority" will find the most
                          common taxonomic label associated with that
                          sequence; note that in the event of a tie,
                          "majority" will pick the winner arbitrarily. "super"
                          finds the LCA consensus while giving preference to
                          majority labels and collapsing substrings into
                          superstrings. For example, when a more specific
                          taxonomy does not contradict a less specific
                          taxonomy, the more specific is chosen. That is,
                          "g__Faecalibacterium; s__prausnitzii", will be
                          preferred over "g__Faecalibacterium; s__"
                                                             [default: 'uniq']
  --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False,
    inclusive_end=True)   The percent identity at which clustering should be
                          performed. This parameter maps to vsearch's --id
                          parameter.                            [default: 1.0]
  --p-threads INTEGER     Number of computation threads to use (1 to 256).
    Range(1, 256)         The number of threads should be lesser or equal to
                          the number of available CPU cores.      [default: 1]
  --p-rank-handles VALUES... List[Str % Choices('disable')] | List[Str %
    Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum',
    'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass',
    'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder',
    'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
    'genus', 'subgenus', 'species group', 'species subgroup', 'species',
    'subspecies', 'forma')]
                          Specifies the set of rank handles used to backfill
                          missing ranks in the resulting dereplicated
                          taxonomy. Use 'disable' to prevent applying
                          'rank-handles'.
[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
  --p-derep-prefix / --p-no-derep-prefix
                          Merge sequences with identical prefixes. If a
                          sequence is identical to the prefix of two or more
                          longer sequences, it is clustered with the shortest
                          of them. If they are equally long, it is clustered
                          with the most abundant.             [default: False]
Outputs:
  --o-dereplicated-sequences ARTIFACT FeatureData[Sequence]
                                                                    [required]
  --o-dereplicated-taxa ARTIFACT FeatureData[Taxonomy]
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this action. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.rescript.methods import dereplicate

Docstring:

Dereplicate features with matching sequences and taxonomies.

Dereplicate FASTA format sequences and taxonomies wherever sequences and
taxonomies match; duplicated sequences and taxonomies are dereplicated
using the "mode" parameter to either: retain all sequences that have unique
taxonomic annotations even if the sequences are duplicates (uniq); or
return only dereplicated sequences labeled by either the least common
ancestor (lca) or the most common taxonomic label associated with sequences
in that cluster (majority). Note: all taxonomy strings will be coerced to
semicolon delimiters without any leading or trailing spaces. If this is not
desired, please use 'rescript edit-taxonomy' to make any changes.

Parameters
----------
sequences : FeatureData[Sequence]
Sequences to be dereplicated
taxa : FeatureData[Taxonomy]
Taxonomic classifications of sequences to be dereplicated
mode : Str % Choices('uniq', 'lca', 'majority', 'super'), optional
How to handle dereplication when sequences map to distinct taxonomies.
"uniq" will retain all sequences with unique taxonomic affiliations.
"lca" will find the least common ancestor among all taxa sharing a
sequence. "majority" will find the most common taxonomic label
associated with that sequence; note that in the event of a tie,
"majority" will pick the winner arbitrarily. "super" finds the LCA
consensus while giving preference to majority labels and collapsing
substrings into superstrings. For example, when a more specific
taxonomy does not contradict a less specific taxonomy, the more
specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii",
will be preferred over "g__Faecalibacterium; s__"
perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True), optional
The percent identity at which clustering should be performed. This
parameter maps to vsearch's --id parameter.
threads : Int % Range(1, 256), optional
Number of computation threads to use (1 to 256). The number of threads
should be lesser or equal to the number of available CPU cores.
rank_handles : List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
Specifies the set of rank handles used to backfill missing ranks in the
resulting dereplicated taxonomy. Use 'disable' to prevent applying
'rank_handles'.
derep_prefix : Bool, optional
Merge sequences with identical prefixes. If a sequence is identical to
the prefix of two or more longer sequences, it is clustered with the
shortest of them. If they are equally long, it is clustered with the
most abundant.

Returns
-------
dereplicated_sequences : FeatureData[Sequence]
dereplicated_taxa : FeatureData[Taxonomy]

dereplicate: Dereplicate features with matching sequences and taxonomies.¶

Docstring:

Import:

Docstring:

Table of Contents

Quick search