Warning
This site has been replaced by the new QIIME 2 “amplicon distribution” documentation, as of the 2025.4 release of QIIME 2. You can still access the content from the “old docs” here for the QIIME 2 2024.10 and earlier releases, but we recommend that you transition to the new documentation at https://amplicon-docs.qiime2.org. Content on this site is no longer updated and may be out of date.
Are you looking for:
the QIIME 2 homepage? That’s https://qiime2.org.
learning resources for microbiome marker gene (i.e., amplicon) analysis? See the QIIME 2 amplicon distribution documentation.
learning resources for microbiome metagenome analysis? See the MOSHPIT documentation.
installation instructions, plugins, books, videos, workshops, or resources? See the QIIME 2 Library.
general help? See the QIIME 2 Forum.
Old content beyond this point… 👴👵
dereplicate: Dereplicate features with matching sequences and taxonomies.¶
Citations |
|
---|
Docstring:
Usage: qiime rescript dereplicate [OPTIONS] Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes. Inputs: --i-sequences ARTIFACT FeatureData[Sequence] Sequences to be dereplicated [required] --i-taxa ARTIFACT FeatureData[Taxonomy] Taxonomic classifications of sequences to be dereplicated [required] Parameters: --p-mode TEXT Choices('uniq', 'lca', 'majority', 'super') How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__" [default: 'uniq'] --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False, inclusive_end=True) The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter. [default: 1.0] --p-threads INTEGER Number of computation threads to use (1 to 256). Range(1, 256) The number of threads should be lesser or equal to the number of available CPU cores. [default: 1] --p-rank-handles VALUES... List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')] Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank-handles'. [default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']] --p-derep-prefix / --p-no-derep-prefix Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant. [default: False] Outputs: --o-dereplicated-sequences ARTIFACT FeatureData[Sequence] [required] --o-dereplicated-taxa ARTIFACT FeatureData[Taxonomy] [required] Miscellaneous: --output-dir PATH Output unspecified results to a directory --verbose / --quiet Display verbose output to stdout and/or stderr during execution of this action. Or silence output if execution is successful (silence is golden). --example-data PATH Write example data and exit. --citations Show citations and exit. --use-cache DIRECTORY Specify the cache to be used for the intermediate work of this action. If not provided, the default cache under $TMP/qiime2/will be used. IMPORTANT FOR HPC USERS: If you are on an HPC system and are using parallel execution it is important to set this to a location that is globally accessible to all nodes in the cluster. --help Show this message and exit.
Import:
from qiime2.plugins.rescript.methods import dereplicate
Docstring:
Dereplicate features with matching sequences and taxonomies. Dereplicate FASTA format sequences and taxonomies wherever sequences and taxonomies match; duplicated sequences and taxonomies are dereplicated using the "mode" parameter to either: retain all sequences that have unique taxonomic annotations even if the sequences are duplicates (uniq); or return only dereplicated sequences labeled by either the least common ancestor (lca) or the most common taxonomic label associated with sequences in that cluster (majority). Note: all taxonomy strings will be coerced to semicolon delimiters without any leading or trailing spaces. If this is not desired, please use 'rescript edit-taxonomy' to make any changes. Parameters ---------- sequences : FeatureData[Sequence] Sequences to be dereplicated taxa : FeatureData[Taxonomy] Taxonomic classifications of sequences to be dereplicated mode : Str % Choices('uniq', 'lca', 'majority', 'super'), optional How to handle dereplication when sequences map to distinct taxonomies. "uniq" will retain all sequences with unique taxonomic affiliations. "lca" will find the least common ancestor among all taxa sharing a sequence. "majority" will find the most common taxonomic label associated with that sequence; note that in the event of a tie, "majority" will pick the winner arbitrarily. "super" finds the LCA consensus while giving preference to majority labels and collapsing substrings into superstrings. For example, when a more specific taxonomy does not contradict a less specific taxonomy, the more specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii", will be preferred over "g__Faecalibacterium; s__" perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True), optional The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter. threads : Int % Range(1, 256), optional Number of computation threads to use (1 to 256). The number of threads should be lesser or equal to the number of available CPU cores. rank_handles : List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional Specifies the set of rank handles used to backfill missing ranks in the resulting dereplicated taxonomy. Use 'disable' to prevent applying 'rank_handles'. derep_prefix : Bool, optional Merge sequences with identical prefixes. If a sequence is identical to the prefix of two or more longer sequences, it is clustered with the shortest of them. If they are equally long, it is clustered with the most abundant. Returns ------- dereplicated_sequences : FeatureData[Sequence] dereplicated_taxa : FeatureData[Taxonomy]