Fork me on GitHub

dereplicate: Dereplicate features with matching sequences and taxonomies.

Citations
  • Torbjørn Rognes, Tomáš Flouri, Ben Nichols, Christopher Quince, and Frédéric Mahé. Vsearch: a versatile open source tool for metagenomics. PeerJ, 4:e2584, 2016. doi:10.7717/peerj.2584.

Docstring:

Usage: qiime rescript dereplicate [OPTIONS]

  Dereplicate FASTA format sequences and taxonomies wherever sequences and
  taxonomies match; duplicated sequences and taxonomies are dereplicated using
  the "mode" parameter to either: retain all sequences that have unique
  taxonomic annotations even if the sequences are duplicates (uniq); or return
  only dereplicated sequences labeled by either the least common ancestor
  (lca) or the most common taxonomic label associated with sequences in that
  cluster (majority). Note: all taxonomy strings will be coerced to semicolon
  delimiters without any leading or trailing spaces. If this is not desired,
  please use 'rescript edit-taxonomy' to make any changes.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          Sequences to be dereplicated              [required]
  --i-taxa ARTIFACT FeatureData[Taxonomy]
                          Taxonomic classifications of sequences to be
                          dereplicated                              [required]
Parameters:
  --p-mode TEXT Choices('uniq', 'lca', 'majority', 'super')
                          How to handle dereplication when sequences map to
                          distinct taxonomies. "uniq" will retain all
                          sequences with unique taxonomic affiliations. "lca"
                          will find the least common ancestor among all taxa
                          sharing a sequence. "majority" will find the most
                          common taxonomic label associated with that
                          sequence; note that in the event of a tie,
                          "majority" will pick the winner arbitrarily. "super"
                          finds the LCA consensus while giving preference to
                          majority labels and collapsing substrings into
                          superstrings. For example, when a more specific
                          taxonomy does not contradict a less specific
                          taxonomy, the more specific is chosen. That is,
                          "g__Faecalibacterium; s__prausnitzii", will be
                          preferred over "g__Faecalibacterium; s__"
                                                             [default: 'uniq']
  --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False,
    inclusive_end=True)   The percent identity at which clustering should be
                          performed. This parameter maps to vsearch's --id
                          parameter.                            [default: 1.0]
  --p-threads INTEGER     Number of computation threads to use (1 to 256).
    Range(1, 256)         The number of threads should be lesser or equal to
                          the number of available CPU cores.      [default: 1]
  --p-rank-handles VALUES... List[Str % Choices('disable')] | List[Str %
    Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum',
    'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass',
    'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder',
    'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe',
    'genus', 'subgenus', 'species group', 'species subgroup', 'species',
    'subspecies', 'forma')]
                          Specifies the set of rank handles used to backfill
                          missing ranks in the resulting dereplicated
                          taxonomy. Use 'disable' to prevent applying
                          'rank-handles'.
[default: ['domain', 'phylum', 'class', 'order', 'family', 'genus', 'species']]
  --p-derep-prefix / --p-no-derep-prefix
                          Merge sequences with identical prefixes. If a
                          sequence is identical to the prefix of two or more
                          longer sequences, it is clustered with the shortest
                          of them. If they are equally long, it is clustered
                          with the most abundant.             [default: False]
Outputs:
  --o-dereplicated-sequences ARTIFACT FeatureData[Sequence]
                                                                    [required]
  --o-dereplicated-taxa ARTIFACT FeatureData[Taxonomy]
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this action. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.rescript.methods import dereplicate

Docstring:

Dereplicate features with matching sequences and taxonomies.

Dereplicate FASTA format sequences and taxonomies wherever sequences and
taxonomies match; duplicated sequences and taxonomies are dereplicated
using the "mode" parameter to either: retain all sequences that have unique
taxonomic annotations even if the sequences are duplicates (uniq); or
return only dereplicated sequences labeled by either the least common
ancestor (lca) or the most common taxonomic label associated with sequences
in that cluster (majority). Note: all taxonomy strings will be coerced to
semicolon delimiters without any leading or trailing spaces. If this is not
desired, please use 'rescript edit-taxonomy' to make any changes.

Parameters
----------
sequences : FeatureData[Sequence]
    Sequences to be dereplicated
taxa : FeatureData[Taxonomy]
    Taxonomic classifications of sequences to be dereplicated
mode : Str % Choices('uniq', 'lca', 'majority', 'super'), optional
    How to handle dereplication when sequences map to distinct taxonomies.
    "uniq" will retain all sequences with unique taxonomic affiliations.
    "lca" will find the least common ancestor among all taxa sharing a
    sequence. "majority" will find the most common taxonomic label
    associated with that sequence; note that in the event of a tie,
    "majority" will pick the winner arbitrarily. "super" finds the LCA
    consensus while giving preference to majority labels and collapsing
    substrings into superstrings. For example, when a more specific
    taxonomy does not contradict a less specific taxonomy, the more
    specific is chosen. That is, "g__Faecalibacterium; s__prausnitzii",
    will be preferred over "g__Faecalibacterium; s__"
perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True), optional
    The percent identity at which clustering should be performed. This
    parameter maps to vsearch's --id parameter.
threads : Int % Range(1, 256), optional
    Number of computation threads to use (1 to 256). The number of threads
    should be lesser or equal to the number of available CPU cores.
rank_handles : List[Str % Choices('disable')] | List[Str % Choices('domain', 'superkingdom', 'kingdom', 'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum', 'superclass', 'class', 'subclass', 'infraclass', 'cohort', 'superorder', 'order', 'suborder', 'infraorder', 'parvorder', 'superfamily', 'family', 'subfamily', 'tribe', 'subtribe', 'genus', 'subgenus', 'species group', 'species subgroup', 'species', 'subspecies', 'forma')], optional
    Specifies the set of rank handles used to backfill missing ranks in the
    resulting dereplicated taxonomy. Use 'disable' to prevent applying
    'rank_handles'.
derep_prefix : Bool, optional
    Merge sequences with identical prefixes. If a sequence is identical to
    the prefix of two or more longer sequences, it is clustered with the
    shortest of them. If they are equally long, it is clustered with the
    most abundant.

Returns
-------
dereplicated_sequences : FeatureData[Sequence]
dereplicated_taxa : FeatureData[Taxonomy]