Fork me on GitHub

cluster-features-open-reference: Open-reference clustering of features.

Citations
  • Jai Ram Rideout, Yan He, Jose A. Navas-Molina, William A. Walters, Luke K. Ursell, Sean M. Gibbons, John Chase, Daniel McDonald, Antonio Gonzalez, Adam Robbins-Pianka, Jose C. Clemente, Jack A. Gilbert, Susan M. Huse, Hong-Wei Zhou, Rob Knight, and J. Gregory Caporaso. Subsampled open-reference clustering creates consistent, comprehensive otu definitions and scales to billions of sequences. PeerJ, 2:e545, 2014. doi:10.7717/peerj.545.

Docstring:

Usage: qiime vsearch cluster-features-open-reference [OPTIONS]

  Given a feature table and the associated feature sequences, cluster the
  features against a reference database based on user-specified percent
  identity threshold of their sequences. Any sequences that don't match are
  then clustered de novo. This is not a general-purpose clustering method, but
  rather is intended to be used for clustering the results of quality-
  filtering/dereplication methods, such as DADA2, or for re-clustering a
  FeatureTable at a lower percent identity than it was originally clustered
  at. When a group of features in the input table are clustered into a single
  feature, the frequency of that single feature in a given sample is the sum
  of the frequencies of the features that were clustered in that sample.
  Feature identifiers will be inherited from the centroid feature of each
  cluster. For features that match a reference sequence, the centroid feature
  is that reference sequence, so its identifier will become the feature
  identifier. The clustered_sequences result will contain feature
  representative sequences that are derived from the sequences input for all
  features in clustered_table. This will always be the most abundant sequence
  in the cluster. The new_reference_sequences result will contain the entire
  reference database, plus feature representative sequences for any de novo
  features. This is intended to be used as a reference database in subsequent
  iterations of cluster_features_open_reference, if applicable. See the
  vsearch documentation for details on how sequence clustering is performed.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          The sequences corresponding to the features in
                          table.                                    [required]
  --i-table ARTIFACT FeatureTable[Frequency]
                          The feature table to be clustered.        [required]
  --i-reference-sequences ARTIFACT FeatureData[Sequence]
                          The sequences to use as cluster centroids.
                                                                    [required]
Parameters:
  --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False,
    inclusive_end=True)   The percent identity at which clustering should be
                          performed. This parameter maps to vsearch's --id
                          parameter.                                [required]
  --p-strand TEXT Choices('plus', 'both')
                          Search plus (i.e., forward) or both (i.e., forward
                          and reverse complement) strands.   [default: 'plus']
  --p-threads NTHREADS    The number of threads to use for computation.
                          Passing 0 will launch one thread per CPU core.
                                                                  [default: 1]
Outputs:
  --o-clustered-table ARTIFACT FeatureTable[Frequency]
                          The table following clustering of features.
                                                                    [required]
  --o-clustered-sequences ARTIFACT FeatureData[Sequence]
                          Sequences representing clustered features.
                                                                    [required]
  --o-new-reference-sequences ARTIFACT FeatureData[Sequence]
                          The new reference sequences. This can be used for
                          subsequent runs of open-reference clustering for
                          consistent definitions of features across
                          open-reference feature tables.            [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --recycle-pool TEXT     Use a cache pool for pipeline resumption. QIIME 2
                          will cache your results in this pool for reuse by
                          future invocations. These pool are retained until
                          deleted by the user. If not provided, QIIME 2 will
                          create a pool which is automatically reused by
                          invocations of the same action and removed if the
                          action is successful. Note: these pools are local to
                          the cache you are using.
  --no-recycle            Do not recycle results from a previous failed
                          pipeline run or save the results from this run for
                          future recycling.
  --parallel              Execute your action in parallel. This flag will use
                          your default parallel config.
  --parallel-config FILE  Execute your action in parallel using a config at
                          the indicated path.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this pipeline. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.vsearch.pipelines import cluster_features_open_reference

Docstring:

Open-reference clustering of features.

Given a feature table and the associated feature sequences, cluster the
features against a reference database based on user-specified percent
identity threshold of their sequences. Any sequences that don't match are
then clustered de novo. This is not a general-purpose clustering method,
but rather is intended to be used for clustering the results of quality-
filtering/dereplication methods, such as DADA2, or for re-clustering a
FeatureTable at a lower percent identity than it was originally clustered
at. When a group of features in the input table are clustered into a single
feature, the frequency of that single feature in a given sample is the sum
of the frequencies of the features that were clustered in that sample.
Feature identifiers will be inherited from the centroid feature of each
cluster. For features that match a reference sequence, the centroid feature
is that reference sequence, so its identifier will become the feature
identifier. The clustered_sequences result will contain feature
representative sequences that are derived from the sequences input for all
features in clustered_table. This will always be the most abundant sequence
in the cluster. The new_reference_sequences result will contain the entire
reference database, plus feature representative sequences for any de novo
features. This is intended to be used as a reference database in subsequent
iterations of cluster_features_open_reference, if applicable. See the
vsearch documentation for details on how sequence clustering is performed.

Parameters
----------
sequences : FeatureData[Sequence]
    The sequences corresponding to the features in table.
table : FeatureTable[Frequency]
    The feature table to be clustered.
reference_sequences : FeatureData[Sequence]
    The sequences to use as cluster centroids.
perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True)
    The percent identity at which clustering should be performed. This
    parameter maps to vsearch's --id parameter.
strand : Str % Choices('plus', 'both'), optional
    Search plus (i.e., forward) or both (i.e., forward and reverse
    complement) strands.
threads : Threads, optional
    The number of threads to use for computation. Passing 0 will launch one
    thread per CPU core.

Returns
-------
clustered_table : FeatureTable[Frequency]
    The table following clustering of features.
clustered_sequences : FeatureData[Sequence]
    Sequences representing clustered features.
new_reference_sequences : FeatureData[Sequence]
    The new reference sequences. This can be used for subsequent runs of
    open-reference clustering for consistent definitions of features across
    open-reference feature tables.