# cluster-features-open-reference: Open-reference clustering of features.¶

 [vsearch:cluster-features-open-reference:RHNM+14] Jai Ram Rideout, Yan He, Jose A. Navas-Molina, William A. Walters, Luke K. Ursell, Sean M. Gibbons, John Chase, Daniel McDonald, Antonio Gonzalez, Adam Robbins-Pianka, Jose C. Clemente, Jack A. Gilbert, Susan M. Huse, Hong-Wei Zhou, Rob Knight, and J. Gregory Caporaso. Subsampled open-reference clustering creates consistent, comprehensive otu definitions and scales to billions of sequences. PeerJ, 2:e545, 2014. doi:10.7717/peerj.545.

Usage: qiime vsearch cluster-features-open-reference [OPTIONS]

Given a feature table and the associated feature sequences, cluster the
features against a reference database based on user-specified percent
identity threshold of their sequences. Any sequences that don't match are
then clustered de novo. This is not a general-purpose clustering method,
but rather is intended to be used for clustering the results of quality-
filtering/dereplication methods, such as DADA2, or for re-clustering a
FeatureTable at a lower percent identity than it was originally clustered
at. When a group of features in the input table are clustered into a
single feature, the frequency of that single feature in a given sample is
the sum of the frequencies of the features that were clustered in that
sample. Feature identifiers will be inherited from the centroid feature of
each cluster. For features that match a reference sequence, the centroid
feature is that reference sequence, so its identifier will become the
feature identifier. The clustered_sequences result will contain feature
representative sequences that are derived from the sequences input for all
features in clustered_table. This will always be the most abundant
sequence in the cluster. The new_reference_sequences result will contain
the entire reference database, plus feature representative sequences for
any de novo features. This is intended to be used as a reference database
in subsequent iterations of cluster_features_open_reference, if
applicable. See the vsearch documentation for details on how sequence
clustering is performed.

Options:
--i-sequences ARTIFACT PATH FeatureData[Sequence]
The sequences corresponding to the features
in table.  [required]
--i-table ARTIFACT PATH FeatureTable[Frequency]
The feature table to be clustered.
[required]
--i-reference-sequences ARTIFACT PATH FeatureData[Sequence]
The sequences to use as cluster centroids.
[required]
--p-perc-identity FLOAT         The percent identity at which clustering
should be performed. This parameter maps to
vsearch's --id parameter.  [required]
--p-strand [both|plus]          Search plus (i.e., forward) or both (i.e.,
forward and reverse complement) strands.
[default: plus]
--p-threads INTEGER RANGE       The number of threads to use for
computation. Passing 0 will launch one
thread per CPU core.  [default: 1]
--o-clustered-table ARTIFACT PATH FeatureTable[Frequency]
The table following clustering of features.
[required if not passing --output-dir]
--o-clustered-sequences ARTIFACT PATH FeatureData[Sequence]
Sequences representing clustered features.
[required if not passing --output-dir]
--o-new-reference-sequences ARTIFACT PATH FeatureData[Sequence]
The new reference sequences. This can be
used for subsequent runs of open-reference
clustering for consistent definitions of
features across open-reference feature
tables.  [required if not passing --output-
dir]
--output-dir DIRECTORY          Output unspecified results to a directory
--cmd-config FILE               Use config file for command options
--verbose                       Display verbose output to stdout and/or
stderr during execution of this action.
[default: False]
--quiet                         Silence output if execution is successful
(silence is golden).  [default: False]
--citations                     Show citations and exit.
--help                          Show this message and exit.

#### Import:

from qiime2.plugins.vsearch.pipelines import cluster_features_open_reference


Open-reference clustering of features.

Given a feature table and the associated feature sequences, cluster the
features against a reference database based on user-specified percent
identity threshold of their sequences. Any sequences that don't match are
then clustered de novo. This is not a general-purpose clustering method,
but rather is intended to be used for clustering the results of quality-
filtering/dereplication methods, such as DADA2, or for re-clustering a
FeatureTable at a lower percent identity than it was originally clustered
at. When a group of features in the input table are clustered into a single
feature, the frequency of that single feature in a given sample is the sum
of the frequencies of the features that were clustered in that sample.
Feature identifiers will be inherited from the centroid feature of each
cluster. For features that match a reference sequence, the centroid feature
is that reference sequence, so its identifier will become the feature
identifier. The clustered_sequences result will contain feature
representative sequences that are derived from the sequences input for all
features in clustered_table. This will always be the most abundant sequence
in the cluster. The new_reference_sequences result will contain the entire
reference database, plus feature representative sequences for any de novo
features. This is intended to be used as a reference database in subsequent
iterations of cluster_features_open_reference, if applicable. See the
vsearch documentation for details on how sequence clustering is performed.

Parameters
----------
sequences : FeatureData[Sequence]
The sequences corresponding to the features in table.
table : FeatureTable[Frequency]
The feature table to be clustered.
reference_sequences : FeatureData[Sequence]
The sequences to use as cluster centroids.
perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True)
The percent identity at which clustering should be performed. This
parameter maps to vsearch's --id parameter.
strand : Str % Choices({'both', 'plus'}), optional
Search plus (i.e., forward) or both (i.e., forward and reverse
complement) strands.
threads : Int % Range(0, 256, inclusive_end=True), optional
The number of threads to use for computation. Passing 0 will launch one
thread per CPU core.

Returns
-------
clustered_table : FeatureTable[Frequency]
The table following clustering of features.
clustered_sequences : FeatureData[Sequence]
Sequences representing clustered features.
new_reference_sequences : FeatureData[Sequence]
The new reference sequences. This can be used for subsequent runs of
open-reference clustering for consistent definitions of features across
open-reference feature tables.