Warning
This site has been replaced by the new QIIME 2 “amplicon distribution” documentation, as of the 2025.4 release of QIIME 2. You can still access the content from the “old docs” here for the QIIME 2 2024.10 and earlier releases, but we recommend that you transition to the new documentation at https://amplicon-docs.qiime2.org. Content on this site is no longer updated and may be out of date.
Are you looking for:
the QIIME 2 homepage? That’s https://qiime2.org.
learning resources for microbiome marker gene (i.e., amplicon) analysis? See the QIIME 2 amplicon distribution documentation.
learning resources for microbiome metagenome analysis? See the MOSHPIT documentation.
installation instructions, plugins, books, videos, workshops, or resources? See the QIIME 2 Library.
general help? See the QIIME 2 Forum.
Old content beyond this point… 👴👵
cluster-features-open-reference: Open-reference clustering of features.¶
Citations |
|
---|
Docstring:
Usage: qiime vsearch cluster-features-open-reference [OPTIONS] Given a feature table and the associated feature sequences, cluster the features against a reference database based on user-specified percent identity threshold of their sequences. Any sequences that don't match are then clustered de novo. This is not a general-purpose clustering method, but rather is intended to be used for clustering the results of quality- filtering/dereplication methods, such as DADA2, or for re-clustering a FeatureTable at a lower percent identity than it was originally clustered at. When a group of features in the input table are clustered into a single feature, the frequency of that single feature in a given sample is the sum of the frequencies of the features that were clustered in that sample. Feature identifiers will be inherited from the centroid feature of each cluster. For features that match a reference sequence, the centroid feature is that reference sequence, so its identifier will become the feature identifier. The clustered_sequences result will contain feature representative sequences that are derived from the sequences input for all features in clustered_table. This will always be the most abundant sequence in the cluster. The new_reference_sequences result will contain the entire reference database, plus feature representative sequences for any de novo features. This is intended to be used as a reference database in subsequent iterations of cluster_features_open_reference, if applicable. See the vsearch documentation for details on how sequence clustering is performed. Inputs: --i-sequences ARTIFACT FeatureData[Sequence] The sequences corresponding to the features in table. [required] --i-table ARTIFACT FeatureTable[Frequency] The feature table to be clustered. [required] --i-reference-sequences ARTIFACT FeatureData[Sequence] The sequences to use as cluster centroids. [required] Parameters: --p-perc-identity PROPORTION Range(0, 1, inclusive_start=False, inclusive_end=True) The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter. [required] --p-strand TEXT Choices('plus', 'both') Search plus (i.e., forward) or both (i.e., forward and reverse complement) strands. [default: 'plus'] --p-threads NTHREADS The number of threads to use for computation. Passing 0 will launch one thread per CPU core. [default: 1] Outputs: --o-clustered-table ARTIFACT FeatureTable[Frequency] The table following clustering of features. [required] --o-clustered-sequences ARTIFACT FeatureData[Sequence] Sequences representing clustered features. [required] --o-new-reference-sequences ARTIFACT FeatureData[Sequence] The new reference sequences. This can be used for subsequent runs of open-reference clustering for consistent definitions of features across open-reference feature tables. [required] Miscellaneous: --output-dir PATH Output unspecified results to a directory --verbose / --quiet Display verbose output to stdout and/or stderr during execution of this action. Or silence output if execution is successful (silence is golden). --recycle-pool TEXT Use a cache pool for pipeline resumption. QIIME 2 will cache your results in this pool for reuse by future invocations. These pool are retained until deleted by the user. If not provided, QIIME 2 will create a pool which is automatically reused by invocations of the same action and removed if the action is successful. Note: these pools are local to the cache you are using. --no-recycle Do not recycle results from a previous failed pipeline run or save the results from this run for future recycling. --parallel Execute your action in parallel. This flag will use your default parallel config. --parallel-config FILE Execute your action in parallel using a config at the indicated path. --example-data PATH Write example data and exit. --citations Show citations and exit. --use-cache DIRECTORY Specify the cache to be used for the intermediate work of this action. If not provided, the default cache under $TMP/qiime2/will be used. IMPORTANT FOR HPC USERS: If you are on an HPC system and are using parallel execution it is important to set this to a location that is globally accessible to all nodes in the cluster. --help Show this message and exit.
Import:
from qiime2.plugins.vsearch.pipelines import cluster_features_open_reference
Docstring:
Open-reference clustering of features. Given a feature table and the associated feature sequences, cluster the features against a reference database based on user-specified percent identity threshold of their sequences. Any sequences that don't match are then clustered de novo. This is not a general-purpose clustering method, but rather is intended to be used for clustering the results of quality- filtering/dereplication methods, such as DADA2, or for re-clustering a FeatureTable at a lower percent identity than it was originally clustered at. When a group of features in the input table are clustered into a single feature, the frequency of that single feature in a given sample is the sum of the frequencies of the features that were clustered in that sample. Feature identifiers will be inherited from the centroid feature of each cluster. For features that match a reference sequence, the centroid feature is that reference sequence, so its identifier will become the feature identifier. The clustered_sequences result will contain feature representative sequences that are derived from the sequences input for all features in clustered_table. This will always be the most abundant sequence in the cluster. The new_reference_sequences result will contain the entire reference database, plus feature representative sequences for any de novo features. This is intended to be used as a reference database in subsequent iterations of cluster_features_open_reference, if applicable. See the vsearch documentation for details on how sequence clustering is performed. Parameters ---------- sequences : FeatureData[Sequence] The sequences corresponding to the features in table. table : FeatureTable[Frequency] The feature table to be clustered. reference_sequences : FeatureData[Sequence] The sequences to use as cluster centroids. perc_identity : Float % Range(0, 1, inclusive_start=False, inclusive_end=True) The percent identity at which clustering should be performed. This parameter maps to vsearch's --id parameter. strand : Str % Choices('plus', 'both'), optional Search plus (i.e., forward) or both (i.e., forward and reverse complement) strands. threads : Threads, optional The number of threads to use for computation. Passing 0 will launch one thread per CPU core. Returns ------- clustered_table : FeatureTable[Frequency] The table following clustering of features. clustered_sequences : FeatureData[Sequence] Sequences representing clustered features. new_reference_sequences : FeatureData[Sequence] The new reference sequences. This can be used for subsequent runs of open-reference clustering for consistent definitions of features across open-reference feature tables.