Clustering sequences into OTUs using q2-vsearch¶

Warning

This site has been replaced by the new QIIME 2 “amplicon distribution” documentation, as of the 2025.4 release of QIIME 2. You can still access the content from the “old docs” here for the QIIME 2 2024.10 and earlier releases, but we recommend that you transition to the new documentation at https://amplicon-docs.qiime2.org. Content on this site is no longer updated and may be out of date.

Are you looking for:

the QIIME 2 homepage? That’s https://qiime2.org.
learning resources for microbiome marker gene (i.e., amplicon) analysis? See the QIIME 2 amplicon distribution documentation.
learning resources for microbiome metagenome analysis? See the MOSHPIT documentation.
installation instructions, plugins, books, videos, workshops, or resources? See the QIIME 2 Library.
general help? See the QIIME 2 Forum.

Old content beyond this point… 👴👵

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.

Clustering of sequences or features into OTUs using vsearch is currently possible from demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact), or from dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence] artifacts, which could be generated using the qiime dada2 denoise-* or qiime deblur denoise-* commands). The first option is currently performed in two steps (but will likely be accessible through a single command in the future for convenience). The second option is performed in one step.

QIIME 1 Users

Demultiplexed, quality-filtered sequence data is synonymous with the seqs.fna file, generated by the QIIME 1 split_libraries*.py commands.

After working through this tutorial, you will know how to run de novo, closed-reference, and open-reference clustering. This will be illustrated beginning with a QIIME 1 seqs.fna file that will be read into an SampleData[Sequences] artifact. If you already have FeatureTable[Frequency] and FeatureData[Sequence] artifacts that you’d like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence] section of this tutorial.

Obtain the data¶

Start by creating a directory to work in.

mkdir qiime2-otu-clustering-tutorial
cd qiime2-otu-clustering-tutorial

Next, download the necessary files:

Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna

Save as: seqs.fna

            wget \
  -O "seqs.fna" \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna"
          

            curl -sL \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna" > \
  "seqs.fna"
          

Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza

Save as: 85_otus.qza

            wget \
  -O "85_otus.qza" \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza"
          

            curl -sL \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza" > \
  "85_otus.qza"
          

Dereplicating a `SampleData[Sequences]` artifact¶

If you are beginning your analysis with demultiplexed, quality controlled sequences, such as those in a QIIME 1 seqs.fna file your first step is to import that data into a QIIME 2 artifact. The semantic type used here is SampleData[Sequences], indicating that the data represents collections of sequences associated with one or more samples.

qiime tools import \
  --input-path seqs.fna \
  --output-path seqs.qza \
  --type 'SampleData[Sequences]'

Output artifacts:

85_otus.qza: view | download
seqs.qza: view | download

After importing data, you can dereplicate it with the dereplicate-sequences command.

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

Output artifacts:

rep-seqs.qza: view | download
table.qza: view | download

The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process. (In this example, the only quality control of these data is what was applied outside of QIIME 2, before the import step.)

Clustering of `FeatureTable[Frequency]` and `FeatureData[Sequence]`¶

OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. These artifacts can come from a variety of analysis pipelines, including qiime vsearch dereplicate-sequences (illustrated above), qiime dada2 denoise-*, qiime deblur denoise-*, or one of the clustering processes illustrated below (for example, to recluster data at a lower percent identity).

The sequences in the FeatureData[Sequence] artifact are clustered against one another (in de novo clustering) or a reference database (in closed-reference clustering), and then features in the FeatureTable are collapsed, resulting in new features that are clusters of the input features.

De novo clustering¶

De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza

Output artifacts:

table-dn-99.qza: view | download
rep-seqs-dn-99.qza: view | download

The outputs from this process are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact will contain the centroid sequence defining each OTU cluster.

Closed-reference clustering¶

Closed-reference clustering of a feature table can be performed as follows. In this example, clustering is performed at 85% identity against the Greengenes 13_8 85% OTUs reference database. The reference database is provided as a FeatureData[Sequence] artifact.

Note

Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-cr-85.qza \
  --o-clustered-sequences rep-seqs-cr-85.qza \
  --o-unmatched-sequences unmatched-cr-85.qza

Output artifacts:

table-cr-85.qza: view | download
unmatched-cr-85.qza: view | download
rep-seqs-cr-85.qza: view | download

The outputs from cluster-features-closed-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts. The FeatureData[Sequence] artifacts consist of the sequences defining the features in the FeatureTable (rep-seqs-cr-85.qza from the command block above) as well as the collection of feature ids and their sequences that didn’t match the reference database at 85% identity. The reference sequences provided as input should be used as sequences defining the features in the FeatureTable in closed-reference OTU picking.

Open-reference clustering¶

Like the closed-reference clustering example above, open-reference clustering can be performed using the qiime vsearch cluster-features-open-reference command.

Note

Open-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform open-reference clustering.

qiime vsearch cluster-features-open-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-or-85.qza \
  --o-clustered-sequences rep-seqs-or-85.qza \
  --o-new-reference-sequences new-ref-seqs-or-85.qza

Output artifacts:

new-ref-seqs-or-85.qza: view | download
rep-seqs-or-85.qza: view | download
table-or-85.qza: view | download

The outputs from cluster-features-open-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts. One of the FeatureData[Sequence] artifacts represents the clustered sequences, while the other artifact represents the new reference sequences, composed of the reference sequences used for input, as well as the sequences clustered as part of the internal de novo clustering step.

Clustering sequences into OTUs using q2-vsearch¶

Obtain the data¶

Dereplicating a SampleData[Sequences] artifact¶

Clustering of FeatureTable[Frequency] and FeatureData[Sequence]¶

De novo clustering¶

Closed-reference clustering¶

Open-reference clustering¶

Table of Contents

Quick search

Dereplicating a `SampleData[Sequences]` artifact¶

Clustering of `FeatureTable[Frequency]` and `FeatureData[Sequence]`¶