Fork me on GitHub

Clustering sequences into OTUs using q2-vsearch

De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.

Clustering of sequences or features into OTUs using vsearch is currently possible from demultiplexed, quality-controlled sequence data (i.e., a SampleData[Sequences] artifact), or from dereplicated, quality-controlled data in feature table and feature representative sequences (i.e., the FeatureTable[Frequency] and FeatureData[Sequence] artifacts, which could be generated using the qiime dada2 denoise-* or qiime deblur denoise-* commands). The first option is currently performed in two steps (but will likely be accessible through a single command in the future for convenience). The second option is performed in one step.

QIIME 1 Users

Demultiplexed, quality-filtered sequence data is synonymous with the seqs.fna file, generated by the QIIME 1 split_libraries*.py commands.

After working through this tutorial, you will know how to run de novo, closed-reference, and open-reference clustering. This will be illustrated beginning with a QIIME 1 seqs.fna file that will be read into an SampleData[Sequences] artifact. If you already have FeatureTable[Frequency] and FeatureData[Sequence] artifacts that you’d like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence] section of this tutorial.

Obtain the data

Start by creating a directory to work in.

mkdir qiime2-otu-clustering-tutorial
cd qiime2-otu-clustering-tutorial

Next, download the necessary files:

Please select a download option that is most appropriate for your environment:
wget \
  -O "seqs.fna" \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna"
curl -sL \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna" > \
  "seqs.fna"
Please select a download option that is most appropriate for your environment:
wget \
  -O "85_otus.qza" \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza"
curl -sL \
  "https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza" > \
  "85_otus.qza"

Dereplicating a SampleData[Sequences] artifact

If you are beginning your analysis with demultiplexed, quality controlled sequences, such as those in a QIIME 1 seqs.fna file your first step is to import that data into a QIIME 2 artifact. The semantic type used here is SampleData[Sequences], indicating that the data represents collections of sequences associated with one or more samples.

qiime tools import \
  --input-path seqs.fna \
  --output-path seqs.qza \
  --type 'SampleData[Sequences]'

Output artifacts:

After importing data, you can dereplicate it with the dereplicate-sequences command.

qiime vsearch dereplicate-sequences \
  --i-sequences seqs.qza \
  --o-dereplicated-table table.qza \
  --o-dereplicated-sequences rep-seqs.qza

Output artifacts:

The outputs from dereplicate-sequences are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureTable[Frequency] artifact is the feature table indicating the number of times each amplicon sequence variant (ASV) is observed in each of your samples. The FeatureData[Sequence] contains the mapping of each feature identifier to the sequence variant that defines that feature. These files are analogous to those generated by qiime dada2 denoise-* and qiime deblur denoise-*, except that no denoising, chimera removal, or other quality control has been applied in the dereplication process. (In this example, the only quality control of these data is what was applied outside of QIIME 2, before the import step.)

Clustering of FeatureTable[Frequency] and FeatureData[Sequence]

OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. These artifacts can come from a variety of analysis pipelines, including qiime vsearch dereplicate-sequences (illustrated above), qiime dada2 denoise-*, qiime deblur denoise-*, or one of the clustering processes illustrated below (for example, to recluster data at a lower percent identity).

The sequences in the FeatureData[Sequence] artifact are clustered against one another (in de novo clustering) or a reference database (in closed-reference clustering), and then features in the FeatureTable are collapsed, resulting in new features that are clusters of the input features.

De novo clustering

De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.

qiime vsearch cluster-features-de-novo \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --p-perc-identity 0.99 \
  --o-clustered-table table-dn-99.qza \
  --o-clustered-sequences rep-seqs-dn-99.qza

Output artifacts:

The outputs from this process are a FeatureTable[Frequency] artifact and a FeatureData[Sequence] artifact. The FeatureData[Sequence] artifact will contain the centroid sequence defining each OTU cluster.

Closed-reference clustering

Closed-reference clustering of a feature table can be performed as follows. In this example, clustering is performed at 85% identity against the Greengenes 13_8 85% OTUs reference database. The reference database is provided as a FeatureData[Sequence] artifact.

Note

Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.

qiime vsearch cluster-features-closed-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-cr-85.qza \
  --o-clustered-sequences rep-seqs-cr-85.qza \
  --o-unmatched-sequences unmatched-cr-85.qza

Output artifacts:

The outputs from cluster-features-closed-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts. The FeatureData[Sequence] artifacts consist of the sequences defining the features in the FeatureTable (rep-seqs-cr-85.qza from the command block above) as well as the collection of feature ids and their sequences that didn’t match the reference database at 85% identity. The reference sequences provided as input should be used as sequences defining the features in the FeatureTable in closed-reference OTU picking.

Open-reference clustering

Like the closed-reference clustering example above, open-reference clustering can be performed using the qiime vsearch cluster-features-open-reference command.

Note

Open-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform open-reference clustering.

qiime vsearch cluster-features-open-reference \
  --i-table table.qza \
  --i-sequences rep-seqs.qza \
  --i-reference-sequences 85_otus.qza \
  --p-perc-identity 0.85 \
  --o-clustered-table table-or-85.qza \
  --o-clustered-sequences rep-seqs-or-85.qza \
  --o-new-reference-sequences new-ref-seqs-or-85.qza

Output artifacts:

The outputs from cluster-features-open-reference are a FeatureTable[Frequency] artifact and two FeatureData[Sequence] artifacts. One of the FeatureData[Sequence] artifacts represents the clustered sequences, while the other artifact represents the new reference sequences, composed of the reference sequences used for input, as well as the sequences clustered as part of the internal de novo clustering step.