Clustering sequences into OTUs using q2-vsearch¶
De novo, closed-reference, and open-reference clustering are currently supported in QIIME 2.
Clustering of sequences or features into OTUs using vsearch is currently
possible from demultiplexed, quality-controlled sequence data (i.e., a
SampleData[Sequences]
artifact), or from dereplicated, quality-controlled
data in feature table and feature representative sequences (i.e., the
FeatureTable[Frequency]
and FeatureData[Sequence]
artifacts, which
could be generated using the qiime dada2 denoise-*
or qiime deblur
denoise-*
commands). The first option is currently performed in two steps
(but will likely be accessible through a single command in the future for
convenience). The second option is performed in one step.
QIIME 1 Users
Demultiplexed, quality-filtered sequence data is synonymous with the
seqs.fna
file, generated by the QIIME 1 split_libraries*.py
commands.
After working through this tutorial, you will know how to run de novo,
closed-reference, and open-reference clustering. This will be illustrated
beginning with a QIIME 1 seqs.fna
file that will be read into an
SampleData[Sequences]
artifact. If you already have
FeatureTable[Frequency]
and FeatureData[Sequence]
artifacts that you’d
like to cluster, you can skip ahead to the Clustering of FeatureTable[Frequency] and FeatureData[Sequence]
section of this tutorial.
Obtain the data¶
Start by creating a directory to work in.
mkdir qiime2-otu-clustering-tutorial
cd qiime2-otu-clustering-tutorial
Next, download the necessary files:
Download URL: https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna
Save as: seqs.fna
wget \
-O "seqs.fna" \
"https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna"
curl -sL \
"https://data.qiime2.org/2024.10/tutorials/otu-clustering/seqs.fna" > \
"seqs.fna"
Download URL: https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza
Save as: 85_otus.qza
wget \
-O "85_otus.qza" \
"https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza"
curl -sL \
"https://data.qiime2.org/2024.10/tutorials/otu-clustering/85_otus.qza" > \
"85_otus.qza"
Dereplicating a SampleData[Sequences]
artifact¶
If you are beginning your analysis with demultiplexed, quality controlled
sequences, such as those in a QIIME 1 seqs.fna
file your first step
is to import that data into a QIIME 2 artifact. The semantic type used here is
SampleData[Sequences]
, indicating that the data represents collections of
sequences associated with one or more samples.
qiime tools import \
--input-path seqs.fna \
--output-path seqs.qza \
--type 'SampleData[Sequences]'
After importing data, you can dereplicate it with the dereplicate-sequences
command.
qiime vsearch dereplicate-sequences \
--i-sequences seqs.qza \
--o-dereplicated-table table.qza \
--o-dereplicated-sequences rep-seqs.qza
The outputs from dereplicate-sequences
are a FeatureTable[Frequency]
artifact and a FeatureData[Sequence]
artifact. The
FeatureTable[Frequency]
artifact is the feature table indicating the number
of times each amplicon sequence variant (ASV) is observed in each of your
samples. The FeatureData[Sequence]
contains the mapping of each feature
identifier to the sequence variant that defines that feature. These files are
analogous to those generated by qiime dada2 denoise-*
and qiime deblur
denoise-*
, except that no denoising, chimera removal, or other quality
control has been applied in the dereplication process. (In this example, the
only quality control of these data is what was applied outside of QIIME 2,
before the import
step.)
Clustering of FeatureTable[Frequency]
and FeatureData[Sequence]
¶
OTU clustering in QIIME 2 is currently applied to a FeatureTable[Frequency]
artifact and a FeatureData[Sequence]
artifact. These artifacts can come
from a variety of analysis pipelines, including qiime vsearch
dereplicate-sequences
(illustrated above), qiime dada2 denoise-*
, qiime
deblur denoise-*
, or one of the clustering processes illustrated below (for
example, to recluster data at a lower percent identity).
The sequences in the FeatureData[Sequence]
artifact are clustered against
one another (in de novo clustering) or a reference database (in
closed-reference clustering), and then features in the FeatureTable
are
collapsed, resulting in new features that are clusters of the input features.
De novo clustering¶
De novo clustering of a feature table can be performed as follows. In this example, clustering is performed at 99% identity to create 99% OTUs.
qiime vsearch cluster-features-de-novo \
--i-table table.qza \
--i-sequences rep-seqs.qza \
--p-perc-identity 0.99 \
--o-clustered-table table-dn-99.qza \
--o-clustered-sequences rep-seqs-dn-99.qza
The outputs from this process are a FeatureTable[Frequency]
artifact and a
FeatureData[Sequence]
artifact. The FeatureData[Sequence]
artifact will
contain the centroid sequence defining each OTU cluster.
Closed-reference clustering¶
Closed-reference clustering of a feature table can be performed as follows. In
this example, clustering is performed at 85% identity against the Greengenes
13_8 85% OTUs reference database. The reference database is provided as a
FeatureData[Sequence]
artifact.
Note
Closed-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform closed-reference clustering.
qiime vsearch cluster-features-closed-reference \
--i-table table.qza \
--i-sequences rep-seqs.qza \
--i-reference-sequences 85_otus.qza \
--p-perc-identity 0.85 \
--o-clustered-table table-cr-85.qza \
--o-clustered-sequences rep-seqs-cr-85.qza \
--o-unmatched-sequences unmatched-cr-85.qza
Output artifacts:
The outputs from cluster-features-closed-reference
are a
FeatureTable[Frequency]
artifact and two FeatureData[Sequence]
artifacts.
The FeatureData[Sequence]
artifacts consist of the sequences defining the
features in the FeatureTable
(rep-seqs-cr-85.qza from the command block above)
as well as the collection of feature ids and their sequences that didn’t match the
reference database at 85% identity. The reference sequences provided as input
should be used as sequences defining the features in the FeatureTable
in
closed-reference OTU picking.
Open-reference clustering¶
Like the closed-reference clustering example above, open-reference clustering
can be performed using the qiime vsearch cluster-features-open-reference
command.
Note
Open-reference OTU clustering is generally performed at a higher percent identity, but 85% is used here so users of this tutorial don’t have to download a larger reference database. Typically clustering at some percent identity is performed against a reference database clustered at the same percent identity, but this has not been properly benchmarked to determine if it is the optimal way to perform open-reference clustering.
qiime vsearch cluster-features-open-reference \
--i-table table.qza \
--i-sequences rep-seqs.qza \
--i-reference-sequences 85_otus.qza \
--p-perc-identity 0.85 \
--o-clustered-table table-or-85.qza \
--o-clustered-sequences rep-seqs-or-85.qza \
--o-new-reference-sequences new-ref-seqs-or-85.qza
Output artifacts:
The outputs from cluster-features-open-reference
are a
FeatureTable[Frequency]
artifact and two FeatureData[Sequence]
artifacts. One of the FeatureData[Sequence]
artifacts represents the
clustered sequences, while the other artifact represents the new reference
sequences, composed of the reference sequences used for input, as well as the
sequences clustered as part of the internal de novo clustering step.