Docstring:
Usage: qiime feature-classifier classify-hybrid-vsearch-sklearn
[OPTIONS]
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to
https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid
classifier. First performs rough positive filter to remove artifact and low-
coverage sequences (use "prefilter" parameter to toggle this step on or
off). Second, performs VSEARCH exact match between query and reference_reads
to find exact matches, followed by least common ancestor consensus taxonomy
assignment from among maxaccepts top hits, min_consensus of which share that
taxonomic assignment. Query sequences without an exact match are then
classified with a pre-trained sklearn taxonomy classifier to predict the
most likely taxonomic lineage.
Inputs:
--i-query ARTIFACT FeatureData[Sequence]
Query Sequences. [required]
--i-reference-reads ARTIFACT FeatureData[Sequence]
Reference sequences. [required]
--i-reference-taxonomy ARTIFACT FeatureData[Taxonomy]
Reference taxonomy labels. [required]
--i-classifier ARTIFACT TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for
classifying the reads. [required]
Parameters:
--p-maxaccepts VALUE Int % Range(1, None) | Str % Choices('all')
Maximum number of hits to keep for each query. Set to
"all" to keep all hits > perc-identity similarity.
Note that if strand=both, maxaccepts will keep N hits
for each direction (if searches in the opposite
direction yield results that exceed the minimum
perc-identity). In those cases use maxhits to control
the total number of hits returned. This option works
in pair with maxrejects. The search process sorts
target sequences by decreasing number of k-mers they
have in common with the query sequence, using that
information as a proxy for sequence similarity. After
pairwise alignments, if the first target sequence
passes the acceptation criteria, it is accepted as
best hit and the search process stops for that query.
If maxaccepts is set to a higher value, more hits are
accepted. If maxaccepts and maxrejects are both set to
"all", the complete database is searched.
[default: 10]
--p-perc-identity PROPORTION Range(0.0, 1.0, inclusive_end=True)
Percent sequence similarity to use for PREFILTER.
Reject match if percent identity to query is lower.
Set to a lower value to perform a rough pre-filter.
This parameter is ignored if `prefilter` is disabled.
[default: 0.5]
--p-query-cov PROPORTION Range(0.0, 1.0, inclusive_end=True)
Query coverage threshold to use for PREFILTER. Reject
match if query alignment coverage per high-scoring
pair is lower. Set to a lower value to perform a rough
pre-filter. This parameter is ignored if `prefilter`
is disabled. [default: 0.8]
--p-strand TEXT Choices('both', 'plus')
Align against reference sequences in forward ("plus")
or both directions ("both"). [default: 'both']
--p-min-consensus NUMBER Range(0.5, 1.0, inclusive_start=False,
inclusive_end=True) Minimum fraction of assignments must match top hit to
be accepted as consensus assignment. [default: 0.51]
--p-maxhits VALUE Int % Range(1, None) | Str % Choices('all')
[default: 'all']
--p-maxrejects VALUE Int % Range(1, None) | Str % Choices('all')
[default: 'all']
--p-reads-per-batch VALUE Int % Range(1, None) | Str % Choices('auto')
Number of reads to process in each batch for sklearn
classification. If "auto", this parameter is
autoscaled to min(number of query sequences / threads,
20000). [default: 'auto']
--p-confidence VALUE Float % Range(0, 1, inclusive_end=True) | Str %
Choices('disable') Confidence threshold for limiting taxonomic depth.
Set to "disable" to disable confidence calculation, or
0 to calculate confidence but not apply it to limit
the taxonomic depth of the assignments. [default: 0.7]
--p-read-orientation TEXT Choices('same', 'reverse-complement', 'auto')
Direction of reads with respect to reference
sequences in pre-trained sklearn classifier. same will
cause reads to be classified unchanged;
reverse-complement will cause reads to be reversed and
complemented prior to classification. "auto" will
autodetect orientation based on the confidence
estimates for the first 100 reads. [default: 'auto']
--p-threads INTEGER Number of threads to use for job parallelization.
Range(1, None) [default: 1]
--p-prefilter / --p-no-prefilter
Toggle positive filter of query sequences on or off.
[default: True]
--p-sample-size INTEGER
Range(1, None) Randomly extract the given number of sequences from
the reference database to use for prefiltering. This
parameter is ignored if `prefilter` is disabled.
[default: 1000]
--p-randseed INTEGER Use integer as a seed for the pseudo-random generator
Range(0, None) used during prefiltering. A given seed always produces
the same output, which is useful for replicability.
Set to 0 to use a pseudo-random seed. This parameter
is ignored if `prefilter` is disabled. [default: 0]
Outputs:
--o-classification ARTIFACT FeatureData[Taxonomy]
Taxonomy classifications of query sequences.
[required]
Miscellaneous:
--output-dir PATH Output unspecified results to a directory
--verbose / --quiet Display verbose output to stdout and/or stderr during
execution of this action. Or silence output if
execution is successful (silence is golden).
--example-data PATH Write example data and exit.
--citations Show citations and exit.
--help Show this message and exit.
Import:
from qiime2.plugins.feature_classifier.pipelines import classify_hybrid_vsearch_sklearn
Docstring:
ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier
NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to
https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid
classifier. First performs rough positive filter to remove artifact and
low-coverage sequences (use "prefilter" parameter to toggle this step on or
off). Second, performs VSEARCH exact match between query and
reference_reads to find exact matches, followed by least common ancestor
consensus taxonomy assignment from among maxaccepts top hits, min_consensus
of which share that taxonomic assignment. Query sequences without an exact
match are then classified with a pre-trained sklearn taxonomy classifier to
predict the most likely taxonomic lineage.
Parameters
----------
query : FeatureData[Sequence]
Query Sequences.
reference_reads : FeatureData[Sequence]
Reference sequences.
reference_taxonomy : FeatureData[Taxonomy]
Reference taxonomy labels.
classifier : TaxonomicClassifier
Pre-trained sklearn taxonomic classifier for classifying the reads.
maxaccepts : Int % Range(1, None) | Str % Choices('all'), optional
Maximum number of hits to keep for each query. Set to "all" to keep all
hits > perc_identity similarity. Note that if strand=both, maxaccepts
will keep N hits for each direction (if searches in the opposite
direction yield results that exceed the minimum perc_identity). In
those cases use maxhits to control the total number of hits returned.
This option works in pair with maxrejects. The search process sorts
target sequences by decreasing number of k-mers they have in common
with the query sequence, using that information as a proxy for sequence
similarity. After pairwise alignments, if the first target sequence
passes the acceptation criteria, it is accepted as best hit and the
search process stops for that query. If maxaccepts is set to a higher
value, more hits are accepted. If maxaccepts and maxrejects are both
set to "all", the complete database is searched.
perc_identity : Float % Range(0.0, 1.0, inclusive_end=True), optional
Percent sequence similarity to use for PREFILTER. Reject match if
percent identity to query is lower. Set to a lower value to perform a
rough pre-filter. This parameter is ignored if `prefilter` is disabled.
query_cov : Float % Range(0.0, 1.0, inclusive_end=True), optional
Query coverage threshold to use for PREFILTER. Reject match if query
alignment coverage per high-scoring pair is lower. Set to a lower value
to perform a rough pre-filter. This parameter is ignored if `prefilter`
is disabled.
strand : Str % Choices('both', 'plus'), optional
Align against reference sequences in forward ("plus") or both
directions ("both").
min_consensus : Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True), optional
Minimum fraction of assignments must match top hit to be accepted as
consensus assignment.
maxhits : Int % Range(1, None) | Str % Choices('all'), optional
maxrejects : Int % Range(1, None) | Str % Choices('all'), optional
reads_per_batch : Int % Range(1, None) | Str % Choices('auto'), optional
Number of reads to process in each batch for sklearn classification. If
"auto", this parameter is autoscaled to min(number of query sequences /
threads, 20000).
confidence : Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'), optional
Confidence threshold for limiting taxonomic depth. Set to "disable" to
disable confidence calculation, or 0 to calculate confidence but not
apply it to limit the taxonomic depth of the assignments.
read_orientation : Str % Choices('same', 'reverse-complement', 'auto'), optional
Direction of reads with respect to reference sequences in pre-trained
sklearn classifier. same will cause reads to be classified unchanged;
reverse-complement will cause reads to be reversed and complemented
prior to classification. "auto" will autodetect orientation based on
the confidence estimates for the first 100 reads.
threads : Int % Range(1, None), optional
Number of threads to use for job parallelization.
prefilter : Bool, optional
Toggle positive filter of query sequences on or off.
sample_size : Int % Range(1, None), optional
Randomly extract the given number of sequences from the reference
database to use for prefiltering. This parameter is ignored if
`prefilter` is disabled.
randseed : Int % Range(0, None), optional
Use integer as a seed for the pseudo-random generator used during
prefiltering. A given seed always produces the same output, which is
useful for replicability. Set to 0 to use a pseudo-random seed. This
parameter is ignored if `prefilter` is disabled.
Returns
-------
classification : FeatureData[Taxonomy]
Taxonomy classifications of query sequences.