Fork me on GitHub

classify-hybrid-vsearch-sklearn: ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifierΒΆ

Docstring:

Usage: qiime feature-classifier classify-hybrid-vsearch-sklearn
           [OPTIONS]

  NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to
  https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid
  classifier. First performs rough positive filter to remove artifact and low-
  coverage sequences (use "prefilter" parameter to toggle this step on or
  off). Second, performs VSEARCH exact match between query and reference_reads
  to find exact matches, followed by least common ancestor consensus taxonomy
  assignment from among maxaccepts top hits, min_consensus of which share that
  taxonomic assignment. Query sequences without an exact match are then
  classified with a pre-trained sklearn taxonomy classifier to predict the
  most likely taxonomic lineage.

Inputs:
  --i-query ARTIFACT FeatureData[Sequence]
                          Query Sequences.                          [required]
  --i-reference-reads ARTIFACT FeatureData[Sequence]
                          Reference sequences.                      [required]
  --i-reference-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Reference taxonomy labels.                [required]
  --i-classifier ARTIFACT Pre-trained sklearn taxonomic classifier for
    TaxonomicClassifier   classifying the reads.                    [required]
Parameters:
  --p-maxaccepts VALUE Int % Range(1, None) | Str % Choices('all')
                          Maximum number of hits to keep for each query. Set
                          to "all" to keep all hits > perc-identity
                          similarity. Note that if strand=both, maxaccepts
                          will keep N hits for each direction (if searches in
                          the opposite direction yield results that exceed the
                          minimum perc-identity). In those cases use maxhits
                          to control the total number of hits returned. This
                          option works in pair with maxrejects. The search
                          process sorts target sequences by decreasing number
                          of k-mers they have in common with the query
                          sequence, using that information as a proxy for
                          sequence similarity. After pairwise alignments, if
                          the first target sequence passes the acceptation
                          criteria, it is accepted as best hit and the search
                          process stops for that query. If maxaccepts is set
                          to a higher value, more hits are accepted. If
                          maxaccepts and maxrejects are both set to "all", the
                          complete database is searched.         [default: 10]
  --p-perc-identity PROPORTION Range(0.0, 1.0, inclusive_end=True)
                          Percent sequence similarity to use for PREFILTER.
                          Reject match if percent identity to query is lower.
                          Set to a lower value to perform a rough pre-filter.
                          This parameter is ignored if `prefilter` is
                          disabled.                             [default: 0.5]
  --p-query-cov PROPORTION Range(0.0, 1.0, inclusive_end=True)
                          Query coverage threshold to use for PREFILTER.
                          Reject match if query alignment coverage per
                          high-scoring pair is lower. Set to a lower value to
                          perform a rough pre-filter. This parameter is
                          ignored if `prefilter` is disabled.   [default: 0.8]
  --p-strand TEXT Choices('both', 'plus')
                          Align against reference sequences in forward
                          ("plus") or both directions ("both").
                                                             [default: 'both']
  --p-min-consensus NUMBER Range(0.5, 1.0, inclusive_start=False,
    inclusive_end=True)   Minimum fraction of assignments must match top hit
                          to be accepted as consensus assignment.
                                                               [default: 0.51]
  --p-maxhits VALUE Int % Range(1, None) | Str % Choices('all')
                                                              [default: 'all']
  --p-maxrejects VALUE Int % Range(1, None) | Str % Choices('all')
                                                              [default: 'all']
  --p-reads-per-batch VALUE Int % Range(1, None) | Str % Choices('auto')
                          Number of reads to process in each batch for
                          sklearn classification. If "auto", this parameter is
                          autoscaled to min(number of query sequences /
                          threads, 20000).                   [default: 'auto']
  --p-confidence VALUE Float % Range(0, 1, inclusive_end=True) | Str %
    Choices('disable')    Confidence threshold for limiting taxonomic depth.
                          Set to "disable" to disable confidence calculation,
                          or 0 to calculate confidence but not apply it to
                          limit the taxonomic depth of the assignments.
                                                                [default: 0.7]
  --p-read-orientation TEXT Choices('same', 'reverse-complement', 'auto')
                          Direction of reads with respect to reference
                          sequences in pre-trained sklearn classifier. same
                          will cause reads to be classified unchanged;
                          reverse-complement will cause reads to be reversed
                          and complemented prior to classification. "auto"
                          will autodetect orientation based on the confidence
                          estimates for the first 100 reads. [default: 'auto']
  --p-threads NTHREADS    Number of threads to use for job parallelization.
                          Pass 0 to use one per available CPU.    [default: 1]
  --p-prefilter / --p-no-prefilter
                          Toggle positive filter of query sequences on or
                          off.                                 [default: True]
  --p-sample-size INTEGER Randomly extract the given number of sequences from
    Range(1, None)        the reference database to use for prefiltering. This
                          parameter is ignored if `prefilter` is disabled.
                                                               [default: 1000]
  --p-randseed INTEGER    Use integer as a seed for the pseudo-random
    Range(0, None)        generator used during prefiltering. A given seed
                          always produces the same output, which is useful for
                          replicability. Set to 0 to use a pseudo-random seed.
                          This parameter is ignored if `prefilter` is
                          disabled.                               [default: 0]
Outputs:
  --o-classification ARTIFACT FeatureData[Taxonomy]
                          Taxonomy classifications of query sequences.
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --recycle-pool TEXT     Use a cache pool for pipeline resumption. QIIME 2
                          will cache your results in this pool for reuse by
                          future invocations. These pool are retained until
                          deleted by the user. If not provided, QIIME 2 will
                          create a pool which is automatically reused by
                          invocations of the same action and removed if the
                          action is successful. Note: these pools are local to
                          the cache you are using.
  --no-recycle            Do not recycle results from a previous failed
                          pipeline run or save the results from this run for
                          future recycling.
  --parallel              Execute your action in parallel. This flag will use
                          your default parallel config.
  --parallel-config FILE  Execute your action in parallel using a config at
                          the indicated path.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this pipeline. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.feature_classifier.pipelines import classify_hybrid_vsearch_sklearn

Docstring:

ALPHA Hybrid classifier: VSEARCH exact match + sklearn classifier

NOTE: THIS PIPELINE IS AN ALPHA RELEASE. Please report bugs to
https://forum.qiime2.org! Assign taxonomy to query sequences using hybrid
classifier. First performs rough positive filter to remove artifact and
low-coverage sequences (use "prefilter" parameter to toggle this step on or
off). Second, performs VSEARCH exact match between query and
reference_reads to find exact matches, followed by least common ancestor
consensus taxonomy assignment from among maxaccepts top hits, min_consensus
of which share that taxonomic assignment. Query sequences without an exact
match are then classified with a pre-trained sklearn taxonomy classifier to
predict the most likely taxonomic lineage.

Parameters
----------
query : FeatureData[Sequence]
    Query Sequences.
reference_reads : FeatureData[Sequence]
    Reference sequences.
reference_taxonomy : FeatureData[Taxonomy]
    Reference taxonomy labels.
classifier : TaxonomicClassifier
    Pre-trained sklearn taxonomic classifier for classifying the reads.
maxaccepts : Int % Range(1, None) | Str % Choices('all'), optional
    Maximum number of hits to keep for each query. Set to "all" to keep all
    hits > perc_identity similarity. Note that if strand=both, maxaccepts
    will keep N hits for each direction (if searches in the opposite
    direction yield results that exceed the minimum perc_identity). In
    those cases use maxhits to control the total number of hits returned.
    This option works in pair with maxrejects. The search process sorts
    target sequences by decreasing number of k-mers they have in common
    with the query sequence, using that information as a proxy for sequence
    similarity. After pairwise alignments, if the first target sequence
    passes the acceptation criteria, it is accepted as best hit and the
    search process stops for that query. If maxaccepts is set to a higher
    value, more hits are accepted. If maxaccepts and maxrejects are both
    set to "all", the complete database is searched.
perc_identity : Float % Range(0.0, 1.0, inclusive_end=True), optional
    Percent sequence similarity to use for PREFILTER. Reject match if
    percent identity to query is lower. Set to a lower value to perform a
    rough pre-filter. This parameter is ignored if `prefilter` is disabled.
query_cov : Float % Range(0.0, 1.0, inclusive_end=True), optional
    Query coverage threshold to use for PREFILTER. Reject match if query
    alignment coverage per high-scoring pair is lower. Set to a lower value
    to perform a rough pre-filter. This parameter is ignored if `prefilter`
    is disabled.
strand : Str % Choices('both', 'plus'), optional
    Align against reference sequences in forward ("plus") or both
    directions ("both").
min_consensus : Float % Range(0.5, 1.0, inclusive_start=False, inclusive_end=True), optional
    Minimum fraction of assignments must match top hit to be accepted as
    consensus assignment.
maxhits : Int % Range(1, None) | Str % Choices('all'), optional
maxrejects : Int % Range(1, None) | Str % Choices('all'), optional
reads_per_batch : Int % Range(1, None) | Str % Choices('auto'), optional
    Number of reads to process in each batch for sklearn classification. If
    "auto", this parameter is autoscaled to min(number of query sequences /
    threads, 20000).
confidence : Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'), optional
    Confidence threshold for limiting taxonomic depth. Set to "disable" to
    disable confidence calculation, or 0 to calculate confidence but not
    apply it to limit the taxonomic depth of the assignments.
read_orientation : Str % Choices('same', 'reverse-complement', 'auto'), optional
    Direction of reads with respect to reference sequences in pre-trained
    sklearn classifier. same will cause reads to be classified unchanged;
    reverse-complement will cause reads to be reversed and complemented
    prior to classification. "auto" will autodetect orientation based on
    the confidence estimates for the first 100 reads.
threads : Threads, optional
    Number of threads to use for job parallelization. Pass 0 to use one per
    available CPU.
prefilter : Bool, optional
    Toggle positive filter of query sequences on or off.
sample_size : Int % Range(1, None), optional
    Randomly extract the given number of sequences from the reference
    database to use for prefiltering. This parameter is ignored if
    `prefilter` is disabled.
randseed : Int % Range(0, None), optional
    Use integer as a seed for the pseudo-random generator used during
    prefiltering. A given seed always produces the same output, which is
    useful for replicability. Set to 0 to use a pseudo-random seed. This
    parameter is ignored if `prefilter` is disabled.

Returns
-------
classification : FeatureData[Taxonomy]
    Taxonomy classifications of query sequences.