Fork me on GitHub

evaluate-fit-classifier: Evaluate and train naive Bayes classifier on reference sequences.

Citations
  • Nicholas A. Bokulich, Benjamin D. Kaehler, Jai Ram Rideout, Matthew Dillon, Evan Bolyen, Rob Knight, Gavin A. Huttley, and J. Gregory Caporaso. Optimizing taxonomic classification of marker-gene amplicon sequences with qiime 2's q2-feature-classifier plugin. Microbiome, 6(1):90, 2018. URL: https://doi.org/10.1186/s40168-018-0470-z, doi:10.1186/s40168-018-0470-z.

Docstring:

Usage: qiime rescript evaluate-fit-classifier [OPTIONS]

  Train a naive Bayes classifier on a set of reference sequences, then test
  performance accuracy on this same set of sequences. This results in a
  "perfect" classifier that "knows" the correct identity of each input
  sequence. Such a leaky classifier indicates the upper limit of
  classification accuracy based on sequence information alone, as
  misclassifications are an indication of unresolvable kmer profiles. This
  test simulates the case where all query sequences are present in a fully
  comprehensive reference database. To simulate more realistic conditions, see
  `evaluate_cross_validate`. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS
  PRODUCTION-READY and can be re-used for classification of other sequences
  (provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR
  TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          Reference sequences to use for classifier
                          training/testing.                         [required]
  --i-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Reference taxonomy to use for classifier
                          training/testing.                         [required]
Parameters:
  --p-reads-per-batch VALUE Int % Range(1, None) | Str % Choices('auto')
                          Number of reads to process in each batch. If
                          "auto", this parameter is autoscaled to min( number
                          of query sequences / n-jobs, 20000).
                                                             [default: 'auto']
  --p-n-jobs NTHREADS     The maximum number of concurrent worker processes.
                          If 0 all CPUs are used. If 1 is given, no parallel
                          computing code is used at all, which is useful for
                          debugging.                              [default: 1]
  --p-confidence VALUE Float % Range(0, 1, inclusive_end=True) | Str %
    Choices('disable')    Confidence threshold for limiting taxonomic depth.
                          Set to "disable" to disable confidence calculation,
                          or 0 to calculate confidence but not apply it to
                          limit the taxonomic depth of the assignments.
                                                                [default: 0.7]
Outputs:
  --o-classifier ARTIFACT Trained naive Bayes taxonomic classifier.
    TaxonomicClassifier                                             [required]
  --o-evaluation VISUALIZATION
                          Visualization of classification accuracy results.
                                                                    [required]
  --o-observed-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Observed taxonomic label for each input sequence,
                          predicted by the trained classifier.      [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --recycle-pool TEXT     Use a cache pool for pipeline resumption. QIIME 2
                          will cache your results in this pool for reuse by
                          future invocations. These pool are retained until
                          deleted by the user. If not provided, QIIME 2 will
                          create a pool which is automatically reused by
                          invocations of the same action and removed if the
                          action is successful. Note: these pools are local to
                          the cache you are using.
  --no-recycle            Do not recycle results from a previous failed
                          pipeline run or save the results from this run for
                          future recycling.
  --parallel              Execute your action in parallel. This flag will use
                          your default parallel config.
  --parallel-config FILE  Execute your action in parallel using a config at
                          the indicated path.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this pipeline. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.rescript.pipelines import evaluate_fit_classifier

Docstring:

Evaluate and train naive Bayes classifier on reference sequences.

Train a naive Bayes classifier on a set of reference sequences, then test
performance accuracy on this same set of sequences. This results in a
"perfect" classifier that "knows" the correct identity of each input
sequence. Such a leaky classifier indicates the upper limit of
classification accuracy based on sequence information alone, as
misclassifications are an indication of unresolvable kmer profiles. This
test simulates the case where all query sequences are present in a fully
comprehensive reference database. To simulate more realistic conditions,
see `evaluate_cross_validate`. THE CLASSIFIER OUTPUT BY THIS PIPELINE IS
PRODUCTION-READY and can be re-used for classification of other sequences
(provided the reference data are viable), hence THIS PIPELINE IS USEFUL FOR
TRAINING FEATURE CLASSIFIERS AND THEN EVALUATING THEM ON-THE-FLY.

Parameters
----------
sequences : FeatureData[Sequence]
    Reference sequences to use for classifier training/testing.
taxonomy : FeatureData[Taxonomy]
    Reference taxonomy to use for classifier training/testing.
reads_per_batch : Int % Range(1, None) | Str % Choices('auto'), optional
    Number of reads to process in each batch. If "auto", this parameter is
    autoscaled to min( number of query sequences / n_jobs, 20000).
n_jobs : Threads, optional
    The maximum number of concurrent worker processes. If 0 all CPUs are
    used. If 1 is given, no parallel computing code is used at all, which
    is useful for debugging.
confidence : Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'), optional
    Confidence threshold for limiting taxonomic depth. Set to "disable" to
    disable confidence calculation, or 0 to calculate confidence but not
    apply it to limit the taxonomic depth of the assignments.

Returns
-------
classifier : TaxonomicClassifier
    Trained naive Bayes taxonomic classifier.
evaluation : Visualization
    Visualization of classification accuracy results.
observed_taxonomy : FeatureData[Taxonomy]
    Observed taxonomic label for each input sequence, predicted by the
    trained classifier.