Fork me on GitHub

evaluate-cross-validate: Evaluate DNA sequence reference database via cross-validated taxonomic classification.

Citations
  • Nicholas A. Bokulich, Benjamin D. Kaehler, Jai Ram Rideout, Matthew Dillon, Evan Bolyen, Rob Knight, Gavin A. Huttley, and J. Gregory Caporaso. Optimizing taxonomic classification of marker-gene amplicon sequences with qiime 2's q2-feature-classifier plugin. Microbiome, 6(1):90, 2018. URL: https://doi.org/10.1186/s40168-018-0470-z, doi:10.1186/s40168-018-0470-z.

Docstring:

Usage: qiime rescript evaluate-cross-validate [OPTIONS]

  Evaluate DNA sequence reference database via cross-validated taxonomic
  classification. Unique taxonomic labels are truncated to enable appropriate
  label stratification. See the cited reference (Bokulich et al. 2018) for
  more details.

Inputs:
  --i-sequences ARTIFACT FeatureData[Sequence]
                          Reference sequences to use for classifier
                          training/testing.                         [required]
  --i-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Reference taxonomy to use for classifier
                          training/testing.                         [required]
Parameters:
  --p-k INTEGER           Number of stratified folds.
    Range(2, None)                                                [default: 3]
  --p-random-state INTEGER
    Range(0, None)        Seed used by the random number generator.
                                                                  [default: 0]
  --p-reads-per-batch VALUE Int % Range(1, None) | Str % Choices('auto')
                          Number of reads to process in each batch. If
                          "auto", this parameter is autoscaled to min( number
                          of query sequences / n-jobs, 20000).
                                                             [default: 'auto']
  --p-n-jobs NTHREADS     The maximum number of concurrent worker processes.
                          If 0 all CPUs are used. If 1 is given, no parallel
                          computing code is used at all, which is useful for
                          debugging.                              [default: 1]
  --p-confidence VALUE Float % Range(0, 1, inclusive_end=True) | Str %
    Choices('disable')    Confidence threshold for limiting taxonomic depth.
                          Set to "disable" to disable confidence calculation,
                          or 0 to calculate confidence but not apply it to
                          limit the taxonomic depth of the assignments.
                                                                [default: 0.7]
Outputs:
  --o-expected-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Expected taxonomic label for each input sequence.
                          Taxonomic labels may be truncated due to k-fold CV
                          and stratification.                       [required]
  --o-observed-taxonomy ARTIFACT FeatureData[Taxonomy]
                          Observed taxonomic label for each input sequence,
                          predicted by cross-validation.            [required]
  --o-evaluation VISUALIZATION
                          Visualization of cross-validated accuracy results.
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --recycle-pool TEXT     Use a cache pool for pipeline resumption. QIIME 2
                          will cache your results in this pool for reuse by
                          future invocations. These pool are retained until
                          deleted by the user. If not provided, QIIME 2 will
                          create a pool which is automatically reused by
                          invocations of the same action and removed if the
                          action is successful. Note: these pools are local to
                          the cache you are using.
  --no-recycle            Do not recycle results from a previous failed
                          pipeline run or save the results from this run for
                          future recycling.
  --parallel              Execute your action in parallel. This flag will use
                          your default parallel config.
  --parallel-config FILE  Execute your action in parallel using a config at
                          the indicated path.
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this action. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.rescript.pipelines import evaluate_cross_validate

Docstring:

Evaluate DNA sequence reference database via cross-validated taxonomic
classification.

Evaluate DNA sequence reference database via cross-validated taxonomic
classification. Unique taxonomic labels are truncated to enable appropriate
label stratification. See the cited reference (Bokulich et al. 2018) for
more details.

Parameters
----------
sequences : FeatureData[Sequence]
    Reference sequences to use for classifier training/testing.
taxonomy : FeatureData[Taxonomy]
    Reference taxonomy to use for classifier training/testing.
k : Int % Range(2, None), optional
    Number of stratified folds.
random_state : Int % Range(0, None), optional
    Seed used by the random number generator.
reads_per_batch : Int % Range(1, None) | Str % Choices('auto'), optional
    Number of reads to process in each batch. If "auto", this parameter is
    autoscaled to min( number of query sequences / n_jobs, 20000).
n_jobs : Threads, optional
    The maximum number of concurrent worker processes. If 0 all CPUs are
    used. If 1 is given, no parallel computing code is used at all, which
    is useful for debugging.
confidence : Float % Range(0, 1, inclusive_end=True) | Str % Choices('disable'), optional
    Confidence threshold for limiting taxonomic depth. Set to "disable" to
    disable confidence calculation, or 0 to calculate confidence but not
    apply it to limit the taxonomic depth of the assignments.

Returns
-------
expected_taxonomy : FeatureData[Taxonomy]
    Expected taxonomic label for each input sequence. Taxonomic labels may
    be truncated due to k-fold CV and stratification.
observed_taxonomy : FeatureData[Taxonomy]
    Observed taxonomic label for each input sequence, predicted by cross-
    validation.
evaluation : Visualization
    Visualization of cross-validated accuracy results.