Fork me on GitHub

denoise-ccs: Denoise and dereplicate single-end Pacbio CCSΒΆ

Docstring:

Usage: qiime dada2 denoise-ccs [OPTIONS]

  This method denoises single-end Pacbio CCS sequences, dereplicates them, and
  filters chimeras. Tutorial and workflow:
  https://github.com/benjjneb/LRASManuscript

Inputs:
  --i-demultiplexed-seqs ARTIFACT SampleData[SequencesWithQuality]
                          The single-end demultiplexed PacBio CCS sequences
                          to be denoised.                           [required]
Parameters:
  --p-front TEXT          Sequence of an adapter ligated to the 5' end. The
                          adapter and any preceding bases are trimmed. Can
                          contain IUPAC ambiguous nucleotide codes. Note,
                          primer direction is 5' to 3'. Primers are removed
                          before trim and filter step. Reads that do not
                          contain the primer are discarded. Each read is
                          re-oriented if the reverse complement of the read is
                          a better match to the provided primer sequence. This
                          is recommended for PacBio CCS reads, which come in a
                          random mix of forward and reverse-complement
                          orientations.                             [required]
  --p-adapter TEXT        Sequence of an adapter ligated to the 3' end. The
                          adapter and any preceding bases are trimmed. Can
                          contain IUPAC ambiguous nucleotide codes. Note,
                          primer direction is 5' to 3'. Primers are removed
                          before trim and filter step. Reads that do not
                          contain the primer are discarded.         [optional]
  --p-max-mismatch INTEGER
                          The number of mismatches to tolerate when matching
                          reads to primer sequences - see
                          http://benjjneb.github.io/dada2/ for complete
                          details.                                [default: 2]
  --p-indels / --p-no-indels
                          Allow insertions or deletions of bases when
                          matching adapters. Note that primer matching can be
                          significantly slower, currently about 4x slower
                                                              [default: False]
  --p-trunc-len INTEGER   Position at which sequences should be truncated due
                          to decrease in quality. This truncates the 3' end of
                          the of the input sequences, which will be the bases
                          that were sequenced in the last cycles. Reads that
                          are shorter than this value will be discarded. If 0
                          is provided, no truncation or length filtering will
                          be performed. Note: Since Pacbio CCS sequences were
                          normally with very high quality scores, there is no
                          need to truncate the Pacbio CCS sequences.
                                                                  [default: 0]
  --p-trim-left INTEGER   Position at which sequences should be trimmed due
                          to low quality. This trims the 5' end of the of the
                          input sequences, which will be the bases that were
                          sequenced in the first cycles.          [default: 0]
  --p-max-ee NUMBER       Reads with number of expected errors higher than
                          this value will be discarded.         [default: 2.0]
  --p-trunc-q INTEGER     Reads are truncated at the first instance of a
                          quality score less than or equal to this value. If
                          the resulting read is then shorter than `trunc-len`,
                          it is discarded.                        [default: 2]
  --p-min-len INTEGER     Remove reads with length less than minLen. minLen
                          is enforced after trimming and truncation. For 16S
                          Pacbio CCS, suggest 1000.              [default: 20]
  --p-max-len INTEGER     Remove reads prior to trimming or truncation which
                          are longer than this value. If 0 is provided no
                          reads will be removed based on length. For 16S
                          Pacbio CCS, suggest 1600.               [default: 0]
  --p-pooling-method TEXT Choices('independent', 'pseudo')
                          The method used to pool samples for denoising.
                          "independent": Samples are denoised indpendently.
                          "pseudo": The pseudo-pooling method is used to
                          approximate pooling of samples. In short, samples
                          are denoised independently once, ASVs detected in at
                          least 2 samples are recorded, and samples are
                          denoised independently a second time, but this time
                          with prior knowledge of the recorded ASVs and thus
                          higher sensitivity to those ASVs.
                                                      [default: 'independent']
  --p-chimera-method TEXT Choices('consensus', 'none', 'pooled')
                          The method used to remove chimeras. "none": No
                          chimera removal is performed. "pooled": All reads
                          are pooled prior to chimera detection. "consensus":
                          Chimeras are detected in samples individually, and
                          sequences found chimeric in a sufficient fraction of
                          samples are removed.          [default: 'consensus']
  --p-min-fold-parent-over-abundance NUMBER
                          The minimum abundance of potential parents of a
                          sequence being tested as chimeric, expressed as a
                          fold-change versus the abundance of the sequence
                          being tested. Values should be greater than or equal
                          to 1 (i.e. parents should be more abundant than the
                          sequence being tested). Suggest 3.5. This parameter
                          has no effect if chimera-method is "none".
                                                                [default: 3.5]
  --p-allow-one-off / --p-no-allow-one-off
                          Bimeras that are one-off from exact are also
                          identified if the `allow-one-off` argument is True.
                          If True, a sequence will be identified as bimera if
                          it is one mismatch or indel away from an exact
                          bimera.                             [default: False]
  --p-n-threads NTHREADS  The number of threads to use for multithreaded
                          processing. If 0 is provided, all available cores
                          will be used.                           [default: 1]
  --p-n-reads-learn INTEGER
                          The number of reads to use when training the error
                          model. Smaller numbers will result in a shorter run
                          time but a less reliable error model.
                                                            [default: 1000000]
  --p-hashed-feature-ids / --p-no-hashed-feature-ids
                          If true, the feature ids in the resulting table
                          will be presented as hashes of the sequences
                          defining each feature. The hash will always be the
                          same for the same sequence so this allows feature
                          tables to be merged across runs of this method. You
                          should only merge tables if the exact same
                          parameters are used for each run.    [default: True]
  --p-retain-all-samples / --p-no-retain-all-samples
                          If True all samples input to dada2 will be retained
                          in the output of dada2, if false samples with zero
                          total frequency are removed from the table.
                                                               [default: True]
Outputs:
  --o-table ARTIFACT FeatureTable[Frequency]
                          The resulting feature table.              [required]
  --o-representative-sequences ARTIFACT FeatureData[Sequence]
                          The resulting feature sequences. Each feature in
                          the feature table will be represented by exactly one
                          sequence.                                 [required]
  --o-denoising-stats ARTIFACT SampleData[DADA2Stats]
                                                                    [required]
Miscellaneous:
  --output-dir PATH       Output unspecified results to a directory
  --verbose / --quiet     Display verbose output to stdout and/or stderr
                          during execution of this action. Or silence output
                          if execution is successful (silence is golden).
  --example-data PATH     Write example data and exit.
  --citations             Show citations and exit.
  --use-cache DIRECTORY   Specify the cache to be used for the intermediate
                          work of this action. If not provided, the default
                          cache under $TMP/qiime2/ will be used.
                          IMPORTANT FOR HPC USERS: If you are on an HPC system
                          and are using parallel execution it is important to
                          set this to a location that is globally accessible
                          to all nodes in the cluster.
  --help                  Show this message and exit.

Import:

from qiime2.plugins.dada2.methods import denoise_ccs

Docstring:

Denoise and dereplicate single-end Pacbio CCS

This method denoises single-end Pacbio CCS sequences, dereplicates them,
and filters chimeras. Tutorial and workflow:
https://github.com/benjjneb/LRASManuscript

Parameters
----------
demultiplexed_seqs : SampleData[SequencesWithQuality]
    The single-end demultiplexed PacBio CCS sequences to be denoised.
front : Str
    Sequence of an adapter ligated to the 5' end. The adapter and any
    preceding bases are trimmed. Can contain IUPAC ambiguous nucleotide
    codes. Note, primer direction is 5' to 3'. Primers are removed before
    trim and filter step. Reads that do not contain the primer are
    discarded. Each read is re-oriented if the reverse complement of the
    read is a better match to the provided primer sequence. This is
    recommended for PacBio CCS reads, which come in a random mix of forward
    and reverse-complement orientations.
adapter : Str, optional
    Sequence of an adapter ligated to the 3' end. The adapter and any
    preceding bases are trimmed. Can contain IUPAC ambiguous nucleotide
    codes. Note, primer direction is 5' to 3'. Primers are removed before
    trim and filter step. Reads that do not contain the primer are
    discarded.
max_mismatch : Int, optional
    The number of mismatches to tolerate when matching reads to primer
    sequences - see http://benjjneb.github.io/dada2/ for complete details.
indels : Bool, optional
    Allow insertions or deletions of bases when matching adapters. Note
    that primer matching can be significantly slower, currently about 4x
    slower
trunc_len : Int, optional
    Position at which sequences should be truncated due to decrease in
    quality. This truncates the 3' end of the of the input sequences, which
    will be the bases that were sequenced in the last cycles. Reads that
    are shorter than this value will be discarded. If 0 is provided, no
    truncation or length filtering will be performed. Note: Since Pacbio
    CCS sequences were normally with very high quality scores, there is no
    need to truncate the Pacbio CCS sequences.
trim_left : Int, optional
    Position at which sequences should be trimmed due to low quality. This
    trims the 5' end of the of the input sequences, which will be the bases
    that were sequenced in the first cycles.
max_ee : Float, optional
    Reads with number of expected errors higher than this value will be
    discarded.
trunc_q : Int, optional
    Reads are truncated at the first instance of a quality score less than
    or equal to this value. If the resulting read is then shorter than
    `trunc_len`, it is discarded.
min_len : Int, optional
    Remove reads with length less than minLen. minLen is enforced after
    trimming and truncation. For 16S Pacbio CCS, suggest 1000.
max_len : Int, optional
    Remove reads prior to trimming or truncation which are longer than this
    value. If 0 is provided no reads will be removed based on length. For
    16S Pacbio CCS, suggest 1600.
pooling_method : Str % Choices('independent', 'pseudo'), optional
    The method used to pool samples for denoising. "independent": Samples
    are denoised indpendently. "pseudo": The pseudo-pooling method is used
    to approximate pooling of samples. In short, samples are denoised
    independently once, ASVs detected in at least 2 samples are recorded,
    and samples are denoised independently a second time, but this time
    with prior knowledge of the recorded ASVs and thus higher sensitivity
    to those ASVs.
chimera_method : Str % Choices('consensus', 'none', 'pooled'), optional
    The method used to remove chimeras. "none": No chimera removal is
    performed. "pooled": All reads are pooled prior to chimera detection.
    "consensus": Chimeras are detected in samples individually, and
    sequences found chimeric in a sufficient fraction of samples are
    removed.
min_fold_parent_over_abundance : Float, optional
    The minimum abundance of potential parents of a sequence being tested
    as chimeric, expressed as a fold-change versus the abundance of the
    sequence being tested. Values should be greater than or equal to 1
    (i.e. parents should be more abundant than the sequence being tested).
    Suggest 3.5. This parameter has no effect if chimera_method is "none".
allow_one_off : Bool, optional
    Bimeras that are one-off from exact are also identified if the
    `allow_one_off` argument is True. If True, a sequence will be
    identified as bimera if it is one mismatch or indel away from an exact
    bimera.
n_threads : Threads, optional
    The number of threads to use for multithreaded processing. If 0 is
    provided, all available cores will be used.
n_reads_learn : Int, optional
    The number of reads to use when training the error model. Smaller
    numbers will result in a shorter run time but a less reliable error
    model.
hashed_feature_ids : Bool, optional
    If true, the feature ids in the resulting table will be presented as
    hashes of the sequences defining each feature. The hash will always be
    the same for the same sequence so this allows feature tables to be
    merged across runs of this method. You should only merge tables if the
    exact same parameters are used for each run.
retain_all_samples : Bool, optional
    If True all samples input to dada2 will be retained in the output of
    dada2, if false samples with zero total frequency are removed from the
    table.

Returns
-------
table : FeatureTable[Frequency]
    The resulting feature table.
representative_sequences : FeatureData[Sequence]
    The resulting feature sequences. Each feature in the feature table will
    be represented by exactly one sequence.
denoising_stats : SampleData[DADA2Stats]