Human sequence read filtering from fastq files#

This notebook applies a human read filter to a directory of fastq files and outputs new fastq files that exclude any sequences that had matches to the human genome. This was developed to prepare the data that is imported in the Upstream tutorial and is not extensively tested. It should therefore be treated only as an example - not production-ready software. The development of this workflow was informed by Qiita’s host filtering workflow, and the bowtie2, samtools, and bedtools documentation.

The data that was used as input to this workflow is not publicly available. Of the approximately 1,850,000 paired-end sequences that were provided as input, about 20 of them had hits to the human genome. About half of these were to mitochondrial sequences, which is expected since the mitochondrial SSU rRNA is a 16S (not 18S) sequence. The others were hits to chromosomal DNA.

The following is run in an environment built as follows:

conda create -n samtools -c conda-forge -c bioconda samtools bedtools bowtie2
import glob
import os.path
import os
import tempfile
import collections

input_dir = 'fastq'
output_dir = 'filtered-fastq'
log_dir = 'logs'

fps = glob.glob(os.path.join(input_dir, '*fastq.gz'))
def sort_paired_end_read_files(files):
    if '_R1_' in files[0]:
        return files[0], files[1]
    elif '_R2_' in files[0]:
        return files[1], files[0]
    else:
        raise ValueError("Can't sort files:\n %s\n %s" % (files[0], files[1]))

sample_id_to_filenames = collections.defaultdict(list)

for fp in fps:
    fn = os.path.split(fp)[1]
    sample_id = fn.split('_')[0]
    sample_id_to_filenames[sample_id].append(fn) 

for sample_id, filenames in sample_id_to_filenames.items():
    n_files = len(filenames)
    if n_files != 2:
        raise ValueError('Incorrect number of files available (n=%d) for sample %s.' % (n_files, sample_id))

for sample_id, filenames in sample_id_to_filenames.items():
    forward_read_fn, reverse_read_fn = sort_paired_end_read_files(filenames)
    forward_read_fp = os.path.join(input_dir, forward_read_fn)
    reverse_read_fp = os.path.join(input_dir, reverse_read_fn)
    forward_read_output_fp = os.path.join(output_dir, forward_read_fn).replace('.gz', '')
    reverse_read_output_fp = os.path.join(output_dir, reverse_read_fn).replace('.gz', '')
    sam_f_name = os.path.join(log_dir, '%s.sam' % sample_id)
    bam_f_name = os.path.join(log_dir, '%s.bam' % sample_id)
    
    !bowtie2 -p 1 -x hg19/hg19 --very-sensitive -1 $forward_read_fp -2 $reverse_read_fp -S $sam_f_name 2> /dev/null
    !samtools view -f 4 -F 256 -o $bam_f_name -b $sam_f_name 2> /dev/null
    !bedtools bamtofastq -i $bam_f_name -fq $forward_read_output_fp -fq2 $reverse_read_output_fp
    !gzip -f $forward_read_output_fp
    !gzip -f $reverse_read_output_fp