Fork me on GitHub

Importing data

Note

This tutorial assumes you have installed QIIME 2 using one of the procedures in the install documents.

In order to use QIIME 2, your input data must be stored in QIIME 2 artifacts (i.e. .qza files). This is what enables distributed and automatic provenance tracking, as well as semantic type validation and transformations between data formats (see the core concepts page for more details about QIIME 2 artifacts). This tutorial demonstrates how to import various data formats into QIIME 2 artifacts for use with QIIME 2.

Note

This tutorial does not describe all data formats that are currently supported in QIIME 2. It is a work-in-progress that describes some of the most commonly used data formats that are available. We are also actively working on supporting additional data formats. If you need to import data in a format that is not covered here, please post to the QIIME 2 Forum for help.

Importing will typically happen with your initial data (e.g. sequences obtained from a sequencing facility), but importing can be performed at any step in your analysis pipeline. For example, if a collaborator provides you with a feature table in .biom format, you can import it into a QIIME 2 artifact to perform “downstream” statistical analyses that operate on a feature table.

Importing can be accomplished using any of the QIIME 2 interfaces. This tutorial will focus on using the QIIME 2 command-line interface (q2cli) to import data with the qiime tools import method. Each section below briefly describes a data format, provides commands to download example data, and illustrates how to import the data into a QIIME 2 artifact.

You may want to begin by creating a directory to work in.

mkdir qiime2-importing-tutorial
cd qiime2-importing-tutorial

Sequence data with sequence quality information (i.e. FASTQ)

With QIIME 2, there are functions to import different types of FASTQ data:

  1. FASTQ data with the EMP Protocol format
  2. FASTQ data with the Casava 1.8 demultiplexed format
  3. Any other kind of FASTQ data

“EMP protocol” multiplexed single-end fastq

Format description

Single-end “Earth Microbiome Project (EMP) protocol” formatted reads should have two fastq.gz files total:

  1. one fastq.gz file that contains the single-end reads,
  2. and another that contains the associated barcode reads

In this format, sequence data is still multiplexed (i.e. you have only one fastq.gz file containing raw data for all of your samples).

The order of the records in the two fastq.gz files defines the association between a sequence read and its barcode read (i.e. the first barcode read corresponds to the first sequence read, the second barcode to the second read, and so on).

Obtaining example data
mkdir emp-single-end-sequences
Please select a download option that is most appropriate for your environment:
wget \
  -O "emp-single-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2018.11/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/moving-pictures/emp-single-end-sequences/barcodes.fastq.gz" > \
  "emp-single-end-sequences/barcodes.fastq.gz"
Please select a download option that is most appropriate for your environment:
wget \
  -O "emp-single-end-sequences/sequences.fastq.gz" \
  "https://data.qiime2.org/2018.11/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/moving-pictures/emp-single-end-sequences/sequences.fastq.gz" > \
  "emp-single-end-sequences/sequences.fastq.gz"

Importing data

qiime tools import \
  --type EMPSingleEndSequences \
  --input-path emp-single-end-sequences \
  --output-path emp-single-end-sequences.qza

Output artifacts:

“EMP protocol” multiplexed paired-end fastq

Format description

Paired-end “Earth Microbiome Project (EMP) protocol” formatted reads should have three fastq.gz files total:

  1. one fastq.gz file that contains the forward sequence reads,
  2. one fastq.gz file that contains the reverse sequence reads,
  3. and a third that contains the associated barcode reads

In this format, sequence data is still multiplexed (i.e. you have only one forward and one reverse fastq.gz file containing raw data for all of your samples).

The order of the records in the fastq.gz files defines the association between a sequence read and its barcode read (i.e. the first barcode read corresponds to the first sequence read, the second barcode to the second read, and so on.)

Obtaining example data
mkdir emp-paired-end-sequences
Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/forward.fastq.gz

Save as: emp-paired-end-sequences/forward.fastq.gz

wget \
  -O "emp-paired-end-sequences/forward.fastq.gz" \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/forward.fastq.gz"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/forward.fastq.gz" > \
  "emp-paired-end-sequences/forward.fastq.gz"
Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/reverse.fastq.gz

Save as: emp-paired-end-sequences/reverse.fastq.gz

wget \
  -O "emp-paired-end-sequences/reverse.fastq.gz" \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/reverse.fastq.gz"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/reverse.fastq.gz" > \
  "emp-paired-end-sequences/reverse.fastq.gz"
Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/barcodes.fastq.gz

Save as: emp-paired-end-sequences/barcodes.fastq.gz

wget \
  -O "emp-paired-end-sequences/barcodes.fastq.gz" \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/barcodes.fastq.gz"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/atacama-soils/1p/barcodes.fastq.gz" > \
  "emp-paired-end-sequences/barcodes.fastq.gz"

Importing data

qiime tools import \
  --type EMPPairedEndSequences \
  --input-path emp-paired-end-sequences \
  --output-path emp-paired-end-sequences.qza

Output artifacts:

Casava 1.8 single-end demultiplexed fastq

Format description

In the Casava 1.8 demultiplexed (single-end) format, there is one fastq.gz file for each sample in the study which contains the single-end reads for that sample. The file name includes the sample identifier and should look like L2S357_15_L001_R1_001.fastq.gz. The underscore-separated fields in this file name are:

  1. the sample identifier,
  2. the barcode sequence or a barcode identifier,
  3. the lane number,
  4. the direction of the read (i.e. only R1, because these are single-end reads), and
  5. the set number.
Obtaining example data
Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2018.11/tutorials/importing/casava-18-single-end-demultiplexed.zip

Save as: casava-18-single-end-demultiplexed.zip

wget \
  -O "casava-18-single-end-demultiplexed.zip" \
  "https://data.qiime2.org/2018.11/tutorials/importing/casava-18-single-end-demultiplexed.zip"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/casava-18-single-end-demultiplexed.zip" > \
  "casava-18-single-end-demultiplexed.zip"
unzip -q casava-18-single-end-demultiplexed.zip

Importing data

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path casava-18-single-end-demultiplexed \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-single-end.qza

Output artifacts:

Casava 1.8 paired-end demultiplexed fastq

Format description

In Casava 1.8 demultiplexed (paired-end) format, there are two fastq.gz files for each sample in the study, each containing the forward or reverse reads for that sample. The file name includes the sample identifier. The forward and reverse read file names for a single sample might look like L2S357_15_L001_R1_001.fastq.gz and L2S357_15_L001_R2_001.fastq.gz, respectively. The underscore-separated fields in this file name are:

  1. the sample identifier,
  2. the barcode sequence or a barcode identifier,
  3. the lane number,
  4. the direction of the read (i.e. R1 or R2), and
  5. the set number.
Obtaining example data
Please select a download option that is most appropriate for your environment:

Download URL: https://data.qiime2.org/2018.11/tutorials/importing/casava-18-paired-end-demultiplexed.zip

Save as: casava-18-paired-end-demultiplexed.zip

wget \
  -O "casava-18-paired-end-demultiplexed.zip" \
  "https://data.qiime2.org/2018.11/tutorials/importing/casava-18-paired-end-demultiplexed.zip"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/casava-18-paired-end-demultiplexed.zip" > \
  "casava-18-paired-end-demultiplexed.zip"
unzip -q casava-18-paired-end-demultiplexed.zip

Importing data

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path casava-18-paired-end-demultiplexed \
  --input-format CasavaOneEightSingleLanePerSampleDirFmt \
  --output-path demux-paired-end.qza

Output artifacts:

“Fastq manifest” formats

If you don’t have either EMP or Casava format, you need to import your data into QIIME 2 manually by first creating a “manifest file” and then using the qiime tools import command with different specifications than in the EMP or Casava import commands.

Format description

First, you’ll create a text file called a “manifest file”, which maps sample identifiers to fastq.gz or fastq absolute filepaths that contain sequence and quality data for the sample (i.e. these are FASTQ files). The manifest file also indicates the direction of the reads in each fastq.gz or fastq file. The manifest file will generally be created by you, and it is designed to be a simple format that doesn’t put restrictions on the naming of the demultiplexed fastq.gz / fastq files, since there is no broadly used naming convention for these files. You can call the manifest file whatever you want.

The manifest file is a comma-separated (i.e., .csv) text file. The first field on each line is the sample identifier that should be used by QIIME, the second field is the absolute filepath, and the third field is the read direction. Lines beginning with # and blank lines are ignored. The first line in the file that does not begin with a # and is not blank must be the header line: sample-id,absolute-filepath,direction. With the exception of the header line, the order of lines in this file is not important.

For single-end reads, there must be exactly one line per sample ID corresponding to the forward reads (or reverse, if you’re processing these separately as single-end reads). For paired-end reads there must be exactly two lines per sample id, corresponding to the forward and the reverse reads. The direction field on each line can only contain the text forward or reverse.

The fastq.gz absolute filepaths may contain environment variables (e.g., $HOME or $PWD). The following example illustrates a simple fastq manifest file for paired-end read data for two samples.

sample-id,absolute-filepath,direction
# Lines starting with '#' are ignored and can be used to create
# "comments" or even "comment out" entries
sample-1,$PWD/some/filepath/sample1_R1.fastq.gz,forward
sample-2,$PWD/some/filepath/sample2_R1.fastq.gz,forward
sample-1,$PWD/some/filepath/sample1_R2.fastq.gz,reverse
sample-2,$PWD/some/filepath/sample2_R2.fastq.gz,reverse

Just like with fastq.gz, the absolute filepaths in the manifest for any fastq files must be accurate. The following example illustrates a simple fastq manifest file for fastq single-end read data for one sample.

sample-id,absolute-filepath,direction
sample-1,$PWD/some/filepath/sample1_R1.fastq,forward

There are four variants of FASTQ data which you must specify to QIIME 2 when importing, and which are defined in the following sections.

SingleEndFastqManifestPhred33

In this variant of the fastq manifest format, the read directions must all either be forward or reverse. This format assumes that the PHRED offset used for the positional quality scores in all of the fastq.gz / fastq files is 33.

SingleEndFastqManifestPhred64

In this variant of the fastq manifest format, the read directions must all either be forward or reverse. This format assumes that the PHRED offset used for the positional quality scores in all of the fastq.gz / fastq files is 64. During import, QIIME 2 will convert the PHRED 64 encoded quality scores to PHRED 33 encoded quality scores. This conversion will be slow, but will only happen one time.

PairedEndFastqManifestPhred33

In this variant of the fastq manifest format, there must be forward and reverse read fastq.gz / fastq files for each sample id. As a result, each sample id is represented twice in the manifest file: once for its forward reads, and once for its reverse reads. This format assumes that the PHRED offset used for the positional quality scores in all of the fastq.gz / fastq files is 33.

PairedEndFastqManifestPhred64

In this variant of the fastq manifest format, there must be forward and reverse read fastq.gz / fastq files for each sample id. As a result, each sample id is represented twice in the manifest file: once for its forward reads, and once for its reverse reads. This format assumes that the PHRED offset used for the positional quality scores in all of the fastq.gz / fastq files is 64. During import, QIIME 2 will convert the PHRED 64 encoded quality scores to PHRED 33 encoded quality scores. This conversion will be slow, but will only happen one time.

Obtaining example data

Since importing data in these four formats is very similar, we’ll only provide examples for two of the variants: SingleEndFastqManifestPhred33 and PairedEndFastqManifestPhred64.

Please select a download option that is most appropriate for your environment:
wget \
  -O "se-33.zip" \
  "https://data.qiime2.org/2018.11/tutorials/importing/se-33.zip"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/se-33.zip" > \
  "se-33.zip"
Please select a download option that is most appropriate for your environment:
wget \
  -O "se-33-manifest" \
  "https://data.qiime2.org/2018.11/tutorials/importing/se-33-manifest"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/se-33-manifest" > \
  "se-33-manifest"
Please select a download option that is most appropriate for your environment:
wget \
  -O "pe-64.zip" \
  "https://data.qiime2.org/2018.11/tutorials/importing/pe-64.zip"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/pe-64.zip" > \
  "pe-64.zip"
Please select a download option that is most appropriate for your environment:
wget \
  -O "pe-64-manifest" \
  "https://data.qiime2.org/2018.11/tutorials/importing/pe-64-manifest"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/pe-64-manifest" > \
  "pe-64-manifest"
unzip -q se-33.zip
unzip -q pe-64.zip

Importing Data

qiime tools import \
  --type 'SampleData[SequencesWithQuality]' \
  --input-path se-33-manifest \
  --output-path single-end-demux.qza \
  --input-format SingleEndFastqManifestPhred33

Output artifacts:

qiime tools import \
  --type 'SampleData[PairedEndSequencesWithQuality]' \
  --input-path pe-64-manifest \
  --output-path paired-end-demux.qza \
  --input-format PairedEndFastqManifestPhred64

Output artifacts:

Sequences without quality information (i.e. FASTA)

QIIME 2 currently supports importing the QIIME 1 seqs.fna file format, which consists of a single FASTA file with exactly two lines per record: header and sequence. Each sequence must span exactly one line and cannot be split across multiple lines. The ID in each header must follow the format <sample-id>_<seq-id>. <sample-id> is the identifier of the sample the sequence belongs to, and <seq-id> is an identifier for the sequence within its sample.

An example of importing and dereplicating this kind of data can be found in the OTU Clustering tutorial.

Other FASTA formats like FASTA files with differently formatted sequence headers or per-sample demultiplexed FASTA files (i.e. one FASTA file per sample) are not currently supported.

Per-feature unaligned sequence data (i.e., representative FASTA sequences)

Format description

Unaligned sequence data is imported from a FASTA formatted file containing DNA sequences that are not aligned (i.e., do not contain - or . characters). The sequences may contain degenerate nucleotide characters, such as N, but some QIIME 2 actions may not support these characters. See the scikit-bio FASTA format description for more information about the FASTA format.

Obtaining example data

Please select a download option that is most appropriate for your environment:
wget \
  -O "sequences.fna" \
  "https://data.qiime2.org/2018.11/tutorials/importing/sequences.fna"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/sequences.fna" > \
  "sequences.fna"

Importing data

qiime tools import \
  --input-path sequences.fna \
  --output-path sequences.qza \
  --type 'FeatureData[Sequence]'

Output artifacts:

Per-feature aligned sequence data (i.e., aligned representative FASTA sequences)

Format description

Aligned sequence data is imported from a FASTA formatted file containing DNA sequences that are aligned to one another. All aligned sequences must be exactly the same length. The sequences may contain degenerate nucleotide characters, such as N, but some QIIME 2 actions may not support these characters. See the scikit-bio FASTA format description for more information about the FASTA format.

Obtaining example data

Please select a download option that is most appropriate for your environment:
wget \
  -O "aligned-sequences.fna" \
  "https://data.qiime2.org/2018.11/tutorials/importing/aligned-sequences.fna"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/aligned-sequences.fna" > \
  "aligned-sequences.fna"

Importing data

qiime tools import \
  --input-path aligned-sequences.fna \
  --output-path aligned-sequences.qza \
  --type 'FeatureData[AlignedSequence]'

Output artifacts:

Feature table data

You can also import pre-processed feature tables into QIIME 2.

BIOM v1.0.0

Format description

See the BIOM v1.0.0 format specification for details.

Obtaining example data
Please select a download option that is most appropriate for your environment:
wget \
  -O "feature-table-v100.biom" \
  "https://data.qiime2.org/2018.11/tutorials/importing/feature-table-v100.biom"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/feature-table-v100.biom" > \
  "feature-table-v100.biom"

Importing data

qiime tools import \
  --input-path feature-table-v100.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV100Format \
  --output-path feature-table-1.qza

Output artifacts:

BIOM v2.1.0

Format description

See the BIOM v2.1.0 format specification for details.

Obtaining example data
Please select a download option that is most appropriate for your environment:
wget \
  -O "feature-table-v210.biom" \
  "https://data.qiime2.org/2018.11/tutorials/importing/feature-table-v210.biom"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/feature-table-v210.biom" > \
  "feature-table-v210.biom"

Importing data

qiime tools import \
  --input-path feature-table-v210.biom \
  --type 'FeatureTable[Frequency]' \
  --input-format BIOMV210Format \
  --output-path feature-table-2.qza

Output artifacts:

Phylogenetic trees

Format description

Phylogenetic trees are imported from newick formatted files. See the scikit-bio newick format description for more information about the newick format.

Obtaining example data

Please select a download option that is most appropriate for your environment:
wget \
  -O "unrooted-tree.tre" \
  "https://data.qiime2.org/2018.11/tutorials/importing/unrooted-tree.tre"
curl -sL \
  "https://data.qiime2.org/2018.11/tutorials/importing/unrooted-tree.tre" > \
  "unrooted-tree.tre"

Importing data

qiime tools import \
  --input-path unrooted-tree.tre \
  --output-path unrooted-tree.qza \
  --type 'Phylogeny[Unrooted]'

Output artifacts:

If you have a rooted tree, you can use --type 'Phylogeny[Rooted]' instead.

Other data types

QIIME 2 can import many other data types not covered in this tutorial. You can see which formats of input data are importable with the following command:

qiime tools import \
  --show-importable-formats

And which QIIME 2 types you can import these formats as:

qiime tools import \
  --show-importable-types

Unfortunately, there isn’t currently documentation detailing which data formats can be imported as which QIIME 2 data types, but hopefully the names of these formats and types should be self-explanatory enough to figure it out. If you have any questions, please post to the QIIME 2 Forum for help!