Exploring the tutorial metadata#
Access the study metadata#
In this chapter we’ll begin our work with QIIME 2 and the tutorial data. We’ll start by downloading the metadata, generating a summary of it, and exploring that summary.
First, download the metadata.
from qiime2 import Metadata
from urllib import request
url = 'https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/data/020-tutorial-upstream/020-metadata/sample-metadata.tsv'
fn = 'sample-metadata.tsv'
request.urlretrieve(url, fn)
sample_metadata_md = Metadata.load(fn)
library(reticulate)
Metadata <- import("qiime2")$Metadata
request <- import("urllib")$request
url <- 'https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/data/020-tutorial-upstream/020-metadata/sample-metadata.tsv'
fn <- 'sample-metadata.tsv'
request$urlretrieve(url, fn)
sample_metadata_md <- Metadata$load(fn)
wget \
-O 'sample-metadata.tsv' \
'https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/data/020-tutorial-upstream/020-metadata/sample-metadata.tsv'
def partial_metadata_factory():
## This function is identical to the filter.md metadata_factory function - should
## be able to call that once issue#1 is addressed.
import tempfile
import requests
import pandas as pd
import numpy as np
import qiime2
sample_metadata_url = 'https://data.qiime2.org/2024.5/tutorials/liao/sample-metadata.tsv'
data = requests.get(sample_metadata_url)
with tempfile.NamedTemporaryFile() as f:
f.write(data.content)
sample_metadata = pd.read_csv(f.name, index_col='SampleID', sep='\t')
patient_sample_counts = sample_metadata['PatientID'].value_counts()
sample_metadata['patient-sample-counts'] = \
patient_sample_counts[sample_metadata['PatientID']].values
transplant_metadata_url = 'https://data.qiime2.org/2024.5/tutorials/liao/transplant-metadata.tsv'
data = requests.get(transplant_metadata_url)
with tempfile.NamedTemporaryFile() as f:
f.write(data.content)
transplant_metadata = pd.read_csv(f.name, sep='\t')
# If a patient received multiple HCTs, keep data only on the most recent.
# This is useful for simplifying downstream workflows.
transplant_metadata = transplant_metadata.sort_values('TimepointOfTransplant')
most_recent_transplant_metadata = transplant_metadata.drop_duplicates(subset=['PatientID'], keep='last')
most_recent_transplant_metadata = most_recent_transplant_metadata.set_index('PatientID')
autoFmtControlIds = {'C%d' % i for i in range(1,12)}
autoFmtTreatmentIds = {'T%d' % i for i in range(1,15)}
new_column = {}
for ptid, aptid in most_recent_transplant_metadata['autoFmtPatientId'].items():
if aptid in autoFmtControlIds:
value = 'control'
elif aptid in autoFmtTreatmentIds:
value = 'treatment'
else:
value = np.nan
new_column[ptid] = value
most_recent_transplant_metadata['autoFmtGroup'] = pd.Series(new_column)
sample_metadata = sample_metadata.join(most_recent_transplant_metadata, on='PatientID')
sample_metadata['categorical-time-relative-to-hct'] = \
pd.cut(sample_metadata['DayRelativeToNearestHCT'],
[-1000, -1, 5, 1000],
labels=['pre', 'peri', 'post'])
sample_metadata['week-relative-to-hct'] = \
pd.cut(sample_metadata['DayRelativeToNearestHCT'],
[-1000, -14, -7, 0, 7, 14, 21, 28, 35, 42, 1000],
labels=[-3, -2, -1, 0, 1, 2, 3, 4, 5, 6])
sample_metadata = sample_metadata.astype(
{'categorical-time-relative-to-hct': object,
'week-relative-to-hct': float})
return qiime2.Metadata(sample_metadata)
def fmt_metadata_factory():
# obtain the metadata that we've been using so far
sample_metadata = partial_metadata_factory().to_dataframe()
# obtain the autoFMT-specific metadata
import tempfile
import requests
import pandas as pd
import numpy as np
import qiime2
fmt_metadata_url = 'https://data.qiime2.org/2024.5/tutorials/liao/fmt-metadata.tsv'
data = requests.get(fmt_metadata_url)
with tempfile.NamedTemporaryFile() as f:
f.write(data.content)
f.flush()
fmt_metadata = pd.read_csv(f.name, sep='\t')
fmt_metadata = fmt_metadata.set_index('PatientID')
# join the two metadata collections, dropping duplicate columns
sample_metadata = sample_metadata.join(fmt_metadata.drop(['autoFmtPatientId'], axis=1), on='PatientID')
# Create new column relating all samples to day relative to FMT treatment.
# For patients in the "control" group, this will be the day they were
# assigned to that group, which is when they would have received the FMT
# (or one to two days before that) if they had been assigned to the
# "treatment" group.
fmt_day = sample_metadata['FMTDayRelativeToNearestHCT'].fillna(
sample_metadata['RandomizationDayRelativeToNearestHCT'])
day_relative_to_fmt = sample_metadata['DayRelativeToNearestHCT'] - fmt_day
sample_metadata.insert(0, 'day-relative-to-fmt', day_relative_to_fmt, True)
sample_metadata['week-relative-to-fmt'] = \
pd.cut(sample_metadata['day-relative-to-fmt'],
[-1000, -14, -7, 0, 7, 14, 21, 28, 35, 42, 1000],
labels=[-3, -2, -1, 0, 1, 2, 3, 4, 5, 6])
sample_metadata['categorical-time-relative-to-fmt'] = \
pd.cut(sample_metadata['day-relative-to-fmt'],
[-1000, -1, 5, 1000],
labels=['pre', 'peri', 'post'])
sample_metadata = sample_metadata.astype(
{'categorical-time-relative-to-fmt': object,
'week-relative-to-fmt': float})
return qiime2.Metadata(sample_metadata)
sample_metadata = use.init_metadata('sample_metadata', fmt_metadata_factory)
- Using the
Upload Data
tool: On the first tab (Regular), press the
Paste/Fetch
data button at the bottom.Set “Name” (first text-field) to:
sample-metadata.tsv
In the larger text-area, copy-and-paste: https://docs.qiime2.org/jupyterbooks/cancer-microbiome-intervention-tutorial/data/020-tutorial-upstream/020-metadata/sample-metadata.tsv
(“Type”, “Genome”, and “Settings” can be ignored)
Press the
Start
button at the bottom.
sample-metadata.tsv
| download
View the metadata#
Next, we’ll get a view of the study metadata as QIIME 2 sees it. This
will allow you to assess whether the metadata that QIIME 2 is using is as you
expect. You can do this using the tabulate
action in QIIME 2’s q2-metadata
plugin as follows.
import qiime2.plugins.metadata.actions as metadata_actions
metadata_summ_1_viz, = metadata_actions.tabulate(
input=sample_metadata_md,
)
metadata_actions <- import("qiime2.plugins.metadata.actions")
action_results <- metadata_actions$tabulate(
input=sample_metadata_md,
)
metadata_summ_1_viz <- action_results$visualization
qiime metadata tabulate \
--m-input-file sample-metadata.tsv \
--o-visualization metadata-summ-1.qzv
use.action(
use.UsageAction(plugin_id='metadata', action_id='tabulate'),
use.UsageInputs(input=sample_metadata),
use.UsageOutputNames(visualization='metadata_summ_1')
)
- Using the
qiime2 metadata tabulate
tool: For “input”:
Perform the following steps.
Leave as
Metadata from TSV
Set “Metadata Source” to
sample-metadata.tsv
Press the
Execute
button.
- Once completed, for the new entry in your history, use the
Edit
button to set the name as follows: (Renaming is optional, but it will make any subsequent steps easier to complete.)
History Name
“Name” to set (be sure to press
Save
)#: qiime2 metadata tabulate [...] : visualization.qzv
metadata-summ-1.qzv
Spend a few minutes now exploring the Galaxy environment on your own, and exploring the metadata that we’ll use in this tutorial. If you have questions about how to use Galaxy or QIIME 2 View, this is a great time to ask those questions.