Access and explore figshare sample and patient metadata

Access and explore figshare sample and patient metadata#

Author: @ebolyen

import pandas as pd
import qiime2

Create Sample Metadata#

import tempfile
import requests
import qiime2

data = requests.get("https://www.dropbox.com/s/aojvmbuxp5jst1q/tblASVsamples.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_samples = pd.read_csv(f.name, index_col='SampleID')

pd_metadata_samples
PatientID Timepoint Consistency Accession BioProject DayRelativeToNearestHCT AccessionShotgun
SampleID
1000A 1000 0 formed SRR11414397 PRJNA545312 -9.0 NaN
1000B 1000 5 liquid SRR11414992 PRJNA545312 -4.0 NaN
1000C 1000 15 liquid SRR11414991 PRJNA545312 6.0 NaN
1000D 1000 18 semi-formed SRR11414990 PRJNA545312 9.0 NaN
1000E 1000 22 formed SRR11414989 PRJNA545312 13.0 NaN
... ... ... ... ... ... ... ...
FMT.0251G FMT.0251 8 semi-formed SRR9270380 PRJNA548153 7.0 NaN
FMT.0251H FMT.0251 7 semi-formed SRR11396690 PRJNA607574 6.0 NaN
FMT.0251I FMT.0251 12 semi-formed SRR9270379 PRJNA548153 11.0 NaN
FMT.0251J FMT.0251 13 semi-formed SRR9270382 PRJNA548153 12.0 NaN
FMT.0251L FMT.0251 15 semi-formed SRR9270381 PRJNA548153 14.0 NaN

12546 rows × 7 columns

q2_metadata = qiime2.Metadata(pd_metadata_samples)
q2_metadata
Metadata
--------
12546 IDs x 7 columns
PatientID:               ColumnProperties(type='categorical')
Timepoint:               ColumnProperties(type='numeric')
Consistency:             ColumnProperties(type='categorical')
Accession:               ColumnProperties(type='categorical')
BioProject:              ColumnProperties(type='categorical')
DayRelativeToNearestHCT: ColumnProperties(type='numeric')
AccessionShotgun:        ColumnProperties(type='categorical')

Call to_dataframe() for a tabular representation.
q2_metadata.save('sample_metadata_simple.tsv')
'tutorial_out/sample_metadata_simple.tsv'

Patient Metadata#

This metadata represent events which are unrelated to the samples. Care should be taken to identify and encode this in a sample-wise fashion which is consistent.

Additional description of the columns has been provided below each table.

data = requests.get("https://www.dropbox.com/s/yxv2x00z9fi0w2l/tblInfectionsCidPapers.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_infections = pd.read_csv(f.name, index_col='PatientID')

pd_metadata_infections
Timepoint InfectiousAgent DayRelativeToNearestHCT
PatientID
1000 213 Enterococcus_Faecium 204.0
1003 1046 Enterococcus_Faecium_Vancomycin_Resistant 1049.0
1010 -58 Escherichia -61.0
1015 19 Enterococcus_Faecium_Vancomycin_Resistant 16.0
1015 20 Enterococcus_Faecium_Vancomycin_Resistant 17.0
... ... ... ...
pt_with_samples_2019_760 -1 Klebsiella_Pneumoniae -58.0
pt_with_samples_2019_760 5 Enterococcus_Faecium -52.0
pt_with_samples_2021_897 0 Escherichia -134.0
pt_with_samples_2021_897 0 Klebsiella_Pneumoniae -134.0
pt_with_samples_642_643 43 Klebsiella_Pneumoniae 83.0

1231 rows × 3 columns

Day of positive blood cultures for 426 patients (include only microbes analyzed in 1. Taur, Y., et al. 2012. Intestinal domination and the risk of bacteremia in patients undergoing allogeneic hematopoietic stem cell transplantation. Clinical infectious diseases, 55(7), pp.905-914; 2. Stoma, I., et al. 2020. Compositional flux within the intestinal microbiota and risk for bloodstream infection with gram-negative bacteria. Clinical Infectious Diseases.)

PatientID: deidentified identifier of patients

Timepoint: deidentified day of infection

InfectiousAgent: the bacteria causing infections

DayRelativeToNearestHCT: day of infection relative to the nearest day of bone marrow transplant


data = requests.get("https://www.dropbox.com/s/nfb1h7kkkx8sqp1/tbltemperature.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_temps = pd.read_csv(f.name, index_col='PatientID', low_memory=False)


pd_metadata_temps
Timepoint MaxTemperature DayRelativeToNearestHCT
PatientID
1000 -462 98.4 -471.0
1000 -427 98.4 -436.0
1000 -399 98.0 -408.0
1000 -371 98.0 -380.0
1000 -343 98.0 -352.0
... ... ... ...
pt_with_samples_833_883 1534 98.2 1331.0
pt_with_samples_833_883 1548 99.0 1345.0
pt_with_samples_833_883 1576 97.9 1373.0
pt_with_samples_833_883 1604 99.0 1401.0
pt_with_samples_833_883 1632 97.9 1429.0

202579 rows × 3 columns

temperatures for 1,249 patients

PatientID: deidentified identifier of patients

Timepoint: deidentified day when patient temperature was measured

MaxTemperature: Maximum temperature (unit: Fahrenheit) recorded on that day for that patient

DayRelativeToNearestHCT: day of temperature measurement relative to the nearest day of bone marrow transplant


data = requests.get("https://www.dropbox.com/s/j277dv6lrqz7hfv/tblVanA.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_van_a = pd.read_csv(f.name, index_col='SampleID')
    
pd_metadata_van_a
VanA
SampleID
1015P 0
1015Q 0
1015T 0
1015U 1
1015V 1
... ...
FMT.0251G 0
FMT.0251H 0
FMT.0251I 0
FMT.0251J 0
FMT.0251L 0

7547 rows × 1 columns

Results of PCR detection for vanA gene for 7,547 samples

SampleID: stool sample identifier

VanA: whether vanA gene is detected in the sample


data = requests.get("https://www.dropbox.com/s/066lxgvx16wsmqf/tbldrug.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_drug = pd.read_csv(f.name, index_col='PatientID', low_memory=False)
pd_metadata_drug
StartTimepoint StopTimepoint Factor Category Route StartDayRelativeToNearestHCT StopDayRelativeToNearestHCT
PatientID
1000 -160 -160 ciprofloxacin quinolones intravenous -169 -169
1000 -160 -160 fluconazole antifungals intravenous -169 -169
1000 -151 -151 aztreonam miscellaneous antibiotics intravenous -160 -160
1000 -151 -151 vancomycin glycopeptide antibiotics intravenous -160 -160
1000 -150 -150 aztreonam miscellaneous antibiotics intravenous -159 -159
... ... ... ... ... ... ... ...
pt_with_samples_833_883 1336 1339 azithromycin macrolide derivatives intravenous 1133 1136
pt_with_samples_833_883 1336 1342 posaconazole antifungals oral 1133 1139
pt_with_samples_833_883 1337 1337 vancomycin glycopeptide antibiotics intravenous 1134 1134
pt_with_samples_833_883 1338 1338 vancomycin glycopeptide antibiotics intravenous 1135 1135
pt_with_samples_833_883 1339 1340 vancomycin glycopeptide antibiotics intravenous 1136 1137

80731 rows × 7 columns

Timing and route of drug administration for 1,279 patients

PatientID: deidentified identifier of patients

StartTimepoint: deidentified day when drug administration started

StopTimepoint: deidentified day when drug administration stopped (including the day)

Factor: name of the drug

Category: category of the drug

Route: route of drug administration

StartDayRelativeToNearestHCT/StopDayRelativeToNearestHCT: start/stop day of the drug administration relative to the nearest day of bone marrow transplant


data = requests.get("https://www.dropbox.com/s/ksee4q7x1c1oq99/tblhctmeta.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_transplant = pd.read_csv(f.name, index_col='PatientID')
pd_metadata_transplant
TimepointOfTransplant HCTSource Disease wbcPatientId autoFmtPatientId nejmPatientId
PatientID
FMT.0161 30 TCD Multiple Myeloma 000f9b9617d476abf1f143 NaN 1
667 8 PBSC_unmodified Leukemia 001f938eeec58c18a4604a 491 1
1277 5 PBSC_unmodified Leukemia 0079d6c0a49b8b6c3daf83 748 1
464 8 BM_unmodified Leukemia 00a7221374f597b954d09f 342 1
420 6 cord Non-Hodgkin's Lymphoma 00d7a5d77e1a5f7a9d3f3c 218 1
... ... ... ... ... ... ...
pt_with_samples_1105_1106_1107_1108 6 cord Leukemia ff493eac17bd83d1a2c57c NaN 1
1759 8 cord Leukemia ff650b2e1faab2a3fbcfc9 NaN 0
559 -5 TCD Leukemia ffcf59d52566ac7729521f 458 1
140 -236 PBSC_unmodified Hodgkin's Disease ffe9adf3d8b0ab843ff29e NaN 1
1965 3 cord Leukemia ffe9f50b4d9ae12e3ae671 NaN 0

1346 rows × 6 columns

Day and source of hematopoietic cell transplant (HCT) for 1,278 patients

PatientID: deidentified identifier of patients

TimepointOfTransplant: deidentified day of HCT

HCTSource: hematopoietic cell sources for HCT patients (BM_unmodified: bone marrow; PBSC_unmodified: peripheral blood stem cells; TCD: T-cell depleted; cord: cord blood)

Disease: disease of patients

wbcPatientId, autoFmtPatientId, nejmPatientId: identifiers for the same patients if they were also included in another previous study (wbcPatientId: Schluter, J. et al. 2019. The gut microbiota influences circulatory immune cell dynamics in humans. BioRxiv; autoFmtPatientId: Taur, Y. et al. 2018. Reconstitution of the gut microbiota of antibiotic-treated patients by autologous fecal microbiota transplant. Science Translational Medicine 10(460); nejmPatientId: Peled, J.U. et al. 2020. Microbiota as Predictor of Mortality in Allogeneic Hematopoietic-Cell Transplantation. New England Journal of Medicine, 382(9), pp.822-834.)


data = requests.get("https://www.dropbox.com/s/wo5c6i4kp79nob8/tblwbc.csv?dl=1")

with tempfile.NamedTemporaryFile() as f:
    f.write(data.content)
    f.flush()
    pd_metadata_wbc = pd.read_csv(f.name, index_col='PatientID', low_memory=False)
pd_metadata_wbc
Timepoint BloodCellType Value DayRelativeToNearestHCT
PatientID
1001 7 WBCtotal <0.1 4
1001 8 WBCtotal <0.1 5
1001 9 WBCtotal <0.1 6
1001 11 WBCtotal 0 8
1001 12 WBCtotal <0.1 9
... ... ... ... ...
pt_with_samples_1933_1993 15 Lymphocytes NaN 7
pt_with_samples_1933_1993 16 Lymphocytes NaN 8
pt_with_samples_1933_1993 17 Lymphocytes NaN 9
2070 9 Lymphocytes NaN 10
2070 10 Lymphocytes NaN 11

220835 rows × 4 columns

PatientID: deidentified ID of patients

Timepoint: deidentified day of blood cell measurement

BloodCellType: lymphocyte cells (Lymphocytes), neutrophil cells (Neutrophils), and total white blood cells (WBCtotal)

Value: blood cell counts (unit: 1,000 cells/uL)

DayRelativeToNearestHCT: day of blood cell measurement relative to the nearest day of bone marrow transplant