01_load_data

In principle, FEATHER can accept the output from any HX/MS software.

There are two types of input files: 1. Peptide pools with centroid deuteration values 2. Raw mass spectra (deconvoluted)

Read the centroid data

  • Table: The peptide pool.

  • Range List: A file that defines the peptides to include or exclude.

  • n_fastamides: In an HDX experiment, the first two residues of a peptide at the N-terminus do not contribute to deuterium uptake due to rapid back exchange.

  • Saturation: The percentage of deuterium in the D2O buffer.

[6]:
from pigeon_feather.data import *
from pigeon_feather.plot import *
from pigeon_feather.hxio import *
from pigeon_feather.spectra import *


import numpy as np
import pandas as pd

import datetime
import os
import pickle
import datetime
[2]:
tables = ['./data/ecDHFR_tutorial.csv']

ranges = ['./data/rangeslist.csv']


raw_spectra_paths = [
    f"./data/SpecExport/",
]

protein_sequence = "MTGHHHHHHENLYFQSISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERR"

# load the data
hdxms_data_list = []
for i in range(len(tables)):
    # for i in [4]:
    print(tables[i])

    # read the data and clean it
    cleaned = read_hdx_tables([tables[i]], [ranges[i]], exclude=False, states_subset=['APO','TRI'])

    # convert the cleaned data to hdxms data object
    hdxms_data = load_dataframe_to_hdxmsdata(
        cleaned,
        n_fastamides=2,
        protein_sequence=protein_sequence,
        fulld_approx=False,
        saturation=0.9,
    )

    hdxms_data_list.append(hdxms_data)


./data/ecDHFR_tutorial.csv
rangeslist included !

check the basic statics_info of the hdxms_data_list

[3]:
from pigeon_feather.hxio import get_all_statics_info

get_all_statics_info(hdxms_data_list)
============================================================
                    HDX-MS Data Statistics
============================================================
States names: ['APO', 'TRI']
Time course (s): [46.0, 373.5, 572.5, 2011.0, 7772.0, 30811.5, 43292.0]
Number of time points: 7
Protein sequence length: 174
Average coverage: 0.97
Number of unique peptides: 261
Average peptide length: 9.8
Redundancy (based on average coverage): 14.7
Average peptide length to redundancy ratio: 0.7
Backexchange average, IQR: 0.27, 0.26
============================================================

Load the raw spectrum

[4]:
# spectrum could be easily loaded to the hdxms_data object
for i in range(len(tables)):
    load_raw_ms_to_hdxms_data(
        hdxms_data,
        raw_spectra_paths[i],
    )
Removed 0 peptides from state APO due to missing raw MS data.
Removed 70 peptides from state APO due to high back exchange.
Removed 2 peptides from state TRI due to missing raw MS data.
Removed 70 peptides from state TRI due to high back exchange.
Done loading raw MS data.

Note: One common error is that the correct spectra file cannot be found. Please ensure that the protein_state.state_name corresponds to the files in the spectrum folder, with the correct time points and charge states.

[7]:
# save the raw data as a pickle file
import pickle

today = datetime.date.today().strftime("%Y%m%d")
today = "20240722"

with open(f"./data/hdxms_data_raw_{today}.pkl", "wb") as f:
    pickle.dump(hdxms_data_list, f)

# with open(f"./data/hdxms_data_raw_{today}.pkl", "rb") as f:
#     hdxms_data_list = pickle.load(f)