{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 01_load_data"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "In principle, FEATHER can accept the output from any HX/MS software.\n",
    "\n",
    "There are two types of input files:\n",
    "1. Peptide pools with centroid deuteration values\n",
    "2. Raw mass spectra (deconvoluted)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Read the centroid data "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* **Table:** The peptide pool.\n",
    "* **Range List:** A file that defines the peptides to include or exclude.\n",
    "* **n_fastamides:** In an HDX experiment, the first two residues of a peptide at the N-terminus do not contribute to deuterium uptake due to rapid back exchange.\n",
    "* **Saturation:** The percentage of deuterium in the D2O buffer.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "metadata": {},
   "outputs": [],
   "source": [
    "from pigeon_feather.data import *\n",
    "from pigeon_feather.plot import *\n",
    "from pigeon_feather.hxio import *\n",
    "from pigeon_feather.spectra import *\n",
    "\n",
    "\n",
    "import numpy as np\n",
    "import pandas as pd\n",
    "\n",
    "import datetime\n",
    "import os\n",
    "import pickle\n",
    "import datetime"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "./data/ecDHFR_tutorial.csv\n",
      "rangeslist included !\n"
     ]
    }
   ],
   "source": [
    "tables = ['./data/ecDHFR_tutorial.csv']\n",
    "\n",
    "ranges = ['./data/rangeslist.csv']\n",
    "\n",
    "\n",
    "raw_spectra_paths = [\n",
    "    f\"./data/SpecExport/\",\n",
    "]\n",
    "\n",
    "protein_sequence = \"MTGHHHHHHENLYFQSISLIAALAVDRVIGMENAMPWNLPADLAWFKRNTLDKPVIMGRHTWESIGRPLPGRKNIILSSQPGTDDRVTWVKSVDEAIAACGDVPEIMVIGGGRVYEQFLPKAQKLYLTHIDAEVEGDTHFPDYEPDDWESVFSEFHDADAQNSHSYCFEILERR\"\n",
    "\n",
    "# load the data\n",
    "hdxms_data_list = []\n",
    "for i in range(len(tables)):\n",
    "    # for i in [4]:\n",
    "    print(tables[i])\n",
    "\n",
    "    # read the data and clean it\n",
    "    cleaned = read_hdx_tables([tables[i]], [ranges[i]], exclude=False, states_subset=['APO','TRI'])\n",
    "    \n",
    "    # convert the cleaned data to hdxms data object\n",
    "    hdxms_data = load_dataframe_to_hdxmsdata(\n",
    "        cleaned,\n",
    "        n_fastamides=2,\n",
    "        protein_sequence=protein_sequence,\n",
    "        fulld_approx=False,\n",
    "        saturation=0.9,\n",
    "    )\n",
    "\n",
    "    hdxms_data_list.append(hdxms_data)\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "check the basic statics_info of the hdxms_data_list"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "============================================================\n",
      "                    HDX-MS Data Statistics\n",
      "============================================================\n",
      "States names: ['APO', 'TRI']\n",
      "Time course (s): [46.0, 373.5, 572.5, 2011.0, 7772.0, 30811.5, 43292.0]\n",
      "Number of time points: 7\n",
      "Protein sequence length: 174\n",
      "Average coverage: 0.97\n",
      "Number of unique peptides: 261\n",
      "Average peptide length: 9.8\n",
      "Redundancy (based on average coverage): 14.7\n",
      "Average peptide length to redundancy ratio: 0.7\n",
      "Backexchange average, IQR: 0.27, 0.26\n",
      "============================================================\n"
     ]
    }
   ],
   "source": [
    "from pigeon_feather.hxio import get_all_statics_info\n",
    "\n",
    "get_all_statics_info(hdxms_data_list)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Load the raw spectrum"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Removed 0 peptides from state APO due to missing raw MS data.\n",
      "Removed 70 peptides from state APO due to high back exchange.\n",
      "Removed 2 peptides from state TRI due to missing raw MS data.\n",
      "Removed 70 peptides from state TRI due to high back exchange.\n",
      "Done loading raw MS data.\n"
     ]
    }
   ],
   "source": [
    "# spectrum could be easily loaded to the hdxms_data object\n",
    "for i in range(len(tables)):\n",
    "    load_raw_ms_to_hdxms_data(\n",
    "        hdxms_data,\n",
    "        raw_spectra_paths[i],\n",
    "    )"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**Note:** One common error is that the correct spectra file cannot be found. Please ensure that the `protein_state.state_name` corresponds to the files in the spectrum folder, with the correct time points and charge states."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "metadata": {},
   "outputs": [],
   "source": [
    "# save the raw data as a pickle file\n",
    "import pickle\n",
    "\n",
    "today = datetime.date.today().strftime(\"%Y%m%d\")\n",
    "today = \"20240722\"\n",
    "\n",
    "with open(f\"./data/hdxms_data_raw_{today}.pkl\", \"wb\") as f:\n",
    "    pickle.dump(hdxms_data_list, f)\n",
    "\n",
    "# with open(f\"./data/hdxms_data_raw_{today}.pkl\", \"rb\") as f:\n",
    "#     hdxms_data_list = pickle.load(f)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.11.8"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}