Interfacing with ML Libraries ============================= Overview -------- Machine Learning is the field of study that gives computers the capability to learn without being explicitly programmed. Machine learning (ML) is well known for its powerful ability to recognize patterns and signals. Recently, the mass spectrometry community has embraced ML techniques for large-scale data analysis. Predicting accurate retention times (RT) has shown to improve identification in bottom-up proteomics. In this tutorial we will predict the RT from amino acid sequence data using simple machine learning methods. First, we import all necessary libraries for this tutorial. .. code-block:: ipython3 :linenos: !pip install seaborn !pip install xgboost .. code-block:: python :linenos: import pandas as pd import seaborn as sns import matplotlib.pyplot as plt import string from collections import Counter import numpy as np from urllib.request import urlretrieve Once we have imported all libraries successfully, we are going to store the dataset in a variable. .. code-block:: python :linenos: gh = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master" urlretrieve(gh + "/src/data/pyOpenMS_ML_Tutorial.tsv", "data.tsv") tsv_data = pd.read_csv("data.tsv", sep="\t", skiprows=17) Here, we have prepared a ``tsv`` file that contains three columns ``sequence``, RT and ``charge``. Note that this table could also be easily created from identification data as produced in previous chapters. Before we move forward lets try to understand more about our data: a. Sequence - Chains of amino acids form peptides or proteins. The arrangement of amino acids is referred as amino acid sequence. The composition and order of amino acids affect the physicochemical properties of the peptide and lead to different retention in the column. b. Retention time (RT) - is the time taken for an analyte to pass through a chromatography column. From the amino acid sequence we can derive additional properties (machine learning features) used to train our model. We can easily check for its shape by using the tsv_data.shape attribute, which will return the size of the dataset. .. code-block:: python :linenos: print(tsv_data.shape) .. code-block:: output (15896, 3) Explore the top 5 rows of the dataset by using head() method on pandas DataFrame. .. code-block:: python :linenos: tsv_data.head() .. code-block:: output sequence RT charge 0 EEETVAK 399.677766 2 1 EQEEQQQQEGHNNK 624.555300 3 2 SHGGHTVISK 625.797960 3 3 SGTHNMYK 625.982520 2 4 AARPTRPDK 626.073300 3 As the RT column is our response variable, we will be storing it separately as Y1_test .. code-block:: python :linenos: Y1_test = tsv_data["RT"] Preprocessing ------------- Cleaning data before applying a machine learning method keeps the relevant information in potentially massive amount of data. Here we will apply some simple preprocessing to extract novel machine learning features from the amino acid sequences. Some of the parameters that can be derived are 1. {Alphabet}_count = The count of Amino Acids in the sequence. 2. {Alphabet}_freq = The count of Amino Acids divided by the total length of the sequence. 3. length = The total number of amino acids in the sequence. .. code-block:: python :linenos: alphabet_list = list(string.ascii_uppercase) column_headers = ( ["sequence"] + [i + "_count" for i in alphabet_list] + [i + "_freq" for i in alphabet_list] + ["charge", "length"] ) types = ( ["object"] + ["int64" for i in alphabet_list] + ["float64" for i in alphabet_list] + ["int64", "int64"] ) pdcols = dict(zip(column_headers, types)) As we have all the column names, now we will start populating it. .. code-block:: python :linenos: df = pd.DataFrame( np.zeros((len(tsv_data.index), len(column_headers))), columns=column_headers ) df["sequence"] = tsv_data["sequence"] df["charge"] = tsv_data["charge"] # For populating the length column df["length"] = df["sequence"].str.len() df = df.astype(dtype=pdcols) # For populating the {alphabet}_count columns def count(row): counts = Counter(row["sequence"]) for count in counts: row[count + "_count"] = int(counts[count]) return row df = df.apply(lambda row: count(row), axis=1) df.head() .. code-block:: output sequence A_count B_count C_count D_count E_count F_count G_count H_count I_count ... S_freq T_freq U_freq V_freq W_freq X_freq Y_freq Z_freq charge length 0 EEETVAK 1 0 0 0 3 0 0 0 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 7 1 EQEEQQQQEGHNNK 0 0 0 0 4 0 1 1 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 14 2 SHGGHTVISK 0 0 0 0 0 0 2 2 1 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 10 3 SGTHNMYK 0 0 0 0 0 0 1 1 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2 8 4 AARPTRPDK 2 0 0 1 0 0 0 0 0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3 9 Now we have completed all the data preprocessing steps. We have deduced a good amount of information from the amino acid sequences that might have influence on the retention time in the column. Now we are good to proceed on building the machine learning model. Modelling --------- .. code-block:: python :linenos: import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import StratifiedKFold from xgboost import XGBRegressor from sklearn.model_selection import train_test_split from matplotlib import pyplot from sklearn.metrics import mean_squared_error from sklearn.model_selection import ShuffleSplit .. code-block:: python :linenos: test_df = df.copy() test_df = test_df.drop("sequence", axis=1) Now, we create the train and test set for cross-validation of the results using the ``train_test_split`` function from sklearn's model_selection module with test_size size equal to 30% of the data. To maintain reproducibility of the results, a random_state is also assigned. .. code-block:: python :linenos: # Splitting Test data into test and validation X_train, X_test, Y_train, Y_test = train_test_split( test_df, Y1_test, test_size=0.3, random_state=3 ) We will be using the ``XGBRegressor()`` class because it is clearly a regression problem as the response variable ( retention time ) is continuous. .. code-block:: python :linenos: xg_reg = XGBRegressor( n_estimators=300, random_state=3, max_leaves=5, colsample_bytree=0.7, max_depth=7, ) Fit the regressor to the training set and make predictions on the test set using the familiar ``.fit()`` and ``.predict()`` methods. .. code-block:: python :linenos: xg_reg.fit(X_train, Y_train) Y_pred = xg_reg.predict(X_test) Compute the root mean square error (rmse) using the mean_sqaured_error function from sklearn's metrics module. .. code-block:: python :linenos: rmse = np.sqrt(mean_squared_error(Y_test, Y_pred)) print("RMSE: %f" % (rmse)) .. code-block:: output RMSE: 437.017290 Store the **Observed** v/s **Predicted** value in pandas dataframe and print. .. code-block:: python :linenos: k = pd.DataFrame( {"Observed": Y_test.values.flatten(), "Predicted": Y_pred.flatten()} ) print(k) .. code-block:: output Observed Predicted 0 3652.28442 3927.141846 1 4244.80320 4290.294434 2 3065.19054 3703.156982 3 909.50610 762.218567 4 1982.80902 2628.958740 ... ... ... 4764 5527.23804 5599.530762 4765 3388.76430 3272.557617 4766 3101.35566 3346.364990 4767 5515.94682 5491.597168 4768 2257.63092 2258.312988 We will now generate a **Observed** v/s **Predicted** plot that gives a high level overview about the model performance. We can clearly see that only few outliers are there and most of them lie in between the central axis. This means that prediction actually works and observed and predicted value won't differ too much. .. code-block:: python :linenos: sns.lmplot( x="Observed", y="Predicted", data=k, scatter_kws={"alpha": 0.2, "s": 5} ) .. image:: img/ml_tutorial_predicted_vs_observed.png .. code-block:: python :linenos: p = sns.kdeplot(data=k["Observed"] - k["Predicted"], fill=True) p.set(xlabel="Observed-Predicted (s)") .. image:: img/ml_tutorial_kdplot.png In order to build more robust models, it is common to do a k-fold cross validation where all the entries in the original training dataset are used for both training as well as validation. Also, each entry is used for validation just once. XGBoost supports k-fold cross validation via the cv() method. All we have to do is specify the nfolds parameter, which is the number of cross validation sets we want to build. .. code-block:: python :linenos: # Performing k-fold cross validation X = np.arange(10) ss = ShuffleSplit(n_splits=5, test_size=0.25, random_state=0) performance_df = pd.DataFrame() performance_list = [] counter = 0 for train_index, test_index in ss.split(X_train, Y_train): counter += 1 X_train_Kfold, X_test_Kfold = ( X_train[X_train.index.isin(train_index)].to_numpy(), X_train[X_train.index.isin(test_index)].to_numpy(), ) y_train_Kfold, y_test_Kfold = ( Y_train[Y_train.index.isin(train_index)].to_numpy().flatten(), Y_train[Y_train.index.isin(test_index)].to_numpy().flatten(), ) Regressor = XGBRegressor() Regressor.fit(X_train_Kfold, y_train_Kfold) predictions = Regressor.predict(X_test_Kfold) df = pd.DataFrame( {"Observed": y_test_Kfold.flatten(), "Predicted": predictions.flatten()} ) print("Fold-" + str(counter)) print("---------------------") print(df) .. code-block:: output Fold-1 --------------------- Observed Predicted 0 1845.17346 2051.894043 1 1155.68124 1911.122192 2 2847.94272 2753.223145 3 2370.70494 2670.160889 4 4111.31718 3961.675049 ... ... ... 1935 3880.18458 3454.832031 1936 4125.82776 4068.806152 1937 4586.33838 3829.927002 1938 2261.99454 3225.578613 1939 4342.82430 3943.912354 [1940 rows x 2 columns] Fold-2 --------------------- Observed Predicted 0 3476.56062 4075.536377 1 4009.78704 4022.654785 2 2847.94272 2779.675293 3 3669.33108 4026.944824 4 3997.12632 3566.471436 ... ... ... 1907 2916.91818 2744.992676 1908 3569.64318 3862.661621 1909 2118.25278 2221.599854 1910 1787.61012 1839.471802 1911 3583.44846 3210.243164 [1912 rows x 2 columns] Fold-3 --------------------- Observed Predicted 0 2052.18066 2237.868896 1 4336.45050 3622.901367 2 2317.39104 2496.773438 3 3356.40018 3291.187988 4 1778.73198 2034.299683 ... ... ... 1934 3795.23424 2968.955322 1935 3622.34358 3203.385742 1936 2261.99454 3115.011475 1937 4112.62578 3743.435791 1938 4342.82430 3721.162842 [1939 rows x 2 columns] Fold-4 --------------------- Observed Predicted 0 1762.89840 1691.997803 1 1292.39622 1418.658325 2 1914.00468 1779.962769 3 4571.86566 4618.782715 4 2317.39104 2417.823242 ... ... ... 1985 2779.37664 2702.244385 1986 4335.23442 3733.191162 1987 2916.91818 2609.322021 1988 4125.82776 3947.512939 1989 3429.54294 3550.206787 [1990 rows x 2 columns] Fold-5 --------------------- Observed Predicted 0 2790.00414 3010.381592 1 3476.56062 3972.215820 2 1845.17346 1901.611572 3 4009.78704 3884.857178 4 3578.05344 2993.831787 ... ... ... 1975 3778.69704 4209.392090 1976 1494.22332 1612.613281 1977 4125.82776 3902.622559 1978 4701.03624 4372.867676 1979 1888.41552 2342.040771 [1980 rows x 2 columns] That's it, we trained a simple machine learning model to predict peptide retention times from peptide data. Sophisticated machine models integrate retention time data from many experiments add additional properties (or even learn them from data) of peptides to achieve lower prediction errors.