Peptides and Proteins ===================== Sequences of amino acids form peptides and proteins. Amino Acid Sequences ******************** The :py:class:`~.AASequence` class handles amino acid sequences in OpenMS. A string of amino acid residues can be turned into a instance of :py:class:`~.AASequence` to provide some commonly used operations like chemical formula, weight, and isotope distribution calculations. The example below shows how amino acid sequences can be created and how basic mass calculations are conducted. .. code-block:: python :linenos: import pyopenms as oms seq = oms.AASequence.fromString( "DFPIANGER" ) # create AASequence object from string representation prefix = seq.getPrefix(4) # extract prefix of length 4 suffix = seq.getSuffix(5) # extract suffix of length 5 concat = seq + seq # concatenate two sequences # print string representation of sequences print("Sequence:", seq) print("Prefix:", prefix) print("Suffix:", suffix) print("Concatenated:", concat) # some mass calculations mfull = seq.getMonoWeight() # weight of M mprecursor = seq.getMonoWeight(oms.Residue.ResidueType.Full, 2) # weight of M+2H # we can calculate mass-over-charge manually mz = seq.getMonoWeight(oms.Residue.ResidueType.Full, 2) / 2.0 # m/z of [M+2H]2+ # or simply by: mz = seq.getMZ(2) # same as above print() print("Monoisotopic mass of peptide [M] is", mfull) print("Monoisotopic mass of peptide precursor [M+2H]2+ is", mprecursor) print("Monoisotopic m/z of [M+2H]2+ is", mz) Which prints the amino acid sequence as well as the result of concatenating two sequences or taking the suffix of a sequence. We then compute the mass of the full peptide (:chem:`[M]`), the mass of the peptide precursor (:chem:`[M+2H]2+`) and m/z value of the peptide precursor (:chem:`[M+2H]2+`). Note that, the mass of the peptide precursor is shifted by two protons that are now attached to the molecules as charge carriers. (Detail: the proton mass of :math:`1.007276\ u` is slightly different from the mass of an uncharged hydrogen atom at :math:`1.007825\ u`). We can easily calculate the charged weight of a :chem:`(M+2H)2+` ion and compute m/z simply dividing by the charge. .. code-block:: output Sequence: DFPIANGER Prefix: DFPI Suffix: ANGER Concatenated: DFPIANGERDFPIANGER Monoisotopic mass of peptide [M] is 1017.4879641373001 Monoisotopic mass of peptide precursor [M+2H]2+ is 1019.5025170708421 Monoisotopic m/z of [M+2H]2+ is 509.7512585354211 The :py:class:`~.AASequence` object also allows iterations directly in Python: .. code-block:: python :linenos: seq = oms.AASequence.fromString("DFPIANGER") print("The peptide", str(seq), "consists of the following amino acids:") for aa in seq: print(aa.getName(), ":", aa.getMonoWeight()) Which will print .. code-block:: output The peptide DFPIANGER consists of the following amino acids: Aspartate : 133.0375092233 Phenylalanine : 165.0789793509 Proline : 115.0633292871 Isoleucine : 131.0946294147 Alanine : 89.04767922330001 Asparagine : 132.0534932552 Glycine : 75.0320291595 Glutamate : 147.05315928710002 Arginine : 174.1116764466 The N- and C-Terminus as well as the residues themself can be modified. The example below shows how to check for such modifications. .. code-block:: python :linenos: seq = oms.AASequence.fromString("C[143]PKCK(Label:13C(6)15N(2))CR") # check if AASequence has a N- or C-terminal modification if seq.hasNTerminalModification(): print("N-Term Modification: ", seq.getNTerminalModification().getFullId()) if seq.hasCTerminalModification(): print("C-Term Modification: ", seq.getCTerminalModification().getFullId()) # iterate over all residues and look for modifications for aa in seq: if aa.isModified(): print( aa.getName(), ":", aa.getMonoWeight(), ":", aa.getModificationName() ) else: print(aa.getName(), ":", aa.getMonoWeight()) Which will print: .. code-block:: output N-Term Modification: Pyro-carbamidomethyl (N-term C) Cysteine : 121.01974995329999 Proline : 115.06332928709999 Lysine : 146.1055284466 Cysteine : 121.01974995329999 Lysine : 154.11972844660002 : Label:13C(6)15N(2) Cysteine : 121.01974995329999 Arginine : 174.1116764466 Molecular Formula ~~~~~~~~~~~~~~~~~ We can now combine our knowledge of :py:class:`~.AASequence` with what we learned in about :py:class:`~.EmpiricalFormula` to get accurate mass and isotope distributions from the amino acid sequence. But first, let's get the formula of peptide: .. code-block:: python :linenos: seq = oms.AASequence.fromString("DFPIANGER") seq_formula = seq.getFormula() print("Peptide", seq, "has molecular formula", seq_formula) Isotope Patterns ~~~~~~~~~~~~~~~~ We now want to print the coarse (e.g., peaks only at nominal masses) distribution. .. code-block:: python :linenos: # print coarse isotope distribution coarse_isotopes = seq_formula.getIsotopeDistribution( oms.CoarseIsotopePatternGenerator(6) ) for iso in coarse_isotopes.getContainer(): print( "Isotope", iso.getMZ(), "has abundance", iso.getIntensity() * 100, "%" ) For most applications in computational proteomics, the coarse isotope distribution is sufficient. But if we deal with very high resolution instruments, we still might want to calculate the isotopic fine structure. We use the :py:class:`~.FineIsotopePatternGenerator` in OpenMS to reveal these additional peaks: .. code-block:: python :linenos: # print fine structure of isotope distribution fine_isotopes = seq_formula.getIsotopeDistribution( oms.FineIsotopePatternGenerator(0.01) ) # max 0.01 unexplained probability for iso in fine_isotopes.getContainer(): print( "Isotope", iso.getMZ(), "has abundance", iso.getIntensity() * 100, "%" ) And plot the very similar looking distributions using standard ``matplotlib`` functionality: .. code-block:: python :linenos: import math from matplotlib import pyplot as plt def plotIsotopeDistribution(isotope_distribution, title="Isotope distribution"): plt.title(title) distribution = {"mass": [], "abundance": []} for iso in isotope_distribution.getContainer(): distribution["mass"].append(iso.getMZ()) distribution["abundance"].append(iso.getIntensity() * 100) bars = plt.bar( distribution["mass"], distribution["abundance"], width=0.01, snap=False ) # snap ensures that all bars are rendered plt.ylim([0, 110]) plt.xticks( range( math.ceil(distribution["mass"][0]) - 2, math.ceil(distribution["mass"][-1]) + 2, ) ) plt.xlabel("Atomic mass (u)") plt.ylabel("Relative abundance (%)") plt.figure(figsize=(10, 7)) plt.subplot(1, 2, 1) plotIsotopeDistribution(coarse_isotopes, "Isotope distribution - coarse") plt.subplot(1, 2, 2) plotIsotopeDistribution(fine_isotopes, "Isotope distribution - fine structure") plt.show() .. image:: img/DFPIANGER_isoDistribution.png Fragment Ions ~~~~~~~~~~~~~ We can easily calculate different ion types for amino acid sequences: .. code-block:: python :linenos: suffix = seq.getSuffix(3) # y3 ion "GER" print("=" * 35) print("y3 ion sequence:", suffix) y3_formula = suffix.getFormula(oms.Residue.ResidueType.YIon, 2) # y3++ ion suffix.getMonoWeight(oms.Residue.ResidueType.YIon, 2) / 2.0 # CORRECT suffix.getMonoWeight(oms.Residue.ResidueType.XIon, 2) / 2.0 # CORRECT suffix.getMonoWeight(oms.Residue.ResidueType.BIon, 2) / 2.0 # INCORRECT print("y3 mz:", suffix.getMonoWeight(oms.Residue.ResidueType.YIon, 2) / 2.0) print("y3 molecular formula:", y3_formula) Which will produce .. code-block:: output =================================== y3 ion sequence: GER y3 mz: 181.09514385 y3 molecular formula: C13H24N6O6 Easy, isn't it? To generate full theoretical spectra watch out for the more specialized (and faster) :py:class:`~.TheoreticalSpectrumGenerator` which we will take a look at later. Modified Sequences ****************** The :py:class:`~.AASequence` class can also handle modifications, modifications are specified using a unique string identifier present in the :py:class:`~.ModificationsDB` in round brackets after the modified amino acid or by providing the mass of the residue in square brackets. For example ``AASequence.fromString(".DFPIAM(Oxidation)GER.")`` creates an instance of the peptide "DFPIAMGER" with an oxidized methionine. There are multiple ways to specify modifications, and ``AASequence.fromString("DFPIAM(UniMod:35)GER")``, ``AASequence.fromString("DFPIAM[+16]GER")`` and ``AASequence.fromString("DFPIAM[147]GER")`` are all equivalent). .. code-block:: python :linenos: seq = oms.AASequence.fromString("PEPTIDESEKUEM(Oxidation)CER") print(seq.toUnmodifiedString()) print(seq.toString()) print(seq.toUniModString()) print(seq.toBracketString()) print(seq.toBracketString(False)) print(oms.AASequence.fromString("DFPIAM(UniMod:35)GER")) print(oms.AASequence.fromString("DFPIAM[+16]GER")) print(oms.AASequence.fromString("DFPIAM[+15.99]GER")) print(oms.AASequence.fromString("DFPIAM[147]GER")) print(oms.AASequence.fromString("DFPIAM[147.035405]GER")) The above code outputs: .. code-block:: output PEPTIDESEKUEMCER PEPTIDESEKUEM(Oxidation)CER PEPTIDESEKUEM(UniMod:35)CER PEPTIDESEKUEM[147]CER PEPTIDESEKUEM[147.0354000171]CER DFPIAM(Oxidation)GER DFPIAM(Oxidation)GER DFPIAM(Oxidation)GER DFPIAM(Oxidation)GER DFPIAM(Oxidation)GER Note there is a subtle difference between ``AASequence.fromString(".DFPIAM[+16]GER.")`` and ``AASequence.fromString(".DFPIAM[+15.9949]GER.")`` - while the former will try to find the first modification matching to a mass difference of :math:`16 \pm 0.5`, the latter will try to find the closest matching modification to the exact mass. The exact mass approach usually gives the intended results while the first approach may or may not. In all instances, it is better to use an exact description of the desired modification, such as UniMod, instead of mass differences. N- and C-terminal modifications are represented by brackets to the right of the dots terminating the sequence. For example, ``".(Dimethyl)DFPIAMGER."`` and ``".DFPIAMGER.(Label:18O(2))"`` represent the labelling of the N- and C-terminus respectively, but ``".DFPIAMGER(Phospho)."`` will be interpreted as a phosphorylation of the last arginine at its side chain: .. code-block:: python :linenos: s = oms.AASequence.fromString(".(Dimethyl)DFPIAMGER.") print(s, s.hasNTerminalModification()) s = oms.AASequence.fromString(".DFPIAMGER.(Label:18O(2))") print(s, s.hasCTerminalModification()) s = oms.AASequence.fromString(".DFPIAMGER(Phospho).") print(s, s.hasCTerminalModification()) Arbitrary / unknown amino acids (usually due to an unknown modification) can be specified using tags preceded by :chem:`X`: :chem:`X[weight]`. This indicates a new amino acid (":chem:`X`") with the specified weight, e.g. :chem:`RX[148.5]T`. Note that this tag does not alter the amino acids to the left (:chem:`R`) or right (:chem:`T`). Rather, :chem:`X` represents an amino acid on its own. Be careful when converting such :py:class:`~.AASequence` objects to an EmpiricalFormula using :py:meth:`~.AASequence.getFormula`, as tags will not be considered in this case (there exists no formula for them). However, they have an influence on :py:meth:`~.AASequence.getMonoWeight` and :py:meth:`~.AASequence.getAverageWeight`! Applying Fixed or Variable Modifications to Sequences ***************************************************** In this tutorial, we will cover a step-by-step guide on how to use the pyopenms library to generate modified peptides from a given amino acid sequence. .. code-block:: python :linenos: import pyopenms as poms # Create an amino acid sequence using the fromString() method of the AASequence class. # In this example, we will use the amino acid sequence "TESTMTECSTMTESTR" sequence = poms.AASequence.fromString("TESTMTECSTMTESTR") # We use the names "Oxidation (M)" and "Carbamidomethyl (C)" for the variable and fixed modifications, respectively. variable_mod_names = [b"Oxidation (M)"] fixed_mod_names = [b"Carbamidomethyl (C)"] # We then use the getModifications() method of the ModifiedPeptideGenerator class to get the modifications for these names. variable_modifications = poms.ModifiedPeptideGenerator.getModifications(variable_mod_names) fixed_modifications = poms.ModifiedPeptideGenerator.getModifications(fixed_mod_names) # Apply the fixed modifications to the amino acid sequence poms.ModifiedPeptideGenerator.applyFixedModifications(fixed_modifications, sequence) # Define the maximum number of variable modifications allowed max_variable_mods = 1 # Generate the modified peptides peptides_with_variable_modifications = [] keep_unmodified_in_result = False poms.ModifiedPeptideGenerator.applyVariableModifications(variable_modifications, sequence, max_variable_mods, peptides_with_variable_modifications, keep_unmodified_in_result) # Print the modified peptides generated using Fixed modifications and their mono-isotopic mass. print("Fixed:", sequence.toString()) print("Mono-isotopic mass:", sequence.getMonoWeight()) # Print the modified peptides generated using variable modifications and their mono-isotopic mass. for peptide in peptides_with_variable_modifications: print("Variable:", peptide.toString()) print("Mono-isotopic mass:", peptide.getMonoWeight()) The above code outputs: .. code-block:: output Fixed: TESTMTEC(Carbamidomethyl)STMTESTR Mono-isotopic mass: 1850.7332409542007 Variable: TESTMTEC(Carbamidomethyl)STM(Oxidation)TESTR Mono-isotopic mass: 1866.7281559542005 Variable: TESTM(Oxidation)TEC(Carbamidomethyl)STMTESTR Mono-isotopic mass: 1866.7281559542005 Proteins and :term:`FASTA` Files ******************************** Protein sequences, can be loaded from and stored in :term:`FASTA` protein databases using :py:class:`~.FASTAFile`. The example below shows how protein sequences can be stored in :term:`FASTA` files and loaded back in pyOpenMS: .. code-block:: python :linenos: bsa = oms.FASTAEntry() # one entry in a FASTA file bsa.sequence = "MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGE" bsa.description = "BSA Bovine Albumin (partial sequence)" bsa.identifier = "BSA" alb = oms.FASTAEntry() alb.sequence = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGE" alb.description = "ALB Human Albumin (partial sequence)" alb.identifier = "ALB" entries = [bsa, alb] f = oms.FASTAFile() f.store("example.fasta", entries) Afterwards, the ``example.fasta`` file can be read again from the disk: .. code-block:: python :linenos: entries = [] f = oms.FASTAFile() f.load("example.fasta", entries) print(len(entries)) for e in entries: print(e.identifier, e.sequence)