# Peptides and Proteins

Sequences of amino acids form peptides and proteins.

## Amino Acid Sequences

The AASequence class handles amino acid sequences in OpenMS. A string of amino acid residues can be turned into a instance of AASequence to provide some commonly used operations like chemical formula, weight, and isotope distribution calculations.

The example below shows how amino acid sequences can be created and how basic mass calculations are conducted.

 1from pyopenms import *
2
3seq = AASequence.fromString(
4    "DFPIANGER"
5)  # create AASequence object from string representation
6prefix = seq.getPrefix(4)  # extract prefix of length 4
7suffix = seq.getSuffix(5)  # extract suffix of length 5
8concat = seq + seq  # concatenate two sequences
9
10# print string representation of sequences
11print("Sequence:", seq)
12print("Prefix:", prefix)
13print("Suffix:", suffix)
14print("Concatenated:", concat)
15
16# some mass calculations
17mfull = seq.getMonoWeight()  # weight of M
18mprecursor = seq.getMonoWeight(Residue.ResidueType.Full, 2)  # weight of M+2H
19
20# we can calculate mass-over-charge manually
21mz = seq.getMonoWeight(Residue.ResidueType.Full, 2) / 2.0  # m/z of [M+2H]2+
22# or simply by:
23mz = seq.getMZ(2)  # same as above
24
25print()
26print("Monoisotopic mass of peptide [M] is", mfull)
27print("Monoisotopic mass of peptide precursor [M+2H]2+ is", mprecursor)
28print("Monoisotopic m/z of [M+2H]2+ is", mz)


Which prints the amino acid sequence as well as the result of concatenating two sequences or taking the suffix of a sequence. We then compute the mass of the full peptide ($$\ce{[M]}$$), the mass of the peptide precursor ($$\ce{[M+2H]2+}$$) and m/z value of the peptide precursor ($$\ce{[M+2H]2+}$$). Note that, the mass of the peptide precursor is shifted by two protons that are now attached to the molecules as charge carriers. (Detail: the proton mass of $$1.007276\ u$$ is slightly different from the mass of an uncharged hydrogen atom at $$1.007825\ u$$). We can easily calculate the charged weight of a $$\ce{(M+2H)2+}$$ ion and compute m/z simply dividing by the charge.

Sequence: DFPIANGER
Prefix: DFPI
Suffix: ANGER
Concatenated: DFPIANGERDFPIANGER

Monoisotopic mass of peptide [M] is 1017.4879641373001
Monoisotopic mass of peptide precursor [M+2H]2+ is 1019.5025170708421
Monoisotopic m/z of [M+2H]2+ is 509.7512585354211


The AASequence object also allows iterations directly in Python:

1seq = AASequence.fromString("DFPIANGER")
2
3print("The peptide", str(seq), "consists of the following amino acids:")
4for aa in seq:
5    print(aa.getName(), ":", aa.getMonoWeight())


Which will print

The peptide DFPIANGER consists of the following amino acids:
Aspartate : 133.0375092233
Phenylalanine : 165.0789793509
Proline : 115.0633292871
Isoleucine : 131.0946294147
Alanine : 89.04767922330001
Asparagine : 132.0534932552
Glycine : 75.0320291595
Glutamate : 147.05315928710002
Arginine : 174.1116764466


The N- and C-Terminus as well as the residues themself can be modified. The example below shows how to check for such modifications.

 1seq = AASequence.fromString("C[143]PKCK(Label:13C(6)15N(2))CR")
2
3# check if AASequence has a N- or C-terminal modification
4if seq.hasNTerminalModification():
5    print("N-Term Modification: ", seq.getNTerminalModification().getFullId())
6if seq.hasCTerminalModification():
7    print("C-Term Modification: ", seq.getCTerminalModification().getFullId())
8# iterate over all residues and look for modifications
9for aa in seq:
10    if aa.isModified():
11        print(
12            aa.getName(), ":", aa.getMonoWeight(), ":", aa.getModificationName()
13        )
14    else:
15        print(aa.getName(), ":", aa.getMonoWeight())


Which will print:

N-Term Modification:  Pyro-carbamidomethyl (N-term C)
Cysteine : 121.01974995329999
Proline : 115.06332928709999
Lysine : 146.1055284466
Cysteine : 121.01974995329999
Lysine : 154.11972844660002 : Label:13C(6)15N(2)
Cysteine : 121.01974995329999
Arginine : 174.1116764466


### Molecular Formula

We can now combine our knowledge of AASequence with what we learned in about EmpiricalFormula to get accurate mass and isotope distributions from the amino acid sequence. But first, let’s get the formula of peptide:

1seq = AASequence.fromString("DFPIANGER")
2seq_formula = seq.getFormula()
3print("Peptide", seq, "has molecular formula", seq_formula)


### Isotope Patterns

We now want to print the coarse (e.g., peaks only at nominal masses) distribution.

1# print coarse isotope distribution
2coarse_isotopes = seq_formula.getIsotopeDistribution(
3    CoarseIsotopePatternGenerator(6)
4)
5for iso in coarse_isotopes.getContainer():
6    print(
7        "Isotope", iso.getMZ(), "has abundance", iso.getIntensity() * 100, "%"
8    )


For most applications in computational proteomics, the coarse isotope distribution is sufficient. But if we deal with very high resolution instruments, we still might want to calculate the isotopic fine structure. We use the FineIsotopePatternGenerator in OpenMS to reveal these additional peaks:

1# print fine structure of isotope distribution
2fine_isotopes = seq_formula.getIsotopeDistribution(
3    FineIsotopePatternGenerator(0.01)
4)  # max 0.01 unexplained probability
5for iso in fine_isotopes.getContainer():
6    print(
7        "Isotope", iso.getMZ(), "has abundance", iso.getIntensity() * 100, "%"
8    )


And plot the very similar looking distributions using standard matplotlib functionality:

 1import math
2from matplotlib import pyplot as plt
3
4
5def plotIsotopeDistribution(isotope_distribution, title="Isotope distribution"):
6    plt.title(title)
7    distribution = {"mass": [], "abundance": []}
8    for iso in isotope_distribution.getContainer():
9        distribution["mass"].append(iso.getMZ())
10        distribution["abundance"].append(iso.getIntensity() * 100)
11
12    bars = plt.bar(
13        distribution["mass"], distribution["abundance"], width=0.01, snap=False
14    )  # snap ensures that all bars are rendered
15
16    plt.ylim([0, 110])
17    plt.xticks(
18        range(
19            math.ceil(distribution["mass"][0]) - 2,
20            math.ceil(distribution["mass"][-1]) + 2,
21        )
22    )
23    plt.xlabel("Atomic mass (u)")
24    plt.ylabel("Relative abundance (%)")
25
26
27plt.figure(figsize=(10, 7))
28plt.subplot(1, 2, 1)
29plotIsotopeDistribution(coarse_isotopes, "Isotope distribution - coarse")
30plt.subplot(1, 2, 2)
31plotIsotopeDistribution(fine_isotopes, "Isotope distribution - fine structure")
32plt.show()


### Fragment Ions

We can easily calculate different ion types for amino acid sequences:

 1suffix = seq.getSuffix(3)  # y3 ion "GER"
2print("=" * 35)
3print("y3 ion sequence:", suffix)
4y3_formula = suffix.getFormula(Residue.ResidueType.YIon, 2)  # y3++ ion
5suffix.getMonoWeight(Residue.ResidueType.YIon, 2) / 2.0  # CORRECT
6suffix.getMonoWeight(Residue.ResidueType.XIon, 2) / 2.0  # CORRECT
7suffix.getMonoWeight(Residue.ResidueType.BIon, 2) / 2.0  # INCORRECT
8
9print("y3 mz:", suffix.getMonoWeight(Residue.ResidueType.YIon, 2) / 2.0)
10print("y3 molecular formula:", y3_formula)


Which will produce

===================================
y3 ion sequence: GER
y3 mz: 181.09514385
y3 molecular formula: C13H24N6O6


Easy, isn’t it? To generate full theoretical spectra watch out for the more specialized (and faster) TheoreticalSpectrumGenerator which we will take a look at later.

## Modified Sequences

The AASequence class can also handle modifications, modifications are specified using a unique string identifier present in the ModificationsDB in round brackets after the modified amino acid or by providing the mass of the residue in square brackets. For example AASequence.fromString(".DFPIAM(Oxidation)GER.") creates an instance of the peptide “DFPIAMGER” with an oxidized methionine. There are multiple ways to specify modifications, and AASequence.fromString("DFPIAM(UniMod:35)GER"), AASequence.fromString("DFPIAM[+16]GER") and AASequence.fromString("DFPIAM[147]GER") are all equivalent).

 1    seq = AASequence.fromString("PEPTIDESEKUEM(Oxidation)CER")
2    print(seq.toUnmodifiedString())
3    print(seq.toString())
4    print(seq.toUniModString())
5    print(seq.toBracketString())
6    print(seq.toBracketString(False))
7
8    print(AASequence.fromString("DFPIAM(UniMod:35)GER"))
9    print(AASequence.fromString("DFPIAM[+16]GER"))
10    print(AASequence.fromString("DFPIAM[+15.99]GER"))
11    print(AASequence.fromString("DFPIAM[147]GER"))
12    print(AASequence.fromString("DFPIAM[147.035405]GER"))


The above code outputs:

PEPTIDESEKUEMCER
PEPTIDESEKUEM(Oxidation)CER
PEPTIDESEKUEM(UniMod:35)CER
PEPTIDESEKUEM[147]CER
PEPTIDESEKUEM[147.0354000171]CER

DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER
DFPIAM(Oxidation)GER


Note there is a subtle difference between AASequence.fromString(".DFPIAM[+16]GER.") and AASequence.fromString(".DFPIAM[+15.9949]GER.") - while the former will try to find the first modification matching to a mass difference of $$16 \pm 0.5$$, the latter will try to find the closest matching modification to the exact mass. The exact mass approach usually gives the intended results while the first approach may or may not. In all instances, it is better to use an exact description of the desired modification, such as UniMod, instead of mass differences.

N- and C-terminal modifications are represented by brackets to the right of the dots terminating the sequence. For example, ".(Dimethyl)DFPIAMGER." and ".DFPIAMGER.(Label:18O(2))" represent the labelling of the N- and C-terminus respectively, but ".DFPIAMGER(Phospho)." will be interpreted as a phosphorylation of the last arginine at its side chain:

1    s = AASequence.fromString(".(Dimethyl)DFPIAMGER.")
2    print(s, s.hasNTerminalModification())
3    s = AASequence.fromString(".DFPIAMGER.(Label:18O(2))")
4    print(s, s.hasCTerminalModification())
5    s = AASequence.fromString(".DFPIAMGER(Phospho).")
6    print(s, s.hasCTerminalModification())


Arbitrary / unknown amino acids (usually due to an unknown modification) can be specified using tags preceded by $$\ce{X}$$: $$\ce{X[weight]}$$. This indicates a new amino acid (”$$\ce{X}$$”) with the specified weight, e.g. $$\ce{RX[148.5]T}$$. Note that this tag does not alter the amino acids to the left ($$\ce{R}$$) or right ($$\ce{T}$$). Rather, $$\ce{X}$$ represents an amino acid on its own. Be careful when converting such AASequence objects to an EmpiricalFormula using getFormula(), as tags will not be considered in this case (there exists no formula for them). However, they have an influence on getMonoWeight() and getAverageWeight()!

## Proteins and FASTA Files

Protein sequences, can be loaded from and stored in FASTA protein databases using FASTAFile. The example below shows how protein sequences can be stored in FASTA files and loaded back in pyOpenMS:

 1    bsa = FASTAEntry()  # one entry in a FASTA file
2    bsa.sequence = "MKWVTFISLLLLFSSAYSRGVFRRDTHKSEIAHRFKDLGE"
3    bsa.description = "BSA Bovine Albumin (partial sequence)"
4    bsa.identifier = "BSA"
5    alb = FASTAEntry()
6    alb.sequence = "MKWVTFISLLFLFSSAYSRGVFRRDAHKSEVAHRFKDLGE"
7    alb.description = "ALB Human Albumin (partial sequence)"
8    alb.identifier = "ALB"
9
10    entries = [bsa, alb]
11
12    f = FASTAFile()
13    f.store("example.fasta", entries)


Afterwards, the example.fasta file can be read again from the disk:

1    entries = []
2    f = FASTAFile()