Identification Data#
In OpenMS, identifications of peptides, proteins and small molecules are stored
in dedicated data structures. These data structures are typically stored to disc
as idXML or mzIdentML file. The highest-level structure is
ProteinIdentification
. It stores all identified proteins of an identification
run as ProteinHit
objects plus additional metadata (search parameters, etc.). Each
ProteinHit
contains the actual protein accession, an associated score, and
(optionally) the protein sequence.
A PeptideIdentification
object stores the
data corresponding to a single identified spectrum or feature. It has members
for the retention time, m/z, and a vector of PeptideHit
objects. Each PeptideHit
stores the information of a specific peptide-spectrum match or PSM (e.g., the score
and the peptide sequence). Each PeptideHit
also contains a vector of
PeptideEvidence
objects which store the reference to one or more (in the case the
peptide maps to multiple proteins) proteins and the position therein.
Note
Protein Ids are linked to peptide Ids by a common identifier (e.g., a unique string of time and date of the search).
The Identifier can be set using the setIdentifier()
method in
ProteinIdentification
and PeptideIdentification
.
Similarly getIdentifier()
can be used to check the link between them.
With the link one can retrieve search meta data (which is stored at the protein level) for individual peptide Ids.
Protein Identifier - IdentificationRun1
Peptide Identifier - IdentificationRun1
Protein Identification#
We can create an object of type ProteinIdentification
and populate it with
ProteinHit
objects as follows:
1import pyopenms as oms
2
3# Create new protein identification object corresponding to a single search
4protein_id = oms.ProteinIdentification()
5protein_id.setIdentifier("IdentificationRun1")
6
7# Each ProteinIdentification object stores a vector of protein hits
8protein_hit = oms.ProteinHit()
9protein_hit.setAccession("sp|MyAccession")
10protein_hit.setSequence("PEPTIDERDLQMTQSPSSLSVSVGDRPEPTIDE")
11protein_hit.setScore(1.0)
12protein_hit.setMetaValue("target_decoy", b"target") # its a target protein
13
14protein_id.setHits([protein_hit])
We have now added a single ProteinHit
with the accession sp|MyAccession
to
the ProteinIdentification
object (note how on line 14 we directly added a
list of size 1). We can continue to add meta-data for the whole identification
run (such as search parameters):
1now = oms.DateTime.now()
2date_string = now.getDate()
3protein_id.setDateTime(now)
4
5# Example of possible search parameters
6search_parameters = (
7 oms.SearchParameters()
8) # ProteinIdentification::SearchParameters
9search_parameters.db = "database"
10search_parameters.charges = "+2"
11protein_id.setSearchParameters(search_parameters)
12
13# Some search engine meta data
14protein_id.setSearchEngineVersion("v1.0.0")
15protein_id.setSearchEngine("SearchEngine")
16protein_id.setScoreType("HyperScore")
17
18# Iterate over all protein hits
19for hit in protein_id.getHits():
20 print("Protein hit accession:", hit.getAccession())
21 print("Protein hit sequence:", hit.getSequence())
22 print("Protein hit score:", hit.getScore())
PeptideIdentification#
Next, we can also create a PeptideIdentification
object and add
corresponding PeptideHit
objects:
1peptide_id = oms.PeptideIdentification()
2
3peptide_id.setRT(1243.56)
4peptide_id.setMZ(440.0)
5peptide_id.setScoreType("ScoreType")
6peptide_id.setHigherScoreBetter(False)
7peptide_id.setIdentifier("IdentificationRun1")
8
9# define additional meta value for the peptide identification
10peptide_id.setMetaValue("AdditionalMetaValue", "Value")
11
12# create a new PeptideHit (best PSM, best score)
13peptide_hit = oms.PeptideHit()
14peptide_hit.setScore(1.0)
15peptide_hit.setRank(1)
16peptide_hit.setCharge(2)
17peptide_hit.setSequence(oms.AASequence.fromString("DLQM(Oxidation)TQSPSSLSVSVGDR"))
18
19ev = oms.PeptideEvidence()
20ev.setProteinAccession("sp|MyAccession")
21ev.setAABefore(b"R")
22ev.setAAAfter(b"P")
23ev.setStart(123) # start and end position in the protein
24ev.setEnd(141)
25peptide_hit.setPeptideEvidences([ev])
26
27# create a new PeptideHit (second best PSM, lower score)
28peptide_hit2 = oms.PeptideHit()
29peptide_hit2.setScore(0.5)
30peptide_hit2.setRank(2)
31peptide_hit2.setCharge(2)
32peptide_hit2.setSequence(oms.AASequence.fromString("QDLMTQSPSSLSVSVGDR"))
33peptide_hit2.setPeptideEvidences([ev])
34
35# add PeptideHit to PeptideIdentification
36peptide_id.setHits([peptide_hit, peptide_hit2])
This allows us to represent single spectra (PeptideIdentification
at m/z
\(440.0\) and rt \(1234.56\)) with possible identifications that are ranked by score.
In this case, apparently two possible peptides match the spectrum which have
the first three amino acids in a different order “DLQ” vs “QDL”).
We can now display the peptides we just stored:
1# Iterate over PeptideIdentification
2peptide_ids = [peptide_id]
3for peptide_id in peptide_ids:
4 # Peptide identification values
5 print("Peptide ID m/z:", peptide_id.getMZ())
6 print("Peptide ID rt:", peptide_id.getRT())
7 print("Peptide ID score type:", peptide_id.getScoreType())
8 # PeptideHits
9 for hit in peptide_id.getHits():
10 print(" - Peptide hit rank:", hit.getRank())
11 print(" - Peptide hit sequence:", hit.getSequence())
12 print(" - Peptide hit score:", hit.getScore())
13 print(
14 " - Mapping to proteins:",
15 [ev.getProteinAccession() for ev in hit.getPeptideEvidences()],
16 )
Storage on Disk#
Finally, we can store the peptide and protein identification data in a
IdXMLFile
(a OpenMS internal file format which we have previously
discussed here)
which we would do as follows:
1# Store the identification data in an idXML file
2oms.IdXMLFile().store("out.idXML", [protein_id], peptide_ids)
3# and load it back into memory
4prot_ids = []
5pep_ids = []
6oms.IdXMLFile().load("out.idXML", prot_ids, pep_ids)
7
8# Iterate over all protein hits
9for protein_id in prot_ids:
10 for hit in protein_id.getHits():
11 print("Protein hit accession:", hit.getAccession())
12 print("Protein hit sequence:", hit.getSequence())
13 print("Protein hit score:", hit.getScore())
14 print("Protein hit target/decoy:", hit.getMetaValue("target_decoy"))
15
16# Iterate over PeptideIdentification
17for peptide_id in pep_ids:
18 # Peptide identification values
19 print("Peptide ID m/z:", peptide_id.getMZ())
20 print("Peptide ID rt:", peptide_id.getRT())
21 print("Peptide ID score type:", peptide_id.getScoreType())
22 # PeptideHits
23 for hit in peptide_id.getHits():
24 print(" - Peptide hit rank:", hit.getRank())
25 print(" - Peptide hit sequence:", hit.getSequence())
26 print(" - Peptide hit score:", hit.getScore())
27 print(
28 " - Mapping to proteins:",
29 [ev.getProteinAccession() for ev in hit.getPeptideEvidences()],
30 )
You can inspect the out.idXML
XML file produced here, and you will find a ProteinHit
entry for
the protein that we stored and two PeptideHit
entries for the two peptides stored on disk.