Peptide Search ============== In MS-based proteomics, fragment ion mass spectra (:term:`MS2`) are often interpreted by comparing them against a theoretical set of mass spectra generated from a :term:`FASTA` database. OpenMS contains a (simple) implementation of such a "search engine" that compares experimental against theoretical mass spectra generated from an enzymatic or chemical digest of a proteome (e.g. tryptic digest). .. image:: img/database_search_illustration.png SimpleSearch ************ In most proteomics applications, a dedicated search engine (such as Comet, Crux, Mascot, MSGFPlus, MSFragger, Myrimatch, OMSSA, SpectraST or XTandem; all of which are supported by pyOpenMS) will be used to search data. Here, we will use the internal :py:class:`~.SimpleSearchEngineAlgorithm` from OpenMS used for teaching purposes. This makes it very easy to search an (experimental) :term:`mzML` file against a fasta database of protein sequences: .. code-block:: python :linenos: from urllib.request import urlretrieve import pyopenms as oms gh = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master" urlretrieve(gh + "/src/data/SimpleSearchEngine_1.mzML", "searchfile.mzML") urlretrieve(gh + "/src/data/SimpleSearchEngine_1.fasta", "search.fasta") protein_ids = [] peptide_ids = [] oms.SimpleSearchEngineAlgorithm().search( "searchfile.mzML", "search.fasta", protein_ids, peptide_ids ) This will print search engine output including the number of peptides and proteins in the database and how many mass spectra were matched to peptides and proteins: .. code-block:: console Peptide statistics unmatched : 0 (0 %) target/decoy: match to target DB only: 3 (100 %) match to decoy DB only : 0 (0 %) match to both : 0 (0 %) :term:`Peptide-Spectrum Match` (:term:`PSM`) Inspection ******************************************************************************* We can now investigate the individual hits as we have done before in the `identification tutorial `_. .. code-block:: python :linenos: for peptide_id in peptide_ids: # Peptide identification values print(35 * "=") print("Peptide ID m/z:", peptide_id.getMZ()) print("Peptide ID rt:", peptide_id.getRT()) print("Peptide scan index:", peptide_id.getMetaValue("scan_index")) print("Peptide scan name:", peptide_id.getMetaValue("scan_index")) print("Peptide ID score type:", peptide_id.getScoreType()) # PeptideHits for hit in peptide_id.getHits(): print(" - Peptide hit rank:", hit.getRank()) print(" - Peptide hit charge:", hit.getCharge()) print(" - Peptide hit sequence:", hit.getSequence()) mz = ( hit.getSequence().getMonoWeight( oms.Residue.ResidueType.Full, hit.getCharge() ) / hit.getCharge() ) print(" - Peptide hit monoisotopic m/z:", mz) print( " - Peptide ppm error:", abs(mz - peptide_id.getMZ()) / mz * 10**6 ) print(" - Peptide hit score:", hit.getScore()) We notice that the second :term:`PSM` was found for the third term:`mass spectrum` in the file for a precursor at :math:`775.38` m/z for the sequence ``RPGADSDIGGFGGLFDLAQAGFR``. .. code-block:: python :linenos: tsg = oms.TheoreticalSpectrumGenerator() thspec = oms.MSSpectrum() p = oms.Param() p.setValue("add_metainfo", "true") tsg.setParameters(p) peptide = oms.AASequence.fromString("RPGADSDIGGFGGLFDLAQAGFR") tsg.getSpectrum(thspec, peptide, 1, 1) # Iterate over annotated ions and their masses for ion, peak in zip(thspec.getStringDataArrays()[0], thspec): print(ion, peak.getMZ()) e = oms.MSExperiment() oms.MzMLFile().load("searchfile.mzML", e) print("Spectrum native id", e[2].getNativeID()) mz, i = e[2].get_peaks() peaks = [(mz, i) for mz, i in zip(mz, i) if i > 1500 and mz > 300] for peak in peaks: print(peak[0], "mz", peak[1], "int") Comparing the theoretical and the experimental mass spectrum for ``RPGADSDIGGFGGLFDLAQAGFR`` we can easily see that the most abundant ions in are :chem:`y8` (:chem:`877.452` m/z), :chem:`b10` (:math:`926.432`), :chem:`y9` (:math:`1024.522` m/z) and :chem:`b13` (:math:`1187.544` m/z). Visualization ************* When loading the ``searchfile.mzML`` into the OpenMS visualization software :term:`TOPPView`, we can convince ourselves that the observed mass spectrum indeed was generated by the peptide ``RPGADSDIGGFGGLFDLAQAGFR`` by loading the corresponding theoretical mass spectrum into the viewer using "Tools"->"Generate theoretical spectrum": .. image:: img/psm.png From our output above, we notice that the second :term:`PSM` at :math:`775.38` m/z for sequence ``RPGADSDIGGFGGLFDLAQAGFR`` was found with an error tolerance of :math:`2.25\ ppm`, therefore if we set the precursor mass tolerance to :math:`4\ ppm\ (\pm 2\ ppm)`, we expect that we will not find the hit at :math:`775.38` m/z any more: .. code-block:: python :linenos: salgo = oms.SimpleSearchEngineAlgorithm() p = salgo.getDefaults() print(p.items()) p[b"precursor:mass_tolerance"] = 4.0 salgo.setParameters(p) protein_ids = [] peptide_ids = [] salgo.search("searchfile.mzML", "search.fasta", protein_ids, peptide_ids) print("Found", len(peptide_ids), "peptides") As we can see, using a smaller precursor mass tolerance leads the algorithm to find only one hit instead of two. Similarly, if we use the wrong enzyme for the digestion (e.g. ``p[b'enzyme'] = "Formic_acid"``), we find no results. More detailed example ********************* Now include some additional decoy database generation step as well as subsequent FDR filtering. .. code-block:: python :linenos: from urllib.request import urlretrieve searchfile = "../../../src/data/BSA1.mzML" searchdb = "../../../src/data/18Protein_SoCe_Tr_detergents_trace.fasta" # generate a protein database with additional decoy sequenes targets = list() decoys = list() oms.FASTAFile().load( searchdb, targets ) # read FASTA file into a list of FASTAEntrys decoy_generator = oms.DecoyGenerator() for entry in targets: rev_entry = oms.FASTAEntry(entry) # copy entry rev_entry.identifier = "DECOY_" + rev_entry.identifier # mark as decoy aas = oms.AASequence().fromString( rev_entry.sequence ) # convert string into amino acid sequence rev_entry.sequence = decoy_generator.reverseProtein( aas ).toString() # reverse decoys.append(rev_entry) target_decoy_database = "search_td.fasta" oms.FASTAFile().store( target_decoy_database, targets + decoys ) # store the database with appended decoy sequences # Run SimpleSearchAlgorithm, store protein and peptide ids protein_ids = [] peptide_ids = [] # set some custom search parameters simplesearch = oms.SimpleSearchEngineAlgorithm() params = simplesearch.getDefaults() score_annot = [b"fragment_mz_error_median_ppm", b"precursor_mz_error_ppm"] params.setValue(b"annotate:PSM", score_annot) params.setValue(b"peptide:max_size", 30) simplesearch.setParameters(params) simplesearch.search(searchfile, target_decoy_database, protein_ids, peptide_ids) # Annotate q-value oms.FalseDiscoveryRate().apply(peptide_ids) # Filter by 1% PSM FDR (q-value < 0.01) idfilter = oms.IDFilter() idfilter.filterHitsByScore(peptide_ids, 0.01) idfilter.removeDecoyHits(peptide_ids) # store PSM-FDR filtered oms.IdXMLFile().store( "searchfile_results_1perc_FDR.idXML", protein_ids, peptide_ids ) However, usually researchers are interested in the most confidently identified proteins. This so called *protein inference* problem is a difficult problem because of often occurring shared/ambiguous peptides. To be able to calculate a target/decoy-based FDR on the protein level, we need to assign scores to proteins first (e.g. based on their observed peptides). This is done by applying one of the available protein inference algorithms on the peptide and protein IDs. .. code-block:: python :linenos: protein_ids = [] peptide_ids = [] # Re-run search since we need to keep decoy hits for inference simplesearch.search(searchfile, target_decoy_database, protein_ids, peptide_ids) # Run inference bpia = oms.BasicProteinInferenceAlgorithm() params = bpia.getDefaults() # FDR with groups currently not supported in pyopenms params.setValue("annotate_indistinguishable_groups", "false") bpia.setParameters(params) bpia.run(peptide_ids, protein_ids) # Annotate q-value on protein level # Removes decoys in default settings oms.FalseDiscoveryRate().apply(protein_ids) # Filter targets by 1% protein FDR (q-value < 0.01) idfilter = oms.IDFilter() idfilter.filterHitsByScore(protein_ids, 0.01) # Restore valid references into the proteins remove_peptides_without_reference = True idfilter.updateProteinReferences( peptide_ids, protein_ids, remove_peptides_without_reference ) # store protein-FDR filtered oms.IdXMLFile().store( "searchfile_results_1perc_protFDR.idXML", protein_ids, peptide_ids )