Scoring Spectra
===============

In the chapter on spectrum alignment we showed how to determine matching peaks between theoretical and experimental spectra.
For many use cases we might actually not be interested in obtaining the list of matched peaks but would like to have
a simple, single score that indicates how "well" the two spectra matched.
The :py:class:`~.HyperScore` is a method to assign a score to :term:`peptide-spectrum matches<peptide-spectrum match>`.


Background
**********

The HyperScore is a scoring scheme first used in the peptide database search engine X!Tandem (Craig,R. and Beavis,R.C. (2003) Rapid Commun. Mass Spectrom., 17, 2310–2316.) to evaluate the quality of peptide-spectrum matches (PSMs). It was later adopted in other peptide search engines or even outside the field of proteomics.

The HyperScore is composed of two parts:

    1. the sum of the log factorials of matched b- and y-ions. Under assumption of a hypergeometric distribution (thus the name HyperScore) it refers to the probability of random matches of fragment ions.
    2. the sum of matched intensities (often refered more generally as the dot product of observed and theoretical intensities). This smaller term is mainly used as a tie-braker added to the log factorials to discriminate between potential many PSMs with same number of matched b and y ions.
    
In OpenMS the :py:class:`~.HyperScore` expects a maximal fragment mass tolerance window, the error unit (m/z or ppm), the observed and the theoretical spectrum (as generated by e.g., :py:class:`~.TheoreticalSpectrumGenerator`). The function then calculates and returns the HyperScore.

.. code-block:: python

    from urllib.request import urlretrieve
    import pyopenms as oms

    gh = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master"
    urlretrieve(gh + "/src/data/SimpleSearchEngine_1.mzML", "searchfile.mzML")

Generate a Theoretical Spectrum
*******************************


We now use the :py:class:`~.TheoreticalSpectrumGenerator` to generate a theoretical spectrum for the sequence we are interested in,
``RPGADSDIGGFGGLFDLAQAGFR``, and compare the peaks to a spectra from our file.


.. code-block:: python

    tsg = oms.TheoreticalSpectrumGenerator()
    thspec = oms.MSSpectrum()
    p = oms.Param()
    p.setValue("add_metainfo", "true")
    tsg.setParameters(p)
    peptide = oms.AASequence.fromString("RPGADSDIGGFGGLFDLAQAGFR")
    tsg.getSpectrum(thspec, peptide, 1, 1)
    # Iterate over annotated ions and their masses
    for ion, peak in zip(thspec.getStringDataArrays()[0], thspec):
        print(ion, peak.getMZ())

    e = oms.MSExperiment()
    oms.MzMLFile().load("searchfile.mzML", e)
    spectrum_of_interest = e[2]
    print("Spectrum native id", spectrum_of_interest.getNativeID())
    mz, i = spectrum_of_interest.get_peaks()
    peaks = [(mz, i) for mz, i in zip(mz, i) if i > 1500 and mz > 300]
    for peak in peaks:
        print(peak[0], "mz", peak[1], "int")

Comparing the spectrum and the experimental spectrum for
``RPGADSDIGGFGGLFDLAQAGFR`` we can easily see that the most abundant ions in the
spectrum are :chem:`y8` (:math:`877.452` m/z), :chem:`b10` (:math:`926.432`), :chem:`y9`
(:math:`1024.522` m/z) and :chem:`b13` (:math:`1187.544` m/z).

Getting a Score
***************

We now run :py:class:`~.HyperScore` to compute the similarity of the theoretical spectrum
and the experimental spectrum and print the result

.. code-block:: python

    hscore = oms.HyperScore()
    fragment_mass_tolerance = 5.0
    is_tol_in_ppm = True
    result = hscore.compute(
        fragment_mass_tolerance, is_tol_in_ppm, spectrum_of_interest, thspec
    )
    result

If we didn't know ahead of time which spectrum was a match we can loop through all the spectra from our file,
 calculate scores for all of them, and print the result:

.. code-block:: python

    for f in e:
        score = hscore.compute(fragment_mass_tolerance, is_tol_in_ppm, f, thspec)
        print(f.getNativeID() + ":" + str(score))


.. note::
    In the original publication, an E-value is calculated based on the score distribution p(x), which is derived from a frequency histogram of PSMs per HyperScore bin, denoted as f(x). The total number of PSMs is represented by N. The formula for calculating the score distribution is: :math:`p(x)=\\frac{f(x)}{N}`

    For a discrete stochastic score probability distribution p(x), the so-called survival function represents the probability of having a greater value than x by random matches in a database. The formula for the survival function is:

    :math:`s(x)=P(X>x)=\\sum{X>xp(x)}`

    To estimate the number of PSMs expected to have scores of x or better, one can calculate an E-value :math:`e(x)=ns(x)`
    Here, n represents the number of sequences.

    By ranking each PSM in the output according to its E-value, the significance of individual hits are taken into account.
    This functionality is currently not implemented in OpenMS.