Export to pandas DataFrame#
NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using the nightly builds as described in the Installation Instructions.
In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame
.
This allows convenient access to data and meta values of spectra, features and identifications.
Required imports for the examples:
1import pyopenms as oms
2import pandas as pd
3from urllib.request import urlretrieve
4
5url = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/"
MSExperiment#
- pyopenms.MSExperiment.get_df( long=False )
Generates a pandas DataFrame with all peaks in the MSExperiment
Parameters:
long : default False
set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)
Returns:
pandas.DataFrame
peak map information stored in a DataFrame
Examples:
1urlretrieve(url + "BSA1.mzML", "BSA1.mzML")
2exp = oms.MSExperiment()
3oms.MzMLFile().load("BSA1.mzML", exp)
4
5df = exp.get_df() # default: long = False
6df.head(2)
RT |
mzarray |
intarray |
|
---|---|---|---|
0 |
1501.41394 |
[300.0897645621494, 300.18132740129533, 300.20… |
[3431.0261, 1181.809, 1516.1746, 1719.8547, 11… |
1 |
1503.03125 |
[300.06577092599525, 300.08932376441896, 300.2… |
[914.79034, 1842.2311, 2395.1025, 851.4738, 16… |
1df = exp.get_df(long=True)
2df.head(2)
RT |
mz |
inty |
|
---|---|---|---|
0 |
1501.41394 |
300.089752 |
3431.026123 |
1 |
1501.41394 |
300.181335 |
1181.808960 |
PeptideIdentification#
- pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ‘’}, export_unidentified=True )
Generates a pandas DataFrame with all peaks in the MSExperiment
Parameters:
peps :
list of PeptideIdentification objects
decode_ontology : default True
if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.
default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ‘’}
default value for missing values for each data type
export_unidentified : default True
export PeptideIdentifications without PeptideHit
Returns:
pandas.DataFrame
peptide identifications in a DataFrame
Example:
1urlretrieve(url + "small.idXML", "small.idXML")
2prot_ids = []
3pep_ids = []
4oms.IdXMLFile().load("small.idXML", prot_ids, pep_ids)
5
6df = oms.peptide_identifications_to_df(pep_ids)
7df.head(2)
id |
RT |
mz |
q-value |
charge |
protein_accession |
start |
end |
NuXL:z2 mass |
NuXL:z3 mass |
… |
isotope_error |
NuXL:peptide_mass_z0 |
NuXL:XL_U |
NuXL:sequence_score |
|
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 |
OpenNuXL_2019-12-04T16:39:43_1021782429466859437 |
900.425415 |
414.730865 |
0.368649 |
4 |
DECOY_sp|Q86UQ0|ZN589_HUMAN |
255 |
267 |
828.458069 |
552.641113 |
… |
0 |
1654.901611 |
0 |
0.173912 |
1 |
OpenNuXL_2019-12-04T16:39:43_7293634134684008928 |
903.565186 |
506.259521 |
0.422779 |
2 |
sp|P61313|RL15_HUMAN |
179 |
187 |
0.0 |
0.0 |
… |
0 |
1010.504639 |
0 |
0.290786 |
FeatureMap#
- pyopenms.FeatureMap.get_df( meta_values = None )
Generates a pandas DataFrame with information contained in the FeatureMap.
Optionally the feature meta values and information for the assigned PeptideHit can be exported.
Parameters:
meta_values : default None
meta values to include (None, [custom list of meta value names] or ‘all’)
export_peptide_identifications (bool): default True
Export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])
Returns:
pandas.DataFrame
feature information stored in a DataFrame
Examples:
1urlretrieve(url + "BSA1_F1_idmapped.featureXML", "BSA1_F1_idmapped.featureXML")
2feature_map = oms.FeatureMap()
3oms.FeatureXMLFile().load("BSA1_F1_idmapped.featureXML", feature_map)
4
5df = feature_map.get_df() # default: meta_values = None
6df.head(2)
id |
peptide_sequence |
peptide_score |
ID_filename |
ID_native_id |
charge |
RT |
mz |
RTstart |
RTend |
mzstart |
mzend |
quality |
intensity |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 |
LVTDLTK |
0.000000 |
unknown |
spectrum=1270 |
2 |
1942.600083 |
395.239277 |
1932.484009 |
1950.834351 |
395.239199 |
397.245758 |
0.808494 |
157572000.0 |
18416216708636999474 |
DDSPDLPK |
0.034483 |
unknown |
spectrum=1167 |
2 |
1749.138335 |
443.711224 |
1735.693115 |
1763.343506 |
443.711122 |
445.717531 |
0.893553 |
54069300.0 |
1df = feature_map.get_df(meta_values="all", export_peptide_identifications=False)
2df.head(2)
id |
charge |
RT |
mz |
RTstart |
RTend |
mzstart |
mzend |
quality |
intensity |
FWHM |
spectrum_index |
spectrum_native_id |
label |
score_correlation |
score_fit |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 |
2 |
1942.600083 |
395.239277 |
1932.484009 |
1950.834351 |
395.239199 |
397.245758 |
0.808494 |
157572000.0 |
10.061090 |
259 |
spectrum=1270 |
168 |
0.989969 |
0.660286 |
18416216708636999474 |
2 |
1749.138335 |
443.711224 |
1735.693115 |
1763.343506 |
443.71112 |
445.717531 |
0.893553 |
54069300.0 |
14.156094 |
156 |
spectrum=1167 |
169 |
0.999002 |
0.799234 |
1df = feature_map.get_df(meta_values=[b"FWHM", b"label"])
2df.head(2)
id |
charge |
RT |
mz |
RTstart |
RTend |
mzstart |
mzend |
quality |
intensity |
FWHM |
label |
---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 |
2 |
1942.600083 |
395.239277 |
1932.484009 |
1950.834351 |
395.239199 |
397.245758 |
0.808494 |
157572000.0 |
10.061090 |
168 |
18416216708636999474 |
2 |
1749.138335 |
443.711224 |
1735.693115 |
1763.343506 |
443.71112 |
445.717531 |
0.893553 |
54069300.0 |
14.156094 |
169 |
Extract assigned peptide identifications from a feature map
Peptide identifications can be mapped to their corresponding features in a FeatureMap
. It is possible to extract them using the function
pyopenms.FeatureMap.get_assigned_peptide_identifications()
returning a list of PeptideIdentification
objects.
- pyopenms.FeatureMap.get_assigned_peptide_identifications()
Generates a list with peptide identifications assigned to a feature.
Adds ‘ID_native_id’ (feature spectrum id), ‘ID_filename’ (primary MS run path of corresponding ProteinIdentification) and ‘feature_id’ (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])
Returns:
[PeptideIdentification]
list of PeptideIdentification objects
A DataFrame
can be created on the resulting list of PeptideIdentification objects using
pyopenms.peptide_identifications_to_df(assigned_peptides)
.
Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete
information for peptides and features in a single data frame.
The columns for unambiguously merging the data frames:
feature_id
: the unique feature identifierID_native_id
: the feature spectrum native identifierID_filename
: the filename (primary MS run path) of the corresponding ProteinIdentification
Example:
1feature_df = feature_map.get_df()
2assigned_peptides = feature_map.get_assigned_peptide_identifications()
3assigned_peptide_df = oms.peptide_identifications_to_df(assigned_peptides)
4
5merged_df = pd.merge(
6 feature_df,
7 assigned_peptide_df,
8 on=["feature_id", "ID_native_id", "ID_filename"],
9)
10merged_df.head(2)
feature_id |
peptide_sequence |
peptide_score |
ID_filename |
ID_native_id |
charge_x |
RT_x |
mz_x |
RTstart |
RTend |
… |
id |
RT_y |
mz_y |
q-value |
charge_y |
protein_accession |
start |
end |
OMSSA_score |
target_decoy |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
9650885788371886430 |
LVTDLTK |
0.000000 |
unknown |
spectrum=1270 |
2 |
1942.600083 |
395.239277 |
1932.484009 |
1950.834351 |
… |
OMSSA_2009-11-17T11:11:11_4731105163044641872 |
1933.405151 |
395.239349 |
0.000000 |
2 |
P02769|ALBU_BOVIN |
-1 |
-1 |
0.001084 |
True |
18416216708636999474 |
DDSPDLPK |
0.034483 |
unknown |
spectrum=1167 |
2 |
1749.138335 |
443.711224 |
1735.693115 |
1763.343506 |
… |
OMSSA_2009-11-17T11:11:11_4731105163044641872 |
1738.033447 |
443.711243 |
0.034483 |
2 |
P02769|ALBU_BOVIN |
-1 |
-1 |
0.003951 |
True |
ConsensusMap#
- pyopenms.ConsensusMap.get_df()
Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.
Returns:
pandas.DataFrame
consensus map meta data and intensity stored in pandas DataFrame
- pyopenms.ConsensusMap.get_intensity_df()
Generates a pandas DataFrame with feature intensities from each sample in long format (over files).
For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index ‘id’.
Returns:
pandas.DataFrame
intensity DataFrame
- pyopenms.ConsensusMap.get_metadata_df()
Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).
Resulting DataFrame can be joined with result from get_intensity_df by their index ‘id’.
Returns:
pandas.DataFrame
DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)
Examples:
1urlretrieve(
2 url + "ProteomicsLFQ_1_out.consensusXML", "ProteomicsLFQ_1_out.consensusXML"
3)
4consensus_map = oms.ConsensusMap()
5oms.ConsensusXMLFile().load("ProteomicsLFQ_1_out.consensusXML", consensus_map)
6
7df = consensus_map.get_df()
8df.head(2)
id |
sequence |
charge |
RT |
mz |
quality |
BSA1_F1.mzML |
… |
BSA1_F2.mzML |
---|---|---|---|---|---|---|---|---|
2935923263525422257 |
DGDIEAEISR |
3 |
1523.370634 |
368.843773 |
0.000000 |
0.0 |
… |
0.0 |
10409195546240342212 |
SHC(Carbamidomethyl)IAEVEK |
3 |
1552.032973 |
358.174576 |
0.491247 |
1358151.0 |
… |
0.0 |
1df = consensus_map.get_intensity_df()
2df.head(2)
id |
BSA1_F1.mzML |
… |
BSA1_F2.mzML |
---|---|---|---|
2935923263525422257 |
0.0 |
… |
0.0 |
10409195546240342212 |
1358151.0 |
… |
0.0 |
1df = consensus_map.get_metadata_df()
2df.head(2)
id |
sequence |
charge |
RT |
mz |
quality |
---|---|---|---|---|---|
2935923263525422257 |
DGDIEAEISR |
3 |
1523.370634 |
368.843773 |
0.000000 |
10409195546240342212 |
SHC(Carbamidomethyl)IAEVEK |
3 |
1552.032973 |
358.174576 |
0.491247 |