Export to pandas DataFrame#

NOTE: This feature is available only if using a version of pyOpenMS >= 3.0, at the time of writing this means using the nightly builds as described in the Installation Instructions.

In pyOpenMS some data structures can be converted to a tabular format as a pandas.DataFrame. This allows convenient access to data and meta values of spectra, features and identifications.

Required imports for the examples:

1import pyopenms as oms
2import pandas as pd
3from urllib.request import urlretrieve
4
5url = "https://raw.githubusercontent.com/OpenMS/pyopenms-docs/master/src/data/"

MSExperiment#

pyopenms.MSExperiment.get_df( long=False )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

long : default False

set to True if you want to have a long/expanded/melted dataframe with one row per peak. Faster but replicated RT information. If False, returns rows in the style: rt, np.array(mz), np.array(int)

Returns:

pandas.DataFrame

peak map information stored in a DataFrame

Examples:

1urlretrieve(url + "BSA1.mzML", "BSA1.mzML")
2exp = oms.MSExperiment()
3oms.MzMLFile().load("BSA1.mzML", exp)
4
5df = exp.get_df()  # default: long = False
6df.head(2)
exp.get_df()#

RT

mzarray

intarray

0

1501.41394

[300.0897645621494, 300.18132740129533, 300.20…

[3431.0261, 1181.809, 1516.1746, 1719.8547, 11…

1

1503.03125

[300.06577092599525, 300.08932376441896, 300.2…

[914.79034, 1842.2311, 2395.1025, 851.4738, 16…

1df = exp.get_df(long=True)
2df.head(2)
exp.get_df(long=True)#

RT

mz

inty

0

1501.41394

300.089752

3431.026123

1

1501.41394

300.181335

1181.808960

PeptideIdentification#

pyopenms.peptide_identifications_to_df( peps, decode_ontology=True, default_missing_values={bool: False, int: -9999, float: np.nan, str: ‘’}, export_unidentified=True )

Generates a pandas DataFrame with all peaks in the MSExperiment

Parameters:

peps :

list of PeptideIdentification objects

decode_ontology : default True

if meta values contain CV identifer (e.g., from PSI-MS) they will be automatically decoded into the human readable CV term name.

default_missing_values : default {bool: False, int: -9999, float: np.nan, str: ‘’}

default value for missing values for each data type

export_unidentified : default True

export PeptideIdentifications without PeptideHit

Returns:

pandas.DataFrame

peptide identifications in a DataFrame

Example:

1urlretrieve(url + "small.idXML", "small.idXML")
2prot_ids = []
3pep_ids = []
4oms.IdXMLFile().load("small.idXML", prot_ids, pep_ids)
5
6df = oms.peptide_identifications_to_df(pep_ids)
7df.head(2)
peptide_identifications_to_df(pep_ids)#

id

RT

mz

q-value

charge

protein_accession

start

end

NuXL:z2 mass

NuXL:z3 mass

isotope_error

NuXL:peptide_mass_z0

NuXL:XL_U

NuXL:sequence_score

0

OpenNuXL_2019-12-04T16:39:43_1021782429466859437

900.425415

414.730865

0.368649

4

DECOY_sp|Q86UQ0|ZN589_HUMAN

255

267

828.458069

552.641113

0

1654.901611

0

0.173912

1

OpenNuXL_2019-12-04T16:39:43_7293634134684008928

903.565186

506.259521

0.422779

2

sp|P61313|RL15_HUMAN

179

187

0.0

0.0

0

1010.504639

0

0.290786

FeatureMap#

pyopenms.FeatureMap.get_df( meta_values = None )

Generates a pandas DataFrame with information contained in the FeatureMap.

Optionally the feature meta values and information for the assigned PeptideHit can be exported.

Parameters:

meta_values : default None

meta values to include (None, [custom list of meta value names] or ‘all’)

export_peptide_identifications (bool): default True

Export sequence and score for best PeptideHit assigned to a feature. Additionally the ID_filename (file name of the corresponding ProteinIdentification) and the ID_native_id (spectrum ID of the corresponding Feature) are exported. They are also annotated as meta values when collecting all assigned PeptideIdentifications from a FeatureMap with FeatureMap.get_assigned_peptide_identifications(). A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

pandas.DataFrame

feature information stored in a DataFrame

Examples:

1urlretrieve(url + "BSA1_F1_idmapped.featureXML", "BSA1_F1_idmapped.featureXML")
2feature_map = oms.FeatureMap()
3oms.FeatureXMLFile().load("BSA1_F1_idmapped.featureXML", feature_map)
4
5df = feature_map.get_df()  # default: meta_values = None
6df.head(2)
feature_map.get_df()#

id

peptide_sequence

peptide_score

ID_filename

ID_native_id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

9650885788371886430

LVTDLTK

0.000000

unknown

spectrum=1270

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

18416216708636999474

DDSPDLPK

0.034483

unknown

spectrum=1167

2

1749.138335

443.711224

1735.693115

1763.343506

443.711122

445.717531

0.893553

54069300.0

1df = feature_map.get_df(meta_values="all", export_peptide_identifications=False)
2df.head(2)
feature_map.get_df(meta_values = ‘all’, export_peptide_identifications = False)#

id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

FWHM

spectrum_index

spectrum_native_id

label

score_correlation

score_fit

9650885788371886430

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

10.061090

259

spectrum=1270

168

0.989969

0.660286

18416216708636999474

2

1749.138335

443.711224

1735.693115

1763.343506

443.71112

445.717531

0.893553

54069300.0

14.156094

156

spectrum=1167

169

0.999002

0.799234

1df = feature_map.get_df(meta_values=[b"FWHM", b"label"])
2df.head(2)
feature_map.get_df(meta_values = [b’FWHM’, b’label’])#

id

charge

RT

mz

RTstart

RTend

mzstart

mzend

quality

intensity

FWHM

label

9650885788371886430

2

1942.600083

395.239277

1932.484009

1950.834351

395.239199

397.245758

0.808494

157572000.0

10.061090

168

18416216708636999474

2

1749.138335

443.711224

1735.693115

1763.343506

443.71112

445.717531

0.893553

54069300.0

14.156094

169

Extract assigned peptide identifications from a feature map

Peptide identifications can be mapped to their corresponding features in a FeatureMap. It is possible to extract them using the function pyopenms.FeatureMap.get_assigned_peptide_identifications() returning a list of PeptideIdentification objects.

pyopenms.FeatureMap.get_assigned_peptide_identifications()

Generates a list with peptide identifications assigned to a feature.

Adds ‘ID_native_id’ (feature spectrum id), ‘ID_filename’ (primary MS run path of corresponding ProteinIdentification) and ‘feature_id’ (unique ID of corresponding Feature) as meta values to the peptide hits. A DataFrame from the assigned peptides generated with peptide_identifications_to_df(assigned_peptides) can be merged with the FeatureMap DataFrame with: merged_df = pd.merge(feature_df, assigned_peptide_df, on=[‘feature_id’, ‘ID_native_id’, ‘ID_filename’])

Returns:

[PeptideIdentification]

list of PeptideIdentification objects

A DataFrame can be created on the resulting list of PeptideIdentification objects using pyopenms.peptide_identifications_to_df(assigned_peptides). Feature map and peptide data frames contain columns, on which they can be merged together to contain the complete information for peptides and features in a single data frame.

The columns for unambiguously merging the data frames:

  • feature_id: the unique feature identifier

  • ID_native_id: the feature spectrum native identifier

  • ID_filename: the filename (primary MS run path) of the corresponding ProteinIdentification

Example:

 1feature_df = feature_map.get_df()
 2assigned_peptides = feature_map.get_assigned_peptide_identifications()
 3assigned_peptide_df = oms.peptide_identifications_to_df(assigned_peptides)
 4
 5merged_df = pd.merge(
 6    feature_df,
 7    assigned_peptide_df,
 8    on=["feature_id", "ID_native_id", "ID_filename"],
 9)
10merged_df.head(2)
consensus_map.get_df()#

feature_id

peptide_sequence

peptide_score

ID_filename

ID_native_id

charge_x

RT_x

mz_x

RTstart

RTend

id

RT_y

mz_y

q-value

charge_y

protein_accession

start

end

OMSSA_score

target_decoy

9650885788371886430

LVTDLTK

0.000000

unknown

spectrum=1270

2

1942.600083

395.239277

1932.484009

1950.834351

OMSSA_2009-11-17T11:11:11_4731105163044641872

1933.405151

395.239349

0.000000

2

P02769|ALBU_BOVIN

-1

-1

0.001084

True

18416216708636999474

DDSPDLPK

0.034483

unknown

spectrum=1167

2

1749.138335

443.711224

1735.693115

1763.343506

OMSSA_2009-11-17T11:11:11_4731105163044641872

1738.033447

443.711243

0.034483

2

P02769|ALBU_BOVIN

-1

-1

0.003951

True

ConsensusMap#

pyopenms.ConsensusMap.get_df()

Generates a pandas DataFrame with both consensus feature meta data and intensities from each sample.

Returns:

pandas.DataFrame

consensus map meta data and intensity stored in pandas DataFrame

pyopenms.ConsensusMap.get_intensity_df()

Generates a pandas DataFrame with feature intensities from each sample in long format (over files).

For labelled analyses channel intensities will be in one row, therefore resulting in a semi-long/block format. Resulting DataFrame can be joined with result from get_metadata_df by their index ‘id’.

Returns:

pandas.DataFrame

intensity DataFrame

pyopenms.ConsensusMap.get_metadata_df()

Generates a pandas DataFrame with feature meta data (sequence, charge, mz, RT, quality).

Resulting DataFrame can be joined with result from get_intensity_df by their index ‘id’.

Returns:

pandas.DataFrame

DataFrame with metadata for each feature (such as: best identified sequence, charge, centroid RT/mz, fitting quality)

Examples:

1urlretrieve(
2    url + "ProteomicsLFQ_1_out.consensusXML", "ProteomicsLFQ_1_out.consensusXML"
3)
4consensus_map = oms.ConsensusMap()
5oms.ConsensusXMLFile().load("ProteomicsLFQ_1_out.consensusXML", consensus_map)
6
7df = consensus_map.get_df()
8df.head(2)
consensus_map.get_df()#

id

sequence

charge

RT

mz

quality

BSA1_F1.mzML

BSA1_F2.mzML

2935923263525422257

DGDIEAEISR

3

1523.370634

368.843773

0.000000

0.0

0.0

10409195546240342212

SHC(Carbamidomethyl)IAEVEK

3

1552.032973

358.174576

0.491247

1358151.0

0.0

1df = consensus_map.get_intensity_df()
2df.head(2)
consensus_map.get_intensity_df()#

id

BSA1_F1.mzML

BSA1_F2.mzML

2935923263525422257

0.0

0.0

10409195546240342212

1358151.0

0.0

1df = consensus_map.get_metadata_df()
2df.head(2)
consensus_map.get_metadata_df()#

id

sequence

charge

RT

mz

quality

2935923263525422257

DGDIEAEISR

3

1523.370634

368.843773

0.000000

10409195546240342212

SHC(Carbamidomethyl)IAEVEK

3

1552.032973

358.174576

0.491247