Skip to Content
DocumentationData TransformationFiltering Crosslink-Spectrum-Matches and Crosslinks

Filtering Crosslink-Spectrum-Matches and Crosslinks

from pyXLMS import __version__ print(f"Installed pyXLMS version: {__version__}")
✓
Installed pyXLMS version: 1.8.7
from pyXLMS import parser from pyXLMS import transform

All data transformation functionality - including all filters - is available via the transform submodule. We also import the parser submodule here for reading result files.

parser_result = parser.read( "../../data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult", engine="MS Annika", crosslinker="DSS", )
✓
Reading MS Annika CSMs...: 100%|█████████████████████████████████████████████████████████████████████████████████| 826/826 [00:00<00:00, 8436.90it/s] Reading MS Annika crosslinks...: 100%|██████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 15229.12it/s]

We read crosslink-spectrum-matches and crosslinks using the generic parser  from a single .pdResult file.

csms = parser_result["crosslink-spectrum-matches"] xls = parser_result["crosslinks"]

For easier access we assign our crosslink-spectrum-matches to the variable csms and our crosslinks to the variable xls.

csms_by_crosslink_type = transform.filter_crosslink_type(csms)

We can filter crosslink-spectrum-matches by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link CSMs and inter-link CSMs respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.

print(f"Number of intra-link CSMs: {len(csms_by_crosslink_type['Intra'])}") print(f"Number of inter-link CSMs: {len(csms_by_crosslink_type['Inter'])}")
✓
Number of intra-link CSMs: 803 Number of inter-link CSMs: 23

Our example file contains 803 intra-link CSMs and 23 inter-link CSMs.

Important

Please note that any CSMs without associated protein accessions would be considered inter-links!

xls_by_crosslink_type = transform.filter_crosslink_type(xls)

Similarly, we can filter crosslinks by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link crosslinks and inter-link crosslinks respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.

print(f"Number of intra-link XLs: {len(xls_by_crosslink_type['Intra'])}") print(f"Number of inter-link XLs: {len(xls_by_crosslink_type['Inter'])}")
✓
Number of intra-link XLs: 279 Number of inter-link XLs: 21

Our example file contains 279 intra-link crosslinks and 21 inter-link crosslinks.

Important

Please note that any crosslinks without associated protein accessions would be considered inter-links!

Filtering by Peptide Pair

csms_by_peptide_pair = transform.filter_peptide_pair_distribution(csms)

We can filter crosslink-spectrum-matches by their associated peptide pairs by calling transform.filter_peptide_pair_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the peptide pairs as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_peptide_pair_distribution() function and all its parameters here: docs.

list(csms_by_peptide_pair.keys())[:5]
✓
['GQKNSR:3-GQKNSR:3', 'GQKNSR:3-DECOY_GSQKDR:4', 'SDKNR:3-SDKNR:3', 'DKQSGK:2-DKQSGK:2', 'DKQSGK:2-HSIKK:4']

Here are the first five peptide pairs that were encountered in our crosslink-spectrum-matches. The numbers after the colons denote the positions of the crosslink in the peptide sequence. Also note that decoy peptides are prefixed with a DECOY_ string, you can disable this by calling transform.filter_peptide_pair_distribution(csms, prefix_decoys=False).

len(csms_by_peptide_pair["LSKSR:3-MKNYWR:2"])
✓
9

For the peptide pair LSKSR:2-MKNYWR:2 we found 9 crosslink-spectrum-matches in our result…

import random random_csm = random.choice(csms_by_peptide_pair["LSKSR:3-MKNYWR:2"]) transform.display(random_csm)
✓
Data Type: crosslink-spectrum-match Completeness: full Alpha Peptide: LSKSR Alpha Modifications: {3: ('DSS', 138.06808)} Alpha Peptide Crosslink Position: 3 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [222] Alpha Proteins Peptide Positions: [220] Alpha Peptide Score: 118.07661741329055 Alpha Decoy: False Beta Peptide: MKNYWR Beta Modifications: {2: ('DSS', 138.06808)} Beta Peptide Crosslink Position: 2 Beta Proteins: ['Cas9'] Beta Proteins Crosslink Positions: [884] Beta Proteins Peptide Positions: [883] Beta Peptide Score: 109.20912671549812 Beta Decoy: False Crosslink Type: intra CSM Score: 109.20912671549812 Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw Scan Number: 9072 Precursor Charge: 4 Retention Time: 2832.95832 Ion Mobility/FAIMS CV: 0.0

…and here is a random LSKSR:3-MKNYWR:2 crosslink-spectrum-match.

Filtering by Residue Pair

csms_by_residue_pair = transform.filter_residue_pair_distribution(csms)

We can filter crosslink-spectrum-matches by their associated residue pairs by calling transform.filter_residue_pair_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the residue pairs as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_residue_pair_distribution() function and all its parameters here: docs.

Important

Please note that this filter requires that alpha_proteins, beta_proteins, alpha_proteins_crosslink_positions, and beta_proteins_crosslink_positions fields are set for all crosslink-spectrum-matches.

list(csms_by_residue_pair.keys())[:5]
✓
['Cas9:779-Cas9:779', 'Cas9:779-DECOY_Cas9:696', 'Cas9:866-Cas9:866', 'Cas9:677-Cas9:677', 'Cas9:48-Cas9:677']

Here are the first five residue pairs that were encountered in our crosslink-spectrum-matches. The numbers after the colons denote the positions of the crosslink in the protein sequence. Also note that decoy proteins are prefixed with a DECOY_ string, you can disable this by calling transform.filter_residue_pair_distribution(csms, prefix_decoys=False).

len(csms_by_residue_pair["Cas9:1122-Cas9:884"])
✓
22

For the residue pair Cas9:1122-Cas9:884 we found 22 crosslink-spectrum-matches in our result…

random_csm = random.choice(csms_by_residue_pair["Cas9:1122-Cas9:884"]) transform.display(random_csm)
✓
Data Type: crosslink-spectrum-match Completeness: full Alpha Peptide: MKNYWR Alpha Modifications: {1: ('Oxidation', 15.994915), 2: ('DSS', 138.06808)} Alpha Peptide Crosslink Position: 2 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [884] Alpha Proteins Peptide Positions: [883] Alpha Peptide Score: 198.27409728492975 Alpha Decoy: False Beta Peptide: NSDKLIAR Beta Modifications: {4: ('DSS', 138.06808)} Beta Peptide Crosslink Position: 4 Beta Proteins: ['Cas9'] Beta Proteins Crosslink Positions: [1122] Beta Proteins Peptide Positions: [1119] Beta Peptide Score: 251.1469294734371 Beta Decoy: False Crosslink Type: intra CSM Score: 198.27409728492975 Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw Scan Number: 11323 Precursor Charge: 4 Retention Time: 3410.44692 Ion Mobility/FAIMS CV: 0.0

…and here is a random Cas9:1122-Cas9:884 crosslink-spectrum-match.

Filtering by Protein

csms_by_protein = transform.filter_protein_distribution(csms)

We can filter crosslink-spectrum-matches by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_protein_distribution() function and all its parameters here: docs.

list(csms_by_protein.keys())
✓
['Cas9', 'sp']

In our example we have crosslink-spectrum-matches from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).

len(csms_by_protein["Cas9"])
✓
821

In total we have 821 crosslink-spectrum-matches where at least one of the two crosslinked peptides is from Cas9.

xls_by_protein = transform.filter_protein_distribution(xls)

Similarly, we can filter crosslinks by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslinks as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding crosslinks. You can read more about the filter_protein_distribution() function and all its parameters here: docs.

list(xls_by_protein.keys())
✓
['Cas9', 'sp']

We also have crosslinks from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).

len(xls_by_protein["Cas9"])
✓
295

In total we have 295 crosslinks where at least one of the two crosslinked peptides is from Cas9.

csms_cas9 = transform.filter_proteins(csms, proteins=["Cas9"])

If we are only interested in crosslink-spectrum-matches of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslink-spectrum-matches as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.

list(csms_cas9.keys())
✓
['Proteins', 'Both', 'One']

The function returns a dictionary with keys "Proteins", "Both", and "One":

  • "Proteins" allows you to access your original list of proteins that was used for filtering (e.g., what was passed via the proteins parameter).
  • "Both" contains all crosslink-spectrum-matches where both peptides were of one of the specified proteins, in our case both peptides are from Cas9.
  • "One" contains all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.
csms_cas9["Proteins"]
✓
['Cas9']

Via "Proteins" we can access our original list of proteins that was used for filtering.

len(csms_cas9["Both"])
✓
798

Via "Both" we get all crosslink-spectrum-matches where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).

random_csm = random.choice(csms_cas9["Both"]) transform.display(random_csm)
✓
Data Type: crosslink-spectrum-match Completeness: full Alpha Peptide: KDWDPK Alpha Modifications: {1: ('DSS', 138.06808)} Alpha Peptide Crosslink Position: 1 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [1128] Alpha Proteins Peptide Positions: [1128] Alpha Peptide Score: 28.719759495027432 Alpha Decoy: False Beta Peptide: MTNFDKNLPNEK Beta Modifications: {6: ('DSS', 138.06808)} Beta Peptide Crosslink Position: 6 Beta Proteins: ['Cas9'] Beta Proteins Crosslink Positions: [504] Beta Proteins Peptide Positions: [499] Beta Peptide Score: 64.4694591126988 Beta Decoy: False Crosslink Type: intra CSM Score: 28.719759495027432 Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw Scan Number: 13306 Precursor Charge: 3 Retention Time: 3827.9838 Ion Mobility/FAIMS CV: 0.0

Here would be a random example crosslink-spectrum-match from "Both" and as you can see both peptides are from Cas9.

len(csms_cas9["One"])
✓
23

Via "One" we get all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).

random_csm = random.choice(csms_cas9["One"]) transform.display(random_csm)
✓
Data Type: crosslink-spectrum-match Completeness: full Alpha Peptide: KVTVK Alpha Modifications: {1: ('DSS', 138.06808)} Alpha Peptide Crosslink Position: 1 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [562] Alpha Proteins Peptide Positions: [562] Alpha Peptide Score: 59.078359485843485 Alpha Decoy: False Beta Peptide: TKIAQLADLVK Beta Modifications: {2: ('DSS', 138.06808)} Beta Peptide Crosslink Position: 2 Beta Proteins: ['sp'] Beta Proteins Crosslink Positions: [93] Beta Proteins Peptide Positions: [92] Beta Peptide Score: 23.04566358838272 Beta Decoy: True Crosslink Type: inter CSM Score: 23.04566358838272 Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw Scan Number: 15849 Precursor Charge: 3 Retention Time: 4356.45162 Ion Mobility/FAIMS CV: 0.0

Here would be a random example crosslink-spectrum-match from "One" and as you can see only one of the two peptides is from Cas9.

xls_cas9 = transform.filter_proteins(xls, proteins=["Cas9"])

Similarly, if we are only interested in crosslinks of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslinks as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.

list(xls_cas9.keys())
✓
['Proteins', 'Both', 'One']

The function returns a dictionary with keys "Proteins", "Both", and "One":

  • "Proteins" allows you to access your original list of proteins that was used for filtering (e.g., what was passed via the proteins parameter).
  • "Both" contains all crosslinks where both peptides were of one of the specified proteins, in our case both peptides are from Cas9.
  • "One" contains all crosslinks where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.
xls_cas9["Proteins"]
✓
['Cas9']

Via "Proteins" we can access our original list of proteins that was used for filtering.

len(xls_cas9["Both"])
✓
274

Via "Both" we get all crosslinks where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).

random_xl = random.choice(xls_cas9["Both"]) transform.display(random_xl)
✓
Data Type: crosslink Completeness: full Alpha Peptide: VLSAYNKHR Alpha Peptide Crosslink Position: 7 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [1300] Alpha Decoy: True Beta Peptide: YKEPLQQR Beta Peptide Crosslink Position: 2 Beta Proteins: ['Cas9'] Beta Proteins Crosslink Positions: [1023] Beta Decoy: True Crosslink Type: intra Crosslink Score: 12.803290867950926

Here would be a random example crosslink from "Both" and as you can see both peptides are from Cas9.

len(xls_cas9["One"])
✓
21

Via "One" we get all crosslinks where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).

random_xl = random.choice(xls_cas9["One"]) transform.display(random_xl)
✓
Data Type: crosslink Completeness: full Alpha Peptide: NSDKLIAR Alpha Peptide Crosslink Position: 4 Alpha Proteins: ['Cas9'] Alpha Proteins Crosslink Positions: [1122] Alpha Decoy: False Beta Peptide: TKYNALKTPDK Beta Peptide Crosslink Position: 2 Beta Proteins: ['sp'] Beta Proteins Crosslink Positions: [145] Beta Decoy: False Crosslink Type: inter Crosslink Score: 30.31260863093563

Here would be a random example crosslink from "One" and as you can see only one of the two peptides is from Cas9.

Filtering by Target-Decoy Type

csms_td = transform.filter_target_decoy(csms)

We can filter crosslink-spectrum-matches by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy CSMs respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.

len(csms_td["Target-Target"])
✓
786

Via "Target-Target" we can access all crosslink-spectrum-matches where both peptides are from the target database.

len(csms_td["Target-Decoy"])
✓
39

Via "Target-Decoy" we can access all crosslink-spectrum-matches where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy".

len(csms_td["Decoy-Decoy"])
✓
1

Via "Decoy-Decoy" we can access all crosslink-spectrum-matches where both peptides are from the decoy database.

xls_td = transform.filter_target_decoy(xls)

Similarly, we can filter crosslinks by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy crosslinks respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.

len(xls_td["Target-Target"])
✓
265

Via "Target-Target" we can access all crosslinks where both peptides are from the target database.

len(xls_td["Target-Decoy"])
✓
0

Via "Target-Decoy" we can access all crosslinks where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy". As you can see here the number of "Target-Decoy" matches is zero for our MS Annika crosslink results because on the crosslink-level MS Annika reports any target-decoy and decoy-target matches as full decoy-decoy matches.

len(xls_td["Decoy-Decoy"])
✓
35

Via "Decoy-Decoy" we can access all crosslinks where both peptides are from the decoy database.

Important

Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!

Filtering Target Matches Only

Because we are often only interested in target-target matches there is a shorthand function that returns only target-target matches called transform.targets_only(). In contrast to all previous filter functions targets_only() accepts both lists of crosslink-spectrum-matches or crosslinks or a parser_result as input (see data type documentation here: docs). The return type will be the same as the input type. You can read more about the targets_only() function and all its parameters here: docs.

csms = transform.targets_only(csms) print(f"Nr. of TT CSMs: {len(csms)}")
✓
Nr. of TT CSMs: 786

Here is an example of calling targets_only() on a list of crosslink-spectrum-matches: a list of crosslink-spectrum-matches containing only target-target matches is returned.

xls = transform.targets_only(xls) print(f"Nr. of TT crosslinks: {len(xls)}")
✓
Nr. of TT crosslinks: 265

Here is an example of calling targets_only() on a list of crosslinks: a list of crosslinks containing only target-target matches is returned.

parser_result = transform.targets_only(parser_result) print(f"Nr. of TT CSMs: {len(parser_result['crosslink-spectrum-matches'])}") print(f"Nr. of TT crosslinks: {len(parser_result['crosslinks'])}")
✓
Nr. of TT CSMs: 786 Nr. of TT crosslinks: 265

Here is an example of calling targets_only() on a parser_result: a parser_result containing only target-target matches is returned.

Important

Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!

Last updated on