Filtering Crosslink-Spectrum-Matches and Crosslinks
from pyXLMS import __version__
print(f"Installed pyXLMS version: {__version__}") Installed pyXLMS version: 1.8.7from pyXLMS import parser
from pyXLMS import transformAll data transformation functionality - including all filters - is available via the transform submodule. We also import the parser submodule here for reading result files.
parser_result = parser.read(
"../../data/ms_annika/XLpeplib_Beveridge_QEx-HFX_DSS_R1.pdResult",
engine="MS Annika",
crosslinker="DSS",
) Reading MS Annika CSMs...: 100%|█████████████████████████████████████████████████████████████████████████████████| 826/826 [00:00<00:00, 8436.90it/s]
Reading MS Annika crosslinks...: 100%|██████████████████████████████████████████████████████████████████████████| 300/300 [00:00<00:00, 15229.12it/s]We read crosslink-spectrum-matches and crosslinks using the generic parser from a single .pdResult file.
csms = parser_result["crosslink-spectrum-matches"]
xls = parser_result["crosslinks"]For easier access we assign our crosslink-spectrum-matches to the variable csms and our crosslinks to the variable xls.
Filtering by Crosslink Type
csms_by_crosslink_type = transform.filter_crosslink_type(csms)We can filter crosslink-spectrum-matches by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link CSMs and inter-link CSMs respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.
print(f"Number of intra-link CSMs: {len(csms_by_crosslink_type['Intra'])}")
print(f"Number of inter-link CSMs: {len(csms_by_crosslink_type['Inter'])}") Number of intra-link CSMs: 803
Number of inter-link CSMs: 23Our example file contains 803 intra-link CSMs and 23 inter-link CSMs.
Please note that any CSMs without associated protein accessions would be considered inter-links!
xls_by_crosslink_type = transform.filter_crosslink_type(xls)Similarly, we can filter crosslinks by their crosslink type (intra-links and inter-links) by calling transform.filter_crosslink_type() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Intra" and "Inter" with their associated values being lists of intra-link crosslinks and inter-link crosslinks respectively. You can read more about the filter_crosslink_type() function and all its parameters here: docs.
print(f"Number of intra-link XLs: {len(xls_by_crosslink_type['Intra'])}")
print(f"Number of inter-link XLs: {len(xls_by_crosslink_type['Inter'])}") Number of intra-link XLs: 279
Number of inter-link XLs: 21Our example file contains 279 intra-link crosslinks and 21 inter-link crosslinks.
Please note that any crosslinks without associated protein accessions would be considered inter-links!
Filtering by Peptide Pair
csms_by_peptide_pair = transform.filter_peptide_pair_distribution(csms)We can filter crosslink-spectrum-matches by their associated peptide pairs by calling transform.filter_peptide_pair_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the peptide pairs as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_peptide_pair_distribution() function and all its parameters here: docs.
list(csms_by_peptide_pair.keys())[:5] ['GQKNSR:3-GQKNSR:3',
'GQKNSR:3-DECOY_GSQKDR:4',
'SDKNR:3-SDKNR:3',
'DKQSGK:2-DKQSGK:2',
'DKQSGK:2-HSIKK:4']Here are the first five peptide pairs that were encountered in our crosslink-spectrum-matches. The numbers after the colons denote the positions of the crosslink in the peptide sequence. Also note that decoy peptides are prefixed with a DECOY_ string, you can disable this by calling transform.filter_peptide_pair_distribution(csms, prefix_decoys=False).
len(csms_by_peptide_pair["LSKSR:3-MKNYWR:2"]) 9For the peptide pair LSKSR:2-MKNYWR:2 we found 9 crosslink-spectrum-matches in our result…
import random
random_csm = random.choice(csms_by_peptide_pair["LSKSR:3-MKNYWR:2"])
transform.display(random_csm) Data Type: crosslink-spectrum-match
Completeness: full
Alpha Peptide: LSKSR
Alpha Modifications: {3: ('DSS', 138.06808)}
Alpha Peptide Crosslink Position: 3
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [222]
Alpha Proteins Peptide Positions: [220]
Alpha Peptide Score: 118.07661741329055
Alpha Decoy: False
Beta Peptide: MKNYWR
Beta Modifications: {2: ('DSS', 138.06808)}
Beta Peptide Crosslink Position: 2
Beta Proteins: ['Cas9']
Beta Proteins Crosslink Positions: [884]
Beta Proteins Peptide Positions: [883]
Beta Peptide Score: 109.20912671549812
Beta Decoy: False
Crosslink Type: intra
CSM Score: 109.20912671549812
Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw
Scan Number: 9072
Precursor Charge: 4
Retention Time: 2832.95832
Ion Mobility/FAIMS CV: 0.0…and here is a random LSKSR:3-MKNYWR:2 crosslink-spectrum-match.
Filtering by Residue Pair
csms_by_residue_pair = transform.filter_residue_pair_distribution(csms)We can filter crosslink-spectrum-matches by their associated residue pairs by calling transform.filter_residue_pair_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the residue pairs as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_residue_pair_distribution() function and all its parameters here: docs.
Please note that this filter requires that alpha_proteins, beta_proteins, alpha_proteins_crosslink_positions, and beta_proteins_crosslink_positions fields are set for all crosslink-spectrum-matches.
list(csms_by_residue_pair.keys())[:5] ['Cas9:779-Cas9:779',
'Cas9:779-DECOY_Cas9:696',
'Cas9:866-Cas9:866',
'Cas9:677-Cas9:677',
'Cas9:48-Cas9:677']Here are the first five residue pairs that were encountered in our crosslink-spectrum-matches. The numbers after the colons denote the positions of the crosslink in the protein sequence. Also note that decoy proteins are prefixed with a DECOY_ string, you can disable this by calling transform.filter_residue_pair_distribution(csms, prefix_decoys=False).
len(csms_by_residue_pair["Cas9:1122-Cas9:884"]) 22For the residue pair Cas9:1122-Cas9:884 we found 22 crosslink-spectrum-matches in our result…
random_csm = random.choice(csms_by_residue_pair["Cas9:1122-Cas9:884"])
transform.display(random_csm) Data Type: crosslink-spectrum-match
Completeness: full
Alpha Peptide: MKNYWR
Alpha Modifications: {1: ('Oxidation', 15.994915), 2: ('DSS', 138.06808)}
Alpha Peptide Crosslink Position: 2
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [884]
Alpha Proteins Peptide Positions: [883]
Alpha Peptide Score: 198.27409728492975
Alpha Decoy: False
Beta Peptide: NSDKLIAR
Beta Modifications: {4: ('DSS', 138.06808)}
Beta Peptide Crosslink Position: 4
Beta Proteins: ['Cas9']
Beta Proteins Crosslink Positions: [1122]
Beta Proteins Peptide Positions: [1119]
Beta Peptide Score: 251.1469294734371
Beta Decoy: False
Crosslink Type: intra
CSM Score: 198.27409728492975
Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw
Scan Number: 11323
Precursor Charge: 4
Retention Time: 3410.44692
Ion Mobility/FAIMS CV: 0.0…and here is a random Cas9:1122-Cas9:884 crosslink-spectrum-match.
Filtering by Protein
csms_by_protein = transform.filter_protein_distribution(csms)We can filter crosslink-spectrum-matches by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding CSMs. You can read more about the filter_protein_distribution() function and all its parameters here: docs.
list(csms_by_protein.keys()) ['Cas9', 'sp']In our example we have crosslink-spectrum-matches from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).
len(csms_by_protein["Cas9"]) 821In total we have 821 crosslink-spectrum-matches where at least one of the two crosslinked peptides is from Cas9.
xls_by_protein = transform.filter_protein_distribution(xls)Similarly, we can filter crosslinks by their associated proteins by calling transform.filter_protein_distribution() and passing the crosslinks as the first argument. The function returns a dictionary containing the proteins as keys with their associated values being lists of the corresponding crosslinks. You can read more about the filter_protein_distribution() function and all its parameters here: docs.
list(xls_by_protein.keys()) ['Cas9', 'sp']We also have crosslinks from two different proteins (or rather one protein "Cas9" and one protein group "sp" which denotes contaminants).
len(xls_by_protein["Cas9"]) 295In total we have 295 crosslinks where at least one of the two crosslinked peptides is from Cas9.
Getting Crosslink-Spectrum-Matches or Crosslinks of Specific Proteins
csms_cas9 = transform.filter_proteins(csms, proteins=["Cas9"])If we are only interested in crosslink-spectrum-matches of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslink-spectrum-matches as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.
list(csms_cas9.keys()) ['Proteins', 'Both', 'One']The function returns a dictionary with keys "Proteins", "Both", and "One":
"Proteins"allows you to access your original list of proteins that was used for filtering (e.g., what was passed via theproteinsparameter)."Both"contains all crosslink-spectrum-matches where both peptides were of one of the specified proteins, in our case both peptides are from Cas9."One"contains all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.
csms_cas9["Proteins"] ['Cas9']Via "Proteins" we can access our original list of proteins that was used for filtering.
len(csms_cas9["Both"]) 798Via "Both" we get all crosslink-spectrum-matches where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).
random_csm = random.choice(csms_cas9["Both"])
transform.display(random_csm) Data Type: crosslink-spectrum-match
Completeness: full
Alpha Peptide: KDWDPK
Alpha Modifications: {1: ('DSS', 138.06808)}
Alpha Peptide Crosslink Position: 1
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [1128]
Alpha Proteins Peptide Positions: [1128]
Alpha Peptide Score: 28.719759495027432
Alpha Decoy: False
Beta Peptide: MTNFDKNLPNEK
Beta Modifications: {6: ('DSS', 138.06808)}
Beta Peptide Crosslink Position: 6
Beta Proteins: ['Cas9']
Beta Proteins Crosslink Positions: [504]
Beta Proteins Peptide Positions: [499]
Beta Peptide Score: 64.4694591126988
Beta Decoy: False
Crosslink Type: intra
CSM Score: 28.719759495027432
Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw
Scan Number: 13306
Precursor Charge: 3
Retention Time: 3827.9838
Ion Mobility/FAIMS CV: 0.0Here would be a random example crosslink-spectrum-match from "Both" and as you can see both peptides are from Cas9.
len(csms_cas9["One"]) 23Via "One" we get all crosslink-spectrum-matches where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).
random_csm = random.choice(csms_cas9["One"])
transform.display(random_csm) Data Type: crosslink-spectrum-match
Completeness: full
Alpha Peptide: KVTVK
Alpha Modifications: {1: ('DSS', 138.06808)}
Alpha Peptide Crosslink Position: 1
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [562]
Alpha Proteins Peptide Positions: [562]
Alpha Peptide Score: 59.078359485843485
Alpha Decoy: False
Beta Peptide: TKIAQLADLVK
Beta Modifications: {2: ('DSS', 138.06808)}
Beta Peptide Crosslink Position: 2
Beta Proteins: ['sp']
Beta Proteins Crosslink Positions: [93]
Beta Proteins Peptide Positions: [92]
Beta Peptide Score: 23.04566358838272
Beta Decoy: True
Crosslink Type: inter
CSM Score: 23.04566358838272
Spectrum File: XLpeplib_Beveridge_QEx-HFX_DSS_R1.raw
Scan Number: 15849
Precursor Charge: 3
Retention Time: 4356.45162
Ion Mobility/FAIMS CV: 0.0Here would be a random example crosslink-spectrum-match from "One" and as you can see only one of the two peptides is from Cas9.
xls_cas9 = transform.filter_proteins(xls, proteins=["Cas9"])Similarly, if we are only interested in crosslinks of a specific protein (or set of proteins) we can further investigate this with the transform.filter_proteins() function and passing the crosslinks as the first argument. The second argument proteins should be a list or set of protein accessions that we are interested in - in the example here we are only interested in a single protein, namely "Cas9". You can read more about the filter_proteins() function and all its parameters here: docs.
list(xls_cas9.keys()) ['Proteins', 'Both', 'One']The function returns a dictionary with keys "Proteins", "Both", and "One":
"Proteins"allows you to access your original list of proteins that was used for filtering (e.g., what was passed via theproteinsparameter)."Both"contains all crosslinks where both peptides were of one of the specified proteins, in our case both peptides are from Cas9."One"contains all crosslinks where only one of the two crosslinked peptides was of the specified proteins, in our case from Cas9.
xls_cas9["Proteins"] ['Cas9']Via "Proteins" we can access our original list of proteins that was used for filtering.
len(xls_cas9["Both"]) 274Via "Both" we get all crosslinks where both peptides are of one of the specified proteins of interest (in our case there was only one protein of interest: Cas9).
random_xl = random.choice(xls_cas9["Both"])
transform.display(random_xl) Data Type: crosslink
Completeness: full
Alpha Peptide: VLSAYNKHR
Alpha Peptide Crosslink Position: 7
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [1300]
Alpha Decoy: True
Beta Peptide: YKEPLQQR
Beta Peptide Crosslink Position: 2
Beta Proteins: ['Cas9']
Beta Proteins Crosslink Positions: [1023]
Beta Decoy: True
Crosslink Type: intra
Crosslink Score: 12.803290867950926Here would be a random example crosslink from "Both" and as you can see both peptides are from Cas9.
len(xls_cas9["One"]) 21Via "One" we get all crosslinks where only one of the two crosslinked peptides was of the specified proteins of interest (in our case there was only one protein of interest: Cas9).
random_xl = random.choice(xls_cas9["One"])
transform.display(random_xl) Data Type: crosslink
Completeness: full
Alpha Peptide: NSDKLIAR
Alpha Peptide Crosslink Position: 4
Alpha Proteins: ['Cas9']
Alpha Proteins Crosslink Positions: [1122]
Alpha Decoy: False
Beta Peptide: TKYNALKTPDK
Beta Peptide Crosslink Position: 2
Beta Proteins: ['sp']
Beta Proteins Crosslink Positions: [145]
Beta Decoy: False
Crosslink Type: inter
Crosslink Score: 30.31260863093563Here would be a random example crosslink from "One" and as you can see only one of the two peptides is from Cas9.
Filtering by Target-Decoy Type
csms_td = transform.filter_target_decoy(csms)We can filter crosslink-spectrum-matches by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslink-spectrum-matches as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy CSMs respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.
len(csms_td["Target-Target"]) 786Via "Target-Target" we can access all crosslink-spectrum-matches where both peptides are from the target database.
len(csms_td["Target-Decoy"]) 39Via "Target-Decoy" we can access all crosslink-spectrum-matches where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy".
len(csms_td["Decoy-Decoy"]) 1Via "Decoy-Decoy" we can access all crosslink-spectrum-matches where both peptides are from the decoy database.
xls_td = transform.filter_target_decoy(xls)Similarly, we can filter crosslinks by their target-decoy type by calling transform.filter_target_decoy() and passing the crosslinks as the first argument. The function returns a dictionary containing the keys "Target-Target", "Target-Decoy", and "Decoy-Decoy" with their associated values being lists of the corresponding target-target, target-decoy, and decoy-decoy crosslinks respectively. You can read more about the filter_target_decoy() function and all its parameters here: docs.
len(xls_td["Target-Target"]) 265Via "Target-Target" we can access all crosslinks where both peptides are from the target database.
len(xls_td["Target-Decoy"]) 0Via "Target-Decoy" we can access all crosslinks where one peptide is from the target database and one peptide is from the decoy database. Therefore both target-decoy and decoy-target matches are contained in "Target-Decoy". As you can see here the number of "Target-Decoy" matches is zero for our MS Annika crosslink results because on the crosslink-level MS Annika reports any target-decoy and decoy-target matches as full decoy-decoy matches.
len(xls_td["Decoy-Decoy"]) 35Via "Decoy-Decoy" we can access all crosslinks where both peptides are from the decoy database.
Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!
Filtering Target Matches Only
Because we are often only interested in target-target matches there is a shorthand function that returns only target-target matches called transform.targets_only(). In contrast to all previous filter functions targets_only() accepts both lists of crosslink-spectrum-matches or crosslinks or a parser_result as input (see data type documentation here: docs). The return type will be the same as the input type. You can read more about the targets_only() function and all its parameters here: docs.
csms = transform.targets_only(csms)
print(f"Nr. of TT CSMs: {len(csms)}") Nr. of TT CSMs: 786Here is an example of calling targets_only() on a list of crosslink-spectrum-matches: a list of crosslink-spectrum-matches containing only target-target matches is returned.
xls = transform.targets_only(xls)
print(f"Nr. of TT crosslinks: {len(xls)}") Nr. of TT crosslinks: 265Here is an example of calling targets_only() on a list of crosslinks: a list of crosslinks containing only target-target matches is returned.
parser_result = transform.targets_only(parser_result)
print(f"Nr. of TT CSMs: {len(parser_result['crosslink-spectrum-matches'])}")
print(f"Nr. of TT crosslinks: {len(parser_result['crosslinks'])}") Nr. of TT CSMs: 786
Nr. of TT crosslinks: 265Here is an example of calling targets_only() on a parser_result: a parser_result containing only target-target matches is returned.
Please note that any crosslink-spectrum-matches or crosslinks with missing target-decoy labels will be filtered out by this function!