A Bootstrapping Based Frequent Itemset Mining Approach To The Identification Of Protein Phosphorylation Motifs
PHOSIDA contains a frequent itemset mining approach to identify protein sequence motifs from large scale datasets on the basis of bootstrap statistics. Sequences surrounding non-phosphorylated serines, threonines and tyrosines are randomly selected from protein databases in iterative steps. They form species-specific bootstrap distributions which reflect the frequencies of amino acids at certain positions relative to the site. Significantly overrepresented phosphorylation motifs are identified by comparing the position specific amino acid frequencies in the sequences surrounding phosphorylation sites (positive set) with the corresponding calculated bootstrapping distributions (background set). Our motif-finding algorithm uses the candidate-pruning principle and outperforms existing motif-finding methods in speed due to consistent database management of the background sets. We also derive a score that describes the significance of each motif. Identified protein sequence motifs are then scanned for matches with annotated kinase motifs. Application of our method to eukaryotic phosphoproteomes recovers both known and novel phosphorylation motifs.
The Frequent Itemset Mining Algorithm and Computation of the Significance Score
One fundamental step of our algorithm is the generation and testing of candidate motifs. In the first run of the algorithm, all possible motifs, which contain the phosphosite and one additional amino acid on any of six positions to the N- and C-terminus next to it, are tested for significance. We selected this sequence frame on the basis of experimentally validated kinase motifs. Furthermore previous studies on phosphorylation site prediction also found out that this sequence frame (+/- 6 positions around the phosphosite) provides the optimal input sets for training. This observation was also confirmed on the basis of evolutionary analyses, as the highest conservation was found for the six residues surrounding phosphosites to both termini. Consequently, 240 (12 positions × 20 amino acids) candidate motifs are generated in the first run and the frequencies in the phosphoset (positive set) are compared with the ones of the corresponding bootstrap distributions created from background sets (see next method section for the creation of bootstrap distributions). For each candidate motif, the yielded score reflects the difference between the frequency of the given amino acid on the specified position and the mean of the corresponding background bootstrap distribution measured in number of standard deviations of the bootstrap distribution (f(aa, position): amino acid frequency on a specified position; N: number of bootstrap sets; xi: mean position specific amino acid frequency in the bootstrapset i):
Notably, the resulting score is negative, if the given frequency is lower than the mean of the corresponding background distribution. Candidate motifs from the first run are kept as significantly overrepresented sequence patterns, if their positive scores satisfy the specified score cutoffs and if their proportional frequencies are also higher than the user defined minimum frequency. Next, those motifs that are classified to be significantly frequent in the phosphoset from the first run are joined to form new candidate motifs that are more specific. Each combined candidate motif consequently contains two amino acids in addition to the phosphosite and is tested for statistical significance. For the second step of the algorithm 57360 (240 × 239) species specific background bootstrap distributions are stored in the database. Candidate motifs that satisfy the specified cutoffs in the second run are additional significant protein phosphorylation motifs. The algorithm could be continued by joining those motifs to generate new candidate motifs in a third run. However, experience has shown that annotated kinase motifs usually contain one or two residues besides the phosphosite. This is also reasonable, as an entire peptide sequence, for example, would not reflect a real motif.
Generation of Bootstrap Distributions
We created bootstrap distributions for a given species from 1,000 sets of randomly selected sequences. Each random set contained 5,000 sequences and each sequence contained two times six residues surrounding a serine, threonine or tyrosine to the N- and to the C-terminus respectively. We created random sets for each amino acid (S/T/Y) separately. For each random set we calculated the relative amino acid frequency on each position within the sequence and the histogram of these 1,000 proportions provided the bootstrap distribution. For the first run of the algorithm (for the identification of sequence motifs that contain a single amino acid besides the site), 240 (20 × 12) bootstrap distributions were created because of the distribution of 20 different amino acids on each of the 12 sequence positions. For the second run, 57360 (240 × 239) bootstrap distributions were created.
Generation of Sequence Logos
Previous studies have demonstrated that the creation of sequence logos is very useful to visualize the amino acid frequency in the given set (Schneider and Stephens, 1990). Our method generates sequence logos that reflect the amino acid frequencies of the total set, which provides an overall reflection of the phosphoset. Amino acids are displayed in the sequence logo, if their frequency on a given position is higher than the mean of the corresponding background distribution. The height of the amino acid letter is relative to the highest score from the first run.