Prediction of GPI-Anchored proteins with pointer neural networks

GPI-anchors constitute a very important post-translational modification, linking many proteins to the outer face of the plasma membrane in eukaryotic cells. Since experimental validation of GPI-anchors is slow and costly, computational approaches for predicting them from amino acid sequences are needed. However, the most recent GPI predictor is more than a decade old, and considerable progress has been made in machine learning since then. We present a new dataset and a novel method, NetGPI, for GPI prediction. The predictor is based on recurrent neural networks, incorporating an attention mechanism that simultaneously detects GPI-anchors and points out the location of their ω-sites. The performance of NetGPI is superior to existing methods with regards to discrimination between GPI-anchors and other proteins and approximate (±1 position) placement of the ω-site. NetGPI is available at: https://services.healthtech.dtu.dk/service.php?NetGPI-1.0.

GPI-anchors constitute a very important post-translational modification, linking many proteins to the outer face of the plasma membrane in eukaryotic cells. Since experimental validation of GPI-anchors is slow and costly, computational approaches for predicting them from amino acid sequences are needed. However, the most recent GPI predictor is more than a decade old, and considerable progress has been made in machine learning since then. We present a new dataset and a novel method, NetGPI, for GPI prediction. The predictor is based on recurrent neural networks, incorporating an attention mechanism that simultaneously detects GPIanchors and points out the location of their ω-sites. The performance of NetGPI is superior to existing methods with regards to discrimination between GPI-anchors and other proteins and approximate (±1 position) placement of the ω-site. NetGPI is available at: https://services.healthtech.dtu.dk/service.php?NetGPI-1.0.

| INTRODUCTION
Some of the proteins that follow the secretory pathway are attached to the membrane of eukaryotic cells by specific mechanisms. One of these mechanisms is a post-translational modification where a glycosylphosphatidylinositol (GPI) anchor is attached to the protein. The identification of proteins that undergo this modification is of high interest due to the diversity of functions that they perform. GPI-anchored proteins are essential in the development of fungi and animal cells [1,2]. They are also involved in certain diseases such as paroxysmal nocturnal haemoglobinuria, an acquired haematopoietic stem-cell disorder [3], and in the defense mechanisms of various protozoan parasites such as Leishmania and Trypanosoma [4] . Consequently, the development of computational tools that are able to detect proteins with this modification is of high impact on the research of eukaryote cell biology [5].
GPI-anchored proteins have two signals in their primary sequence: an N-terminal sequence for endoplasmic reticulum targeting (signal peptide) and a C-terminal signal sequence directing the attachment of the GPI-anchor. This attachment is carried out by a GPI transamidase which recognizes the C-terminal signal sequence and cleaves the peptide bond at the GPI-anchor attachment site, known as the ω-site. This cleavage creates a link between the GPI and the C-terminus of the cleaved protein, allowing the protein to remain tethered to the membrane. C-terminal signal sequences are generally composed by five regions, which are determined by the amino acids before the ω site (ω-minus) and after (ω-plus). The five regions are: a stretch of polar amino acids that form a flexible linker region (ω − 10 to ω − 1); the ω site amino acid; the ω + 2 amino acid, a restrictive position with mostly G, A or S; a spacer region of moderately charged amino acids (ω + 3 to ω + 9 or more), and a stretch of hydrophobic amino acids starting approximately at ω + 10 [6].
In order to detect proteins that carry this signal, experimental assays are required. Such experiments are generally low throughput and costly, which has resulted in a low amount of experimentally annotated GPI-anchored proteins. To overcome this limitation, fast computational methods that can approximate the experimentally validated process are needed. For this purpose, current machine learning methods exist for predicting GPI-anchors [7,8,9]. Nonetheless, these methods were developed more than a decade ago and do not utilize recent progress in machine learning methods nor access to new data sources. Deep learning methods, such as the Recurrent Neural Network (RNN) [10], have recently proven effective at protein prediction tasks [11]. However, Deep Learning requires large amounts of annotated examples to generalize well [12].
In this paper we present a new tool for detecting GPI-anchored proteins and determining the position of the ω-site using recurrent neural networks. To overcome the low amounts of experimentally validated data we build a new training set utilizing manually annotated predicted GPI anchored proteins, mostly reserving the experimentally verified data for a held-out test set. Regardless, our method achieves state-of-the-art performance on the GPI-anchor prediction task.
Moreover, we show that the model learns biologically meaningful characteristics.

| Related works
Initial work on predicting the presence of GPI-anchors and the ω-site was published by Eisenhaber et al. [13]. This work, known as the Big-Π Predictor, details a method that evaluates amino acid type preferences at positions near a supposed ω-site as well as the concordance with general physical properties encoded in multi-residue correlation within the motif sequence [13]. Big-Π provides kingdom-specific predictions as it was trained on metazoan, protozoan, fungi [14] and plant [15] proteins separately.
Fankhauser and Mäser [8] presented a neural network based prediction tool called KohGPI/GPI-SOM. GPI-SOM utilizes a Kohonen Self Organizing Map structure which takes as input the average position of a given amino acid relative to its proximity to the C-terminal, the hydrophobicity of the amino acid at 22 C-terminal positions and 2 units representing the quality of the presumed ω-site and its position. Both GPI-SOM and Big-Π utilize an external signal peptide predictor known as SignalP [16] to preselect proteins. A genome-wide study by the authors indicated that the percentage of GPI-anchored proteins in a given proteome was in the same order of magnitude as their reported error rate, not accounting for the error rate of the version of SignalP used.
In 2008 Pierleoni, Martelli & Casadio published Pred-GPI, a GPI-anchor predictor utilizing a Hidden Markov Model (HMM) for the prediction of the position of the ω-site and a Support Vector Machine (SVM) for the presence of a GPI-anchor [9]. The HMM has 46 states, with varying probabilities for amino acids and the potential ω-site assigned the 26th state. The SVM takes as input the negative log-likelihood computed by the HMM as well as 82 features intended to describe the overall composition of the sequence, the features of the N-terminal regions comprising the signal peptide, and the features of the C-terminal regions containing the cleaved GPI-anchor signal. Pred-GPI supplies two different variants: One model where the potential ω-site is restricted to be one of Cysteine, Aspartic acid, Glycine, Asparagine, and Serine, this approach they refer to as the conservative model; and another model having no such restriction. Unlike the two other methods, Pred-GPI does not rely on an external signal peptide predictor, such as SignalP.

| Dataset
All data used in this project are extracted from the UniProt database [17]. The dataset construction follows two main steps: data gathering and homology partitioning. First, we select all the eukaryotic proteins with experimental evidence

| Objective
The objective of GPI prediction is to decide whether a GPI signal is present and, if present, to determine the position of the ω-site in a protein sequence. We combine these two tasks by reducing them to the single task of maximizing the probability of a position in a sequence. To achieve this, we add a placeholder to the end of the protein sequence which TA B L E 1 Dataset composition for the training/validation set and held out test set. serves to indicate that it is non GPI-anchored. Thus we formally define the objective as maximizing the probability of a position inD , which is known as pointing [19].

Data-set
Where D ∈ Σ T is an amino acid sequence and Σ is a dictionary of the twenty common amino acids as well as the token X, which represents any encountered amino acid not in Σ. We only consider the last 100 amino acids in the protein sequence, such that the length T ≤ 100. C i corresponds to a position inD .
If the sequence does not contain an ω-site we maximize the probability of the protein being non GPI-anchored. Inspired by work in natural language processing [20,21], we represent the lack of an ω-site by maximizing the placeholder position known as the sentinel, z , at the end of the amino acid sequence. This results inD To parameterize the conditional probability distribution P θ we use a neural network architecture known as the Long-Short Term Memory (LSTM) Cell [22] and distributed representations of the amino acids [23] as shown in equation 3.
Where embedding :Σ → d turns each amino acid into a distributed representation of real numbers using a linear trainable weight of size d and i , j ∈ ≤ T are indexes of the protein sequence including the sentinel position. The LSTM is a non-linear transformation of a sequence of real values. It uses trainable recurrent units to distribute sequential information across the protein sequence, LSTM : T ×d → T ×d , where d is the output size of the LSTM. As we use a bidirectional LSTM [24] we end up with two hidden representations of size d . To get the probability over the sequence we project the output of every position to a logit, g i V ∈ , followed by a softmax : T → [0, 1] T that normalizes the logits into a probability distribution over the sequence. To create the logits we use a two layer feed forward neural network on top of the LSTM hidden states, h ∈ T ×2d , with a tanh activation function, W ∈ 2d ×d , and V ∈ d .
This usage of softmax over a sequence length is a modification of attention where the interaction size d of gV is the attention hidden representation size, which is known as a pointer network [19].
The embedding, LSTM, W , and V are all trainable with stochastic gradient descent using back-propagation through time [25]. We have visualized our model in Figure 1.
· · · · · · Sentinel · · · · · · · · · · · · F I G U R E 1 Diagram of the model, illustrating how the model points to a position in a sequence, in this case, the entry with UniProt accession number P15693. The sequence is truncated to the last 100 amino acids and the sentinel, z , is appended. The predicted ω-site is an Asparagine (N). If the position with highest likelihood had been the sentinel position, then the protein would have been predicted as non GPI-anchored.

| Quantitative evaluation criteria
To evaluate the discrimination between GPI-anchored and non GPI-anchored proteins we use the Matthews Correlation Coefficient (MCC) and for ω-site prediction evaluation we use the F1 score [26]. Due to the dual nature of the problem and as well as the lack of experimental ω−site evidence in the training set, a simple heuristic is devised. The heuristic is a composition of the two evaluation methods. The F1 score is calculated with a tolerance of two positions from the annotated ω-site. We allow for this flexibility when calculating the F1 score as the training set contains only nonexperimentally verified ω-site samples, which are not as reliable as the experimentally verified. The MCC is weighed twice as important as the F1 score. We weigh the MCC more as we want to emphasize the GPI-anchoring discrimination over the ω-site prediction performance. The model with the combination of hyperparameters that gives the best heuristics, on the validation partition, is chosen for each fold. This heuristic also controls when the model's parameters are stored as an early stopping approach. The self evaluation during training is the Cross Entropy Loss.

| Model Details
We train the neural network with a batch size of 64 and up to 30 epochs. We set the embedding size d = 22. To find the optimal values for: the size of the bidirectional LSTM cell hidden representation d , the attention hidden representation d , the number of LSTM layers, the dropout between LSTM layers, the optimizer's weight decay, and learning rate we use a validation set. Dropout between LSTM layers forces each hidden unit in subsequent layers to work with a randomly chosen set of hidden units from the previous layer [27].
To better utilize data we do a five-fold split of the training set and optimize the neural network hyperparameters individually for each split. The best performing model from each split is used in an ensemble for the test set. Each model of the ensemble is transformed with a logarithm before being averaged. This is done to emphasize confident model predictions.
We evaluate 192 different hyperparameter settings on the validation set for each fold. The hyperparameters we TA B L E 2 The combination of hyperparameters with the best validation performance for each partition.

| Qualitative evaluation methods
To better understand the decisions the model makes we performed a feature importance analysis using the Local Interpretable Model-agnostic Explanations (LIME) package [30]. This analysis is performed on the held-out test set. In the LIME analysis, amino acids contributing to a GPI-anchored prediction will have a positive importance, while amino acids contributing to the non GPI-anchored prediction will have a negative importance. The larger the weight the larger the contribution to the prediction.
Furthermore, we investigate the sequence composition around the ω-site to uncover possible model biases.

| Quantitative results
To benchmark the performance of the current tools the held-out test set was submitted to the three tools currently available; Big-Π, GPI-SOM, and PredGPI. In the case of Big-Π we separated the held-out test set according to kingdom and submitted to the corresponding versions of the tool. Big-Π annotates its predictions according to likelihood.
Predictions with high likelihood are labeled as P , twilight zone predictions are labeled as S , and non-potentially GPIanchored proteins are labeled as N . We regarded any protein predicted as potentially GPI-anchored (P or S ) as a GPI-anchored prediction.
PredGPI ranks and classifies predictions according to specificity. Predictions are regarded as highly probable, probable, weakly probable, and not GPI-anchored. We measure the performance for two settings of PredGPI; designating weakly probable either as GPI-anchored or non GPI-anchored. Assuming weakly probable as negative predictions gives the best performance according to MCC.
For predicting the presence of GPI-anchors, NetGPI achieves the highest MCC of 0.962. If we regard PredGPI's weakly probable as negative, the second highest MCC is PredGPI, otherwise the second highest is Big-Π. NetGPI also attains the highest true positive rate (TPR), 0.975, the second highest being GPI-SOM. NetGPI achieves the second TA B L E 3 Comparison of the GPI-anchor presence prediction performance of NetGPI and benchmarked methods.  see table 3. Noticeably, Big-Π uses kingdom information in its predictor. We tried a similar approach, but found no improvement in our performance using kingdom features during hyperparameter optimization, which is why we did not include it in our predictor.
We find that Big-Π has at least 59 overlapping samples with our positive test set and an unknown overlap with our negative test set. This might cause the performance of Big-Π to be overestimated. We filter the dataset of test samples   * PredGPI provides two options, this is their conservative option.
** This is the result for PredGPI when weakly probable predictions are regarded as positive.
*** For the position prediction we use the experimentally tested sequences with known ω-sites. The precision is calculated w.r.t.
the experimentally tested sequences with known ω-sites as well as all negative samples in the test set.

| Qualitative results
In the qualitative analysis we investigate the importance of biological features when NetGPI predicts GPI-anchor presence and the ω-site. In addition, we do a statistical analysis of the ω-site composition to understand the neighborhood of true and predicted ω-site positions. Lastly, we investigate model likelihood of the predictions, and how it relates to model correctness, on the held-out test set. Figure 2 illustrates the results of the LIME analysis for both positive (see Figure 2a) and negative (see Figure 2b) samples.

| Feature Importance Analysis
We observe that the presence of a hydrophobic tail contributes the most towards a positive prediction. This is consistent with the literature [6], which defines the presence of a hydrophobic region from the position ω + 10. From that position the feature importance is much higher than for the rest of the sequence, which means that the main feature driving the positive prediction of NetGPI is the presence of the hydrophobic region. Regarding the negative predictions, we observe that the amino acids contributing the most towards a negative prediction are charged and polar amino acids.
This indicates that the model is attributing higher importance to non-hydrophobic amino acids, indicating a lack of hydrophobic tail, when making a negative prediction.

| ω-site composition
Of the 50 experimentally verified ω-sites 54% are Serine, while the other amino acids observed are Asparagine, Glycine, Aspartic acid, Cysteine and Alanine, in decreasing order of frequency. All Glycine and Cysteine ω-sites are correctly predicted, one of the Asparagine ω-sites is off by 2 positions and both of the Alanine sites are off by one, where the preceding Serine is predicted instead, see table 5. Positioning errors made by NetGPI are mostly specific to Aspartic acid. Of the experimentally verified ω-sites, 8% are Aspartic acid, however we predict it in 14% of the 50 experimentally TA B L E 5 NetGPI's and Big-Π's ω-site position prediction performance for the 50 true ω-site amino acid in the test set. We see that both models only predict 1 out of 4 Aspartic acid ω-sites correctly. NetGPI has 9 one-off errors, 7 of which are actually Serine ω-sites. verified ω-sites. NetGPI has 9 off by one errors, 7 of which are actually Serine ω-sites, all of whom belong to the species Arabidopsis thaliana. Of those, 6 are predicted as an Aspartic acid where the actual ω-site is the preceding Serine, and together they belong to the 4-mer PTSD -an ω-site motif that does not occur in our training set. Both Big-Π and NetGPI are unable to position 3 out of 4 Aspartic acid ω-sites, see table 5. This may be related to the ω + 2 position, as these 3
It is worth mentioning that out of the 50 experimentally verified, 13 belong to the species Arabidopsis thaliana and 14 to Homo sapiens. Only one of the 50 proteins with an experimentally verified ω-site is predicted non GPI-anchored by NetGPI. This example has a very unusual ω + 2 amino acid, namely Lysine (K).

| Likelihood and correctness
In addition to the classification of the sequence and the most likely position of the ω-site, NetGPI reports the likelihood of the chosen position. For positive predictions this is the predicted ω-site, while for negative predictions it is the sentinel.
As our model is trained with cross entropy, it is penalized with a logarithm of the correct prediction. If we predict incorrectly, with a very low likelihood for the correct position, the loss can be immense. We should thus expect that answers with a high likelihood are more credible.
In Figure 3 we display the likelihood distribution of the predictions on the held-out test set. We observe differences in the likelihood of correct and incorrect predictions implying a correlation between likelihood and correctness. Furthermore, we observe higher likelihood in negative predictions than positive. This is expected as this is the probability distribution over the last 100 amino acids as well as the added sentinel, where only the sentinel position denotes a negative prediction, while a positive prediction is spread across the 100 amino acid positions. This means that positive prediction likelihood has to cover all potential ω-site positions, while the negative prediction likelihood is limited to one position. Therefore, using the likelihood as ranking should be done separately for negative and positive results.  (2a) is aligned to the predicted ω-site, while the negative set (2b) is aligned to the C-terminus. Positive feature importance contributes to a positive prediction whereas a negative feature importance contributes to a negative one. We see that the presence of a hydrophobic tail contributes the most towards a positive prediction, whereas charged and polar amino acids contribute the most towards a negative prediction.

| CONCLUSION
We have shown that GPI-anchor prediction can be improved using recurrent neural networks and up-to-date datasets, achieving state-of-the-art performance. Comparison with previous methods is challenging as there exists no standard dataset for training and testing predictive methods. Given progress in protein annotation, we publish a new homology partitioned training and test set, using experimentally verified proteins for testing and manually annotated predicted proteins for training. However, due to the new dataset definition, the performance of current methods could be overestimated as their training sets overlap with our test set.
Our results show that proteins manually annotated by prediction methods or sequence similarity are useful for training a GPI-anchor predictor to perform well when evaluated on experimentally verified ω-sites. However, using these data comes with a caveat; ω-site predictions are sometimes off by one position. We believe that this limitation is necessary in order to obtain a larger training set and create a completely independent test set of experimentally verified GPI-anchors. If we were to use only the experimentally verified GPI-anchors to train and test the predictor, we would not have enough training samples to teach a deep neural network classifier, and the resulting test set would be too small to be representative.
A web server implementing NetGPI is available at https://services.healthtech.dtu.dk/service.php?NetGPI-1.0, and our training and testing data set can be downloaded from the same site.