- Split View
-
Views
-
Cite
Cite
Xiaoqing Ru, Xiucai Ye, Tetsuya Sakurai, Quan Zou, Application of learning to rank in bioinformatics tasks, Briefings in Bioinformatics, Volume 22, Issue 5, September 2021, bbaa394, https://doi.org/10.1093/bib/bbaa394
- Share Icon Share
Abstract
Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.
Introduction
Learning to rank (LTR) algorithms were originally applied in the information retrieval field [1]. With development over time, the amount of various types of information has exploded, and it is a challenging task to retrieve required information based on these data, it is expensive to complete this work by labor alone, this process does not meet the need for speed in the current era. Therefore, machine learning has been introduced into information retrieval, and LTR algorithms are effective solutions to solve the ranking problems in this field. The idea of LTR is simple and easy to understand; it is similar to the process of searching for information on the World Wide Web, that is, when users enter queries in search engines, a series of information will be recalled [2] and ranked according to the correlation with queries, with strong correlation ranking in the front and weak correlation ranking in the back [3].
At present, LTR has become the main technology of modern network searches and has been applied in many fields, such as natural language processing and data mining [4]. It has also been introduced into bioinformatics, which solves some problems, including Medical Subject Headings (MeSH) indexing, protein homology detection, protein structure and function prediction and several related tasks of drug research and development.
LTR is indeed an effective class of algorithms that can be applied to bioinformatics. Therefore, the comprehensive investigation of LTR has a certain positive impact on the future development of this field. In addition, to solve more problems with LTR, it is necessary to understand the basic principle, application range and unique attributes. This paper summarizes the basic knowledge, existing open source software and specific applications of these algorithms in bioinformatics. The specific applications and the basic steps to solve problems with LTR are shown in Figure 1. By comparing the methods or models in many bioinformatics studies, the advantages, disadvantages and future development prospects of these LTR algorithms are summarized. Through a previous investigation, it was found that no review of such work has been reported so far. Therefore, this paper has certain significance and value.
LTR
Introduction to LTR
The process of constructing the LTR framework is basically consistent with the step of constructing the conventional classification model, namely constructing the dataset, extracting the data features, training the model and testing the model. However, the composition types of the input dataset, file format and output are different.
It is not necessary to specifically collect the negative dataset when constructing the LTR framework. The data need to be integrated and processed in a fixed format: |$\Big\{{q}_i\ {F}^j\ {R}_i^j\Big\}$|, where |${q}_i$| represents a certain query, |${F}^j$|represents all features of sample |$j$| and |${R}_i^j$| represents the degree of correlation [5]. This is similar to the file format entered into libsvm (i.e. a software package commonly used in classification), except that a query function is added. Each query in the LTR corresponds to more than one file. The output is not the category but the correlation with the query. Under the same query, the output results are sorted in descending order of correlation and not the exact correlation scores but the relative scores. Finally, unlike classification algorithms, LTR has its own unique performance evaluation criteria, the most commonly used of which is Normalized Discounted Cumulative Gain(NDCG) [6].
Based on the number of object documents processed under the same LTR query, the LTR algorithms can be divided into three types: pointwise, pairwise and listwise [3, 4]. The pointwise method [7, 8], which is computationally fast and less complex, treats one document as an object. Therefore, the relationship between two documents is ignored which may cause ineffective. The pairwise method [9, 10] marks the relative relationship between two documents, which solves the shortcomings of the pointwise method to a certain extent. But when the number of documents for each query varies greatly, the pairwise combination of documents will further magnify the difference. In addition, this method ignores the position of the document in the whole list. The listwise method [11] takes all documents under the same query as the object, and it has a high computational complexity. This type of method compensates for the drawback in the pointwise method (i.e. the relationship between documents) and the pairwise method (i.e. location of documents in the whole list). Therefore, researchers need to select the appropriate algorithm according to the specific needs.
Toolkits for LTR
Here, two open source toolkits, i.e. Ranklib and |${\mathrm{SVM}}^{\mathrm{rank}}$|, are introduced. The existence of these two toolkits makes the use of LTR more convenient.
|${\mathrm{SVM}}^{\mathrm{rank}}$|(https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html) only covers the Ranking SVM algorithm, which is widely used in bioinformatics. This algorithm belongs to the pairwise type and was proposed by Herbrich et al. [10] in 2000. Using the Ranking SVM algorithm and the data from user click through, Joachims et al. [12] optimized the search engine’s search quality. |${\mathrm{SVM}}^{\mathrm{rank}}$| is an open source toolkit implemented by Joachims et al., which is easy to use and includes one learning module and one prediction module. The training dataset trains the model through the learning module, and then in the prediction module, the prediction result is output based on the trained model and the test set.
On the other hand, Ranklib (https://sourceforge.net/p/lemur/wiki/RankLib/) integrates a variety of LTR algorithms, including MART, RankNet, RankBoost, AdaRank, LambdaMART and ListNet. In addition to building and testing the model, this tool realizes cross-validation by sequential partitioning and random partitioning of datasets. And this tool can also compare the performance between models and reorder the data.
Applications of LTR in bioinformatics
MeSH indexing
MeSH is developed and maintained by the National Library of Medicine (NLM), and it is a tool used in medical information retrieval that catalogs documents, books and audiovisual materials recorded in the NLM. Assigning the appropriate MeSH main headings (MHs) to biomedical literature can help researchers discover new knowledge and propose new scientific hypotheses while improving their productivity. It costs approximately $9.4 for manually reviewing the literature and assigning it the appropriate MHs [13]. With the rapid growth of the number of literatures in the database, it will consume a lot of time and money through manual methods. The emergence of automatic indexing technology has improved the shortcomings of manual annotation operations to some extent. The initial approach is to rely on three technologies: the K-nearest neighbor algorithm [14], machine learning method or probability model linking file content with MeSH terms [15, 16] and domain-specific knowledge resources (such as MetaMap [17] and trigram [18]). That is, Medical Text Indexer (MTI) developed by NLM is a combination of the K-nearest neighbor algorithm and domain-specific knowledge resource technology [19]. Different from the previous automatic MeSH indexing methods, Huang et al. [20] proposed a method by regarding the task as a ranking problem. They first used the K-nearest neighbor algorithm to find similar articles for target articles, obtained the MeSH terms assigned to these articles as well as the initial list of these terms and then applied ListNet algorithm to reorder these candidate terms based on the learned association between the document text and each MeSH term. In two representative datasets, the superior performance of LTR was verified by contrast experiments with MTI, random reflective indexing and two strong baseline methods (i.e. neighborhood frequency and neighborhood similarity). Thus, LTR is indeed an effective means to solve the problem of indexing. After that, multiple LTR frameworks have been developed. Mao et al. [21] achieved an extension of Huang et al.’s method by introducing the output result of MTIFL, a new version of MTL, and by setting an automatic cut-off metric with the aim of only returning a fixed number of MeSH terms that are most relevant to the target article. In addition, several LTR algorithms, including MART, RankNet, Coordinate Ascent, LambdaMART and AdaRank, are also appeared, and it concluded that MART is the most suitable algorithm for their research.
The current status and challenges of automatic MeSH indexing are as follows: (1) uneven distribution of MHs, that is, some MHs have appeared in millions of citations, while some MHs have only appeared in dozens of citations; (2) the number of MHs in citations varies greatly, as some citations may have 30 MHs, while some have only 5 MHs; (3) semantic and context-dependent information is not well obtained and (4) indexing is not explored based on the full text content.
In LTR, the number of relevant documents corresponding to each query is not fixed; therefore, the above problems can be solved based on this characteristic and complementary evidence information extracted from multiple perspectives. MeSHLabeler, DeepMeSH, MeSH Now [22] and FullMeSH are models constructed using the LTR algorithm and based on various types of evidence information, such as MeSHLabeler [23], which was built based on the LambdaMART algorithm, integrated five types of category information, including global information, local information, subject word dependence, pattern recognition and MTI. There are large numbers of domain phrases, concepts and abbreviations in the biomedical literature. One concept can be represented by different words and one abbreviation may have different meanings, and the simple word bag method (BOW) has been unable to obtain complex semantic information of biomedical documents, which limits the performance of the model. DeepMeSH [24] proposed intensive semantic concepts such as Word2Vec (W2V), Word2Phrase (W2P) and Document2Vec (D2V) to capture the semantic and contextual information of the text, which provided opportunities to improve the performance of automatic MeSH indexing. Many methods only use title and abstract for exploration. Some important information may only appear in the text is ignored. FullMeSH [25] took the entire article as the research object, divided the full text, then obtained many types of evidence information with deep learning and other traditional models and finally integrated information into the LTR algorithm.
The research mentioned above shows that selecting appropriate LTR algorithms and capturing effective semantic information are effective means to improve the performance of MeSH indexing. But a challenging problem existing in MeSH indexing is that MeSH vocabulary and indexing principles continue to change over time, which requires researchers to develop more robust models.
Exploration of protein homology
SCOP is one of the commonly used databases for protein remote homology detection, and it divides proteins in a hierarchical manner based on their structural versus evolutionary relationships [26]. Novel protein structural and functional properties can be inferred by the relationship between the protein and known group. According to the SCOP database, proteins belonging to the same superfamily can be regarded as homologous, of which proteins belonging to the same superfamily but not to the same family are remote homologous [27].
Homology can be judged by sequence similarity, and subtle sequence similarity often implies structural, functional and evolutionary relationships of proteins [28]. BLAST [29] and PSI-BLAST [30] are two common tools to explore the sequence similarity [31]. Through these two methods, the related sequences will be output in the order of strong to weak similarity with the given protein sequence. Therefore, exploring protein homology can be treated as a ranking task. Weston et al. [32] used Rankprop algorithm and the entire network structure of similarity relationships among proteins to rank proteins. Rankprop can propagate in densely connected regions of protein similarity networks to detect subtle sequence relationships missed by methods such as PSI-BLAST. Subsequently, Melvin et al. [33] extended the Rankprop algorithm by two steps: propagating proteins closely related to the query to improve accuracy and mapping empirical values to improve the interpretability of scores produced by Rankprop.
The structure and function of proteins are more conserved than sequences during natural evolution. Protein remote homology detection is one of the basic problems in computational biology, and its purpose is to find remote homologous proteins with low sequence similarity but similar structure and function. Many methods have been developed for protein remote homology detection, and these can be classified into three types: alignment methods, classification methods and ranking methods [27, 34]. The alignment method can be regarded as the basis of the classification and ranking method. The classification method is a transition of the alignment and ranking method. The ranking method can combine the information of the alignment and classification method, maximize the inheritance of their advantages and overcome their disadvantages to a certain extent. ProtDec-LTR [34] is a ranking model integrating the output results of three alignment algorithms (i.e. PSI-BLAST, HHblits and ProtEmbed) into an LTR algorithm. It was verified experimentally that ProtDec-LTR performs better than three independent alignment algorithms. Pseudo-amino acid representation takes residue mutations, deletions and insertions in protein sequences into account and can find conserved domains in structures [35, 36]. Pseudo-amino acid representation facilitates remote homology detection of proteins [37–40]. ProtDec-LTR2.0 [41] first converted all proteins in the query and benchmark databases into pseudo proteins and then conducted follow-up research based on the principle of the ProtDec-LTR. Eventually, ProtDec-LTR2.0 was significantly superior to ProtDec-LTR in terms of ROC1. Profile-based features have shown strong discrimination in classification methods for identifying remote homologous proteins. ProtDec-LTR3.0 [42] integrated ACC and top-gram features processed by feature mapping strategies into an LTR algorithm. The features extracted by ACC and top-gram not only reflect the evolutionary information of proteins but also contain the features of protein sequences. On this basis, the accuracy of protein remote homology detection was further improved by the PageRank and HITS algorithms. From this, ProtDec-LTR3.0 achieved better performance than ProtDec-LTR, ProtDec-LTR2.0 and nine other advanced predictors.
Prediction of protein structure and function
Proteins are the main bearers of life activities. Research on protein is mainly divided into two aspects: structural proteomics and functional proteomics.
The structure of protein is more conserved than the sequence, and as the purpose of biology, medicine and pharmacy, exploring protein structure is important. In the research of protein tertiary structure prediction, many known protein sequences cannot find protein sequences with known structures similar to them, that is, there is neither homologous protein with known structure nor remote homologous protein with known structure. The main purpose of de novo protein structure prediction is to predict the protein 3D structure using sequences. However, because of the insurmountable dimensionality of the unrestrained search space, de novo protein structure prediction has not been too successful and is not universally suitable for large-scale prediction of protein structures with high accuracy [43]. Protein residue–residue contact prediction can effectively constrain the conformational search space, and protein contact constraint methods first predict the protein residue–residue contacts in residue sequences and then use these predicted contacts as constraints to predict the tertiary structure of proteins [44]. In recent years, there have been many applications of fusion methods in residue–residue association prediction. RRCRank [45] is a new fusion method based on a ranking strategy to predict the contact model. This model used SVMRank algorithm, and the proposed ranking strategy showed better performance for all three contact (short, medium and long) types.
An increasing number of structure models are predicted by the protein structure prediction technology. These structure prediction technologies predict multiple decoy models for one sequence. To determine the reliability of the predicted models, it is necessary to evaluate the quality of these models [46]. The methods used to predict the quality of protein models can be roughly divided into three categories: single method [47], quasi-single method [48] and cluster (or consensus) method [49]. MQAPRank [50] used both a single method and a quasi-single method and ranked the decoy models based on the similarity of the model to the corresponding native structure. First, a single method based on the LTR algorithm was used to rank the decoy models to indicate its relative quality for target proteins. Then, the top five decoy models were used as references, and the other models’ prediction quality was the average GDT_TS score of the target model and the five reference models. Based on the CASP11 and 3DRobot datasets, MQAPRank outperformed other leading protein model quality assessment methods. It participated in CASP12 under the group name FDUBio and achieved the most advanced performance. There are three reasons why MQAPRank is effective: one of which is that the LTR framework provides a reasonable ranking of protein decoy models for target proteins. Therefore, it can be determined that LTR is indeed an effective means to the prediction of protein model quality assessment.
The study of protein function is conducive to explore the disease mechanisms. Protein function can be predicted based on the similarity of protein sequences [51]. It turns out that similarity-based methods, such as BLAST and PSI-BLAST, do have some competitiveness [52, 53]. However, such methods are not as effective when the sequence identity is less than 60%. In addition, the advent of GO [54] brings convenience to protein function prediction in bioinformatics. LTR can be applied to automated function prediction by thinking GO terms as documents and proteins as queries. GOLabeler [55] constructed an LTR model based on sequence homology information and various deep-rooted evidence information extracted from the sequence information. There are similar challenges as in the MeSH indexing task: there are multiple GO terms per protein, and the number of GO terms per protein varies greatly. GOLabeler effectively integrated various types of information by using the LambdaMART algorithm and solved the above existing challenges to some extent. Sequence information is only a part of protein information, so integrating other effective information becomes a key step to improve protein function prediction. NetGO [56] further improved the performance of protein function prediction by merging a large amount of protein network information. LTR is also used in the functional prediction task of enzymes. Stock et al. [57] used four different cavity-based similarity measures and a sequence alignment-based measure as input to RankRLS to identify functionally related enzymes.
Relevant process of drug development
Selecting promising compounds (candidate drugs), exploring drug–target interaction and exploring drug–cell line relationships are beneficial to drug development. The methods used for these studies can be broadly divided into two types: similarity-based and feature-based. The similarity-based method compares the similarity of unknown function compounds (or proteins) with known function compounds (or proteins) [58]. Feature-based method refers to the extraction of features based on the compound chemical structures, gene (or protein) sequences or structural information and the subsequent work based on these features [59–61]. LTR is the most powerful technique in the feature-based methods; it can integrate various types of features, including those extracted based on similarity methods, but it is not widely used in drug development-related research. Applications of this technique in different research tasks are described in detail as below.
The purpose of virtual screening is to identify a group of candidate compounds, and it is an important step in the early process of drug discovery. Agarwal et al. [62] first introduced LTR into ligand-based virtual screening and demonstrated the effectiveness and superiority of the SVM ranking algorithm (i.e. RankSVM) by comparing it with SVM classification and SVM regression. The virtual screening process should focus on the top related compounds with ranking significance. Rathke et al. [63] focused on the topmost ranks by optimizing the rank loss NDCG. Ohue et al. [64] further improved the accuracy of virtual screening by ignoring the ranking between compounds with similar activity and the ranking between inactive compounds. At present, LTR is limited by sparse valuable data, and Liu et al. [65] used additional bioassay and compound information that can provide effective information to solve the problem of data sparseness.
Many problems faced in compound screening can also be solved by LTR: (1) interaction between drugs and targets maybe measured in different platforms or in different affinity criteria. Zhang et al. [66] used LTR to simultaneously learn experimental information obtained under different experimental conditions and different targets, and the results demonstrated that LTR is an efficient computational strategy for virtual drug screening, especially due to its new use in cross-target virtual screening and heterogeneous data integration. The direct combination of target and compound features can only represent limited information. Therefore, the features used in their study were processed by the tensor product, and thus, a feature set with a very high dimension (4704 dimension) was obtained. PKRank [67] is a general case of the method proposed by Zhang et al., which no longer uses the tensor product to process features. Instead, the Gram matrix of the paired kernel K was generated. (2) Especially in the treatment of complex diseases, a ligand is expected to interact with multiple targets. In LTR, the relationship between query and document is usually one-to-many, according to which LTR can be used to solve the multi-target problem. Dorr et al. [68] simultaneously learned compounds with different activity profiles and priorities based on the SVMRank algorithm. Therefore, the specific labeling of each compound was elaborated to infer a virtual screening model for multiple targets. (3) Key properties that need to be displayed for successful drugs are the selectivity of compounds and the biological activity of compounds. Selectivity refers to the phenomenon that different organs and tissues of the body show significant differences to compounds (drugs) sensitivity. That is, a compound has a particularly strong effect on a certain organ and tissue but has a very weak or even no effect on other tissues. dCPPP [69] learned the compound-first model and ranked the active compounds effectively. At the same time, a higher ranking of selective compounds by a bidirectional selective push strategy was preferred. In dCPPP, both activity ranking and selectivity prioritization were addressed in a differential optimization model.
Selecting appropriate drugs for cancer patients is an important part of precision medicine. Rahangdale et al. [70] used LTR for screening and prioritizing cancer-specific drugs, and this study not only maintained the order between sensitive and insensitive drugs but also maintained the order between sensitive drugs. In addition, pLETORg [71] used LTR for three specific application scenarios to screen drugs: selecting sensitive drugs from new drugs for each known cell line; selecting sensitive drugs from all available drugs, including new drugs and known drugs for each known cell line; and selecting sensitive drugs from all available drugs for new cell lines.
LTR has also been applied to explore drug–target correlation. DrugE-Rank [72] introduced drug general descriptors, target composition, transformation and distribution and the output results of six similarity-based methods into the LTR algorithm, which was improving the prediction of drug–target correlation of new drug candidates or targets. In addition, our previous work [73] explored the compound ranking and potential target problems by using two different types of datasets with LTR and finally achieved good results.
Ranking of genes associated with disease
Identifying genes related to specific diseases is one of the greatest challenges in medical research, and finding key genes affecting specific diseases can not only help understand the disease mechanism but also help health care workers develop treatment plans faster and better. Shivani et al. [74] first proposed the use of LTR in the search for promising genes. Based on RankBoost algorithm, their model identified some genes that were not identified by previous methods in a microarray dataset of leukemia and colon cancer and verified that the vast majority of the top 25 genes had known or potential links to the corresponding diseases and only a very small number had no links to the corresponding diseases. This study showed that LTR can be a powerful tool for mining gene-related data sources. Lee et al. [75] appropriately optimized RankBoost algorithm by adding a weighting function and adjusting parameters. In terms of average accuracy, receiver operating characteristic (ROC) and AUC measurements, their method outperformed the four gene prioritization methods in the ToppGene suite for ranking the 13 known genes.
Many methods to solve the gene prioritization problem have been summarized by Raj et al. [76] and are roughly divided into four types: text mining-based, network-based, machine learning-based and mixed modes. At present, there are few applications of LTR algorithms, and LTR is indeed an effective method to solve the gene priority ranking problem. Therefore, it can be mostly used in such studies in the future.
Application of other angles
LTR has been used with many perspectives in addition to the several types of studies introduced above, such as identification of promising peptides targeting target proteins [77], standardization of disease names [78], biomedical document retrieval [79], gene summary extraction [80] and protein folding energy design [81]. Since there is less work applying LTR from these angles, it will not be elaborated upon here.
Discussion
LTR has been gradually applied to multiple research tasks in bioinformatics. To observe and summarize the applications of LTR, the algorithms and algorithm types specifically used by these studies and some features that input into the LTR framework under each study are listed in Table 1. This section summarizes the advantages and disadvantages of the LTR framework and proposes suggestions for better use of LTR to solve bioinformatics tasks in the future.
Task . | References . | Year . | LTR . | LTR type . | Features . | Feature processing . |
---|---|---|---|---|---|---|
Assignment of MeSH | Huang et al. [20] | 2011 | ListNet | List | Domain-specific knowledge | × |
Mao et al. [21] | 2013 | MART | Point | MTIFL | × | |
MeSHLabeler [23] | 2015 | Lambda MART | List | BOW | × | |
DeepMeSH [24] | 2016 | Lambda MART | List | D2V-TFIDF | × | |
MeSH Now [22] | 2017 | Lambda MART | List | – | × | |
FullMeSH [25] | 2019 | XGBoost | Pair | AttentionCNN | × | |
Exploration of protein homology | Weston et al. [32] | 2004 | Rankprop | Point | Protein similarity network | × |
Melvin et al. [33] | 2009 | Rankprop | Point | – | × | |
ProtDec-LTR [34] | 2015 | Lambda MART | List | – | × | |
ProtDec-LTR2.0 [41] | 2017 | – | Pseudo protein representation | × | ||
ProtDec-LTR3.0 [42] | 2019 | – | Profile-based | × | ||
Prediction of protein structure and function | Stock et al. [57] | 2014 | RankRLS | Cavity-based similarity measures | × | |
MQAPRank [50] | 2017 | SVMRank | Pair | Knowledge-based potentials; evaluation scores | × | |
RRCRank [45] | 2017 | SVMRank | Pair | Correlated mutations | × | |
GOLabeler [55] | 2018 | Lambda MART | List | GO term frequency, protein families, domains and motifs | × | |
NetGO [56] | 2019 | Lambda MART | List | Network-based | × | |
Relevant process of drug development | Agarwal et al. [62] | 2010 | RankSVM | Pair | Molprint2D fingerprint, FP2 fingerprint | × |
Rathke et al. [63] | 2011 | StructRank | Pair | Dragon | × | |
Zhang et al. [66] | 2015 | SVMRank | Pair | General descriptor | Tensor product | |
Dorr et al. [68] | 2015 | SVMRank | Pair | Extended-connectivity fingerprint | × | |
DrugE-Rank [72] | 2016 | Lambda MART | List | Laplacian regularized least squares | × | |
dCPPP [69] | 2017 | SVMRank | Pair | Tanimoto matrix | × | |
Liu et al. [65] | 2017 | SVMRank | Pair | – | × | |
PKRank [67] | 2017 | RankSVM | Pair | – | Pairwise kernel | |
Rahangdale et al. [70] | 2018 | – | – | – | × | |
Ohue et al. [64] | 2019 | RankSVM | Pair | × | ||
pLETORg [71] | 2020 | – | Pair | Cosine similarities, Spearman rank correlation coefficient | × | |
Our previous work [73] | 2020 | RFRanker | – | DT | PCA | |
Ranking of genes associated with disease | Shivani et al. [74] | 2009 | RankBoost | Pair | – | × |
Lee et al. [75] | 2013 | RankBoost | Pair | – | × | |
Identification of promising peptides | PeptideRank [77] | 2014 | Lambda MART | List | Peptide properties | Feature selection [88] |
Standardization of disease names | DNorm [78] | 2013 | – | Pair | Term frequency-inverse, document frequency | × |
Biomedical document retrieval | Wu et al. [79] | 2014 | Coordinate ascent | List | – | × |
Gene summary extraction | Shang et al. [80] | 2014 | ListNet | List | Gene ontology relevance, topic relevance, TextRank | × |
Protein folding energy designing | Guan et al. [81] | 2011 | Ranking SVM | Pair | – | × |
Task . | References . | Year . | LTR . | LTR type . | Features . | Feature processing . |
---|---|---|---|---|---|---|
Assignment of MeSH | Huang et al. [20] | 2011 | ListNet | List | Domain-specific knowledge | × |
Mao et al. [21] | 2013 | MART | Point | MTIFL | × | |
MeSHLabeler [23] | 2015 | Lambda MART | List | BOW | × | |
DeepMeSH [24] | 2016 | Lambda MART | List | D2V-TFIDF | × | |
MeSH Now [22] | 2017 | Lambda MART | List | – | × | |
FullMeSH [25] | 2019 | XGBoost | Pair | AttentionCNN | × | |
Exploration of protein homology | Weston et al. [32] | 2004 | Rankprop | Point | Protein similarity network | × |
Melvin et al. [33] | 2009 | Rankprop | Point | – | × | |
ProtDec-LTR [34] | 2015 | Lambda MART | List | – | × | |
ProtDec-LTR2.0 [41] | 2017 | – | Pseudo protein representation | × | ||
ProtDec-LTR3.0 [42] | 2019 | – | Profile-based | × | ||
Prediction of protein structure and function | Stock et al. [57] | 2014 | RankRLS | Cavity-based similarity measures | × | |
MQAPRank [50] | 2017 | SVMRank | Pair | Knowledge-based potentials; evaluation scores | × | |
RRCRank [45] | 2017 | SVMRank | Pair | Correlated mutations | × | |
GOLabeler [55] | 2018 | Lambda MART | List | GO term frequency, protein families, domains and motifs | × | |
NetGO [56] | 2019 | Lambda MART | List | Network-based | × | |
Relevant process of drug development | Agarwal et al. [62] | 2010 | RankSVM | Pair | Molprint2D fingerprint, FP2 fingerprint | × |
Rathke et al. [63] | 2011 | StructRank | Pair | Dragon | × | |
Zhang et al. [66] | 2015 | SVMRank | Pair | General descriptor | Tensor product | |
Dorr et al. [68] | 2015 | SVMRank | Pair | Extended-connectivity fingerprint | × | |
DrugE-Rank [72] | 2016 | Lambda MART | List | Laplacian regularized least squares | × | |
dCPPP [69] | 2017 | SVMRank | Pair | Tanimoto matrix | × | |
Liu et al. [65] | 2017 | SVMRank | Pair | – | × | |
PKRank [67] | 2017 | RankSVM | Pair | – | Pairwise kernel | |
Rahangdale et al. [70] | 2018 | – | – | – | × | |
Ohue et al. [64] | 2019 | RankSVM | Pair | × | ||
pLETORg [71] | 2020 | – | Pair | Cosine similarities, Spearman rank correlation coefficient | × | |
Our previous work [73] | 2020 | RFRanker | – | DT | PCA | |
Ranking of genes associated with disease | Shivani et al. [74] | 2009 | RankBoost | Pair | – | × |
Lee et al. [75] | 2013 | RankBoost | Pair | – | × | |
Identification of promising peptides | PeptideRank [77] | 2014 | Lambda MART | List | Peptide properties | Feature selection [88] |
Standardization of disease names | DNorm [78] | 2013 | – | Pair | Term frequency-inverse, document frequency | × |
Biomedical document retrieval | Wu et al. [79] | 2014 | Coordinate ascent | List | – | × |
Gene summary extraction | Shang et al. [80] | 2014 | ListNet | List | Gene ontology relevance, topic relevance, TextRank | × |
Protein folding energy designing | Guan et al. [81] | 2011 | Ranking SVM | Pair | – | × |
Note: The ‘Features’ column of this table only lists some of the more distinctive features in the corresponding research, not all the features. ‘–’ means that the corresponding information is not be listed in detail. ‘×’ means no feature processing.
Task . | References . | Year . | LTR . | LTR type . | Features . | Feature processing . |
---|---|---|---|---|---|---|
Assignment of MeSH | Huang et al. [20] | 2011 | ListNet | List | Domain-specific knowledge | × |
Mao et al. [21] | 2013 | MART | Point | MTIFL | × | |
MeSHLabeler [23] | 2015 | Lambda MART | List | BOW | × | |
DeepMeSH [24] | 2016 | Lambda MART | List | D2V-TFIDF | × | |
MeSH Now [22] | 2017 | Lambda MART | List | – | × | |
FullMeSH [25] | 2019 | XGBoost | Pair | AttentionCNN | × | |
Exploration of protein homology | Weston et al. [32] | 2004 | Rankprop | Point | Protein similarity network | × |
Melvin et al. [33] | 2009 | Rankprop | Point | – | × | |
ProtDec-LTR [34] | 2015 | Lambda MART | List | – | × | |
ProtDec-LTR2.0 [41] | 2017 | – | Pseudo protein representation | × | ||
ProtDec-LTR3.0 [42] | 2019 | – | Profile-based | × | ||
Prediction of protein structure and function | Stock et al. [57] | 2014 | RankRLS | Cavity-based similarity measures | × | |
MQAPRank [50] | 2017 | SVMRank | Pair | Knowledge-based potentials; evaluation scores | × | |
RRCRank [45] | 2017 | SVMRank | Pair | Correlated mutations | × | |
GOLabeler [55] | 2018 | Lambda MART | List | GO term frequency, protein families, domains and motifs | × | |
NetGO [56] | 2019 | Lambda MART | List | Network-based | × | |
Relevant process of drug development | Agarwal et al. [62] | 2010 | RankSVM | Pair | Molprint2D fingerprint, FP2 fingerprint | × |
Rathke et al. [63] | 2011 | StructRank | Pair | Dragon | × | |
Zhang et al. [66] | 2015 | SVMRank | Pair | General descriptor | Tensor product | |
Dorr et al. [68] | 2015 | SVMRank | Pair | Extended-connectivity fingerprint | × | |
DrugE-Rank [72] | 2016 | Lambda MART | List | Laplacian regularized least squares | × | |
dCPPP [69] | 2017 | SVMRank | Pair | Tanimoto matrix | × | |
Liu et al. [65] | 2017 | SVMRank | Pair | – | × | |
PKRank [67] | 2017 | RankSVM | Pair | – | Pairwise kernel | |
Rahangdale et al. [70] | 2018 | – | – | – | × | |
Ohue et al. [64] | 2019 | RankSVM | Pair | × | ||
pLETORg [71] | 2020 | – | Pair | Cosine similarities, Spearman rank correlation coefficient | × | |
Our previous work [73] | 2020 | RFRanker | – | DT | PCA | |
Ranking of genes associated with disease | Shivani et al. [74] | 2009 | RankBoost | Pair | – | × |
Lee et al. [75] | 2013 | RankBoost | Pair | – | × | |
Identification of promising peptides | PeptideRank [77] | 2014 | Lambda MART | List | Peptide properties | Feature selection [88] |
Standardization of disease names | DNorm [78] | 2013 | – | Pair | Term frequency-inverse, document frequency | × |
Biomedical document retrieval | Wu et al. [79] | 2014 | Coordinate ascent | List | – | × |
Gene summary extraction | Shang et al. [80] | 2014 | ListNet | List | Gene ontology relevance, topic relevance, TextRank | × |
Protein folding energy designing | Guan et al. [81] | 2011 | Ranking SVM | Pair | – | × |
Task . | References . | Year . | LTR . | LTR type . | Features . | Feature processing . |
---|---|---|---|---|---|---|
Assignment of MeSH | Huang et al. [20] | 2011 | ListNet | List | Domain-specific knowledge | × |
Mao et al. [21] | 2013 | MART | Point | MTIFL | × | |
MeSHLabeler [23] | 2015 | Lambda MART | List | BOW | × | |
DeepMeSH [24] | 2016 | Lambda MART | List | D2V-TFIDF | × | |
MeSH Now [22] | 2017 | Lambda MART | List | – | × | |
FullMeSH [25] | 2019 | XGBoost | Pair | AttentionCNN | × | |
Exploration of protein homology | Weston et al. [32] | 2004 | Rankprop | Point | Protein similarity network | × |
Melvin et al. [33] | 2009 | Rankprop | Point | – | × | |
ProtDec-LTR [34] | 2015 | Lambda MART | List | – | × | |
ProtDec-LTR2.0 [41] | 2017 | – | Pseudo protein representation | × | ||
ProtDec-LTR3.0 [42] | 2019 | – | Profile-based | × | ||
Prediction of protein structure and function | Stock et al. [57] | 2014 | RankRLS | Cavity-based similarity measures | × | |
MQAPRank [50] | 2017 | SVMRank | Pair | Knowledge-based potentials; evaluation scores | × | |
RRCRank [45] | 2017 | SVMRank | Pair | Correlated mutations | × | |
GOLabeler [55] | 2018 | Lambda MART | List | GO term frequency, protein families, domains and motifs | × | |
NetGO [56] | 2019 | Lambda MART | List | Network-based | × | |
Relevant process of drug development | Agarwal et al. [62] | 2010 | RankSVM | Pair | Molprint2D fingerprint, FP2 fingerprint | × |
Rathke et al. [63] | 2011 | StructRank | Pair | Dragon | × | |
Zhang et al. [66] | 2015 | SVMRank | Pair | General descriptor | Tensor product | |
Dorr et al. [68] | 2015 | SVMRank | Pair | Extended-connectivity fingerprint | × | |
DrugE-Rank [72] | 2016 | Lambda MART | List | Laplacian regularized least squares | × | |
dCPPP [69] | 2017 | SVMRank | Pair | Tanimoto matrix | × | |
Liu et al. [65] | 2017 | SVMRank | Pair | – | × | |
PKRank [67] | 2017 | RankSVM | Pair | – | Pairwise kernel | |
Rahangdale et al. [70] | 2018 | – | – | – | × | |
Ohue et al. [64] | 2019 | RankSVM | Pair | × | ||
pLETORg [71] | 2020 | – | Pair | Cosine similarities, Spearman rank correlation coefficient | × | |
Our previous work [73] | 2020 | RFRanker | – | DT | PCA | |
Ranking of genes associated with disease | Shivani et al. [74] | 2009 | RankBoost | Pair | – | × |
Lee et al. [75] | 2013 | RankBoost | Pair | – | × | |
Identification of promising peptides | PeptideRank [77] | 2014 | Lambda MART | List | Peptide properties | Feature selection [88] |
Standardization of disease names | DNorm [78] | 2013 | – | Pair | Term frequency-inverse, document frequency | × |
Biomedical document retrieval | Wu et al. [79] | 2014 | Coordinate ascent | List | – | × |
Gene summary extraction | Shang et al. [80] | 2014 | ListNet | List | Gene ontology relevance, topic relevance, TextRank | × |
Protein folding energy designing | Guan et al. [81] | 2011 | Ranking SVM | Pair | – | × |
Note: The ‘Features’ column of this table only lists some of the more distinctive features in the corresponding research, not all the features. ‘–’ means that the corresponding information is not be listed in detail. ‘×’ means no feature processing.
Advantages and future applications of LTR
The classification and regression tasks require the construction of negative datasets, which sometimes contain samples that have not been validated, that is, whether the sample is positive or negative has not been determined. Such dataset is undoubtedly not conducive to subsequent steps. This problem is well circumvented by the ranking models, which is not necessary to construct negative datasets. LTR also solves the problems of data heterogeneity and cross-targets that have not been properly handled in conventional classification and regression tasks.
The process of these ranking tasks in bioinformatics is similar to the process by which users query information on the World Wide Web. This means that, other tasks with similar principle to this process can also be solved by LTR. The emergence of COVID-19 is a serious threat to human health and has caused a large number of deaths worldwide. There is no effective drug for this disease, so the new use of old drugs may become a means of treating it. LTR is well suited to solve the drug redirection problems. In this case, the currently known drugs can be used as LTR queries. In addition, LTR has been applied to rank disease-related genes. Therefore, the ranking of disease-related microRNAs and prioritization of protein complexes related to human diseases can also be solved by LTR.
Own flaw of LTR
In ranking tasks, researchers not only expect that the results obtained are ranked in descending order of relevance but also expect that the information contained is effective and has low redundancy. This requires that both the whole and the local should be concerned when designing LTR algorithm. The pointwise method and pairwise method do not meet this requirement. The pointwise method ignores the connection between documents. The pairwise method considers the relationship between two documents, which compensates for the lack of the pointwise method to a certain extent but ignores the location information of documents in the whole ranking list. The listwise method takes the whole and local information into account, but due to the complicated computational complexity, this method does not meet the needs of the era in terms of efficiency. As can be observed from Table 1, it is the methods of pairwise and listwise types that are currently more widely applied. In the future, in order to design a new ranking algorithm with good performance and fast calculation speed, the principle of listwise method can be used as the main content, and the principle of pointwise method and pairwise method can be used as the supplementary content.
Problems and optimization methods to the LTR model
By analyzing all the LTR frameworks mentioned in the Applications of LTR in bioinformatics section, it is concluded that the reasons for the poor performance of the LTR framework are derived from multiple aspects.
(i) The samples used for the study: the samples are not representative, redundancy between samples, the number of samples corresponding to each query varies greatly and similarity between the training set and test set samples. There is inevitable redundancy in the data searched through online databases, and the direct use of such data will undoubtedly prolong the experimental cycle. The samples are not selected properly and cannot effectively represent a certain type of information, and the studies based on these samples are of little significance. Therefore, the first treatment to be done after obtaining the original data samples is to remove redundant data the data and then select the representative data samples. The experimental results will be biased toward queries with more samples; therefore, the number of samples under each query should be as equal as possible. The test set is used to verify the model’s generalization ability. If the samples in the test set have great similarity to the samples in the training set, the performance will be overestimated, which is not enough to explain the generalization of the model. Therefore, the data in the test set and the training set should be compared before the experiment to ensure the significance of the subsequent experiment.
(ii) Features that input into LTR: the studies mentioned in Applications of LTR in bioinformatics section show that the features can further change the performance of the final model. In protein remote homology detection, ProtDec-LTR [34], ProtDec-LTR2.0 [41] and ProtDec-LTR3.0 [42] extract features by considering different angles of information. The features that input into the LTR framework are one-sided, that is, only partial angle or shallow angle information is considered and more in-depth feature information cannot be mined.
The features that are now input into the LTR frameworks are broadly divided into two types: one is directly extracted from protein (gene) sequences or compound information, and the other is the output results of the baseline learner. Currently, there are many methods and angles for feature extraction of protein (gene) and compound information. For proteins, there are amino acid composition-based, pseudo amino acid composition-based and evolutionary information-based methods. For compounds, there are 2D descriptors, chemical fingerprints and introversion of drug fingerprint values. Discrete Fourier transform [59] and wavelet transform [61] can transform drug structure information. Many toolkits have been developed to assist these researches, such as ifeature [82], pse-in-one [83], OpenBabel (http://openbabel.org/wiki/Main_Page) and Rdkit [84]. These tools can be fully utilized to extract more comprehensive sequence information and compound structure information. It is also effective to improve the performance by using a feature mapping strategy to cross the information of the proteins and compounds. For example, Zhang et al. [66] used tensor product and PKrank [67] used kernel. The output results of the baseline learner can be input into the LTR framework as features. A base learner with good performance should be constructed so that its output can be input into the LTR as a strong evidence. Therefore, to obtain more comprehensive and accurate information as much as possible, the integration idea can be used.
Protein–compound association features may play a role in the situation of both proteins and compounds are present in the sample. TargetGDrug [61] used a new post-processing procedure based on the drug correlation matrix to reduce the potential false positive or false negative of the initial prediction. This work can be a point of inspiration to explore and use the protein–compound association characteristics as much as possible in future work.
These three types of features exert different effects on different research tasks. These features can be comprehensively considered by assigning weights or adjusting parameters to obtain LTR models with better performance [85]. In addition, LTR can be combined with deep learning [86, 87] in the future to better solve tasks in bioinformatics.
(iii) In classification work, the feature set with high dimensionality will lead to overfitting of experimental results and long experimental period. This problem also plagues the ranking task. In our previous study [73], the model constructed based on the features which were processed by PCA was verified to have better performance. The feature selection algorithms used in classification not only consider the redundancy between features but also the correlation between features and labels. The feature selection algorithms used in LTR should also consider factors from multiple angles.
(iv) Finally, in several tasks related to new drug research and development, compounds can be ranked according to activity and selectivity. The ranking can also be performed according to other characteristics of the compound (such as toxicity). Therefore, appropriate characteristics should be selected in the ranking task. In fact, not only do new drug research and development but also other tasks need to pay attention to this point. It is necessary to select appropriate correlation indicators to better explore the correlation between query and output results.
Summary
This paper summarizes the specific application of LTR in bioinformatics, analyzes the existing problems of these LTR frameworks and puts forward brief suggestions to better apply LTR in this field. This review will be a useful tool that can not only help the relevant personnel to preliminarily understand the LTR algorithm but also guide them to use the LTR algorithm to do some meaningful work, such as drug redirection.
Funding
New Energy and Industrial Technology Development Organization 265 (NEDO) and the Japan Society for the Promotion of Science (JSPS), Grants-in-Aid for Scientific Research (grant no. 18H03250); National Natural Science Foundation of China (grant nos. 61922020, 61771331 and 91935302).
Conflicts of interest
The authors declare that they have no conflicts of interest.
Xiaoqing Ru is a PhD student in the University of Tsukuba. Her research interest is learning to rank.
Xiucai Ye is currently an assistant professor in the Department of Computer Science and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba. Her current research interests include feature selection, clustering, machine learning and bioinformatics.
Tetsuya Sakurai is currently a professor in the Department of Computer Science and is the director of the C-AIR, University of Tsukuba. His research interests include high performance algorithms for large-scale simulations, data and image analysis and deep neural network computations.
Quan Zou is a professor at the University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. He majors in bioinformatics, machine learning and algorithms.