Abstract

Over the past decades, learning to rank (LTR) algorithms have been gradually applied to bioinformatics. Such methods have shown significant advantages in multiple research tasks in this field. Therefore, it is necessary to summarize and discuss the application of these algorithms so that these algorithms are convenient and contribute to bioinformatics. In this paper, the characteristics of LTR algorithms and their strengths over other types of algorithms are analyzed based on the application of multiple perspectives in bioinformatics. Finally, the paper further discusses the shortcomings of the LTR algorithms, the methods and means to better use the algorithms and some open problems that currently exist.

Introduction

Learning to rank (LTR) algorithms were originally applied in the information retrieval field [1]. With development over time, the amount of various types of information has exploded, and it is a challenging task to retrieve required information based on these data, it is expensive to complete this work by labor alone, this process does not meet the need for speed in the current era. Therefore, machine learning has been introduced into information retrieval, and LTR algorithms are effective solutions to solve the ranking problems in this field. The idea of LTR is simple and easy to understand; it is similar to the process of searching for information on the World Wide Web, that is, when users enter queries in search engines, a series of information will be recalled [2] and ranked according to the correlation with queries, with strong correlation ranking in the front and weak correlation ranking in the back [3].

At present, LTR has become the main technology of modern network searches and has been applied in many fields, such as natural language processing and data mining [4]. It has also been introduced into bioinformatics, which solves some problems, including Medical Subject Headings (MeSH) indexing, protein homology detection, protein structure and function prediction and several related tasks of drug research and development.

LTR is indeed an effective class of algorithms that can be applied to bioinformatics. Therefore, the comprehensive investigation of LTR has a certain positive impact on the future development of this field. In addition, to solve more problems with LTR, it is necessary to understand the basic principle, application range and unique attributes. This paper summarizes the basic knowledge, existing open source software and specific applications of these algorithms in bioinformatics. The specific applications and the basic steps to solve problems with LTR are shown in Figure 1. By comparing the methods or models in many bioinformatics studies, the advantages, disadvantages and future development prospects of these LTR algorithms are summarized. Through a previous investigation, it was found that no review of such work has been reported so far. Therefore, this paper has certain significance and value.

The specific application of LTR in bioinformatics and the basic steps to solve problems with LTR. The tasks solved by LTR can be summarized as (A)–(D). Ranking of genes associated with disease can be classified (C). The middle of this figure shows the step of using LTR to process these tasks. Note: The structure diagram of the compound in this figure comes from ChEMBL Database (https://www.ebi.ac.uk/chembl/).
Figure 1

The specific application of LTR in bioinformatics and the basic steps to solve problems with LTR. The tasks solved by LTR can be summarized as (A)–(D). Ranking of genes associated with disease can be classified (C). The middle of this figure shows the step of using LTR to process these tasks. Note: The structure diagram of the compound in this figure comes from ChEMBL Database (https://www.ebi.ac.uk/chembl/).

LTR

Introduction to LTR

The process of constructing the LTR framework is basically consistent with the step of constructing the conventional classification model, namely constructing the dataset, extracting the data features, training the model and testing the model. However, the composition types of the input dataset, file format and output are different.

It is not necessary to specifically collect the negative dataset when constructing the LTR framework. The data need to be integrated and processed in a fixed format: |$\Big\{{q}_i\ {F}^j\ {R}_i^j\Big\}$|⁠, where |${q}_i$| represents a certain query, |${F}^j$|represents all features of sample |$j$| and |${R}_i^j$| represents the degree of correlation [5]. This is similar to the file format entered into libsvm (i.e. a software package commonly used in classification), except that a query function is added. Each query in the LTR corresponds to more than one file. The output is not the category but the correlation with the query. Under the same query, the output results are sorted in descending order of correlation and not the exact correlation scores but the relative scores. Finally, unlike classification algorithms, LTR has its own unique performance evaluation criteria, the most commonly used of which is Normalized Discounted Cumulative Gain(NDCG) [6].

Based on the number of object documents processed under the same LTR query, the LTR algorithms can be divided into three types: pointwise, pairwise and listwise [3, 4]. The pointwise method [7, 8], which is computationally fast and less complex, treats one document as an object. Therefore, the relationship between two documents is ignored which may cause ineffective. The pairwise method [9, 10] marks the relative relationship between two documents, which solves the shortcomings of the pointwise method to a certain extent. But when the number of documents for each query varies greatly, the pairwise combination of documents will further magnify the difference. In addition, this method ignores the position of the document in the whole list. The listwise method [11] takes all documents under the same query as the object, and it has a high computational complexity. This type of method compensates for the drawback in the pointwise method (i.e. the relationship between documents) and the pairwise method (i.e. location of documents in the whole list). Therefore, researchers need to select the appropriate algorithm according to the specific needs.

Toolkits for LTR

Here, two open source toolkits, i.e. Ranklib and |${\mathrm{SVM}}^{\mathrm{rank}}$|⁠, are introduced. The existence of these two toolkits makes the use of LTR more convenient.

|${\mathrm{SVM}}^{\mathrm{rank}}$|(https://www.cs.cornell.edu/people/tj/svm_light/svm_rank.html) only covers the Ranking SVM algorithm, which is widely used in bioinformatics. This algorithm belongs to the pairwise type and was proposed by Herbrich et al. [10] in 2000. Using the Ranking SVM algorithm and the data from user click through, Joachims et al. [12] optimized the search engine’s search quality. |${\mathrm{SVM}}^{\mathrm{rank}}$| is an open source toolkit implemented by Joachims et al., which is easy to use and includes one learning module and one prediction module. The training dataset trains the model through the learning module, and then in the prediction module, the prediction result is output based on the trained model and the test set.

On the other hand, Ranklib (https://sourceforge.net/p/lemur/wiki/RankLib/) integrates a variety of LTR algorithms, including MART, RankNet, RankBoost, AdaRank, LambdaMART and ListNet. In addition to building and testing the model, this tool realizes cross-validation by sequential partitioning and random partitioning of datasets. And this tool can also compare the performance between models and reorder the data.

Applications of LTR in bioinformatics

MeSH indexing

MeSH is developed and maintained by the National Library of Medicine (NLM), and it is a tool used in medical information retrieval that catalogs documents, books and audiovisual materials recorded in the NLM. Assigning the appropriate MeSH main headings (MHs) to biomedical literature can help researchers discover new knowledge and propose new scientific hypotheses while improving their productivity. It costs approximately $9.4 for manually reviewing the literature and assigning it the appropriate MHs [13]. With the rapid growth of the number of literatures in the database, it will consume a lot of time and money through manual methods. The emergence of automatic indexing technology has improved the shortcomings of manual annotation operations to some extent. The initial approach is to rely on three technologies: the K-nearest neighbor algorithm [14], machine learning method or probability model linking file content with MeSH terms [15, 16] and domain-specific knowledge resources (such as MetaMap [17] and trigram [18]). That is, Medical Text Indexer (MTI) developed by NLM is a combination of the K-nearest neighbor algorithm and domain-specific knowledge resource technology [19]. Different from the previous automatic MeSH indexing methods, Huang et al. [20] proposed a method by regarding the task as a ranking problem. They first used the K-nearest neighbor algorithm to find similar articles for target articles, obtained the MeSH terms assigned to these articles as well as the initial list of these terms and then applied ListNet algorithm to reorder these candidate terms based on the learned association between the document text and each MeSH term. In two representative datasets, the superior performance of LTR was verified by contrast experiments with MTI, random reflective indexing and two strong baseline methods (i.e. neighborhood frequency and neighborhood similarity). Thus, LTR is indeed an effective means to solve the problem of indexing. After that, multiple LTR frameworks have been developed. Mao et al. [21] achieved an extension of Huang et al.’s method by introducing the output result of MTIFL, a new version of MTL, and by setting an automatic cut-off metric with the aim of only returning a fixed number of MeSH terms that are most relevant to the target article. In addition, several LTR algorithms, including MART, RankNet, Coordinate Ascent, LambdaMART and AdaRank, are also appeared, and it concluded that MART is the most suitable algorithm for their research.

The current status and challenges of automatic MeSH indexing are as follows: (1) uneven distribution of MHs, that is, some MHs have appeared in millions of citations, while some MHs have only appeared in dozens of citations; (2) the number of MHs in citations varies greatly, as some citations may have 30 MHs, while some have only 5 MHs; (3) semantic and context-dependent information is not well obtained and (4) indexing is not explored based on the full text content.

In LTR, the number of relevant documents corresponding to each query is not fixed; therefore, the above problems can be solved based on this characteristic and complementary evidence information extracted from multiple perspectives. MeSHLabeler, DeepMeSH, MeSH Now [22] and FullMeSH are models constructed using the LTR algorithm and based on various types of evidence information, such as MeSHLabeler [23], which was built based on the LambdaMART algorithm, integrated five types of category information, including global information, local information, subject word dependence, pattern recognition and MTI. There are large numbers of domain phrases, concepts and abbreviations in the biomedical literature. One concept can be represented by different words and one abbreviation may have different meanings, and the simple word bag method (BOW) has been unable to obtain complex semantic information of biomedical documents, which limits the performance of the model. DeepMeSH [24] proposed intensive semantic concepts such as Word2Vec (W2V), Word2Phrase (W2P) and Document2Vec (D2V) to capture the semantic and contextual information of the text, which provided opportunities to improve the performance of automatic MeSH indexing. Many methods only use title and abstract for exploration. Some important information may only appear in the text is ignored. FullMeSH [25] took the entire article as the research object, divided the full text, then obtained many types of evidence information with deep learning and other traditional models and finally integrated information into the LTR algorithm.

The research mentioned above shows that selecting appropriate LTR algorithms and capturing effective semantic information are effective means to improve the performance of MeSH indexing. But a challenging problem existing in MeSH indexing is that MeSH vocabulary and indexing principles continue to change over time, which requires researchers to develop more robust models.

Exploration of protein homology

SCOP is one of the commonly used databases for protein remote homology detection, and it divides proteins in a hierarchical manner based on their structural versus evolutionary relationships [26]. Novel protein structural and functional properties can be inferred by the relationship between the protein and known group. According to the SCOP database, proteins belonging to the same superfamily can be regarded as homologous, of which proteins belonging to the same superfamily but not to the same family are remote homologous [27].

Homology can be judged by sequence similarity, and subtle sequence similarity often implies structural, functional and evolutionary relationships of proteins [28]. BLAST [29] and PSI-BLAST [30] are two common tools to explore the sequence similarity [31]. Through these two methods, the related sequences will be output in the order of strong to weak similarity with the given protein sequence. Therefore, exploring protein homology can be treated as a ranking task. Weston et al. [32] used Rankprop algorithm and the entire network structure of similarity relationships among proteins to rank proteins. Rankprop can propagate in densely connected regions of protein similarity networks to detect subtle sequence relationships missed by methods such as PSI-BLAST. Subsequently, Melvin et al. [33] extended the Rankprop algorithm by two steps: propagating proteins closely related to the query to improve accuracy and mapping empirical values to improve the interpretability of scores produced by Rankprop.

The structure and function of proteins are more conserved than sequences during natural evolution. Protein remote homology detection is one of the basic problems in computational biology, and its purpose is to find remote homologous proteins with low sequence similarity but similar structure and function. Many methods have been developed for protein remote homology detection, and these can be classified into three types: alignment methods, classification methods and ranking methods [27, 34]. The alignment method can be regarded as the basis of the classification and ranking method. The classification method is a transition of the alignment and ranking method. The ranking method can combine the information of the alignment and classification method, maximize the inheritance of their advantages and overcome their disadvantages to a certain extent. ProtDec-LTR [34] is a ranking model integrating the output results of three alignment algorithms (i.e. PSI-BLAST, HHblits and ProtEmbed) into an LTR algorithm. It was verified experimentally that ProtDec-LTR performs better than three independent alignment algorithms. Pseudo-amino acid representation takes residue mutations, deletions and insertions in protein sequences into account and can find conserved domains in structures [35, 36]. Pseudo-amino acid representation facilitates remote homology detection of proteins [37–40]. ProtDec-LTR2.0 [41] first converted all proteins in the query and benchmark databases into pseudo proteins and then conducted follow-up research based on the principle of the ProtDec-LTR. Eventually, ProtDec-LTR2.0 was significantly superior to ProtDec-LTR in terms of ROC1. Profile-based features have shown strong discrimination in classification methods for identifying remote homologous proteins. ProtDec-LTR3.0 [42] integrated ACC and top-gram features processed by feature mapping strategies into an LTR algorithm. The features extracted by ACC and top-gram not only reflect the evolutionary information of proteins but also contain the features of protein sequences. On this basis, the accuracy of protein remote homology detection was further improved by the PageRank and HITS algorithms. From this, ProtDec-LTR3.0 achieved better performance than ProtDec-LTR, ProtDec-LTR2.0 and nine other advanced predictors.

Prediction of protein structure and function

Proteins are the main bearers of life activities. Research on protein is mainly divided into two aspects: structural proteomics and functional proteomics.

The structure of protein is more conserved than the sequence, and as the purpose of biology, medicine and pharmacy, exploring protein structure is important. In the research of protein tertiary structure prediction, many known protein sequences cannot find protein sequences with known structures similar to them, that is, there is neither homologous protein with known structure nor remote homologous protein with known structure. The main purpose of de novo protein structure prediction is to predict the protein 3D structure using sequences. However, because of the insurmountable dimensionality of the unrestrained search space, de novo protein structure prediction has not been too successful and is not universally suitable for large-scale prediction of protein structures with high accuracy [43]. Protein residue–residue contact prediction can effectively constrain the conformational search space, and protein contact constraint methods first predict the protein residue–residue contacts in residue sequences and then use these predicted contacts as constraints to predict the tertiary structure of proteins [44]. In recent years, there have been many applications of fusion methods in residue–residue association prediction. RRCRank [45] is a new fusion method based on a ranking strategy to predict the contact model. This model used SVMRank algorithm, and the proposed ranking strategy showed better performance for all three contact (short, medium and long) types.

An increasing number of structure models are predicted by the protein structure prediction technology. These structure prediction technologies predict multiple decoy models for one sequence. To determine the reliability of the predicted models, it is necessary to evaluate the quality of these models [46]. The methods used to predict the quality of protein models can be roughly divided into three categories: single method [47], quasi-single method [48] and cluster (or consensus) method [49]. MQAPRank [50] used both a single method and a quasi-single method and ranked the decoy models based on the similarity of the model to the corresponding native structure. First, a single method based on the LTR algorithm was used to rank the decoy models to indicate its relative quality for target proteins. Then, the top five decoy models were used as references, and the other models’ prediction quality was the average GDT_TS score of the target model and the five reference models. Based on the CASP11 and 3DRobot datasets, MQAPRank outperformed other leading protein model quality assessment methods. It participated in CASP12 under the group name FDUBio and achieved the most advanced performance. There are three reasons why MQAPRank is effective: one of which is that the LTR framework provides a reasonable ranking of protein decoy models for target proteins. Therefore, it can be determined that LTR is indeed an effective means to the prediction of protein model quality assessment.

The study of protein function is conducive to explore the disease mechanisms. Protein function can be predicted based on the similarity of protein sequences [51]. It turns out that similarity-based methods, such as BLAST and PSI-BLAST, do have some competitiveness [52, 53]. However, such methods are not as effective when the sequence identity is less than 60%. In addition, the advent of GO [54] brings convenience to protein function prediction in bioinformatics. LTR can be applied to automated function prediction by thinking GO terms as documents and proteins as queries. GOLabeler [55] constructed an LTR model based on sequence homology information and various deep-rooted evidence information extracted from the sequence information. There are similar challenges as in the MeSH indexing task: there are multiple GO terms per protein, and the number of GO terms per protein varies greatly. GOLabeler effectively integrated various types of information by using the LambdaMART algorithm and solved the above existing challenges to some extent. Sequence information is only a part of protein information, so integrating other effective information becomes a key step to improve protein function prediction. NetGO [56] further improved the performance of protein function prediction by merging a large amount of protein network information. LTR is also used in the functional prediction task of enzymes. Stock et al. [57] used four different cavity-based similarity measures and a sequence alignment-based measure as input to RankRLS to identify functionally related enzymes.

Relevant process of drug development

Selecting promising compounds (candidate drugs), exploring drug–target interaction and exploring drug–cell line relationships are beneficial to drug development. The methods used for these studies can be broadly divided into two types: similarity-based and feature-based. The similarity-based method compares the similarity of unknown function compounds (or proteins) with known function compounds (or proteins) [58]. Feature-based method refers to the extraction of features based on the compound chemical structures, gene (or protein) sequences or structural information and the subsequent work based on these features [59–61]. LTR is the most powerful technique in the feature-based methods; it can integrate various types of features, including those extracted based on similarity methods, but it is not widely used in drug development-related research. Applications of this technique in different research tasks are described in detail as below.

The purpose of virtual screening is to identify a group of candidate compounds, and it is an important step in the early process of drug discovery. Agarwal et al. [62] first introduced LTR into ligand-based virtual screening and demonstrated the effectiveness and superiority of the SVM ranking algorithm (i.e. RankSVM) by comparing it with SVM classification and SVM regression. The virtual screening process should focus on the top related compounds with ranking significance. Rathke et al. [63] focused on the topmost ranks by optimizing the rank loss NDCG. Ohue et al. [64] further improved the accuracy of virtual screening by ignoring the ranking between compounds with similar activity and the ranking between inactive compounds. At present, LTR is limited by sparse valuable data, and Liu et al. [65] used additional bioassay and compound information that can provide effective information to solve the problem of data sparseness.

Many problems faced in compound screening can also be solved by LTR: (1) interaction between drugs and targets maybe measured in different platforms or in different affinity criteria. Zhang et al. [66] used LTR to simultaneously learn experimental information obtained under different experimental conditions and different targets, and the results demonstrated that LTR is an efficient computational strategy for virtual drug screening, especially due to its new use in cross-target virtual screening and heterogeneous data integration. The direct combination of target and compound features can only represent limited information. Therefore, the features used in their study were processed by the tensor product, and thus, a feature set with a very high dimension (4704 dimension) was obtained. PKRank [67] is a general case of the method proposed by Zhang et al., which no longer uses the tensor product to process features. Instead, the Gram matrix of the paired kernel K was generated. (2) Especially in the treatment of complex diseases, a ligand is expected to interact with multiple targets. In LTR, the relationship between query and document is usually one-to-many, according to which LTR can be used to solve the multi-target problem. Dorr et al. [68] simultaneously learned compounds with different activity profiles and priorities based on the SVMRank algorithm. Therefore, the specific labeling of each compound was elaborated to infer a virtual screening model for multiple targets. (3) Key properties that need to be displayed for successful drugs are the selectivity of compounds and the biological activity of compounds. Selectivity refers to the phenomenon that different organs and tissues of the body show significant differences to compounds (drugs) sensitivity. That is, a compound has a particularly strong effect on a certain organ and tissue but has a very weak or even no effect on other tissues. dCPPP [69] learned the compound-first model and ranked the active compounds effectively. At the same time, a higher ranking of selective compounds by a bidirectional selective push strategy was preferred. In dCPPP, both activity ranking and selectivity prioritization were addressed in a differential optimization model.

Selecting appropriate drugs for cancer patients is an important part of precision medicine. Rahangdale et al. [70] used LTR for screening and prioritizing cancer-specific drugs, and this study not only maintained the order between sensitive and insensitive drugs but also maintained the order between sensitive drugs. In addition, pLETORg [71] used LTR for three specific application scenarios to screen drugs: selecting sensitive drugs from new drugs for each known cell line; selecting sensitive drugs from all available drugs, including new drugs and known drugs for each known cell line; and selecting sensitive drugs from all available drugs for new cell lines.

LTR has also been applied to explore drug–target correlation. DrugE-Rank [72] introduced drug general descriptors, target composition, transformation and distribution and the output results of six similarity-based methods into the LTR algorithm, which was improving the prediction of drug–target correlation of new drug candidates or targets. In addition, our previous work [73] explored the compound ranking and potential target problems by using two different types of datasets with LTR and finally achieved good results.

Ranking of genes associated with disease

Identifying genes related to specific diseases is one of the greatest challenges in medical research, and finding key genes affecting specific diseases can not only help understand the disease mechanism but also help health care workers develop treatment plans faster and better. Shivani et al. [74] first proposed the use of LTR in the search for promising genes. Based on RankBoost algorithm, their model identified some genes that were not identified by previous methods in a microarray dataset of leukemia and colon cancer and verified that the vast majority of the top 25 genes had known or potential links to the corresponding diseases and only a very small number had no links to the corresponding diseases. This study showed that LTR can be a powerful tool for mining gene-related data sources. Lee et al. [75] appropriately optimized RankBoost algorithm by adding a weighting function and adjusting parameters. In terms of average accuracy, receiver operating characteristic (ROC) and AUC measurements, their method outperformed the four gene prioritization methods in the ToppGene suite for ranking the 13 known genes.

Many methods to solve the gene prioritization problem have been summarized by Raj et al. [76] and are roughly divided into four types: text mining-based, network-based, machine learning-based and mixed modes. At present, there are few applications of LTR algorithms, and LTR is indeed an effective method to solve the gene priority ranking problem. Therefore, it can be mostly used in such studies in the future.

Application of other angles

LTR has been used with many perspectives in addition to the several types of studies introduced above, such as identification of promising peptides targeting target proteins [77], standardization of disease names [78], biomedical document retrieval [79], gene summary extraction [80] and protein folding energy design [81]. Since there is less work applying LTR from these angles, it will not be elaborated upon here.

Discussion

LTR has been gradually applied to multiple research tasks in bioinformatics. To observe and summarize the applications of LTR, the algorithms and algorithm types specifically used by these studies and some features that input into the LTR framework under each study are listed in Table 1. This section summarizes the advantages and disadvantages of the LTR framework and proposes suggestions for better use of LTR to solve bioinformatics tasks in the future.

Table 1

Summary of studies using LTR

TaskReferencesYearLTRLTR typeFeaturesFeature processing
Assignment of MeSHHuang et al. [20]2011ListNetListDomain-specific knowledge×
Mao et al. [21]2013MARTPointMTIFL×
MeSHLabeler [23]2015Lambda MARTListBOW×
DeepMeSH [24]2016Lambda MARTListD2V-TFIDF×
MeSH Now [22]2017Lambda MARTList×
FullMeSH [25]2019XGBoostPairAttentionCNN×
Exploration of protein homologyWeston et al. [32]2004RankpropPointProtein similarity network×
Melvin et al. [33]2009RankpropPoint×
ProtDec-LTR [34]2015Lambda MARTList×
ProtDec-LTR2.0 [41]2017Pseudo protein representation×
ProtDec-LTR3.0 [42]2019Profile-based×
Prediction of protein structure and functionStock et al. [57]2014RankRLSCavity-based similarity measures×
MQAPRank [50]2017SVMRankPairKnowledge-based potentials; evaluation scores×
RRCRank [45]2017SVMRankPairCorrelated mutations×
GOLabeler [55]2018Lambda MARTListGO term frequency, protein families, domains and motifs×
NetGO [56]2019Lambda MARTListNetwork-based×
Relevant process of drug developmentAgarwal et al. [62]2010RankSVMPairMolprint2D fingerprint, FP2 fingerprint×
Rathke et al. [63]2011StructRankPairDragon×
Zhang et al. [66]2015SVMRankPairGeneral descriptorTensor product
Dorr et al. [68]2015SVMRankPairExtended-connectivity fingerprint×
DrugE-Rank [72]2016Lambda MARTListLaplacian regularized least squares×
dCPPP [69]2017SVMRankPairTanimoto matrix×
Liu et al. [65]2017SVMRankPair×
PKRank [67]2017RankSVMPairPairwise kernel
Rahangdale et al. [70]2018×
Ohue et al. [64]2019RankSVMPair×
pLETORg [71]2020PairCosine similarities, Spearman rank correlation coefficient×
Our previous work [73]2020RFRankerDTPCA
Ranking of genes associated with diseaseShivani et al. [74]2009RankBoostPair×
Lee et al. [75]2013RankBoostPair×
Identification of promising peptidesPeptideRank [77]2014Lambda MARTListPeptide propertiesFeature selection [88]
Standardization of disease namesDNorm [78]2013PairTerm frequency-inverse, document frequency×
Biomedical document retrievalWu et al. [79]2014Coordinate ascentList×
Gene summary extractionShang et al. [80]2014ListNetListGene ontology relevance, topic relevance, TextRank×
Protein folding energy designingGuan et al. [81]2011Ranking SVMPair×
TaskReferencesYearLTRLTR typeFeaturesFeature processing
Assignment of MeSHHuang et al. [20]2011ListNetListDomain-specific knowledge×
Mao et al. [21]2013MARTPointMTIFL×
MeSHLabeler [23]2015Lambda MARTListBOW×
DeepMeSH [24]2016Lambda MARTListD2V-TFIDF×
MeSH Now [22]2017Lambda MARTList×
FullMeSH [25]2019XGBoostPairAttentionCNN×
Exploration of protein homologyWeston et al. [32]2004RankpropPointProtein similarity network×
Melvin et al. [33]2009RankpropPoint×
ProtDec-LTR [34]2015Lambda MARTList×
ProtDec-LTR2.0 [41]2017Pseudo protein representation×
ProtDec-LTR3.0 [42]2019Profile-based×
Prediction of protein structure and functionStock et al. [57]2014RankRLSCavity-based similarity measures×
MQAPRank [50]2017SVMRankPairKnowledge-based potentials; evaluation scores×
RRCRank [45]2017SVMRankPairCorrelated mutations×
GOLabeler [55]2018Lambda MARTListGO term frequency, protein families, domains and motifs×
NetGO [56]2019Lambda MARTListNetwork-based×
Relevant process of drug developmentAgarwal et al. [62]2010RankSVMPairMolprint2D fingerprint, FP2 fingerprint×
Rathke et al. [63]2011StructRankPairDragon×
Zhang et al. [66]2015SVMRankPairGeneral descriptorTensor product
Dorr et al. [68]2015SVMRankPairExtended-connectivity fingerprint×
DrugE-Rank [72]2016Lambda MARTListLaplacian regularized least squares×
dCPPP [69]2017SVMRankPairTanimoto matrix×
Liu et al. [65]2017SVMRankPair×
PKRank [67]2017RankSVMPairPairwise kernel
Rahangdale et al. [70]2018×
Ohue et al. [64]2019RankSVMPair×
pLETORg [71]2020PairCosine similarities, Spearman rank correlation coefficient×
Our previous work [73]2020RFRankerDTPCA
Ranking of genes associated with diseaseShivani et al. [74]2009RankBoostPair×
Lee et al. [75]2013RankBoostPair×
Identification of promising peptidesPeptideRank [77]2014Lambda MARTListPeptide propertiesFeature selection [88]
Standardization of disease namesDNorm [78]2013PairTerm frequency-inverse, document frequency×
Biomedical document retrievalWu et al. [79]2014Coordinate ascentList×
Gene summary extractionShang et al. [80]2014ListNetListGene ontology relevance, topic relevance, TextRank×
Protein folding energy designingGuan et al. [81]2011Ranking SVMPair×

Note: The ‘Features’ column of this table only lists some of the more distinctive features in the corresponding research, not all the features. ‘–’ means that the corresponding information is not be listed in detail. ‘×’ means no feature processing.

Table 1

Summary of studies using LTR

TaskReferencesYearLTRLTR typeFeaturesFeature processing
Assignment of MeSHHuang et al. [20]2011ListNetListDomain-specific knowledge×
Mao et al. [21]2013MARTPointMTIFL×
MeSHLabeler [23]2015Lambda MARTListBOW×
DeepMeSH [24]2016Lambda MARTListD2V-TFIDF×
MeSH Now [22]2017Lambda MARTList×
FullMeSH [25]2019XGBoostPairAttentionCNN×
Exploration of protein homologyWeston et al. [32]2004RankpropPointProtein similarity network×
Melvin et al. [33]2009RankpropPoint×
ProtDec-LTR [34]2015Lambda MARTList×
ProtDec-LTR2.0 [41]2017Pseudo protein representation×
ProtDec-LTR3.0 [42]2019Profile-based×
Prediction of protein structure and functionStock et al. [57]2014RankRLSCavity-based similarity measures×
MQAPRank [50]2017SVMRankPairKnowledge-based potentials; evaluation scores×
RRCRank [45]2017SVMRankPairCorrelated mutations×
GOLabeler [55]2018Lambda MARTListGO term frequency, protein families, domains and motifs×
NetGO [56]2019Lambda MARTListNetwork-based×
Relevant process of drug developmentAgarwal et al. [62]2010RankSVMPairMolprint2D fingerprint, FP2 fingerprint×
Rathke et al. [63]2011StructRankPairDragon×
Zhang et al. [66]2015SVMRankPairGeneral descriptorTensor product
Dorr et al. [68]2015SVMRankPairExtended-connectivity fingerprint×
DrugE-Rank [72]2016Lambda MARTListLaplacian regularized least squares×
dCPPP [69]2017SVMRankPairTanimoto matrix×
Liu et al. [65]2017SVMRankPair×
PKRank [67]2017RankSVMPairPairwise kernel
Rahangdale et al. [70]2018×
Ohue et al. [64]2019RankSVMPair×
pLETORg [71]2020PairCosine similarities, Spearman rank correlation coefficient×
Our previous work [73]2020RFRankerDTPCA
Ranking of genes associated with diseaseShivani et al. [74]2009RankBoostPair×
Lee et al. [75]2013RankBoostPair×
Identification of promising peptidesPeptideRank [77]2014Lambda MARTListPeptide propertiesFeature selection [88]
Standardization of disease namesDNorm [78]2013PairTerm frequency-inverse, document frequency×
Biomedical document retrievalWu et al. [79]2014Coordinate ascentList×
Gene summary extractionShang et al. [80]2014ListNetListGene ontology relevance, topic relevance, TextRank×
Protein folding energy designingGuan et al. [81]2011Ranking SVMPair×
TaskReferencesYearLTRLTR typeFeaturesFeature processing
Assignment of MeSHHuang et al. [20]2011ListNetListDomain-specific knowledge×
Mao et al. [21]2013MARTPointMTIFL×
MeSHLabeler [23]2015Lambda MARTListBOW×
DeepMeSH [24]2016Lambda MARTListD2V-TFIDF×
MeSH Now [22]2017Lambda MARTList×
FullMeSH [25]2019XGBoostPairAttentionCNN×
Exploration of protein homologyWeston et al. [32]2004RankpropPointProtein similarity network×
Melvin et al. [33]2009RankpropPoint×
ProtDec-LTR [34]2015Lambda MARTList×
ProtDec-LTR2.0 [41]2017Pseudo protein representation×
ProtDec-LTR3.0 [42]2019Profile-based×
Prediction of protein structure and functionStock et al. [57]2014RankRLSCavity-based similarity measures×
MQAPRank [50]2017SVMRankPairKnowledge-based potentials; evaluation scores×
RRCRank [45]2017SVMRankPairCorrelated mutations×
GOLabeler [55]2018Lambda MARTListGO term frequency, protein families, domains and motifs×
NetGO [56]2019Lambda MARTListNetwork-based×
Relevant process of drug developmentAgarwal et al. [62]2010RankSVMPairMolprint2D fingerprint, FP2 fingerprint×
Rathke et al. [63]2011StructRankPairDragon×
Zhang et al. [66]2015SVMRankPairGeneral descriptorTensor product
Dorr et al. [68]2015SVMRankPairExtended-connectivity fingerprint×
DrugE-Rank [72]2016Lambda MARTListLaplacian regularized least squares×
dCPPP [69]2017SVMRankPairTanimoto matrix×
Liu et al. [65]2017SVMRankPair×
PKRank [67]2017RankSVMPairPairwise kernel
Rahangdale et al. [70]2018×
Ohue et al. [64]2019RankSVMPair×
pLETORg [71]2020PairCosine similarities, Spearman rank correlation coefficient×
Our previous work [73]2020RFRankerDTPCA
Ranking of genes associated with diseaseShivani et al. [74]2009RankBoostPair×
Lee et al. [75]2013RankBoostPair×
Identification of promising peptidesPeptideRank [77]2014Lambda MARTListPeptide propertiesFeature selection [88]
Standardization of disease namesDNorm [78]2013PairTerm frequency-inverse, document frequency×
Biomedical document retrievalWu et al. [79]2014Coordinate ascentList×
Gene summary extractionShang et al. [80]2014ListNetListGene ontology relevance, topic relevance, TextRank×
Protein folding energy designingGuan et al. [81]2011Ranking SVMPair×

Note: The ‘Features’ column of this table only lists some of the more distinctive features in the corresponding research, not all the features. ‘–’ means that the corresponding information is not be listed in detail. ‘×’ means no feature processing.

Advantages and future applications of LTR

The classification and regression tasks require the construction of negative datasets, which sometimes contain samples that have not been validated, that is, whether the sample is positive or negative has not been determined. Such dataset is undoubtedly not conducive to subsequent steps. This problem is well circumvented by the ranking models, which is not necessary to construct negative datasets. LTR also solves the problems of data heterogeneity and cross-targets that have not been properly handled in conventional classification and regression tasks.

The process of these ranking tasks in bioinformatics is similar to the process by which users query information on the World Wide Web. This means that, other tasks with similar principle to this process can also be solved by LTR. The emergence of COVID-19 is a serious threat to human health and has caused a large number of deaths worldwide. There is no effective drug for this disease, so the new use of old drugs may become a means of treating it. LTR is well suited to solve the drug redirection problems. In this case, the currently known drugs can be used as LTR queries. In addition, LTR has been applied to rank disease-related genes. Therefore, the ranking of disease-related microRNAs and prioritization of protein complexes related to human diseases can also be solved by LTR.

Own flaw of LTR

In ranking tasks, researchers not only expect that the results obtained are ranked in descending order of relevance but also expect that the information contained is effective and has low redundancy. This requires that both the whole and the local should be concerned when designing LTR algorithm. The pointwise method and pairwise method do not meet this requirement. The pointwise method ignores the connection between documents. The pairwise method considers the relationship between two documents, which compensates for the lack of the pointwise method to a certain extent but ignores the location information of documents in the whole ranking list. The listwise method takes the whole and local information into account, but due to the complicated computational complexity, this method does not meet the needs of the era in terms of efficiency. As can be observed from Table 1, it is the methods of pairwise and listwise types that are currently more widely applied. In the future, in order to design a new ranking algorithm with good performance and fast calculation speed, the principle of listwise method can be used as the main content, and the principle of pointwise method and pairwise method can be used as the supplementary content.

Problems and optimization methods to the LTR model

By analyzing all the LTR frameworks mentioned in the Applications of LTR in bioinformatics section, it is concluded that the reasons for the poor performance of the LTR framework are derived from multiple aspects.

  • (i) The samples used for the study: the samples are not representative, redundancy between samples, the number of samples corresponding to each query varies greatly and similarity between the training set and test set samples. There is inevitable redundancy in the data searched through online databases, and the direct use of such data will undoubtedly prolong the experimental cycle. The samples are not selected properly and cannot effectively represent a certain type of information, and the studies based on these samples are of little significance. Therefore, the first treatment to be done after obtaining the original data samples is to remove redundant data the data and then select the representative data samples. The experimental results will be biased toward queries with more samples; therefore, the number of samples under each query should be as equal as possible. The test set is used to verify the model’s generalization ability. If the samples in the test set have great similarity to the samples in the training set, the performance will be overestimated, which is not enough to explain the generalization of the model. Therefore, the data in the test set and the training set should be compared before the experiment to ensure the significance of the subsequent experiment.

  • (ii) Features that input into LTR: the studies mentioned in Applications of LTR in bioinformatics section show that the features can further change the performance of the final model. In protein remote homology detection, ProtDec-LTR [34], ProtDec-LTR2.0 [41] and ProtDec-LTR3.0 [42] extract features by considering different angles of information. The features that input into the LTR framework are one-sided, that is, only partial angle or shallow angle information is considered and more in-depth feature information cannot be mined.

The features that are now input into the LTR frameworks are broadly divided into two types: one is directly extracted from protein (gene) sequences or compound information, and the other is the output results of the baseline learner. Currently, there are many methods and angles for feature extraction of protein (gene) and compound information. For proteins, there are amino acid composition-based, pseudo amino acid composition-based and evolutionary information-based methods. For compounds, there are 2D descriptors, chemical fingerprints and introversion of drug fingerprint values. Discrete Fourier transform [59] and wavelet transform [61] can transform drug structure information. Many toolkits have been developed to assist these researches, such as ifeature [82], pse-in-one [83], OpenBabel (http://openbabel.org/wiki/Main_Page) and Rdkit [84]. These tools can be fully utilized to extract more comprehensive sequence information and compound structure information. It is also effective to improve the performance by using a feature mapping strategy to cross the information of the proteins and compounds. For example, Zhang et al. [66] used tensor product and PKrank [67] used kernel. The output results of the baseline learner can be input into the LTR framework as features. A base learner with good performance should be constructed so that its output can be input into the LTR as a strong evidence. Therefore, to obtain more comprehensive and accurate information as much as possible, the integration idea can be used.

Protein–compound association features may play a role in the situation of both proteins and compounds are present in the sample. TargetGDrug [61] used a new post-processing procedure based on the drug correlation matrix to reduce the potential false positive or false negative of the initial prediction. This work can be a point of inspiration to explore and use the protein–compound association characteristics as much as possible in future work.

These three types of features exert different effects on different research tasks. These features can be comprehensively considered by assigning weights or adjusting parameters to obtain LTR models with better performance [85]. In addition, LTR can be combined with deep learning [86, 87] in the future to better solve tasks in bioinformatics.

  • (iii) In classification work, the feature set with high dimensionality will lead to overfitting of experimental results and long experimental period. This problem also plagues the ranking task. In our previous study [73], the model constructed based on the features which were processed by PCA was verified to have better performance. The feature selection algorithms used in classification not only consider the redundancy between features but also the correlation between features and labels. The feature selection algorithms used in LTR should also consider factors from multiple angles.

  • (iv) Finally, in several tasks related to new drug research and development, compounds can be ranked according to activity and selectivity. The ranking can also be performed according to other characteristics of the compound (such as toxicity). Therefore, appropriate characteristics should be selected in the ranking task. In fact, not only do new drug research and development but also other tasks need to pay attention to this point. It is necessary to select appropriate correlation indicators to better explore the correlation between query and output results.

Summary

This paper summarizes the specific application of LTR in bioinformatics, analyzes the existing problems of these LTR frameworks and puts forward brief suggestions to better apply LTR in this field. This review will be a useful tool that can not only help the relevant personnel to preliminarily understand the LTR algorithm but also guide them to use the LTR algorithm to do some meaningful work, such as drug redirection.

Key Points
  • LTR has been applied in many fields, such as information retrieval, natural language processing and data mining. LTR algorithms have also been introduced into bioinformatics, and many tasks have been properly solved by the LTR.

  • The composition types of the input dataset, file format and output in the LTR framework are different from the conventional classification model. Under the same LTR query, the LTR algorithms can be divided into three types according to the number of object documents processed.

  • The LTR frameworks in many bioinformatics tasks, such as assignment of MeSH, protein homology detection, protein structure and function prediction and several related tasks of drug research and development, are introduced and discussed.

  • This paper also summarizes and discusses the advantages and disadvantages of the LTR frameworks and proposes suggestions for better use of LTR to solve bioinformatics tasks in the future.

Funding

New Energy and Industrial Technology Development Organization 265 (NEDO) and the Japan Society for the Promotion of Science (JSPS), Grants-in-Aid for Scientific Research (grant no. 18H03250); National Natural Science Foundation of China (grant nos. 61922020, 61771331 and 91935302).

Conflicts of interest

The authors declare that they have no conflicts of interest.

Xiaoqing Ru is a PhD student in the University of Tsukuba. Her research interest is learning to rank.

Xiucai Ye is currently an assistant professor in the Department of Computer Science and Center for Artificial Intelligence Research (C-AIR), University of Tsukuba. Her current research interests include feature selection, clustering, machine learning and bioinformatics.

Tetsuya Sakurai is currently a professor in the Department of Computer Science and is the director of the C-AIR, University of Tsukuba. His research interests include high performance algorithms for large-scale simulations, data and image analysis and deep neural network computations.

Quan Zou is a professor at the University of Electronic Science and Technology of China. He is a senior member of IEEE and ACM. He majors in bioinformatics, machine learning and algorithms.

References

1.

Goutte
C
.
Learning to rank for information retrieval and natural language processing
.
Computl Linguist
2012
;
38
(
2
):
459
9
.

2.

Wang
JY
,
Sun
Y
,
Gao
X
.
Sparse structure regularized ranking
.
Multimed Tools Appl
2015
;
74
(
2
):
635
54
.

3.

He
C
,
Wang
C
,
Zhong
Y
, et al.  A survey on learning to rank. In:
International Conference on Machine Learning and Cybernetics, 2008
,
1734
9
. IEEE. Kunming, PEOPLES R CHINA.

4.

Li
H
.
A short introduction to learning to rank
.
IEICE T Inf Syst
2011
;
94
(
10
):
1854
62
.

5.

Xu
B
,
Lin
HF
,
Lin
Y
, et al.  Learning to rank for biomedical information retrieval. In:
Proceedings of 2015 IEEE International Conference on Bioinformatics and Biomedicine, 2015
,
464
9
. IEEE. Washington, DC.

6.

Jarvelin
K
,
Kekalainen
J
.
Cumulated gain-based evaluation of IR techniques
.
ACM Trans Inf Syst
2002
;
20
(
4
):
422
46
. MIT Press. Vancouver, Canada.

7.

Crammer
K
,
Singer
Y
. Pranking with ranking. In:
Advances in Neural Information Processing Systems
,
2001
,
641
7
.

8.

Caruana
R
,
Baluja
S
,
Mitchell
TM
, et al. 
Using the future to ‘sort out’ the present: Rankprop and multitask learning for medical risk evaluation
.
Adv Neural Inf Process Syst
1999
;
8
:
959
65
.

9.

Burges
CJ
,
Ragno
RJ
,
Le
QV
. Learning to rank with nonsmooth cost functions. In:
International Conference on Neural Information Processing Systems, 2006
,
193
200
. MIT Press. Vancouver, Canada

10.

Herbrich
R
,
Graepel
T
,
Obermayer
K
.
Large margin rank boundaries for ordinal regression
.
Adv Neural Inf Process Syst
2000
;
88
:
115
132
.

11.

Cao
Z
,
Qin
T
,
Liu
T
, et al.  Learning to rank: from pairwise approach to listwise approach. In:
International Conference on Machine Learning, 2007
,
129
36
. Association for Computing Machinery, New York, NY, United States. Corvalis Oregon.

12.

Joachims
T
. Optimizing search engines using clickthrough data. In:
Knowledge Discovery and Data Mining
,
2002
,
133
42
. Association for Computing Machinery, New York, NY, United States. Edmonton Alberta Canada.

13.

Mork
J
,
Jimenoyepes
A
,
Aronson
A
, et al. 
The NLM medical text indexer system for indexing biomedical literature
. In:
Proceedings of BioASQ CLEF
. CEUR Workshop Proceedings. Valencia, Spain.
2013
(
1094
).

14.

Trieschnigg
D
,
Pezik
P
,
Lee
V
, et al. 
MeSH up: effective MeSH text classification for improved document retrieval
.
Bioinformatics
2009
;
25
(
11
):
1412
8
.

15.

Sohn
S
,
Kim
W
,
Comeau
DC
, et al. 
Optimal training sets for Bayesian prediction of MeSH (R) assignment
.
J Am Med Inform Assoc
2008
;
15
(
4
):
546
53
.

16.

Ruch
P
.
Automatic assignment of biomedical categories: toward a generic approach
.
Bioinformatics
2006
;
22
(
6
):
658
64
.

17.

Aronson
AR
.
Effective mapping of biomedical text to the UMLS metathesaurus: the MetaMap program
.
J Am Med Inform Assoc
2001
:
17
21
.

18.

Kim
W
,
Aronson
AR
,
Wilbur
WJ
.
Automatic MeSH term assignment and quality assessment
.
J Am Med Inform Assoc
2001
;
8
(1):
319
23
.

19.

Aronson
AR
,
Mork
JG
,
Gay
CW
, et al.  The NLM indexing initiative’s medical text indexer. In:
Medinfo 2004: Proceedings of the 11th World Congress on Medical Informatics
,
2004
,
268
72
. IOS Press, Nieuwe Hemweg 6B, 1013 BG Amsterdam, Netherlands. San Francisco, CA.

20.

Huang
M
,
Neveol
A
,
Lu
Z
.
Recommending MeSH terms for annotating biomedical articles
.
J Am Med Inform Assoc
2011
;
18
(
5
):
660
7
.

21.

Mao
Y
,
Lu
Z
. NCBI at the 2013 BioASQ challenge task: learning to rank for automatic MeSH indexing.
Technical Report
,
2013
.

22.

Mao
Y
,
Lu
Z
.
MeSH now: automatic MeSH indexing at PubMed scale via learning to rank
.
J Biomed Semantics
2017
;
8
(
1
):
15
.

23.

Liu
K
,
Peng
S
,
Wu
J
, et al. 
MeSHLabeler: improving the accuracy of large-scale MeSH indexing by integrating diverse evidence
.
Bioinformatics
2015
;
31
(
12
):
i339
47
.

24.

Peng
S
,
You
R
,
Wang
H
, et al. 
DeepMeSH: deep semantic representation for improving large-scale MeSH indexing
.
Bioinformatics
2016
;
32
(
12
):
i70
9
.

25.

Dai
S
,
You
R
,
Lu
Z
, et al. 
FullMeSH: improving large-scale MeSH indexing with full text
.
Bioinformatics
2020
;
36
(
5
):
1533
41
.

26.

Murzin
AG
,
Brenner
SE
,
Hubbard
T
, et al. 
SCOP: a structural classification of proteins database for the investigation of sequences and structures
.
J Mol Biol
1995
;
247
(
4
):
536
40
.

27.

Chen
J
,
Guo
M
,
Wang
X
, et al. 
A comprehensive review and comparison of different computational methods for protein remote homology detection
.
Brief Bioinform
2018
;
19
(
2
):
231
44
.

28.

Wang
JY
,
Bensmail
H
,
Gao
X
.
Multiple graph regularized protein domain ranking
.
BMC Bioinformatics
2012
;
13
(
1
):
307
.

29.

Altschul
SF
,
Gish
W
,
Miller
W
, et al. 
Basic local alignment search tool
.
J Mol Biol
1990
;
215
(
3
):
403
10
.

30.

Altschul
SF
,
Madden
TL
,
Schaffer
AA
, et al. 
Gapped BLAST and PSI-BLAST: a new generation of protein database search programs
.
Nucleic Acids Res
1997
;
25
(
17
):
3389
402
.

31.

Kuang
R
,
Weston
J
,
Noble
WS
, et al. 
Motif-based protein ranking by network propagation
.
Bioinformatics
2005
;
21
(
19
):
3711
8
.

32.

Weston
J
,
Elisseeff
A
,
Zhou
DY
, et al. 
Protein ranking: from local to global structure in the protein similarity network
.
Proc Natl Acad Sci U S A
2004
;
101
(
17
):
6559
63
.

33.

Melvin
I
,
Weston
J
,
Leslie
C
, et al. 
RANKPROP: a web server for protein remote homology detection
.
Bioinformatics
2009
;
25
(
1
):
121
2
.

34.

Liu
B
,
Chen
J
,
Wang
X
.
Application of learning to rank to protein remote homology detection
.
Bioinformatics
2015
;
31
(
21
):
3492
8
.

35.

Liu
B
,
Xu
JH
,
Zou
Q
, et al. 
Using distances between top-n-gram and residue pairs for protein remote homology detection
.
BMC Bioinformatics
2014
;
15
(
2
):
1
10
.

36.

Liu
B
,
Zhang
DY
,
Xu
RF
, et al. 
Combining evolutionary information extracted from frequency profiles with sequence-based kernels for protein remote homology detection
.
Bioinformatics
2014
;
30
(
4
):
472
9
.

37.

Liu
B
,
Chen
JJ
,
Wang
XL
.
Protein remote homology detection by combining Chou’s distance-pair pseudo amino acid composition and principal component analysis
.
Mol Genet Genomics
2015
;
290
(
5
):
1919
31
.

38.

Chen
JJ
,
Liu
BQ
,
Huang
D
.
Protein remote homology detection based on an ensemble learning approach
.
Biomed Res Int
2016
:
5813645
5
.

39.

Liu
B
,
Chen
JJ
,
Wang
SY
.
Protein remote homology detection by combining pseudo dimer composition with an ensemble learning method
.
Curr Proteomics
2016
;
13
(
2
):
86
91
.

40.

Chen
JJ
,
Long
R
,
Wang
XL
, et al. 
dRHP-PseRA: detecting remote homology proteins using profile-based pseudo protein sequence and rank aggregation
.
Sci Rep
2016
;
6
(
1
):
32333
3
.

41.

Chen
J
,
Guo
M
,
Li
S
, et al. 
ProtDec-LTR2.0: an improved method for protein remote homology detection by combining pseudo protein and supervised learning to rank
.
Bioinformatics
2017
;
33
(
21
):
3473
6
.

42.

Liu
B
,
Zhu
Y
.
ProtDec-LTR3.0: protein remote homology detection by incorporating profile-based features into learning to rank
.
IEEE Access
2019
;
7
:
102499
507
.

43.

Piana
S
,
Klepeis
JL
,
Shaw
DE
.
Assessing the accuracy of physical models used in protein-folding simulations: quantitative evidence from long molecular dynamics simulations
.
Curr Opin Struct Biol
2014
;
24
:
98
105
.

44.

Marks
DS
,
Hopf
TA
,
Sander
C
.
Protein structure prediction from sequence variation
.
Nat Biotechnol
2012
;
30
(
11
):
1072
80
.

45.

Jing
X
,
Dong
Q
,
Lu
R
.
RRCRank: a fusion method using rank strategy for residue-residue contact prediction
.
BMC Bioinformatics
2017
;
18
(
1
):
390
.

46.

Zhang
Y
.
Protein structure prediction: when is it useful?
Curr Opin Struct Biol
2009
;
19
(
2
):
145
55
.

47.

Ghosh
S
,
Vishveshwara
S
.
Ranking the quality of protein structure models using sidechain based network properties
.
F1000Research
2014
;
3
:
17
7
.

48.

Pawlowski
M
,
Kozlowski
L
,
Kloczkowski
A
.
MQAPsingle: a quasi single-model approach for estimation of the quality of individual protein structure models
.
Proteins
2016
;
84
(
8
):
1021
8
.

49.

Wang
QG
,
Shang
C
,
Xu
D
, et al. 
New Mds and clustering based algorithms for protein model quality assessment and selection
.
Int J Art Intell Tools
2013
;
22
(
5
):
1360006
6
.

50.

Jing
X
,
Dong
Q
.
MQAPRank: improved global protein model quality assessment by learning-to-rank
.
BMC Bioinformatics
2017
;
18
(
1
):
1
8
.

51.

Sleator
RD
,
Walsh
P
.
An overview of in silico protein function prediction
.
Arch Microbiol
2010
;
192
(
3
):
151
5
.

52.

Hamp
T
,
Kassner
R
,
Seemayer
S
, et al. 
Homology-based inference sets the bar high for protein function prediction
.
BMC Bioinformatics
2013
;
14
(
3
):
1
10
.

53.

Gillis
J
,
Pavlidis
P
.
Characterizing the state of the art in the computational assignment of gene function: lessons from the first critical assessment of functional annotation (CAFA)
.
BMC Bioinformatics
2013
;
14
(
3
):
1
12
.

54.

Ashburner
M
,
Ball
CA
,
Blake
JA
, et al. 
Gene ontology: tool for the unification of biology. The gene ontology consortium
.
Nat Genet
2000
;
25
(
1
):
25
9
.

55.

You
R
,
Zhang
Z
,
Xiong
Y
, et al. 
GOLabeler: improving sequence-based large-scale protein function prediction by learning to rank
.
Bioinformatics
2018
;
34
(
14
):
2465
73
.

56.

You
R
,
Yao
S
,
Xiong
Y
, et al. 
NetGO: improving large-scale protein function prediction with massive network information
.
Nucleic Acids Res
2019
;
47
(
1
):
379
W387
.

57.

Stock
M
,
Fober
T
,
Hullermeier
E
, et al. 
Identification of functionally related enzymes by learning-to-rank methods
.
IEEE/ACM Trans Comput Biol Bioinform
2014
;
11
(
6
):
1157
69
.

58.

Ding
H
,
Takigawa
I
,
Mamitsuka
H
, et al. 
Similarity-based machine learning methods for predicting drug-target interactions: a brief review
.
Brief Bioinform
2014
;
15
(
5
):
734
47
.

59.

Xiao
X
,
Min
JL
,
Wang
P
, et al. 
iGPCR-drug: a web server for predicting interaction between GPCRs and drugs in cellular networking
.
PLoS One
2013
;
8
(
8
):
e72234
.

60.

Luo
YA
,
Zhao
XB
,
Zhou
JT
, et al. 
A network integration approach for drug-target interaction prediction and computational drug repositioning from heterogeneous information
.
Nat Commun
2017
;
8
(
1
):
573
3
.

61.

Hu
J
,
Li
Y
,
Yang
JY
, et al. 
GPCR-drug interactions prediction using random forest with drug-association-matrix-based post-processing procedure
.
Comput Biol Chem
2016
;
60
:
59
71
.

62.

Agarwal
S
,
Dugar
D
,
Sengupta
S
.
Ranking chemical structures for drug discovery: a new machine learning approach
.
J Chem Inf Model
2010
;
50
(
5
):
716
31
.

63.

Rathke
F
,
Hansen
K
,
Brefeld
U
, et al. 
StructRank: a new approach for ligand-based virtual screening
.
J Chem Inf Model
2011
;
51
(
1
):
83
92
.

64.

Ohue
M
,
Suzuki
SD
,
Akiyama
Y
.
Learning-to-rank technique based on ignoring meaningless ranking orders between compounds
.
J Mol Graph Model
2019
;
92
:
192
200
.

65.

Liu
J
,
Ning
X
.
Multi-assay-based compound prioritization via assistance utilization: a machine learning framework
.
J Chem Inf Model
2017
;
57
(
3
):
484
98
.

66.

Zhang
W
,
Ji
L
,
Chen
Y
, et al. 
When drug discovery meets web search: learning to rank for ligand-based virtual screening
.
J Chem
2015
;
7
(
1
):
5
5
.

67.

Suzuki
SD
,
Ohue
M
,
Akiyama
Y
.
PKRank: a novel learning-to-rank method for ligand-based virtual screening using pairwise kernel and RankSVM
.
Artif Life Robotics
2018
;
23
(
2
):
205
12
.

68.

Dorr
A
,
Rosenbaum
L
,
Zell
A
.
A ranking method for the concurrent learning of compounds with various activity profiles
.
J Chem
2015
;
7
(
1
):
2
2
.

69.

Liu
J
,
Ning
X
.
Differential compound prioritization via bidirectional selectivity push with power
.
J Chem Inf Model
2017
;
57
(
12
):
2958
75
.

70.

Rahangdale
A
,
Raut
S
. Gene-expression based predictor for drug selection and prioritization using learning-to-rank. In:
International Conference on Bioinformatics, 2018
. IEEE, 345 E 47th st, New York, NY 10017 USA. Allahabad, India.

71.

He
Y
,
Liu
J
,
Ning
X
.
Drug selection via joint push and learning to rank
.
IEEE/ACM Trans Comput Biol Bioinform
2020
;
17
(
1
):
110
23
.

72.

Yuan
Q
,
Gao
J
,
Wu
D
, et al. 
DrugE-rank: improving drug-target interaction prediction of new candidate drugs or targets by ensemble learning to rank
.
Bioinformatics
2016
;
32
(
12
):
i18
27
.

73.

Ru
X
,
Wang
L
,
Li
L
, et al. 
Exploration of the correlation between GPCRs and drugs based on a learning to rank algorithm
.
Comput Biol Med
2020
;
119
:
103660
.

74.

Shivani
A
,
Shiladitya
S
. Ranking genes by relevance to a disease. In:
Proceedings of the 8th International Conference on Computational Systems Bioinformatics, 2009
, Vol.
8
,
37
46
. CBS 2009 On-line Proceedings. San Francisco, United States.

75.

Lee
PF
,
Soo
VW
. An ensemble rank learning approach for gene prioritization. In:
Conference Proceedings: Annual International Conference of the IEEE Engineering in Medicine and Biology Society, 2013
,
3507
10
. IEEE, 345 E 47th st, New York, NY 10017 USA. Osaka, Japan.

76.

Raj
MR
,
Sreeja
A
.
Analysis of computational gene prioritization approaches
.
Procedia Comput Sci
2018
;
143
:
395
410
.

77.

Qeli
E
,
Omasits
U
,
Goetze
S
, et al. 
Improved prediction of peptide detectability for targeted proteomics using a rank-based algorithm and organism-specific data
.
J Proteomics
2014
;
108
:
269
83
.

78.

Leaman
R
,
Islamaj Dogan
R
,
Lu
Z
.
DNorm: disease name normalization with pairwise learning to rank
.
Bioinformatics
2013
;
29
(
22
):
2909
17
.

79.

Wu
JJ
,
Huang
JX
,
Ye
Z
.
Learning to rank diversified results for biomedical information retrieval from multiple features
.
Biomed Eng Online
2014
;
13
(
2
):
1
10
.

80.

Shang
Y
,
Hao
HH
,
Wu
JJ
, et al. 
Learning to rank-based gene summary extraction
.
BMC Bioinformatics
2014
;
15
(
12
):
1
8
.

81.

Guan
W
,
Ozakin
A
,
Gray
A
, et al. 
Learning protein folding energy functions
. In:
International Conference on Data Mining, 2011
.
1062
7
. IEEE Computer Society. Vancouver, Canada.

82.

Chen
Z
,
Zhao
P
,
Li
FY
, et al. 
iFeature: a python package and web server for features extraction and selection from protein and peptide sequences
.
Bioinformatics
2018
;
34
(
14
):
2499
502
.

83.

Liu
B
,
Liu
FL
,
Wang
XL
, et al. 
Pse-in-one: a web server for generating various modes of pseudo components of DNA, RNA, and protein sequences
.
Nucleic Acids Res
2015
;
43
(
W1
):
W65
71
.

84.

Lovric
M
,
Molero
JM
,
Kern
R
.
PySpark and RDKit: moving towards big data in cheminformatics
.
Mol Inform
2019
;
38
(
6
):
4
.

85.

Wang
JY
,
Cui
X
,
Yu
G
, et al. 
When sparse coding meets ranking: a joint framework for learning sparse codes and ranking scores
.
Neural Comput Appl
2019
;
31
(
3
):
701
10
.

86.

Li
Y
,
Kuwahara
H
,
Yang
P
, et al. 
PGCN: disease gene prioritization by disease and gene embedding through graph convolutional neural networks
.
bioRxiv
2019
;
00
:
532226
.

87.

Han
P
,
Yang
P
,
Zhao
PL
, et al. 
GCN-MF: Disease-Gene Association Identification by Graph Convolutional Networks and Matrix Factorization
.
New York
:
Assoc Computing Machinery
,
2019
.

88.

Geng
X
,
Liu
T
,
Qin
T
, et al.  Feature selection for ranking. In:
International ACM SIGIR Conference on Research and Development in Information Retrieval, 2007
,
407
14
. Association for Computing Machinery, New York, NY, United States. Amsterdam, The Netherlands.

This article is published and distributed under the terms of the Oxford University Press, Standard Journals Publication Model (https://academic.oup.com/journals/pages/open_access/funder_policies/chorus/standard_publication_model)