Volume 12 - Issue 2

Mini review Biomedical Science and Research Biomedical Science and Research CC by Creative Commons, CC-BY

Multi-classification and Variable Selection Techniques in Cancer Genomic Data Research

*Corresponding author: Nan Li, Department of Epidemiology and Cancer Control, St. Jude Children’s Research Hospital, US.

Received: February 02, 2021; Published: March 03, 2021

DOI: 10.34297/AJBSR.2021.12.001725

Abstract

Keywords: Cancer Classification, LASSO, Logistic Regression, Neural Network, SVM, Variable Selection

Introduction

In the past two decades, a huge amount of high-throughput -omics data, such as genomics, transcriptomics, metabolomics, and proteomics, have been generated regarding variations in DNA, RNA, or protein features for many cancers. The tremendous volume and complexity of these data bring significant challenges for biostatisticians, biologists, and clinicians. One of the central goal of analyzing these data is disease classification, which is fundamental for us to explore knowledge, formulate diagnosis, and develop personalized treatment. Here, we review the statistical and machine learning techniques studied in cancer classification and the process or difficulties of categorizing cancer subtypes from their genomic features.

Traditionally cancer was classified by organ location, then it is further stratified by the cell type, patient age, or histological grade [1]. Finally, the dramatically wave of genomic data accelerate the trend of classifying cancer subtypes by the clinical outcome or treatment option. Exploration of multi-classification problems is essential for successful application in precision medicine.

Method

Multi-classification

A few notes of terminology are introduced at the beginning. Multi-classification is a kind of supervised learning, which aims to predict the value of a class outcome using input variables from a training set of samples with known class labels [2]. Another very popular machine learning technique, clustering, falls into the category of unsupervised learning, which doesn’t need outcome label but has the goal to describe the associations and patterns among a set of input variables [2]. We will only talk about classification in this review. Since binary is a special situation for multi-classification, we will focus on multi-classification here.

In literature, there are two main types of classification: soft classification and hard classification. Soft classification rules first estimate the conditional outcome class probabilities and then predict the class label based on the maximum probability. Among them are linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and logistic regression [3]. On the other hand, hard classification rules directly target on the discriminant function without estimating conditional class probabilities, such as support vector machine (SVM, [4]).

Most binary classifiers, such as LDA, QDA and logistic regression, can be extended to multi-classification naturally. However, applying SVM in multi-category problems is not straightforward. One intuitive approach is to reduce a multi-category problem into a series of binary problems through strategies of “one vs. one” or “one vs. rest”. In a “one vs. one” reduction, one trains all pairwise binary classifiers for a K-class problem. For each test point, the predicted class is the one that wins the most pairwise contests. In the “one vs. rest” strategy, a K-class problem is divided into K “onevs- rest” problems, and each “one-vs-rest” problem is addressed by a different class-specific binary classifier (e.g., “class 1” vs. ”not class 1”). Then a new sample takes the class of classifier with the largest real valued output, such as confidence score. These indirectly approaches suffer from unbalanced or reduced sample size and fail to capture correlations among different classes [5].

To overcome these shortcomings, some directly simultaneous multi-classification methods (global models) were proposed to inherit and extend the optimal property of binary SVM to the multi-category case, such as multi-category SVM (MSVM, [6]) and multiclass Proximal SVM (MPSVM, [7]). Another stream of resemble methods, like boosting and random forest, are also very popular due to its high accuracy and strong generalization. Lastly, the famous neural network was developed separately in the field of artificial intelligent. Neural network essentially extract linear combinations of the inputs as derived features, and then model the target as a nonlinear function of those features [2]. It is especially effective for complex and hard to interpret input data, and among the most effective general purpose supervised learning methods currently known.

Variable Selection

In cancer genomic studies, such as microarrays or RNA-seq, the overwhelming number of variables far exceeds the size of training samples even though the underlying model is naturally sparse. Therefore, it is essential to identify important variables for achieving classifiers with higher prediction accuracy and better model interpretability. Variable selection in multi-classification is much challenge than in binary classification or regression, since one needs to consider which variables are important for each individual discriminant functions separately as well as for the whole set of functions. In regression or binary classification, the modern penalized methods, such as LASSO [8], adaptive LASSO [9] or group LASSO [10], outperform the traditional methods of forward/backward/stepwise selection, due to their continuous selection process with smaller variation and better predicting power. Especially in high-dimension genomic analysis, the smaller sample size than the available genomic features make it impossible for practical applications of the traditional subset selection methods. In multi-classification, some existing works are L1-MSVM [11], group L1 multinomial logistic regression [12], supnorm MSVM [13], and supSCAD multinomial logistic regression/MSVM [14].

Alternatively, individual-gene-ranking of discriminant power [15] or filtering through relevance and correlation [16] can achieve good performance with comparatively less computation cost. Some genetic algorithm (such as recursive feature elimination scheme) can be embedded into different multi-classification machines to achieve variable selection in iterative steps. The nearest shrunken centroids classifier [17] was proposed in multiple cancer classification with gene expression and have shown good empirical performance. A hierarchical ensemble model with Error-Correcting Output Codes was studied in [18] on multi-class microarray data.

Application

A variety of applications of previous reviewed methods have been studied in different kinds of -omics data in terms of cancer classification. SVM with recursive feature elimination was applied in multi-class cancer classification [19] and the performance was compared between microRNA and mRNA expression profiles [20]. Group L1 multinomial regression was applied in a three-class acute leukemia gene expression data [21]. Then an adaptive version of the Group L1 multinomial regression, which was designed to selecting informative gene groups and also important genes within each group, was developed and applied in lung cancer classification [22]. A deep learning model was proposed to classify multiple cancer subtypes using RNA-seq gene expression [23].

Conclusion

Multi-classification and variable selection have been studied extensively in statistics and machine learning community. Various penalized classifiers have been proposed and examined to achieve good finite sample performance as well as sound asymptotically properties at manageable computing cost. On the other hand, ensemble methods, neural network and other deep learning technique have been applied to process different -omics data and develop complex models or classifiers.

Acknowledgement

The author thanks the Editor and reviews for constructive comments, careful reading, and guidance of the paper presentation.

References

Sign up for Newsletter

Sign up for our newsletter to receive the latest updates. We respect your privacy and will never share your email address with anyone else.