br A research was conducted by Garc a
A research was conducted by García, Sánchez, Cleofas-Sánchez, Ochoa-Domínguez, and López-Orozco (2017) to analyze the effect of high-dimensional data on the classification of gene GSK126 datasets. Gain ratio and ReliefF were used as gene ranking meth-ods with six classifiers on four biomedical datasets. The results showed that regardless of the used gene ranking algorithm and classifier, the highest classification performance was achieved by using a very small number of genes (less than the fifth of the total amount of genes).
A dynamic relevance-based gene selection method (DRGS) was introduced by Sun et al. (2013) to identify a gene subset from
Dataset type Number of variables Dataset function Sample type Number of samples
TCGA gene 17,815 genes 46 samples for 5-fold cross validation Normal 21
training and 200 samples for testing Cancer 225
GEO gene 17,815 genes 174 samples for Independent test Normal 19
cross validation training Cancer 234
high dimensional gene expression Microarray data for cancer clas-sification and diagnosis. This method aimed to use a target-based scheme for relevance, interdependence and redundancy analysis to retain the useful functional gene groups. This is done by updating the relevance between each gene and target dynamically when a new gene is selected. The proposed method was validated against Information Gain, mRMR, ReliefF, and Significance Analysis of Mi-croarrays (SAM) on six gene expression Microarray datasets. The results showed that, compared to the other selectors, DRGS se-lected fewer genes with higher classification accuracy.
In this study, an ensemble feature selection approach based on a Nested Genetic Algorithm is proposed to select the optimal Mi-croarray genes subset that represents the biomarker genes of one cancer type by combining the information from two types of Mi-croarray data; gene expression data and DNA Methylation data. The Nested Genetic Algorithm (Nested-GA) utilizes both Filter and Wrapper feature selection methods. For filter feature selection, t-test is used as a preprocessing step. Then, a Nested Genetic Al-gorithm composed of two genetic algorithms, one with a Sup-port Vector Machine (SVM) and the other with a Neural Network, are used as the Wrapper feature selection technique. Incremental Feature Selection (IFS) is then used as an ensemble approach to present the biomarker genes as its outcome. r> 2. Materials and methods
The results presented in this paper are based on the colon can-cer gene expression data downloaded from The Cancer Genome Atlas (TCGA) https://tcga-data.nci.nih.gov/tcga/ and TCGA DNA Methylation dataset based on the IHM-27k platform for running the Nested-GA algorithm. The colon cancer gene expression data from Gene Expression Omnibus (GEO) from NCBI has been used as a dataset for independent testing. Table 1 shows more details of the used datasets.
2.2. The proposed algorithm
The pipeline of the proposed method, as shown in Fig. 1, starts by preprocessing for both the Gene Expression and the DNA Methylation datasets before applying feature selection. After that, feature filtering is applied using t-test to select a subset of the top ranked Genes and CpG sites from Gene Expression and DNA Methylation data. The filtered gene subset is fed as an input to the OGA-SVM with SVM fitness function, while the filtered CpG sites subset is fed as input to the IGA-NNW with N-Net fitness function. Finding the relation between genes and CpG sites is important step that is used in the initialization stage of each IGA-NNW and OGA-SVM. After determine number of runs of OGA-SVM, we get number of solutions N. We rank the genes in the N solutions in descend-ing order based on their frequency. Next, we incrementally append genes with high rank producing M subsets of top ranked genes, models. SVM is used to evaluate the M models to get the optimal model. At the end, the optimal model’s genes are validated.