Discipline: Technology and Engineering
Subcategory: Electrical Engineering
Saul B. Henderson - University of the District of Columbia
Co-Author(s): Keenan Leatham, Nian Zhang, Lara Thompson
Although the semi-supervised learning has great success in many machine learning and data mining applications, its importance under the condition of imbalanced data sets has received very limited attention in the community. Classification of data becomes very difficult because of unbounded size and imbalanced nature of data. The minority samples are those that rarely occur but are extremely important and they also implies an overwhelming cost when they are not well classified. Therefore, it is critical to develop a highly efficient algorithm to alleviate such overlap in low dimensional mapping so as to improve the classification accuracy. Inspired by the semi-supervised learning mode, the proposed clustering algorithms will utilize external information or side information from context classes (i.e. backgrounds or confounders) in addition to intrinsic information from the object class (i.e. targets to be recognized) to partition data into clusters. Such intra-class clustering (ICC) approach partitions each class into sub-classes in order to minimize overlap across clusters from different classes. The new semi-supervised learning based kernel density clustering algorithm consist of a principal component analysis (PCA) step, a difference-of-density estimation step, and a gradient ascent step. Initially data will be projected into a lower dimensional space using PCA. This will ensure the following kernel density estimation can be computed efficiently in high dimensional space. Then, the difference-of-density estimation will perform Gaussian kernel density estimation on both the object class and the context class. It will then find the differences between the two density estimation. Under this analysis, a density clustering algorithm has been developed. It uses the density function as a map, and assign examples on the same ??mountain?? (local peak) to the same cluster. Once the density clustering step is complete, the data in each new cluster share one local maximum on the difference of density map. To search for the local maximum of sample in the object class, the gradient of the difference of density map is calculated through finite difference. A modified Levenberg-Marquardt algorithm will be applied to the gradient to iteratively find the position of the local maximum. Once a local maximum for each example is found, all examples sharing the same local maximum will be assigned to the same subclass label. A kernel-based least square support vector machine (LS-SVM) is designed as a classifier. Its performance is compared with the traditional quadratic classifier on both real-world photo-thermal infrared (IR) imaging spectroscopy (PT-IRIS) data and olfactory database. Experimental results show that the proposed algorithm can not only separate an arbitrary data distribution into non-overlapping unimodal clusters, but also can utilize intervening context data distributions to further separate the clusters. The findings demonstrate that the proposed approach can perform efficiently in those applications where class-conditional densities are significantly non-Gaussian or multi-modal. [This study was supported by a grant from the University of the District of Columbia (NSF/HBCU-UP/ HRD #1505509, HRD #1533479, and NSF/DUE #1654474), Washington, D. C. 20008]
Funder Acknowledgement(s): National Science Foundation (NSF/HBCU-UP/ HRD #1505509, HRD #1533479, and NSF/DUE #1654474)
Faculty Advisor: Nian Zhang, nzhang@udc.edu
Role: I contributed to the Matlab implementation of principal component analysis (PCA) step, the difference-of-density estimation step, and the gradient ascent step.