Discipline: Technology and Engineering
Subcategory: STEM Research
Nian Zhang - University of the District of Columbia
The goal of this project is to analyze the noise-free but highly overlapped and imbalanced data set involving trace explosives. The objective is to develop algorithms to discover the underlying mechanism that affect the clustering performance on different combination of principal components and different number of features. In this project, we explored the principle components in the feature space to observe which spectral bands contribute the most contrast or data spreading. We also compared the principal component analysis (PCA) results and the corresponding k-means clustering algorithm results generated using lower PCs, i.e. PC4 and PC5 to see how it compares to using a combination of top PCs, i.e. PC1 and PC2. Specifically, we reveal the data in PC1-PC2 space, PC1-PC3 space, PC2-PC3 space, and PC4-PC5 space, respectively. Then we used the K-mean clustering algorithm to classify them into six classes including TNT, DNT, PE, PC, RDX, and Copper/Steel. We developed an algorithm to automatically determine the classifiers associated with each clustering. The algorithm can accurately determine which clustering corresponds to which analyte. To enhance the visualization efficiency, the same color code is used in the k-means algorithm as the PCA, for example, red represents TNT, green represents DNT, yellow represents PE, cyan represents PC, blue represents RDX, and black represents copper/steel. In this way, we can effectively compare the PCA and the k-means clustering results. We also conducted the clustering performance evaluation by calculating the probability of detection (POD), false alarm rate (FAR), accuracy, precision, and recall. This process was facilitated by developing an automatic algorithm to determine the true positive (TP), false negative (FN), false positive (FP), and true negative (TN), which are components of the above performance evaluation matrices. In addition, we compared the clustering abilities with different combination of principal components. Moreover, we investigated the false alarms to see if they are always the same samples, and if so, which ones. Furthermore, we investigated the effect of the first 28 features on the PCA and k-means algorithm with different combination of principal components. The experimental results demonstrated that top principal components (PCs) have higher clustering accuracy than the lower PCs.
Funder Acknowledgement(s): This work was supported in part by the National Science Foundation (NSF) under Grant HRD #1505509.
Faculty Advisor: None Listed,