Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems
Desmond Hill - Grambling State University
Co-Author(s): Busby Sanders, Yenumula B. Reddy, and Jaruwan Mesit, Grambling State University
Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. Analyzing the Big Data using Map Reduce techniques and identifying a required document from a stream of documents is the goal of the current research.
The research was conducted using Hadoop 2.6.0, JDK 7, and Python 3.4 on Dell Precision T5500 with Ubuntu 14.04 in Department of Computer Science Research Lab at Grambling State University during spring and summer 2015.
The process includes the following steps:
1. Two computers were used to setup the single node Hadoop cluster;
2. Install Ubuntu 14.04, create Hadoop account;
3. Configure the Secure Shell (SSH), generate key for the Hadoop account and enable SSH access to the local machine. Test the SSH setup by connecting the local machine with Hadoop account; and
4. Update the required Hadoop core files as directed in document so that the single node clusters. Use JPS command to check the Hadoop single node cluster.
To select a required document, we provide the keywords and their importance that varies between 0 and 1. We then take the important factor multiplied by the number of times keyword and sum the result of all keyword importance. If the sum is greater than or equal to threshold we conclude that the document is required. The algorithm was coded in two stages. During the first stage the reputation of the words and in the second stage the importance factor and selection of the document were coded in Python. We processed six text files to compare the results for the current experiment. The keywords and impact factor provided are: medicine (0.02), profession (0.025), disease (0.02), surgery (0.02), mythology (0.02), and cure (0.05). The size of each file in words, times taken in seconds to process and impact factors respectively are: (128,729; 0.1539; 5.39), (128,805; 0.1496; 0.62), (266,017; 0.13887, 0), (277,478; 0.1692; 6.02), (330,582; 0.1725; 7.93), and (409,113; 0.2032; 18.87). The threshold set was 10.0. Therefore, the file with impact factor 18.87 is selected as required file. If we lower the threshold to 5.0 another two files with impact factors 6.02 and 7.93 would become our required files.
The current research discusses the implementation of Hadoop Distributed file system and selection of required document among the stream of documents (unstructured data). Setting up Hadoop single node cluster on two machines and selection of required document among the stream of text documents was successfully completed. The process did not include any data models or SQL. It is simple Hadoop cluster, Map Reduce algorithm, required keys and their importance factor.
Funder Acknowledgement(s): The research work was supported by the AFRL Collaboration Program: Sensors Directorate, Air Force Contract FA8650-13-C-5800, through subcontract number GRAM 13-S7700-02-C2.
Faculty Advisor: Yenumula B Reddy,