• Skip to main content
  • Skip to after header navigation
  • Skip to site footer
ERN: Emerging Researchers National Conference in STEM

ERN: Emerging Researchers National Conference in STEM

  • About
    • About AAAS
    • About the NSF
    • About the Conference
    • Partners/Supporters
    • Project Team
  • Conference
  • Abstracts
    • Undergraduate Abstract Locator
    • Graduate Abstract Locator
    • Abstract Submission Process
    • Presentation Schedules
    • Abstract Submission Guidelines
    • Presentation Guidelines
  • Travel Awards
  • Resources
    • Award Winners
    • Code of Conduct-AAAS Meetings
    • Code of Conduct-ERN Conference
    • Conference Agenda
    • Conference Materials
    • Conference Program Books
    • ERN Photo Galleries
    • Events | Opportunities
    • Exhibitor Info
    • HBCU-UP/CREST PI/PD Meeting
    • In the News
    • NSF Harassment Policy
    • Plenary Session Videos
    • Professional Development
    • Science Careers Handbook
    • Additional Resources
    • Archives
  • Engage
    • Webinars
    • ERN 10-Year Anniversary Videos
    • Plenary Session Videos
  • Contact Us
  • Login

Retrieving a Required Document Using Hadoop Technology

Undergraduate #58
Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems

Desmond Hill - Grambling State University
Co-Author(s): Busby Sanders, Yenumula B. Reddy, and Jaruwan Mesit, Grambling State University



Big data is used for structured, unstructured and semi-structured large volume of data which is difficult to manage and costly to store. Using explanatory analysis techniques to understand such raw data, carefully balance the benefits in terms of storage and retrieval techniques is an essential part of the Big Data. Analyzing the Big Data using Map Reduce techniques and identifying a required document from a stream of documents is the goal of the current research.

The research was conducted using Hadoop 2.6.0, JDK 7, and Python 3.4 on Dell Precision T5500 with Ubuntu 14.04 in Department of Computer Science Research Lab at Grambling State University during spring and summer 2015.

The process includes the following steps:
1. Two computers were used to setup the single node Hadoop cluster;
2. Install Ubuntu 14.04, create Hadoop account;
3. Configure the Secure Shell (SSH), generate key for the Hadoop account and enable SSH access to the local machine. Test the SSH setup by connecting the local machine with Hadoop account; and
4. Update the required Hadoop core files as directed in document so that the single node clusters. Use JPS command to check the Hadoop single node cluster.

To select a required document, we provide the keywords and their importance that varies between 0 and 1. We then take the important factor multiplied by the number of times keyword and sum the result of all keyword importance. If the sum is greater than or equal to threshold we conclude that the document is required. The algorithm was coded in two stages. During the first stage the reputation of the words and in the second stage the importance factor and selection of the document were coded in Python. We processed six text files to compare the results for the current experiment. The keywords and impact factor provided are: medicine (0.02), profession (0.025), disease (0.02), surgery (0.02), mythology (0.02), and cure (0.05). The size of each file in words, times taken in seconds to process and impact factors respectively are: (128,729; 0.1539; 5.39), (128,805; 0.1496; 0.62), (266,017; 0.13887, 0), (277,478; 0.1692; 6.02), (330,582; 0.1725; 7.93), and (409,113; 0.2032; 18.87). The threshold set was 10.0. Therefore, the file with impact factor 18.87 is selected as required file. If we lower the threshold to 5.0 another two files with impact factors 6.02 and 7.93 would become our required files.

The current research discusses the implementation of Hadoop Distributed file system and selection of required document among the stream of documents (unstructured data). Setting up Hadoop single node cluster on two machines and selection of required document among the stream of text documents was successfully completed. The process did not include any data models or SQL. It is simple Hadoop cluster, Map Reduce algorithm, required keys and their importance factor.

Funder Acknowledgement(s): The research work was supported by the AFRL Collaboration Program: Sensors Directorate, Air Force Contract FA8650-13-C-5800, through subcontract number GRAM 13-S7700-02-C2.

Faculty Advisor: Yenumula B Reddy,

Sidebar

Abstract Locators

  • Undergraduate Abstract Locator
  • Graduate Abstract Locator

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DUE-1930047. Any opinions, findings, interpretations, conclusions or recommendations expressed in this material are those of its authors and do not represent the views of the AAAS Board of Directors, the Council of AAAS, AAAS’ membership or the National Science Foundation.

AAAS

1200 New York Ave, NW
Washington,DC 20005
202-326-6400
Contact Us
About Us

  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
  • YouTube

The World’s Largest General Scientific Society

Useful Links

  • Membership
  • Careers at AAAS
  • Privacy Policy
  • Terms of Use

Focus Areas

  • Science Education
  • Science Diplomacy
  • Public Engagement
  • Careers in STEM

Focus Areas

  • Shaping Science Policy
  • Advocacy for Evidence
  • R&D Budget Analysis
  • Human Rights, Ethics & Law

© 2023 American Association for the Advancement of Science