• Skip to main content
  • Skip to after header navigation
  • Skip to site footer
ERN: Emerging Researchers National Conference in STEM

ERN: Emerging Researchers National Conference in STEM

  • About
    • About AAAS
    • About the NSF
    • About the Conference
    • Partners/Supporters
    • Project Team
  • Conference
  • Abstracts
    • Undergraduate Abstract Locator
    • Graduate Abstract Locator
    • Abstract Submission Process
    • Presentation Schedules
    • Abstract Submission Guidelines
    • Presentation Guidelines
  • Travel Awards
  • Resources
    • Award Winners
    • Code of Conduct-AAAS Meetings
    • Code of Conduct-ERN Conference
    • Conference Agenda
    • Conference Materials
    • Conference Program Books
    • ERN Photo Galleries
    • Events | Opportunities
    • Exhibitor Info
    • HBCU-UP/CREST PI/PD Meeting
    • In the News
    • NSF Harassment Policy
    • Plenary Session Videos
    • Professional Development
    • Science Careers Handbook
    • Additional Resources
    • Archives
  • Engage
    • Webinars
    • ERN 10-Year Anniversary Videos
    • Plenary Session Videos
  • Contact Us
  • Login

Large-scale Workload Characterization in Apache Spark Framework

Undergraduate #42
Discipline: Computer Sciences and Information Management
Subcategory: Computer Science & Information Systems

Sebastian Cousins - Winston-Salem State University
Co-Author(s): Debzani Deb, Winston-Salem State University, Winston Salem, NC



Recently, Apache Spark platform [1] has grown both in popularity and complexity and has been increasingly adopted by a wide spectrum of areas for big data analytics and many of them use cloud environments to deploy their applications. In order to use available cloud resources in an efficient way and at the same time to effectively meet the workloads demands, spark user need to understand how the execution of her application is impacted by the various configuration settings. Spark relies on users for specifying many configuration parameters and understanding the affect of these choices with respect to the execution time and resource utilization is not an easy task. The goal of this study is to investigate the characteristic of spark application performance through running them with various configuration settings and to gain deep insight about the resource provisioning, workload management etc. Toward this goal, we implemented a Spark on YARN cluster on the Chameleon Cloud [2], a large-scale and open cloud research platform funded by the NSF that allows for greater computational power that we lack having in house. Our cluster has master and 8 slave nodes (each with 48 cores, 128 GB memory) running Ubuntu 16.04 OS with Hadoop 2.7.4, Spark 2.2.0, and Scala 2.13.8. We gathered our experimental data based on the logged information available through Spark?s history server and by utilizing cluster-wide monitoring tool Ganglia, that provides us important insight about the overall cluster resource utilization (such as CPU and memory). We used two representative Spark-SQL applications from TPC-H [3] benchmark and executed them with 50GB and 100GB datasets. We choose Query 1 (Q1) and Query 5 (Q5) because of their unique characteristics. Q1 has minimal join operation (therefore least shuffling data), which makes it to be CPU bound, on the other hand, Q5 is an I/O bound job that contains multiple join operations (significant shuffling during stages). Our results shows that, the two different applications exhibits different resource consumptions and their execution times are significantly impacted by the settings of the configuration parameters such as size of the RDD cache, the number of concurrent tasks in each executor, executor configuration (core and memory per executor) etc. Configuring a remote Spark on Yarn cluster from scratch with all the abovementioned tools is a significantly challenging task and we learned a lot throughout the process. We are really excited to present our learning experiences and our experimental results during ERN 2018 conference and hope to receive important feedback that will further us with our ultimate research goal of investigating Elastic Spark where resource provisioning for spark executors in multi-tenant environment can occur dynamically and informatively.

References
: 1. Apache Spark, https://spark.apache.org

2. Chameleon cloud, https://www.chameleoncloud.org

3. TPC-H is a Decision Support Benchmark, http://www.tpc.org/tpch/

Funder Acknowledgement(s): This study is supported by NSF HBCU-UP grant #1600864 awarded to Debzani Deb, Associate Professor, Winston-Salem State University

Faculty Advisor: Debzani Deb, Winston-Salem State University, debd@wssu.edu

Role: The parts I completed are being involved in study planning, setting up the cluster and each node on chameleon cloud, generating the bench-marked data using TCP-H tools, running tests using data and queries Q1 and Q5, gathering results and graph data and putting them together into meaningful information, and involved in discussion and analysis of results and conclusions.

Sidebar

Abstract Locators

  • Undergraduate Abstract Locator
  • Graduate Abstract Locator

This material is based upon work supported by the National Science Foundation (NSF) under Grant No. DUE-1930047. Any opinions, findings, interpretations, conclusions or recommendations expressed in this material are those of its authors and do not represent the views of the AAAS Board of Directors, the Council of AAAS, AAAS’ membership or the National Science Foundation.

AAAS

1200 New York Ave, NW
Washington,DC 20005
202-326-6400
Contact Us
About Us

  • LinkedIn
  • Facebook
  • Instagram
  • Twitter
  • YouTube

The World’s Largest General Scientific Society

Useful Links

  • Membership
  • Careers at AAAS
  • Privacy Policy
  • Terms of Use

Focus Areas

  • Science Education
  • Science Diplomacy
  • Public Engagement
  • Careers in STEM

Focus Areas

  • Shaping Science Policy
  • Advocacy for Evidence
  • R&D Budget Analysis
  • Human Rights, Ethics & Law

© 2023 American Association for the Advancement of Science