Project 2: Large-Scale Data Analysis using Hadoop MapReduce

  < Previous

Objectives

The goal of this project is to provide hands-on experience with distributed data processing using Hadoop MapReduce. You get to analyze and extract insights from a real-world, large-scale dataset (at least a gigabytes in size) by designing and implementing efficient MapReduce tasks. This project emphasizes understanding the scalability, and performance characteristics of the Hadoop ecosystem.

Description

In this project, you get to select a large dataset — such as web logs, social media streams, sensor data, or public datasets (e.g., Common Crawl, OpenStreetMap, Enron emails, Kaggle, etc.) — and perform an end-to-end data analysis workflow using Hadoop MapReduce.

You may complete this individually or in a group of at most 3 students in the class.

You get to:

Possible analysis examples:

Deliverables

Code Repository

Well-documented code with clear instructions for execution.

Report

Describe your problem selection, any data preprocessing, your system design, implementation, results, and performance evaluation.

Presentation

Create a video presentation about your analysis. Include background information about the problem that you chose. Describe the dataset(s) that you used. Briefly explain the set-up of your Hadoop system. Demonstrate your analysis actually running. Summarize the results produced / key findings. Also include implementation challenges and lessons learned.

Rubric:

Project 2 rubric thumbnail

Submissions

CougarVIEW Assignment

Submit the following to the CougarVIEW Assignment:

CougarVIEW Discussion

Submit your video presentation to the CougarVIEW Discussion. Review at least 2 other presentations and comment on something you liked about their approach or analysis that you could use in the future.