Project 2: Large-Scale Data Analysis using Hadoop MapReduce |
< Previous |
The goal of this project is to provide hands-on experience with distributed data processing using Hadoop MapReduce. You get to analyze and extract insights from a real-world, large-scale dataset (at least a gigabytes in size) by designing and implementing efficient MapReduce tasks. This project emphasizes understanding the scalability, and performance characteristics of the Hadoop ecosystem.
In this project, you get to select a large dataset — such as web logs, social media streams, sensor data, or public datasets (e.g., Common Crawl, OpenStreetMap, Enron emails, Kaggle, etc.) — and perform an end-to-end data analysis workflow using Hadoop MapReduce.
You may complete this individually or in a group of at most 3 students in the class.
You get to:
Possible analysis examples:
Well-documented code with clear instructions for execution.
Describe your problem selection, any data preprocessing, your system design, implementation, results, and performance evaluation.
Create a video presentation about your analysis. Include background information about the problem that you chose. Describe the dataset(s) that you used. Briefly explain the set-up of your Hadoop system. Demonstrate your analysis actually running. Summarize the results produced / key findings. Also include implementation challenges and lessons learned.
Submit the following to the CougarVIEW Assignment:
Submit your video presentation to the CougarVIEW Discussion. Review at least 2 other presentations and comment on something you liked about their approach or analysis that you could use in the future.