Course Description
Introduction
The term "big data" is now commonly used to mean that the growth of data in volume, velocity, variety and veracity are in such an unprecedented scale that traditional database management systems can no longer handle it properly. Take Walmart - the world's biggest retailer with over 20,000 stores in 28 countries, as an example, is in the process of building the world's biggest private cloud, to process 2.5 petabytes of data every hour. Facebook, the world's most popular social media network, needs to process data from more than 2 billion monthly active users worldwide. Every 60 seconds, 136,000 photos are uploaded, 510,000 comments are posted, and 293,000 status updates are posted. That amounts to 1000+ terabytes of data generated per day. Approximately 600 million times per second, particles collide within the Large Hadron Collider (LHC) at CERN. Merely recording these events takes up 500EB(1EB = 1024PB) of storage per day, let alone analyzing it. We therefore need new technologies (big data processing) and new tools (big data systems) for these jobs. This is an introductory course on big data concepts, processing, analytics and systems. You will learn the latest development in big data technologies and get hands on experience in using popular open source big data systems such as Hadoop, HBase, Spark, Hama, etc.
The objectives of this course can be summarized as follows.
- Understand big data concepts, challenges and trends.
- Learn the technological foundations of big data science & engineering.
- Learn the principles and practices behind popular open source big data systems.
- Get hands on experiences of using open source big data systems for solving big data problems.
This is a lecture-oriented course. The system part of the course will be executed through in-class example discussion, homework assignments and term project. Due to the time limit, the lectures will focus mostly on the technological innovation of each system rather than how to use them. With brief introduction to the basic operations of various big data systems, students are expected to learn to use them on their own.
The topics to be covered in the lecture are listed as follows (**: topics to be covered depending on the time and progress):
- Introduction
- Data -> Knowledge -> Intelligence
- What/Why/When of Big Data
- The opportunities and challenges for Big Data
- General purpose big data systems
- Big data storage
- Distributed filesystems and Google GFS
- Apache HDFS
- Big structured data processing
- Google BigTable system
- NoSQL and Apache HBase
- Cassandra and MongoDB
- Data Warehousing, Google BigQuery and Apache Hive
- Big graph processing
- Pregel family of systems
- GraphLab family of systems
- Big stream processing
- Apache Flink
- Apache Storm
- Spark Streaming
- Other topics: Google Dremel, Apache Drill, Apache Impala, Google Knowledge Graph, Open Data, Beyond Hadoop
Visit the syllabus page for detail information about the lecture schedule.
Administrative Information