|
|
|
NOSQL - Managing Data (almost) without a Database System WS 10/11
Background:
- current wave of NOSQL data managing systems and start-ups
- e.g.: MapReduce, Hadoop, Hadoop++, HBase, BigTable, Hypertable, CouchDB, MongoDB, Cassandra, SimpleDB, PNUTS, neo4j, voldemort, ....
Goals:
- understand motivation for not using existing DBMS
- understand technology behind those systems
- understand when to use which system
- understand to what degree these systems are reinventing the wheel
Requirements:
- sound understanding of relational DBMS
- i.e. at least a good grade in the Informationssysteme lecture or a comparable lecture
Administrative issues:
- Time: Thursdays, 10:15 to 12:00
- Place: E1.3, HS III
- Type: advanced lecture
Slides
- Admin, Introduction, Motivation
- MapReduce:
- accompanying slides
- GFS original paper:
Sanjay Ghemawat, Howard Gobioff, Shun-Tak Leung: The Google file system. SOSP 2003:29-43 pdf
- MapReduce original paper:
Jeffrey Dean, Sanjay Ghemawat: MapReduce: Simplified Data Processing on Large Clusters. OSDI 2004:137-150 pdf
- Hadoop++ paper with detailed execution plan and relational mappings in Section 2 and Appendix B.1ff:
Jens Dittrich, Jorge-Arnulfo Quiane-Ruiz, Alekh Jindal, Yagiz Kargin, Vinay Setty, Joerg Schad: Hadoop++: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). PVLDB 3(1):518-529 (2010) pdf
- open source implementation Hadoop with HDFS and Hadoop MapReduce
- 39% of all companies interested in NOSQL! ;-) link
- MapReduce and Hadoop (continued), Nephele/PACTs
- outline (continued from last week: special case of a SQL join/groupby query that can (surprisingly) be translated to a single MR job using only map()/reduce() or the ten UDFs, failover strategies of MapReduce including task failure, master failure, skipping bad records, stragglers, backup tasks, improvements, RAFT, local checkpointing, remote checkpointing, query metadata logging, PACTs, m()-functions, single input contracts, how to model the MapReduce programming paradigm with PACTs, partitioning units)
- RAFT tech report
- Nephele/PACTs paper
- Big Data and NoSQL March to the Enterprise
- Nephele/PACTs (continued), BigTable and HBase
- accompanying slides
- outline (PACTS: multi-input contracts, association functions, applications; BigTable: data model, column families, representation versus physical storage, pivot, impact of physical layout, tablets, partitioning problem, query processing, two layer architecture, homework: find out why two layers?)
- BigTable paper
- HBase
- HBase storage architecture
- HBase vs. BigTable comparison
- BigTable and HBase (continued)
- accompanying slides
- outline (TabletServer organization, log-structured writes, LSM tree, partitioned exponential file, indexing in BigTable, Chubby, detecting failed tablet servers, how to compare two index structures, compression, bitmaps)
- The partitioned exponential file
- BigTable (continued) and PigLatin
- Apache Pig
- outline (agenda, BigTable optimizations, bloom filter, locality groups, Pig Latin, Pig, data model, programs, reordering, DUMP)
- Pig Latin Paper, SIGMOD 2008
- Pig Latin Paper, VLDB 2009
- PigLatin (continued) and HiveQL
- outline (HBase versus BigTable, pig latin demo, describe, foreach generate, filter, join, illustrate, flatten, stream, UDFs, physical execution, joins: replicated, skewed, merge, what versus how, parallel; hive and hiveql, semijoins, differences to HBase)
- Hive
- HiveQL Language Manual
- BerkeleyDB
- MongoDB
- Storage and OctopusDB
- Hadoop++
- CIDR 2011, Dataspaces
- RDF, Martin Theobald
- Percolator and Incremental MapReduce, Rodrigo Rodrigues
Exercises and Groups
- Time/Location
- Fridays, 10:15 and 14:15
- MPII, room 0.23
- Assistants
- Assignments
- Exercise 1:
- Exercise 2:
- exercise
- Notice: we did not get till PACTs multi-input contracts yet. Therefore you may hand-in Exercise 2.3.2 (TPC-H, Query 3) with Exercise 3.
- modified lineitem table
- Exercise 3:
- exercise
- Notice: You may hand-in Exercise 3, Part 3 (HBase with Hadoop MapReduce) with Exercise 4. For future exercises all implementation tasks may be handed in after two weeks.
- Exercise 4:
- Exercise 5:
- Exercise 6:
- exercise
- Notice: we did not get till BerkeleyDB yet. Therefore you may hand-in Exercise 6.2 (BerkeleyDB Concepts) till Dec 16 noon.
- Exercise 7:
- Exercise 8:
- Exercise 9:
- Rules
- need to reach 50% of points in assignments to participate in final exam
- need to pass either final exam or repetition exam (best exam counts)
- Exams
- (06-12-10) Please register in HISPOS until January 27, 2011 (latest!).
- Mini Midterm in December (counts 20% of your final grade)
- Final Exam on February 10, 2011, 10:15 am to 12:00
- Repetition Exam in March
- Sample Solutions
- Solutions for exercises 1 to 6
- Solutions for exercises 7 to 9
- sample solutions for the assignments in pdf will be provided two weeks before the final exam.
- Why "sample"?: for many assignments there may not only be a single right solution but multiple ones.
- TBA
|