Data Intensive Apps - book notes

1. thinking about data systems
2. reliability
3. scalability
4. maintainability
1. relational vs document models
2. many-to-one, many-to-many relations
3. are document DBs repeating history?
4. relational vs document DBs today
5. query languages
6. mapreduce queries
7. graph data models
8. cypher - a query language
9. sql graph queries
10. triple-stores & sparql
11. the foundation - datalog
1. data structures
2. hashes
3. sstables & lsm-trees
4. b-trees
5. b-trees vs lsm-trees
6. other indexing structures
7. transaction processing, or analytics?
8. data warehousing
9. stars & snowflakes - analytics schemas
10. column-based storage
11. column compression
12. column storage - sort order
13. column storage - writes
14. aggregation - data cubes & materialized views
1. encoding formats
2. thrift & protocol buffers
3. apache avro (encoding format)
4. schemas - the merits
5. dataflows - thru databases
6. dataflows - thru svcs (REST, RPC, )
7. dataflows - message passing
1. leaders & followers
2. sync vs async replication
3. new followers
4. node outages
5. implemenation of replication logs
6. replication lag - problems
7. multi-leader replication
8. leaderless replication
1. partitioning & replication
2. key-value data partitions
3. secondary indexes
4. rebalancing
5. request routing
1. (slippery) concept
2. ACID - atomicity, consistency, isolation, durability
3. single- & multi-object ops
4. weak isolation levels
5. preventing lost updates
6. write skew, and phantoms
7. serializability
8. 2-phase locking (2PL)
9. serializable snapshot isolation (SSI)
1. faults & partial failures
2. cloud computing - supercomputing
3. unreliable networks
4. unreliable clocks
5. knowledge, truth & lies
6. system models vs reality
1. consistency guarantees
2. linearizability
3. ordering guarantees
4. distributed transactions & consensus
5. membership & coordination services
1. batch processing with unix tools
2. unix philosophy
3. mapreduce & distributed filesystems
4. reduce-side joins & grouping
5. map-side joins
6. batch workflow outputs
7. hadoop vs distributed databases
8. beyond mapreduce
9. graphs & iterative processing
10. high-level APIs & languages
1. transmitting event streams
2. messaging systems
3. partitioned logs
4. databases and streams
5. event sourcing
6. streams, states & immutability
7. processing
8. time
9. joins
10. fault tolerance
1. data integration
2. batch & stream processing
3. unbundling databases
4. designing apps around dataflows
5. observing derived states
6. aiming for correctness
7. enforcing constraints
8. timeliness & integrity
9. trust, but verify
10. doing the right thing