Grad Advanced Operating Systems & Distrib. Systems Notes

Professor Dave Andersen

2016-Oct-14

Dynamo Paper

Eventual Consistency -> Vector Clocks (Bayou) / Red Blue Consistency
Consistent Hashing / Chord
Gossip Protocols
SLOs & 99.9% Percentile Latency
Merkle Trees

2016-Sept-30

The Design and Implementation of a Log-Structured File System

How many disks are at $x -> estimate data per second ingestion / disk storage

Context Raid: (IBM / Disk Size)
Balance
Spindles vs $/tb vs trut?
- RAID Levels
Reliability / MTTR / MTBF / MTTF etc.
Schroeder
RAID6 vs Cross-rack/de
Section Remapping is eww (inserts seeks into reads).

Reliability

Raid vs Tape Storage vs zone replication (buddy data centers!)
http://firstround.com/review/the-three-infrastructure-mistakes-your-company-must-not-make/
Monitor everything
https://en.wikipedia.org/wiki/Jerry_Saltzer has an undergrad book & class on relability eng

RAID

0- striping, 1- mirror, 2- ecc, 3- single group, 4/5 sector.
https://community.netapp.com/fukiw75442/attachments/fukiw75442/data-ontap-discussions/2334/1/WAFL.pdf
Hard drives are not as good as promised
Failures correlated
MTTRs are increasing (fast + bigger disks)
Failures increase (esp during heavy use!)

“When you’re operating at googlebooksoft scale” every server cannot be a special snowflake

2016-Sept-26

Paper: AFS Howard 88

AFS:

# of users > 5-10k
Efficient servers by whole-file caching
Server crash -> pause
No Net? Maybe local cache if lease
Typically 1 user prer file or read only
Evaluation?:
- Benchmark (Modified Andrew Benchmark)
- Ran commmands - You can see amdahls law’s and don’t have interactions b/w file ops
- Experience: - Much slower than local
  - Better than 1 server
  - Killed Server

Dropbox:

500M ~ users.
work offline?
- Conflict handling / merges
- Human / Piss someone off

Lustre, etc: / Panasas

Storage: Sharded in Parallel - 10 Comps. Performance!!!
GFS Looks similar -> pushed requirements to apps -> append semantics

NFS

More cluster-y
Survive crashes
Transparent
Caching! -> Validate on use. (Block Based)
Didn’t have consistent access.
3 second attr cache

How does local ops do things?

Read(fd ..) -> Get a file handle (vol, inode, gen #) -> Make server stateless and idempotent.

Q: How make generation #?
Q: Why is TCP Better > UDP For Services? Because Firewalls like TCP Better generally

2016-Sept-23

Paper: Using Model Checking to Find Serious File System Errors Yang 04

Fuzzing is helpful
- Link against ASAM (address sanitizer) / LibFuzz / AFL - American Fuzzy Lop
- Industry “State of Art” / Adaptive Fuzzing
Filesystems are ugly to fuzz
- People test general not edge cases
- Crash/Reboot => long
  - Sometimes true that disks do what you tell it - write [w1, fsync, w2] - you should have w1 if w2, … - disks sometimes consistent - if you do certain things
  - Untested Code in Drivers: Error States
What do filesystems look like?
- Well known superblock -> links to ‘/’ directory or similar, - ‘/’ then has names -> inodes with inodes pointing to blocks. Blocks can point to blocks. - Important Consistent Things - If delete dog (file), you can just delete entry in directory. Bad issue is if name deleted but lost state about ops in file system. You have pointer gone but inodes/blocks are allocated. - How do we know if this happens? FSCK
  - What if we delete inodes, and then crash (dont delete name chunk)? We start overwriting other things if we try to write to it.
  - When deleting you always delete & unlink then do unfree.
  - When you make a new file - first allocate & then link & then names and things.
  - If you do these in order you should have consistent. But large performance penalty for fsync.
- ext2 so fast vs BSD Fast File System, but loses data - last 30 sec sometimes - wrong order of ops can throw your FS away.
- JFS - Journaling FS. / ReiserFS also a thing
- BTRFS is the new cool thing
- Checksums -> Stop instead of corruption failures
Imagine building a soda dispenser

Coin -(Insert)-> Select State Select -Water-> Dispense Select -Soda-> Select

What is correctness look like?

State always valid, you have to pay, you always get what you ask for

Back to Model Checking

State Compaction (heuristic) e.g. ignore filenames
- Hashbased comparisons
Explore Efficiently
Find Impl Bugs (vs Model Checking usually being correctness)
It’s hard to specify correctness (cool paper: SQCK: a declarative file system checker)
Fsck was the basis for their correctness?
CMC + Stubs. They forwarded syscalls to their state machine

SSDs - New FS’s try to do large sequential writes, Disk drivers make Log Structured File Systems, optimize small things - “Turf War”. (So they work with legacy FS).

Disks not predictable - There’s a paper by Eric Brewer.

Question: DSLs for specifying equality relations i.e. things that are the same in a filesystem?
Question: Is disk control algorithms a research area still? Question: No code gen for formal models?

2016-Sept-12

Admin: {Wed: Time, Clocks; Fri: No Class}

Paper: Using Threads in Interactive Systems: A Case Study Hauser 93 SIGOPS

Precursor

multithread individual programs to get crisp response.
when you have 8+ cores, you need to think how to extract more parallelism.
in a modern multisocket there is two cpu’s with a dram and a PCIe each (he calls them sockets b/c cpu’s go in a sockets), connected by a interconnect. Usually ~16 cores/cpu now. L1/L2 cache on core. L3 Cache per socket. (NUMA CPU Design).
bigger = slower means segregated memory hierachy (Physics) -> Time to activate a line and length of buses.
DRAM/CPU are different fab processes - expand at different rates, you dont want the mini metal interconnects to break as they expand
HBM (high bandwidth memory) - stack of dram with a high bandwidth interconnect
Most cores can execute 2 threads at a time (hyperthreading)
PCIe - peripherical connect interface -> attach flash / NIC / High speed peripheral interconnects to attach additional components.
1000’s of threads to keep cores 100% utilized
more energy efficient to buy 2x the cpu’s but things like huffman encoding because not parallelizable

Process vs Thread (in context of paper)

Process has Seperate Address space vs Shared (Thread)
Process have own resources & permissions whereas Threads share those e.g. open file descriptors
Threads are cheaper (not cheap!). ~ 100kb in this, Processes are (~10s of MB).
- Fibers/Coroutines: “Fiber is smaller than a thread”. Within a process cooperatively process & multischedule. can have 10/100k of these.
Preemptive vs Cooperative Threads (os steps in vs thread runs till it gives up control)
A process switch requires interrupts, and then run a OS scheduler, and then returning control potentially to a different process: the most expensive thing is to invalidate the TLB :( .
Threads require interrupt but doesnt invalidate TLB
Application Schedulers on the other hand - Function call, table lookup, function call -> much much cheaper.
- e.g. Fiber schedulers are app level schedulers.
“Rob Pike from the Go team” has a great talk on using gothreads to write a compiler

Paper

Mark Weiser (an author): one of fathers of ubiquitous computing
Kinds of Threads:
- Transient: - Defer: Another thread to do long tasks
  - Pump: Input transformer
  - Sleeper (time/periodic)
  - Slack (e.g. batching=less work/more efficient). But slack process /w wrong thread priority / not many things to do is degrading of performance. Also possibility for priority inversion. - Ways to attempt to reduce deadlock: Deadlock Checker Thread / Have a thread give up all locks and try to get all of them from scratch. Also partial unrolling / failures within nested calls is painful
  - Task Rejuvenation: If something going wrong just kill it and have another process do the work -> Masks failure. Controversial. - Martin Renard’s Group (MIT): Solution is that anytime you might read from bad data / e.g. segfault -> just return random data and stuff works!-> no one noticed the corruption (in the case of web apps).
    - Micro-reboot Idea (Stanford - Candea / Armando Fox).

Q: Is DRAM always next to core on board?

2016-Sept-9 - Lecture 2

Paper: Implementing Remote Procedure Calls Birrell 1984

RPC
- Have RPC look like local calls
- synchronous calls
- marshal args & copy to foreign server
- SunRPC popularized RPC.
- Encoding as JSON/BSON/XML
- What if we do a remote call with f(a,b) where b is 10gb? > Design defined.
- Decoupled Marshal from RPC (e.g. ZeroMQ with Protobuf serialization)
How do we bind in modern day?
- Zookeeper
Failures
- Exactly Once is impossible unless you have a reliable component e.g. hard drive - but not possible with just
- We focus on async communication because real networks are not synch and “life is hard”
RPC System seems very interested in low level
- Bypass Lower Layer OS network stack
- RDMA
- Cross-Layer Optimization
RPC / General Message Passing used commonly now
Distributed Transactional Memory possible over LAN not so much WAN.

Written on September 9, 2016