Evolution of Google FS Talk

Google - Larry Greenfield lfg@google?

Evolution of Google FS

Storage Run in datacenters
petabyte of free space?
gfs
- 2002
- location indep namespace
- 100’s of tb - scaled to 10s of PB
- userspace, no POSIX semantics (not atomics)
  - reading/writing file at same time not a great idea
- simple design, good for large batch work
- GFS Masters hold state of filesystem
  - 3 way chunking replication
  - Files <-> Chunk Mapping
  - Clients talk to GFS master first, then data next
  - Only one GFS master (primary) and backups
- GFS Write:
  - Open includes location of first/last chunk
  - FindLockHolder if additional chunks are needed
  - Buffer sends data to chunk server
  - Write commits buffers to disk, Chunk server sends things to chunks
    - This daisy chaining is great for networks
  - Accelerations
    - shadow masters - read only lagging replicas
  - multiple GFS cells per chunk server (scale metadata - manual sharding)
  - automatic master election / consistent replication
  - archival reed-solomon encoding
    - must be first written replicated
    - might have long pauses when reading
- From internal GFS 101 Slides: What is GFS Bad for
  - Predictable performance (No guarantee of latency, no operation timeouts)
  - Structured Data (GFS is filename -> blob data store)
  - Random Writes (Optimized for parallel append)
  - High Availability (fault tolerant not HA)
- architectural problems
  - GFS Master:
    - One machine not large enough for a large FS
    - Single bottleneck for metadata ops
  - GFS Semantics
    - Unclear state of files when not consensus? (missed slide)
- Some GFS v2 Goals
  - Bigger, faster, predictable performance, predictable tail latency
  - GFS Master replaced by Colossus
  - GFS chunkserver replaced by D
- Solve an easier problem : Optimize for bigtable
  - File system for bigtable
  - append ony
  - single writer
  - rename only to indicate finished file
  - no snapshots
  - directories unnecessary
  - Where to put metadata?
- Storage options back then
  - GFS-
  - BIgtable (sorted kv store on GFS)
  - Sharded MySQL with local disk & replication
    - ads db
  - Local kv store with Paxos Replication
    - Authentication DB
    - Chubby
  - Metadata in Bigtable (!?)
GFS master -> CFS
- CFS Curators are bigtable coprocessors
- bigtable row corresponds to a single file
- stripes are replication groups: open, closed, finalized (replication unit)

And then I had to leave for an interview :(

Written on November 23, 2015