Evolution of Google FS Talk

Google - Larry Greenfield lfg@google?

Evolution of Google FS

  • Storage Run in datacenters
  • petabyte of free space?
  • gfs
    • 2002
    • location indep namespace
    • 100’s of tb - scaled to 10s of PB
    • userspace, no POSIX semantics (not atomics)
      • reading/writing file at same time not a great idea
    • simple design, good for large batch work
    • GFS Masters hold state of filesystem
      • 3 way chunking replication
      • Files <-> Chunk Mapping
      • Clients talk to GFS master first, then data next
      • Only one GFS master (primary) and backups
    • GFS Write:
      • Open includes location of first/last chunk
      • FindLockHolder if additional chunks are needed
      • Buffer sends data to chunk server
      • Write commits buffers to disk, Chunk server sends things to chunks
        • This daisy chaining is great for networks
      • Accelerations
        • shadow masters - read only lagging replicas
      • multiple GFS cells per chunk server (scale metadata - manual sharding)
      • automatic master election / consistent replication
      • archival reed-solomon encoding
        • must be first written replicated
        • might have long pauses when reading
    • From internal GFS 101 Slides: What is GFS Bad for
      • Predictable performance (No guarantee of latency, no operation timeouts)
      • Structured Data (GFS is filename -> blob data store)
      • Random Writes (Optimized for parallel append)
      • High Availability (fault tolerant not HA)
    • architectural problems
      • GFS Master:
        • One machine not large enough for a large FS
        • Single bottleneck for metadata ops
      • GFS Semantics
        • Unclear state of files when not consensus? (missed slide)
    • Some GFS v2 Goals
      • Bigger, faster, predictable performance, predictable tail latency
      • GFS Master replaced by Colossus
      • GFS chunkserver replaced by D
    • Solve an easier problem : Optimize for bigtable
      • File system for bigtable
      • append ony
      • single writer
      • rename only to indicate finished file
      • no snapshots
      • directories unnecessary
      • Where to put metadata?
    • Storage options back then
      • GFS-
      • BIgtable (sorted kv store on GFS)
      • Sharded MySQL with local disk & replication
        • ads db
      • Local kv store with Paxos Replication
        • Authentication DB
        • Chubby
      • Metadata in Bigtable (!?)
  • GFS master -> CFS
    • CFS Curators are bigtable coprocessors
    • bigtable row corresponds to a single file
    • stripes are replication groups: open, closed, finalized (replication unit)

And then I had to leave for an interview :(

Written on November 23, 2015