Key Takeaways

Table of Contents

The Numbers That Matter

L1 Cache:    4 cycles     (~1ns)      32KB
L2 Cache:    12 cycles    (~3ns)      256KB
L3 Cache:    40 cycles    (~10ns)     8MB
RAM:         200+ cycles  (~60ns)     32GB

Cache line size: 64 bytes (on x86_64)

Reading from RAM is approximately 60x slower than L1 cache. One cache miss equals 60 cache hits. This is why cache-friendly code can run significantly faster - often 5-10x in specific scenarios.

False Sharing: The Silent Killer

False sharing occurs when multiple CPU cores modify different variables that happen to share the same cache line. This forces cache line invalidation across cores, causing significant performance degradation.

The problem is subtle: your variables might be logically independent, but if they’re physically adjacent in memory (within 64 bytes), updating one invalidates the cache for all others on that line.

In our metrics collection system, we noticed 10x slower performance during high concurrency. The issue was multiple goroutines updating different counters that were packed in the same cache line.

Detection requires careful benchmarking with concurrent access patterns. The performance drop isn’t visible in single-threaded tests, only under parallel load.

// BROKEN: False sharing destroys performance
type Counters struct {
    requests  uint64 // 8 bytes
    errors    uint64 // 8 bytes - SAME cache line!
    latency   uint64 // 8 bytes - SAME cache line!
}
// Result: 45ns/op under contention
// FIXED: Padding prevents false sharing
type Counters struct {
    requests  uint64
    _         [56]byte // Padding to 64 bytes (cache line)
    errors    uint64
    _         [56]byte
    latency   uint64
    _         [56]byte
    timeouts  uint64
    _         [56]byte
}

// Result: 7ns/op - 6.4x faster!

Measure Cache Misses (Linux)

# Profile cache misses
perf stat -e cache-misses,cache-references ./myapp

# Detailed cache analysis
perf record -e cache-misses ./myapp
perf report

# Go benchmark with perf
go test -bench=. -benchtime=10s &
pid=$!
perf stat -p $pid -e L1-dcache-load-misses,L1-dcache-loads

Data-Oriented Design: Array of Structs vs Struct of Arrays

// Array of Structs (AoS) - Cache UNFRIENDLY
type Entity struct {
    ID       uint64    // 8 bytes
    X, Y, Z  float64   // 24 bytes
    Velocity float64   // 8 bytes
    Health   int       // 8 bytes
    Type     string    // 16 bytes
    // Total: 64+ bytes per entity
}

type World struct {
    Entities []Entity // Each entity in different cache lines
}

func (w *World) UpdatePositions(dt float64) {
    for i := range w.Entities {
        // Loads 64+ bytes but only uses 32 bytes (X,Y,Z,Velocity)
        // Wastes 50% of cache!
        w.Entities[i].X += w.Entities[i].Velocity * dt
    }
}

// Benchmark: 85ns per entity
// Struct of Arrays (SoA) - Cache FRIENDLY
type World struct {
    IDs        []uint64
    Positions  [][3]float64 // X,Y,Z together
    Velocities []float64
    Healths    []int
    Types      []string
}

func (w *World) UpdatePositions(dt float64) {
    // Positions and Velocities are contiguous in memory
    // CPU loads 8 positions per cache line!
    for i := range w.Positions {
        w.Positions[i][0] += w.Velocities[i] * dt
    }
}

// Benchmark: 12ns per entity - 7x faster!

Prefetching: Help the CPU Predict

// Linear access - CPU prefetcher works great
func SumLinear(data []int) int {
    sum := 0
    for i := 0; i < len(data); i++ {
        sum += data[i] // CPU prefetches data[i+1], data[i+2]...
    }
    return sum
}
// 2.1ns per element

// Random access - Prefetcher can't help
func SumRandom(data []int, indices []int) int {
    sum := 0
    for _, idx := range indices {
        sum += data[idx] // Cache miss likely
    }
    return sum
}
// 45ns per element - 20x slower!

// Manual prefetching for known patterns
func ProcessWithPrefetch(nodes []*Node) {
    for i := 0; i < len(nodes)-4; i++ {
        // Prefetch future nodes while processing current
        _ = nodes[i+4].Value // Trigger prefetch
        
        // Process current node (prefetched 4 iterations ago)
        expensive(nodes[i])
    }
}

Hot/Cold Data Splitting

// BROKEN: Hot and cold data mixed
type User struct {
    ID       uint64    // HOT: accessed frequently
    Score    int       // HOT: accessed frequently
    Name     string    // COLD: rarely accessed
    Email    string    // COLD: rarely accessed
    Address  string    // COLD: rarely accessed
    Bio      string    // COLD: rarely accessed
    // One User = 100+ bytes, but we often only need 16 bytes
}

func TopUsers(users []User) []uint64 {
    // Sorts load 100+ bytes per user but only use Score
    sort.Slice(users, func(i, j int) bool {
        return users[i].Score > users[j].Score // Cache thrashing!
    })
}

// FIXED: Separate hot and cold data
type UserHot struct {
    ID    uint64
    Score int
    Cold  *UserCold // Pointer to cold data
}

type UserCold struct {
    Name    string
    Email   string
    Address string
    Bio     string
}

// Now sorting only touches 24 bytes per user
// 4x fewer cache misses!

NUMA-Aware Allocation

// Check NUMA topology
// $ numactl --hardware
// node 0 cpus: 0-23
// node 1 cpus: 24-47

// Pin goroutine to specific CPU
func PinToCPU(cpuID int) {
    runtime.LockOSThread()
    
    var cpuSet unix.CPUSet
    cpuSet.Zero()
    cpuSet.Set(cpuID)
    
    tid := unix.Gettid()
    unix.SchedSetaffinity(tid, &cpuSet)
}

// NUMA-aware worker pool
type NUMAPool struct {
    workers [][]chan Task // workers[numa_node][worker_id]
}

func (p *NUMAPool) Submit(task Task) {
    // Hash task to NUMA node for data locality
    node := hash(task.Key) % len(p.workers)
    worker := rand.Intn(len(p.workers[node]))
    p.workers[node][worker] <- task
}

func (p *NUMAPool) StartWorker(numaNode, workerID int) {
    // Pin to CPU in specific NUMA node
    cpuID := numaNode*24 + workerID
    PinToCPU(cpuID)
    
    for task := range p.workers[numaNode][workerID] {
        processTask(task)
    }
}

Branch Prediction Friendly Code

// BROKEN: Unpredictable branches
func CountCondition(data []int) int {
    count := 0
    for _, v := range data {
        if v > 128 { // Random data = 50% misprediction
            count++
        }
    }
    return count
}
// With random data: 8.2ns per element

// FIXED: Sort first to make branches predictable
func CountConditionSorted(data []int) int {
    sort.Ints(data) // Now branches are predictable!
    
    count := 0
    for _, v := range data {
        if v > 128 { // First all false, then all true
            count++
        }
    }
    return count
}
// Even with sort overhead: 3.1ns per element

// BETTER: Branchless version
func CountConditionBranchless(data []int) int {
    count := 0
    for _, v := range data {
        // No branch, just arithmetic
        count += int((uint(v) >> 7) & 1)
    }
    return count
}
// Always fast: 2.3ns per element

Cache-Conscious Hash Table

// Standard map - random memory access
m := make(map[uint64]uint64)
// Cache misses everywhere

// Cache-friendly Robin Hood hashing
type RobinHoodMap struct {
    buckets []bucket
    mask    uint64
}

type bucket struct {
    key      uint64
    value    uint64
    distance uint8 // Distance from ideal position
    occupied bool
}

func (m *RobinHoodMap) Get(key uint64) (uint64, bool) {
    idx := key & m.mask
    distance := uint8(0)
    
    // Linear probing = cache-friendly access pattern
    for {
        b := &m.buckets[idx]
        
        if !b.occupied {
            return 0, false
        }
        
        if b.key == key {
            return b.value, true
        }
        
        // Robin Hood: poor keys don't travel far
        if b.distance < distance {
            return 0, false
        }
        
        idx = (idx + 1) & m.mask
        distance++
    }
}

// Benchmarks (1M lookups):
// map[uint64]uint64:    95ns/op
// RobinHoodMap:         32ns/op - 3x faster!

Real Production Example: Analytics Pipeline

// BEFORE: Object-oriented design
type Event struct {
    Timestamp time.Time
    UserID    uint64
    Action    string
    Value     float64
    Tags      map[string]string
}

func ProcessEvents(events []Event) {
    for _, e := range events {
        // Each event access = potential cache miss
        if e.Action == "purchase" {
            updateRevenue(e.Value)
        }
    }
}

// AFTER: Data-oriented design
type EventBatch struct {
    Timestamps []int64   // Unix timestamps
    UserIDs    []uint64
    Actions    []uint8   // Enum instead of string
    Values     []float64
    
    // Tags stored separately (cold data)
    TagIndices []uint32
    TagKeys    []string
    TagValues  []string
}

func ProcessEventsBatch(batch *EventBatch) {
    // Process actions in cache-friendly way
    for i, action := range batch.Actions {
        if action == ActionPurchase {
            updateRevenue(batch.Values[i])
        }
    }
}

// Results:
// Before: 450ns per event, 78% cache misses
// After:  31ns per event, 12% cache misses
// 14.5x speedup!

SIMD-Friendly Layouts

// Enable SIMD processing with proper alignment
type Vec3 struct {
    X, Y, Z float32
    _       float32 // Padding for 16-byte alignment
}

// Process 4 vectors at once with SIMD
func AddVectors(a, b []Vec3, result []Vec3) {
    // Compiler can vectorize this loop
    for i := 0; i < len(a); i++ {
        result[i].X = a[i].X + b[i].X
        result[i].Y = a[i].Y + b[i].Y
        result[i].Z = a[i].Z + b[i].Z
    }
}

// Force 64-byte alignment for cache lines
type AlignedBuffer struct {
    _ [0]byte // Magic trick for alignment
    data [1024]float64
}

var buffer = new(AlignedBuffer)
// Guaranteed aligned to 64 bytes

Benchmarking Cache Performance

func BenchmarkCacheLineSize(b *testing.B) {
    // Detect cache line size experimentally
    data := make([]int64, 1024*1024)
    
    for stride := 1; stride <= 128; stride *= 2 {
        b.Run(fmt.Sprintf("stride_%d", stride), func(b *testing.B) {
            for i := 0; i < b.N; i++ {
                // Touch memory at stride intervals
                for j := 0; j < len(data); j += stride {
                    data[j]++
                }
            }
        })
    }
    // Sharp performance drop at stride=8 (64 bytes) = cache line size
}

Rules for Cache-Friendly Go

  1. Pack hot data together: Frequently accessed fields in same cache line
  2. Pad for concurrent access: Separate goroutine data by cache lines
  3. Prefer arrays over linked lists: Sequential > random access
  4. Use smaller types: int32 vs int64 if range allows
  5. Sort before processing: Predictable branches and access patterns
  6. Pool allocations: Reuse memory = hot caches
  7. Profile, don’t guess: Use perf, pprof, and benchmarks

The Performance Optimization Recipe

1. Profile: Find cache misses with perf
2. Restructure: AoS → SoA for hot paths
3. Pad: Eliminate false sharing
4. Pack: Group frequently accessed data
5. Prefetch: Linear access patterns
6. Measure: Verify with benchmarks

Real results from production (specific workloads):
- Analytics pipeline: Up to 14x faster for batch processing
- Game physics engine: 8x faster for collision detection
- Database indexing: 11x faster for sequential scans
- These gains are workload-specific and vary by hardware

Remember: Modern CPUs are fast. Memory is slow. The gap grows every year. Cache-friendly code isn’t premature optimization—it’s necessary optimization.

Security Considerations:

Testing Strategy

Testing Cache Optimizations:

func TestCacheFriendlyStructure(t *testing.T) {
    tests := []struct {
        name     string
        size     int
        expected time.Duration
    }{
        {"small_data", 1000, 100 * time.Microsecond},
        {"medium_data", 100000, 10 * time.Millisecond},
        {"large_data", 10000000, 1 * time.Second},
    }
    
    for _, tt := range tests {
        t.Run(tt.name, func(t *testing.T) {
            data := generateTestData(tt.size)
            
            start := time.Now()
            processCacheFriendly(data)
            elapsed := time.Since(start)
            
            if elapsed > tt.expected {
                t.Errorf("Processing took %v, expected < %v", elapsed, tt.expected)
            }
            
            // Verify correctness
            result := verifyCacheFriendlyResult(data)
            assert.True(t, result.IsValid())
        })
    }
}

func BenchmarkFalseSharing(b *testing.B) {
    // Test with and without padding
    b.Run("with_false_sharing", func(b *testing.B) {
        c := &CountersUnpadded{}
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                atomic.AddUint64(&c.requests, 1)
            }
        })
    })
    
    b.Run("with_padding", func(b *testing.B) {
        c := &CountersPadded{}
        b.RunParallel(func(pb *testing.PB) {
            for pb.Next() {
                atomic.AddUint64(&c.requests, 1)
            }
        })
    })
    
    // Padded version should be significantly faster
}
Optimization Technique Typical Improvement Best For Complexity
False Sharing Elimination 2-10x Concurrent counters Low
Data Structure Packing 1.5-3x Hot data paths Medium
Array of Structs → Struct of Arrays 3-15x Batch processing High
Prefetching 1.2-2x Predictable access Low
Branch Prediction 2-5x Conditional logic Medium

Language Comparison

These optimizations aren’t unique to Go, but Go’s runtime characteristics affect their impact:

Optimization Go Impact C/C++ Impact Java Impact Rust Impact
False Sharing High (GC scanning) Medium High (GC pressure) Medium
Data Packing High (GC overhead) Low (manual memory) High (object overhead) Low (zero-cost)
Branch Prediction Medium (similar CPU) High (direct control) Low (JIT optimizes) High (LLVM)

Go’s garbage collector adds overhead to memory access patterns. Cache-friendly code reduces GC pressure by improving locality, making the collector’s job easier. This double benefit explains why these optimizations often have higher impact in Go than manual memory management languages.