Meta Description: Master your System Design interview with this comprehensive guide of 40 essential questions and detailed answers. From Load Balancers to Distributed Transactions, we cover the scalability concepts you need to crack interviews at Google, Amazon, Meta, and more.
Introduction
System design interviews are the gatekeepers to senior engineering roles at FAANG and other top-tier tech companies. They don't just test your coding skills; they test your ability to build scalable, reliable, and maintainable systems under constraint.
In this guide, we break down 40 critical system design questions. We start with the core concepts and move into advanced architectural challenges, complete with real-world examples like designing Uber, Netflix, and TinyURL.
Part 1: Core Concepts & Fundamentals
Before diving into complex architectures, you must master the building blocks.
1. API Gateway vs. Load Balancer: What’s the difference?
Answer: While both manage traffic, they serve different purposes.
Load Balancer: Strictly distributes incoming network traffic across multiple servers to ensure high availability and prevent any single server from becoming a bottleneck. It operates at Layer 4 (Transport) or Layer 7 (Application).
API Gateway: Acts as a "front door" for microservices. It handles request routing, composition, protocol translation (e.g., HTTP to gRPC), rate limiting, and authentication.
Example: You might use NGINX as a Load Balancer to distribute traffic, but use Netflix Zuul or Amazon API Gateway to handle authentication and route requests to specific microservices.
2. Horizontal vs. Vertical Scaling?
Answer:
Vertical Scaling (Scale-up): Adding more power (CPU, RAM) to an existing machine. It’s easy to implement but has a hard limit (hardware capacity) and introduces a single point of failure.
Horizontal Scaling (Scale-out): Adding more machines to the pool of resources. This is preferred for distributed systems as it offers elasticity.
Example: A standard SQL database often uses vertical scaling, while web applications (like Instagram) use horizontal scaling, adding thousands of Kubernetes pods to handle traffic spikes.
3. Monolithic Architecture vs. Microservices?
Answer:
Monolith: The entire application is a single deployable unit. It is easier to develop and test initially but becomes hard to scale and maintain as the team grows.
Microservices: The application is broken into small, independent services that communicate over a network. This enables independent scaling and technology choices per team.
Trade-off: Microservices introduce network latency and distributed complexity (data consistency is harder) but allow for greater team autonomy.
4. What is Rate Limiting and how is it implemented?
Answer: Rate limiting controls the number of requests a user can send to a server within a specific timeframe to prevent DoS attacks or resource exhaustion.
Algorithms:
Token Bucket: Tokens are added to a bucket at a fixed rate; requests consume tokens.
Leaky Bucket: Requests are processed at a fixed rate, regardless of arrival speed.
Fixed Window: Limits requests per hour (e.g., 100 requests/hour).
Example: GitHub limits API requests to 5,000 per hour for authenticated users, often tracking counts via Redis.
5. Database Sharding vs. Partitioning?
Answer:
Partitioning (Vertical): Splitting a table by columns. For example, storing user profiles in one table and user login credentials in another.
Sharding (Horizontal): Splitting a table by rows based on a "shard key." Each shard has the same schema but unique data.
Example: YouTube might shard video metadata based on
VideoID. If you have 500M videos, videos with IDs ending in 0-3 go to Server A, 4-7 to Server B, etc.
Part 2: Key System Designs
Applying the fundamentals to build specific utilities.
6. How would you design a URL Shortener (like TinyURL)?
Answer:
Core Logic: Convert a long URL into a unique short string. Use MD5 hashing followed by Base62 encoding (a-z, A-Z, 0-9) to generate a 6-7 character string.
Storage: Use a NoSQL store like DynamoDB (Key: ShortCode, Value: LongURL) for fast read/writes.
Scalability: Use a high-performance caching layer (Redis) for popular links. If a hash collision occurs, append a sequence number or retry.
Scale: Handle 100M redirects/day using consistent hashing to distribute load.
7. Compare Caching Strategies (Write-Through vs. Write-Back)?
Answer:
Write-Through: Data is written to the cache and the database simultaneously. Pros: High data consistency. Cons: Higher write latency.
Write-Back (Write-Behind): Data is written to the cache first and asynchronously updated in the DB later. Pros: Extremely fast writes. Cons: Risk of data loss if the cache fails before syncing.
Example: Facebook uses massive Redis clusters (LRU eviction) to cache News Feed stories, reducing database load by 90%+.
8. Design a Pastebin (Text Sharing Service)?
Answer:
Requirements: High write throughput, unique URLs, expiration times.
Database: Cassandra or HBase is ideal for heavy write loads.
Key Generation: Pre-generate unique 7-character keys using a Key Generation Service (KGS) to avoid collision checks during runtime.
Storage: Store the text blob in an object store (like S3) if large, or directly in the DB if small (<1MB).
9. How does Single Sign-On (SSO) work?
Answer: SSO allows a user to log in once and access multiple applications.
Mechanism: Typically uses OAuth 2.0 or SAML.
User attempts to access App A.
App A redirects user to the Identity Provider (IdP) like Google or Okta.
User logs in at IdP.
IdP sends a signed token (JWT) back to App A.
App A validates the token and grants access.
10. Kafka vs. RabbitMQ: When to use which?
Answer:
RabbitMQ: A traditional message broker. Best for complex routing logic and when you need per-message acknowledgments. It pushes messages to consumers.
Apache Kafka: A distributed streaming platform. Best for high-throughput, log-based storage where messages need to be replayed or stored for days. It uses a "pull" model.
Example: Uber uses Kafka to log terabytes of trip data per day for analytics, while they might use RabbitMQ for immediate, transactional task processing.
Part 3: Advanced Architectures
Designing complex, consumer-facing applications.
11. Design YouTube (Video Streaming)?
Answer:
Architecture:
Upload: Users upload raw video to object storage (AWS S3).
Processing: A queue triggers parallel workers (e.g., Celery) to transcode video into multiple formats (480p, 1080p, 4K) via FFmpeg.
Delivery: Videos are pushed to a CDN (Content Delivery Network) like Akamai or Cloudfront.
Database: Metadata (titles, descriptions) in MySQL (sharded).
Optimization: Use Adaptive Bitrate Streaming (HLS/DASH) to adjust quality based on user bandwidth.
12. Design a News Feed (Facebook/Twitter)?
Answer:
Pull vs. Push:
Push (Fan-out on Write): When a user posts, deliver it immediately to all followers' pre-computed feed caches. Good for users with few followers.
Pull (Fan-out on Read): When a user loads their feed, query all followees' recent posts and merge them. Good for celebrities (Justin Bieber) to avoid "thundering herd" writes.
Ranking: Use Redis Sorted Sets to rank posts by timestamp or an engagement score.
13. Design Uber (Ride Hailing)?
Answer:
Geo-Location: The core challenge is matching riders to nearby drivers efficiently.
Spatial Indexing: Use QuadTrees or Google S2 libraries to divide the map into small cells. Drivers update their location in the cell every few seconds.
Communication: Use WebSockets (e.g., Socket.io) for real-time bi-directional communication between app and server.
Consistency: Use a strongly consistent database (PostgreSQL) for transactional ride data.
14. How do Distributed Caches work?
Answer:
Concept: A pool of RAM across multiple servers acting as a single store (e.g., Memcached or Redis Cluster).
Routing: Clients use Consistent Hashing to determine which server holds a specific key. This minimizes data movement when servers are added/removed.
Example: Twitter uses Twemcache (custom Memcached) to handle billions of reads per second.
15. Design Google Docs (Collaborative Editing)?
Answer:
Concurrency Control: The hardest part is handling multiple users editing the same sentence instantly.
Algorithm: Operational Transformation (OT) or CRDTs (Conflict-free Replicated Data Types). OT transforms operations (like "insert 'A' at index 0") based on other concurrent operations so all users see the same final state.
Protocol: WebSockets for instant character-by-character updates.
Part 4: Scalability & Reliability
Addressing the challenges of massive scale.
16. What is a CDN and why is it critical?
Answer: A Content Delivery Network (CDN) is a network of geographically distributed servers.
Function: It caches static content (images, CSS, video) closer to the user to reduce latency (TTFB).
Mechanism: Uses Anycast DNS to route users to the nearest "Edge Server."
Example: Cloudflare serving Netflix assets ensures a user in India doesn't fetch data from a server in California.
17. Design Instagram (Image Heavy)?
Answer:
Storage: Photos in Amazon S3; Metadata (likes, comments) in Postgres.
Social Graph: Use a Graph Database (like Neo4j) or efficient adjacency lists in Cassandra to manage "Followers/Following."
Feed Generation: Pre-generate feeds (Push model).
Pagination: Use Cursor-based pagination instead of Offset-based for faster infinite scrolling performance.
18. Explain the CAP Theorem Trade-offs.
Answer: In a distributed system, you can only pick two: Consistency, Availability, or Partition Tolerance.
CP (Consistency + Partition Tolerance): Banking systems. If the network breaks, the system stops accepting writes to ensure data is accurate.
AP (Availability + Partition Tolerance): Social media feeds. If the network breaks, serve a slightly outdated feed rather than showing an error.
19. Design Ticket Booking (BookMyShow)?
Answer:
Concurrency: Handling the "double booking" problem.
Solution: Use Optimistic Locking in the database. When a user selects a seat, set a temporary lock (with TTL) in Redis. If payment fails, release the lock.
Queueing: If high demand (e.g., Coldplay concert), dump requests into a Kafka queue to process them sequentially rather than crashing the database.
20. Leader Election in Distributed Systems?
Answer:
Problem: In a cluster (e.g., database replicas), who decides which node accepts writes?
Tools: ZooKeeper or etcd.
Algorithms: Paxos or Raft. Nodes "vote" to elect a leader. If the leader fails (stops sending heartbeats), a new election is triggered instantly.
Part 5: 20 MORE Advanced System Design Questions (Deep Dive)
Take your preparation to the next level with these additional scenarios.
21. What is a Bloom Filter and where is it used?
Answer: A probabilistic data structure used to test whether an element is a member of a set. It is extremely memory efficient.
Property: It can tell you "Definitely No" or "Probably Yes." It never produces false negatives.
Use Case: Databases use it to avoid expensive disk lookups for non-existent rows (e.g., Cassandra, Postgres). Browsers use it to check for malicious URLs.
22. Explain the "Thundering Herd" problem.
Answer: This occurs when a large number of processes wake up simultaneously to process an event, but only one can handle it, causing a massive CPU spike.
Scenario: A cache key expires, and 10,000 users try to query the database simultaneously to regenerate it.
Solution: Use Cache Stampede protection (e.g., probabilistic early expiration or locking the cache key so only one process rebuilds it).
23. Gossip Protocol vs. Centralized Coordination?
Answer:
Gossip Protocol: A peer-to-peer communication protocol where nodes randomly share state information with neighbors (like a virus spreading).
Use Case: Amazon DynamoDB and Cassandra use it for failure detection and membership management because it is decentralized and scalable.
24. Design a Web Crawler (Google Bot)?
Answer:
Components: URL Frontier (Queue), Fetcher, DNS Resolver, Content Deduplicator.
Challenge: Politeness (don't DDoS sites), looping (traps), and handling dynamic content.
Storage: Use a
visited_urlstable with a Bloom Filter to quickly check if a URL has already been crawled.
25. Strong Consistency vs. Eventual Consistency?
Answer:
Strong: A read always returns the latest write (e.g., SQL ACID). Critical for financial ledgers.
Eventual: A read might return stale data for a short period, but all nodes will eventually sync (e.g., DNS, YouTube view counts).
Trade-off: Strong consistency hurts availability and latency; eventual consistency improves them.
26. How to implement "Idempotency" in APIs?
Answer: Idempotency means making the same request multiple times yields the same result.
Implementation: The client generates a unique
idempotency_key(UUID). The server checks a specialized store (like Redis) to see if this key was already processed. If yes, it returns the previous successful response without re-executing the logic (e.g., charging a credit card twice).
27. SQL vs. NoSQL: How to choose?
Answer:
SQL (Relational): Best for structured data, complex joins, and transactions (e.g., E-commerce orders).
NoSQL (Non-Relational):
Key-Value (Redis): Caching.
Document (MongoDB): Flexible schema (CMS, Catalogs).
Column-Family (Cassandra): High write throughput (IoT logs, Chat history).
Graph (Neo4j): Complex relationships (Social networks).
28. What is Backpressure?
Answer: A mechanism in stream processing where a consumer signals the producer to slow down because it cannot keep up with the data rate.
Without Backpressure: The consumer crashes (OOM) or queues explode.
Implementation: TCP flow control, Reactive Streams (RxJava), or limiting the queue size in message brokers.
29. Design a "Typeahead" (Autocomplete) System?
Answer:
Data Structure: Trie (Prefix Tree).
Optimization: Store the top 5 most searched terms at each node of the Trie.
Service: Typeahead service queries the Trie. Since the Trie is read-heavy, replicate it across memory in multiple servers.
Updates: Update the Trie offline (e.g., hourly) using MapReduce logs, rather than real-time, to maintain speed.
30. Explain Circuit Breaker Pattern.
Answer: Prevents an application from repeatedly trying to execute an operation that's likely to fail.
States: Closed (Normal), Open (Error threshold reached, block requests), Half-Open (Test if service is back).
Benefit: Prevents cascading failures in microservices. If Service A depends on Service B, and B is down, A shouldn't hang waiting for timeouts; it should fail fast.
31. Server-Sent Events (SSE) vs. WebSockets vs. Long Polling?
Answer:
Long Polling: Client requests, server holds connection open until data is available. Old school, header heavy.
WebSockets: Bi-directional, full-duplex communication. Best for Chat apps and Gaming.
SSE: Uni-directional (Server to Client). Best for stock tickers, news feeds, or notifications where the client doesn't need to send data back.
32. Design a "Near Me" Service (Yelp/Google Maps)?
Answer:
Database: Needs efficient spatial queries. Postgres with PostGIS extension.
Geohashing: Convert 2D coordinates (lat/long) into a single alphanumeric string. Locations sharing a prefix are geographically close.
Search: Query the database for points that match the user's Geohash prefix.
33. Blue-Green Deployment vs. Canary Deployment?
Answer:
Blue-Green: Two identical environments. Blue is live. Deploy to Green. Switch router to Green. Instant rollback possible.
Canary: Roll out the update to a small % of users (e.g., 1%). Monitor metrics. Gradually increase to 100%. Lower risk than Blue-Green but slower.
34. How to handle "Hot Partitions" in a distributed database?
Answer: A hot partition occurs when a specific shard key is accessed disproportionately (e.g., Justin Bieber's tweets).
Solution:
Salting: Append a random number to the key to distribute it across multiple shards (e.g.,
Bieber_1,Bieber_2).Read Repair: Aggregate the data from the salted keys during read.
35. Explain Event Sourcing.
Answer: Instead of storing just the current state (e.g., "Balance: $100"), store the sequence of events that led to it (e.g., "Deposit $50", "Withdraw $20", "Deposit $70").
Pros: Complete audit trail, ability to replay history to fix bugs, easy temporal queries.
Cons: Complexity, need snapshots to speed up state reconstruction.
36. What is Distributed Tracing?
Answer: A method to track a request as it flows through multiple microservices.
Tooling: Jaeger, Zipkin.
Mechanism: Assign a unique Trace ID at the ingress gateway and pass it in headers to every downstream service. Helps debug latency issues (e.g., "Why did this request take 2 seconds?").
37. Design a Notification System?
Answer:
Components: Notification Service, User Preferences DB (Opt-in/out), Template Engine.
Queues: Use RabbitMQ/Kafka to decouple the trigger from the sender.
Workers: Separate workers for Email (SES), SMS (Twilio), and Push (FCM/APNS).
Retry Logic: Exponential backoff for failed sends.
38. Database Isolation Levels (ACID)?
Answer:
Read Uncommitted: Dirty reads allowed (fastest, unsafe).
Read Committed: No dirty reads (standard).
Repeatable Read: No non-repeatable reads (Phantom reads possible).
Serializable: Strict execution order (slowest, safest).
Interview Tip: Most scalable systems default to Read Committed or lower to maintain performance.
39. Design a Metrics/Monitoring System (like Datadog/Prometheus)?
Answer:
Data Model: Time-series database (e.g., InfluxDB or Prometheus TSDB).
Collection: Pull model (Prometheus scrapes endpoints) vs. Push model (App sends to agent).
Resolution: Keep high resolution (1 sec) for 24 hours, then downsample (1 min) for 30 days to save storage.
40. How to secure a Microservices Architecture?
Answer:
mTLS (Mutual TLS): Encrypt traffic between services and verify identity of both sides.
API Gateway: Centralized auth (JWT validation).
Service Mesh: (e.g., Istio) handles security logic like mTLS and access policies out of the application code.
Least Privilege: Services should only have network access to specific dependencies.
💬 Discussion
Which of these system design challenges have you faced in an interview?
Do you prefer Monoliths or Microservices for early-stage startups?
Drop your answer below 👇, LIKE if this guide helped your prep, and SHARE this post to help fellow developers ace their FAANG interviews!
Comments
Post a Comment