Database Design Patterns: Choosing the Right Data Store

Database Design Patterns: Choosing the Right Data Store
min read

Database Design Patterns: Choosing the Right Data Store

Day 4 of the 30-Day System Design Roadmap

"Just use MongoDB!" a junior developer once told me during a system design review for a banking application. "It's web scale!"

I smiled, knowing we were about to have an important conversation. Three months later, after countless headaches with inconsistent account balances and a near-disaster with lost transactions, that same developer came to my desk and said, "I think I understand why you were hesitant."

Here's the thing: choosing a database isn't about what's trendy or what your favorite tech influencer recommends. It's not about MongoDB vs PostgreSQL vs Redis in some abstract battle. It's about understanding your data's personality, your application's needs, and the trade-offs you're willing to make.

I've spent the last decade designing database architectures for everything from small startups to systems handling billions of transactions. And you know what? The database choice can make or break your application—not because one is inherently "better," but because each one is designed for specific problems.

Today, we're going to explore database design patterns with both theory and practical examples. By the end, you'll know how to choose the right data store for your specific situation.

The Database Selection Dilemma

Let me paint you a picture. You're building a social media platform—the next big thing. You sit down to design your data layer and realize you need to handle:

- User profiles with structured information (email, name, birthdate)

  • Posts and comments that grow rapidly and have flexible content
  • Friend connections and follower relationships
  • User sessions that need lightning-fast access
  • Analytics data that's write-heavy and time-based

    Your first instinct might be to pick one database and force everything into it. I've seen teams do this countless times. They choose PostgreSQL and struggle with graph queries for social connections. Or they choose MongoDB and fight to ensure data consistency for financial transactions.

    Here's the reality that took me years to accept: Modern applications don't use just one database. Netflix uses multiple types. So does Amazon. So does pretty much every company operating at scale.

  • SQL vs NoSQL: Beyond the Hype

    The SQL versus NoSQL debate is often framed wrong. People treat it like choosing sides in a sports game. But it's more like choosing between a sedan and a pickup truck—both are vehicles, but they're designed for different jobs.

    When SQL Databases Make Perfect Sense

    SQL databases—PostgreSQL, MySQL, Oracle—are like that reliable friend who always keeps their promises. They're built around ACID properties and excel when you need absolute certainty about your data.

    I once worked on an e-commerce platform processing millions of dollars daily. When a customer places an order, we need to:

    1. Check if the product is in stock 2. Reduce the inventory count 3. Create an order record 4. Charge the customer's payment method

    Imagine if step 3 succeeded but step 4 failed. You'd have an order in the system but no payment. This is where SQL databases shine—transactions ensure either everything succeeds or nothing does.

    Here's how it looks in practice:

    sql
    -- E-commerce order processing with ACID guarantees
    BEGIN TRANSACTION;
    
    -- Step 1: Check and reduce inventory
    UPDATE inventory
    SET quantity = quantity - 1
    WHERE product_id = 'PROD-123'
      AND quantity > 0;
    
    -- If no rows affected, product is out of stock
    IF @@ROWCOUNT = 0 THEN
        ROLLBACK;
        RAISE 'Product out of stock';
    END IF;
    
    -- Step 2: Create the order
    INSERT INTO orders (user_id, product_id, amount, status)
    VALUES (42, 'PROD-123', 99.99, 'pending')
    RETURNING order_id;
    
    -- Step 3: Process payment
    INSERT INTO payments (order_id, amount, status)
    VALUES (order_id, 99.99, 'processing');
    
    -- All succeed together, or all fail together
    COMMIT;

    SQL databases are your go-to when:

    - You're dealing with money (banking, payments, e-commerce)

  • Your data has complex relationships (orders with line items, users with addresses)
  • Data integrity is non-negotiable
  • You need complex queries across multiple entities
  • Your schema is relatively stable

  • When NoSQL Databases Save the Day

    NoSQL databases trade some of SQL's guarantees for massive scalability and flexibility. Let me tell you about a real problem I faced.

    I consulted for a social media startup using PostgreSQL for their user feed. Everything worked fine until they hit 100,000 users. Then performance collapsed. The problem? They were doing complex joins across 6 different tables to build each user's feed:

    sql
    -- This query became a nightmare at scale
    SELECT
        p.post_id,
        p.content,
        p.created_at,
        u.user_id,
        u.name,
        u.avatar_url,
        COUNT(DISTINCT l.like_id) as like_count,
        COUNT(DISTINCT c.comment_id) as comment_count,
        EXISTS(
            SELECT 1 FROM likes
            WHERE post_id = p.post_id AND user_id = @current_user_id
        ) as user_has_liked
    FROM posts p
    INNER JOIN users u ON p.author_id = u.user_id
    INNER JOIN followers f ON f.followed_id = p.author_id
    LEFT JOIN likes l ON l.post_id = p.post_id
    LEFT JOIN comments c ON c.post_id = p.post_id
    WHERE f.follower_id = @current_user_id
    GROUP BY p.post_id, u.user_id
    ORDER BY p.created_at DESC
    LIMIT 50;

    Each feed request hit 6 tables with multiple joins. At scale, this was a disaster.

    We migrated to MongoDB with a denormalized approach:

    javascript
    // Single document contains everything needed for display
    {
      "_id": "feed_user_12345",
      "user_id": 12345,
      "posts": [
        {
          "post_id": "post_789",
          "content": "Just deployed to production on Friday! 🚀",
          "created_at": ISODate("2025-01-15T14:30:00Z"),
          "author": {
            "user_id": "user_456",
            "name": "Jane Doe",
            "avatar_url": "https://cdn.example.com/jane.jpg"
          },
          "stats": {
            "likes": 42,
            "comments": 7
          },
          "user_has_liked": false
        }
        // ... more posts
      ],
      "last_updated": ISODate("2025-01-15T14:35:00Z")
    }
    
    // Fetching a feed becomes one simple query
    db.feeds.findOne({ user_id: 12345 })

    Feed load times dropped from 2 seconds to 50 milliseconds. Yes, we had to update more places when data changed. But the trade-off was worth it for the read performance.

    NoSQL databases are your go-to when:

    - You need to scale horizontally (add more servers)

  • Your data model is flexible and evolving
  • You're dealing with documents or nested data
  • You have massive write volumes
  • Geographic distribution is critical

  • The Honest Truth About Trade-offs

    Here's what most tutorials won't tell you: both approaches have painful limitations.

    SQL databases struggle with horizontal scaling. When you outgrow a single server, sharding is complex and you lose the ability to join across shards.

    NoSQL databases struggle with consistency. When data is spread across servers, you might not see your own write immediately. Transactions across multiple documents are limited or non-existent.

    The key is understanding which limitations you can live with.

    ACID Properties: The Promise of Reliability

    ACID stands for Atomicity, Consistency, Isolation, and Durability. These aren't just buzzwords—they're fundamental guarantees that can save you from catastrophic failures.

    Atomicity: All or Nothing

    Atomicity means operations either all succeed or all fail. No partial success.

    python
    # Without atomicity - DISASTER waiting to happen
    def transfer_money_bad(from_account, to_account, amount):
        balance = get_balance(from_account)
        if balance >= amount:
            deduct(from_account, amount)  # What if this succeeds...
            add(to_account, amount)       # But this fails? Money disappears!
    
    # With atomicity - SAFE
    def transfer_money_good(from_account, to_account, amount):
        with database.transaction():
            balance = get_balance(from_account)
            if balance >= amount:
                deduct(from_account, amount)
                add(to_account, amount)
                # Both succeed or both fail - no middle ground

    I learned this the hard way consulting for a startup with a race condition in their wallet system. Money could be deducted but not credited due to network timeouts. They lost $50K before discovering the bug. Atomicity would have prevented this entirely.

    Consistency: Rules Are Rules

    Consistency means your database enforces rules, even when your application code has bugs.

    sql
    -- Database-level constraints ensure data integrity
    CREATE TABLE users (
        user_id SERIAL PRIMARY KEY,
        email VARCHAR(255) UNIQUE NOT NULL,  -- No duplicate emails, ever
        age INTEGER CHECK (age >= 18),        -- Must be adult
        balance DECIMAL(10,2) CHECK (balance >= 0),  -- Can't go negative
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE TABLE orders (
        order_id SERIAL PRIMARY KEY,
        user_id INTEGER REFERENCES users(user_id),  -- Must be valid user
        amount DECIMAL(10,2) NOT NULL CHECK (amount > 0)
    );

    Think about an inventory system. You have a rule: quantity can never go below zero. Without database-level consistency, two customers could buy the last item simultaneously:

    python
    # Race condition without proper constraints
    # Customer A: Check quantity = 1 ✓
    # Customer B: Check quantity = 1 ✓
    # Customer A: Create order ✓
    # Customer B: Create order ✓
    # Result: Two orders, one item!
    
    # With proper isolation and constraints
    def purchase_item(user_id, product_id):
        with db.transaction(isolation_level='SERIALIZABLE'):
            product = db.execute(
                "SELECT * FROM inventory WHERE product_id = %s FOR UPDATE",
                (product_id,)
            )
    
            if product.quantity < 1:
                raise OutOfStockError()
    
            db.execute(
                "UPDATE inventory SET quantity = quantity - 1 "
                "WHERE product_id = %s AND quantity > 0",
                (product_id,)
            )
    
            db.execute(
                "INSERT INTO orders (user_id, product_id) VALUES (%s, %s)",
                (user_id, product_id)
            )

    Isolation: Don't Step on Each Other's Toes

    Isolation is about what happens when multiple operations occur simultaneously. Databases offer different isolation levels, and choosing the right one is a balancing act:

    Read Committed (default for most databases): You only see data that's been committed. Good for most applications.

    Repeatable Read: Same query returns same results throughout a transaction. Great for reports.

    Serializable: Complete isolation, as if transactions ran one at a time. Maximum safety, minimum concurrency.

    python
    # Example: Booking system with proper isolation
    from sqlalchemy import create_engine
    
    # Configure isolation level based on your needs
    engine = create_engine(
        'postgresql://localhost/bookings',
        isolation_level='REPEATABLE_READ'  # Consistent reads within transaction
    )
    
    def book_seat(user_id, seat_id):
        with engine.begin() as conn:
            # Lock the row for update
            seat = conn.execute(
                "SELECT * FROM seats WHERE seat_id = %s FOR UPDATE",
                (seat_id,)
            ).fetchone()
    
            if not seat.available:
                raise SeatAlreadyBookedError()
    
            conn.execute(
                "UPDATE seats SET available = false, booked_by = %s "
                "WHERE seat_id = %s",
                (user_id, seat_id)
            )

    Durability: Written Means Written

    Once a transaction commits, it survives crashes, power failures, everything. This is why database write operations are slower than pure memory operations—durability has a cost, but for critical data, it's worth it.

    The CAP Theorem: Choose Your Battles

    The CAP theorem states you can only have 2 out of 3 in a distributed system:

    - Consistency: All nodes see the same data at the same time

  • Availability: Every request gets a response (not an error)
  • Partition Tolerance: System works despite network failures

  • Why You Can't Have It All

    Imagine two database servers in different data centers. The network connection breaks (this happens regularly in real systems). You face a choice:

    Option A: Stop accepting writes until connection is restored. This maintains consistency but sacrifices availability.

    Option B: Continue accepting writes on both servers. This maintains availability but sacrifices consistency (servers now have different data).

    There's no third option. You must choose.

    CP Systems: Consistency + Partition Tolerance

    CP systems choose consistency over availability. MongoDB with majority write concern, HBase, and Consul are CP systems.

    python
    # MongoDB configured for strong consistency
    from pymongo import MongoClient, WriteConcern
    
    client = MongoClient('mongodb://localhost:27017/')
    db = client.banking
    
    # Write succeeds only if majority of replicas confirm
    # If replica set can't reach majority, write is rejected (less available)
    # But all replicas agree on the data (consistent)
    result = db.accounts.insert_one(
        {
            'account_id': 'ACC-123',
            'balance': 1000.00,
            'currency': 'USD'
        },
        write_concern=WriteConcern(w='majority', j=True)
    )

    When to choose CP: Financial transactions, inventory systems, any scenario where wrong data is worse than slow data.

    AP Systems: Availability + Partition Tolerance

    AP systems choose availability over consistency. Cassandra, DynamoDB, and Riak are AP systems.

    python
    # Cassandra configured for high availability
    from cassandra.cluster import Cluster
    from cassandra.query import SimpleStatement, ConsistencyLevel
    
    cluster = Cluster(['node1', 'node2', 'node3'])
    session = cluster.connect('social_media')
    
    # Write succeeds even if some nodes are down (available)
    # But might be temporarily inconsistent across nodes
    statement = SimpleStatement(
        "INSERT INTO user_sessions (user_id, session_data) VALUES (%s, %s)",
        consistency_level=ConsistencyLevel.ONE  # Only need one node to acknowledge
    )
    
    session.execute(statement, (user_id, session_data))

    When to choose AP: Social media feeds, shopping carts, user sessions—scenarios where availability trumps consistency.

    Real-World: Using Both

    At a previous company, we didn't choose one philosophy. We analyzed each data type:

    python
    # User accounts: PostgreSQL (CP)
    class UserService:
        def update_balance(self, user_id, amount):
            # Strong consistency required for money
            with postgres_db.transaction():
                return postgres_db.execute(
                    "UPDATE accounts SET balance = balance + %s "
                    "WHERE user_id = %s",
                    (amount, user_id)
                )
    
    # User sessions: Redis (AP)
    class SessionService:
        def create_session(self, user_id, data):
            # High availability, eventual consistency OK
            return redis_cluster.setex(
                f"session:{user_id}",
                3600,  # 1 hour
                json.dumps(data)
            )
    
    # Shopping cart: Hybrid approach
    class CartService:
        def add_item(self, user_id, item):
            # Optimistic update to local cache (AP)
            cart = redis.get(f"cart:{user_id}") or []
            cart.append(item)
            redis.set(f"cart:{user_id}", cart)
    
            # Async sync to PostgreSQL for durability (CP)
            task_queue.enqueue(sync_cart_to_db, user_id, cart)

    Types of Databases: Choosing Your Weapon

    Let's explore different database types with practical examples.

    Relational Databases: The Structured Powerhouse

    Perfect for structured data with complex relationships.

    sql
    -- E-commerce schema showing the power of relationships
    CREATE TABLE users (
        user_id SERIAL PRIMARY KEY,
        email VARCHAR(255) UNIQUE NOT NULL,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE TABLE products (
        product_id SERIAL PRIMARY KEY,
        name VARCHAR(255) NOT NULL,
        price DECIMAL(10,2) NOT NULL,
        stock INTEGER NOT NULL CHECK (stock >= 0)
    );
    
    CREATE TABLE orders (
        order_id SERIAL PRIMARY KEY,
        user_id INTEGER REFERENCES users(user_id),
        total DECIMAL(10,2) NOT NULL,
        status VARCHAR(50) NOT NULL,
        created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP
    );
    
    CREATE TABLE order_items (
        order_id INTEGER REFERENCES orders(order_id),
        product_id INTEGER REFERENCES products(product_id),
        quantity INTEGER NOT NULL,
        price DECIMAL(10,2) NOT NULL,
        PRIMARY KEY (order_id, product_id)
    );
    
    -- Complex queries are natural and efficient
    SELECT
        u.email,
        COUNT(DISTINCT o.order_id) as total_orders,
        SUM(o.total) as total_spent,
        AVG(o.total) as avg_order_value
    FROM users u
    LEFT JOIN orders o ON u.user_id = o.user_id
    WHERE o.created_at >= NOW() - INTERVAL '30 days'
    GROUP BY u.email
    HAVING COUNT(DISTINCT o.order_id) > 5
    ORDER BY total_spent DESC;

    When to use: Financial systems, CRM, ERP, any application with complex relationships.

    Document Databases: The Flexible Adapter

    Perfect for hierarchical data and flexible schemas.

    javascript
    // Blog post in MongoDB - naturally nested structure
    {
      "_id": ObjectId("507f1f77bcf86cd799439011"),
      "title": "Database Design Patterns",
      "slug": "database-design-patterns",
      "author": {
        "user_id": 12345,
        "name": "Hrithik Dhakrey",
        "avatar": "https://cdn.example.com/avatar.jpg"
      },
      "content": "Today we're learning about databases...",
      "tags": ["databases", "system-design", "backend"],
      "published_at": ISODate("2025-01-15T10:00:00Z"),
      "stats": {
        "views": 15420,
        "likes": 342,
        "shares": 67
      },
      "comments": [
        {
          "comment_id": "cmt_001",
          "user": {
            "user_id": 67890,
            "name": "Jane Developer"
          },
          "text": "Great article!",
          "created_at": ISODate("2025-01-15T11:30:00Z")
        }
      ]
    }
    
    // One query gets everything - no joins needed
    const post = await db.posts.findOne({ slug: "database-design-patterns" });
    
    // Easy to evolve - add new fields without migration
    await db.posts.updateOne(
      { _id: post._id },
      {
        $set: {
          video_url: "https://youtube.com/...",
          reading_time: 15
        }
      }
    );

    When to use: Content management, product catalogs, user profiles with varying fields.

    Key-Value Stores: The Speed Demon

    Perfect for caching and simple, fast lookups.

    python
    import redis
    import json
    
    redis_client = redis.Redis(host='localhost', port=6379)
    
    # Session management
    def create_session(user_id, session_data):
        session_id = generate_session_id()
        redis_client.setex(
            f"session:{session_id}",
            3600,  # 1 hour expiry
            json.dumps(session_data)
        )
        return session_id
    
    # Rate limiting
    def check_rate_limit(user_id, max_requests=100, window=3600):
        key = f"rate_limit:{user_id}"
        current = redis_client.incr(key)
    
        if current == 1:
            redis_client.expire(key, window)
    
        return current <= max_requests
    
    # Caching expensive queries
    def get_user_profile(user_id):
        cache_key = f"user:{user_id}"
    
        # Try cache first
        cached = redis_client.get(cache_key)
        if cached:
            return json.loads(cached)
    
        # Cache miss - fetch from database
        user = postgres_db.query("SELECT * FROM users WHERE id = %s", user_id)
    
        # Cache for 5 minutes
        redis_client.setex(cache_key, 300, json.dumps(user))
        return user
    
    # Leaderboard with sorted sets
    def update_leaderboard(user_id, score):
        redis_client.zadd("game_leaderboard", {user_id: score})
    
    def get_top_players(limit=10):
        return redis_client.zrevrange("game_leaderboard", 0, limit-1, withscores=True)

    When to use: Caching, sessions, real-time features, rate limiting.

    Graph Databases: The Relationship Expert

    Perfect when relationships are as important as the data itself.

    cypher
    -- Social network in Neo4j
    -- Create users and relationships
    CREATE (alice:User {user_id: 1, name: 'Alice'})
    CREATE (bob:User {user_id: 2, name: 'Bob'})
    CREATE (carol:User {user_id: 3, name: 'Carol'})
    CREATE (dave:User {user_id: 4, name: 'Dave'})
    
    CREATE (alice)-[:FOLLOWS]->(bob)
    CREATE (alice)-[:FOLLOWS]->(carol)
    CREATE (bob)-[:FOLLOWS]->(carol)
    CREATE (bob)-[:FOLLOWS]->(dave)
    CREATE (carol)-[:FOLLOWS]->(dave)
    
    -- Find mutual friends (try this efficiently in SQL!)
    MATCH (alice:User {name: 'Alice'})-[:FOLLOWS]->(mutual)<-[:FOLLOWS]-(friend)
    WHERE alice <> friend
    RETURN DISTINCT friend.name
    
    -- Friend recommendations (friends of friends you don't follow)
    MATCH (alice:User {name: 'Alice'})-[:FOLLOWS*2..2]->(recommendation)
    WHERE NOT (alice)-[:FOLLOWS]->(recommendation)
      AND alice <> recommendation
    RETURN recommendation.name, COUNT(*) as connection_strength
    ORDER BY connection_strength DESC
    LIMIT 5

    When to use: Social networks, recommendation engines, fraud detection, knowledge graphs.

    Time-Series Databases: The Metrics Collector

    Perfect for timestamped data and analytics.

    python
    from influxdb_client import InfluxDBClient, Point
    from influxdb_client.client.write_api import SYNCHRONOUS
    
    client = InfluxDBClient(url="http://localhost:8086", token="token", org="myorg")
    write_api = client.write_api(write_options=SYNCHRONOUS)
    
    # Recording application metrics
    def track_api_request(endpoint, response_time, status_code):
        point = Point("api_metrics") \
            .tag("endpoint", endpoint) \
            .tag("status_code", str(status_code)) \
            .field("response_time_ms", response_time) \
            .field("value", 1)
    
        write_api.write(bucket="monitoring", record=point)
    
    # Recording IoT sensor data
    def record_sensor_data(sensor_id, temperature, humidity):
        point = Point("sensor_readings") \
            .tag("sensor_id", sensor_id) \
            .tag("location", "warehouse_a") \
            .field("temperature", temperature) \
            .field("humidity", humidity)
    
        write_api.write(bucket="iot", record=point)
    
    # Querying time-series data
    query = '''
    from(bucket: "monitoring")
      |> range(start: -24h)
      |> filter(fn: (r) => r["_measurement"] == "api_metrics")
      |> filter(fn: (r) => r["endpoint"] == "/api/orders")
      |> aggregateWindow(every: 1h, fn: mean)
      |> yield(name: "mean_response_time")
    '''
    
    result = client.query_api().query(query)

    When to use: Application monitoring, IoT data, financial tick data, analytics.

    Practical Exercise: Social Media Platform Architecture

    Let's design a real system. Here's how I'd architect a social media platform using multiple databases.

    The Architecture Overview

    python
    """
    Database Strategy for Social Media Platform:
    
    1. PostgreSQL - User accounts and authentication (needs ACID)
    2. MongoDB - Posts and comments (flexible schema, high write volume)
    3. Neo4j - Social graph (optimized for relationship queries)
    4. Redis - Feed cache and sessions (speed critical)
    5. InfluxDB - Analytics and metrics (time-series data)
    """
    
    # 1. User Service (PostgreSQL)
    class UserService:
        def create_user(self, email, password, name):
            """
            User accounts need ACID guarantees.
            Email must be unique. Password must be hashed.
            All or nothing.
            """
            with postgres_db.transaction():
                user = postgres_db.execute(
                    "INSERT INTO users (email, password_hash, created_at) "
                    "VALUES (%s, %s, NOW()) RETURNING user_id",
                    (email, hash_password(password))
                )
    
                postgres_db.execute(
                    "INSERT INTO user_profiles (user_id, name) VALUES (%s, %s)",
                    (user.user_id, name)
                )
    
                return user.user_id
    
        def get_user(self, user_id):
            """
            Complex query across related tables - SQL's strength
            """
            return postgres_db.query(
                "SELECT u.*, p.name, p.bio, p.avatar_url, "
                "COUNT(DISTINCT o.order_id) as order_count "
                "FROM users u "
                "JOIN user_profiles p ON u.user_id = p.user_id "
                "LEFT JOIN orders o ON u.user_id = o.user_id "
                "WHERE u.user_id = %s "
                "GROUP BY u.user_id, p.name, p.bio, p.avatar_url",
                (user_id,)
            )
    
    # 2. Post Service (MongoDB)
    class PostService:
        def create_post(self, user_id, content, media_urls=None):
            """
            Posts have flexible schema - some have images, some videos, etc.
            MongoDB's document model fits naturally.
            """
            post = {
                'user_id': user_id,
                'content': content,
                'media_urls': media_urls or [],
                'created_at': datetime.now(),
                'stats': {'likes': 0, 'comments': 0, 'shares': 0},
                'comments': []  # Embedded for fast reads
            }
    
            result = mongo_db.posts.insert_one(post)
    
            # Async: update follower feeds
            task_queue.enqueue(update_feeds, user_id, str(result.inserted_id))
    
            return str(result.inserted_id)
    
        def add_comment(self, post_id, user_id, text):
            """
            Embedded comments - one query to get post with all comments
            """
            mongo_db.posts.update_one(
                {'_id': ObjectId(post_id)},
                {
                    '$push': {
                        'comments': {
                            'comment_id': generate_id(),
                            'user_id': user_id,
                            'text': text,
                            'created_at': datetime.now()
                        }
                    },
                    '$inc': {'stats.comments': 1}
                }
            )
    
    # 3. Social Graph Service (Neo4j)
    class SocialGraphService:
        def follow_user(self, follower_id, followee_id):
            """
            Neo4j excels at relationship operations
            """
            query = """
            MATCH (follower:User {user_id: $follower_id})
            MATCH (followee:User {user_id: $followee_id})
            CREATE (follower)-[r:FOLLOWS {created_at: datetime()}]->(followee)
            RETURN r
            """
            neo4j_db.run(query, follower_id=follower_id, followee_id=followee_id)
    
        def get_friend_suggestions(self, user_id, limit=10):
            """
            Friends of friends - natural in graph DB, nightmare in SQL
            """
            query = """
            MATCH (user:User {user_id: $user_id})-[:FOLLOWS*2..2]->(suggestion)
            WHERE NOT (user)-[:FOLLOWS]->(suggestion)
              AND user <> suggestion
            WITH suggestion, COUNT(*) as strength
            ORDER BY strength DESC
            LIMIT $limit
            RETURN suggestion.user_id as user_id, strength
            """
            return neo4j_db.run(query, user_id=user_id, limit=limit)
    
    # 4. Feed Service (Redis)
    class FeedService:
        def get_feed(self, user_id, page=1, per_page=20):
            """
            Feeds must be blazing fast - Redis cache to the rescue
            """
            cache_key = f"feed:{user_id}:page:{page}"
    
            # Try cache first
            cached = redis_client.get(cache_key)
            if cached:
                return json.loads(cached)
    
            # Cache miss - build feed from MongoDB
            following = social_graph.get_following(user_id)
    
            posts = mongo_db.posts.find(
                {'user_id': {'$in': following}}
            ).sort('created_at', -1).skip((page-1)*per_page).limit(per_page)
    
            feed = list(posts)
    
            # Cache for 5 minutes
            redis_client.setex(cache_key, 300, json.dumps(feed, default=str))
    
            return feed
    
        def invalidate_feed(self, user_id):
            """
            When user posts, invalidate follower caches
            """
            followers = social_graph.get_followers(user_id)
            pipeline = redis_client.pipeline()
            for follower_id in followers:
                pipeline.delete(f"feed:{follower_id}:*")
            pipeline.execute()
    
    # 5. Analytics Service (InfluxDB)
    class AnalyticsService:
        def track_post_view(self, post_id, user_id):
            """
            Millions of events per day - time-series DB handles it
            """
            point = Point("post_views") \
                .tag("post_id", str(post_id)) \
                .tag("user_id", str(user_id)) \
                .field("value", 1)
    
            influx_write_api.write(bucket="analytics", record=point)
    
        def get_trending_posts(self, hours=1):
            """
            Time-based aggregation - InfluxDB's specialty
            """
            query = f'''
            from(bucket: "analytics")
              |> range(start: -{hours}h)
              |> filter(fn: (r) => r["_measurement"] == "post_views")
              |> group(columns: ["post_id"])
              |> count()
              |> sort(columns: ["_value"], desc: true)
              |> limit(n: 20)
            '''
            return influx_query_api.query(query)

    Why This Architecture Works

    Each database does what it's optimized for:

    - PostgreSQL ensures user account integrity with ACID transactions

  • MongoDB handles high-volume, flexible post data
  • Neo4j makes graph queries (friend suggestions) trivial
  • Redis provides sub-millisecond feed access
  • InfluxDB efficiently stores and queries millions of metric events

    Could we use just PostgreSQL? Technically yes. But we'd struggle with feed performance, graph queries would be complex, and analytics would slow down operational tables.

  • Making the Right Choice: A Decision Framework

    Here's the framework I use:

    Step 1: Understand Your Data

    - Structured or flexible?

  • How are entities related?
  • How often does schema change?

    Step 2: Define Consistency Requirements

    - Can you tolerate eventual consistency?

  • Are there regulatory requirements?

    Step 3: Consider Scale

    - How much data?

  • Reads vs writes per second?
  • Need for global distribution?

    Step 4: Evaluate Your Team

    - What's your team familiar with?

  • Operational expertise available?

    Step 5: Start Simple, Scale Pragmatically

    - Don't optimize prematurely

  • Begin with what you know
  • Scale when you have real needs

  • Common Pitfalls

    Choosing based on hype: Netflix's needs aren't your startup's needs.

    Premature optimization: PostgreSQL can handle millions of users if designed well.

    Ignoring operations: A database is only as good as your ability to back it up and recover from failures.

    Not planning for growth: Design your application with clear service boundaries so you can migrate later if needed.

    Key Takeaways

    1. No perfect database exists - every choice involves trade-offs 2. Polyglot persistence is normal - use multiple database types 3. Understand ACID and CAP - these have real implications 4. Start simple, scale pragmatically - don't optimize prematurely 5. Operations matter - consider backups, monitoring, recovery

    What's Next?

    Tomorrow we're diving into Caching Strategies (Day 5). You'll learn how to serve millions of requests without constantly hitting your database, understand cache invalidation patterns, and design multi-layer caching strategies.

    Understanding databases is crucial, but understanding how to avoid hitting them unnecessarily is just as important.

    ---

    Practice Exercise: Design database architecture for a ride-sharing app:

    - Driver and rider profiles

  • Real-time location tracking
  • Ride matching and pricing
  • Payment processing
  • Trip history and ratings

    For each data type, choose a database and justify why. What consistency level does it need? What's the read/write pattern? Share your design in the comments!

  • Made With Love on