The Two-Tower Model for Recommendation Systems: A Deep Dive

The Two-Tower model is a foundational architecture for large-scale recommendation systems, built to efficiently retrieve relevant items from massive catalogs. By learning separate embeddings for users and items, it enables fast candidate generation via approximate nearest neighbor search—critical for real-time personalization. This article breaks down how the model works, why it scales, and where it fits in modern recsys stacks, highlighting its strengths, limitations, and role alongside ranking and graph-based approaches.

May 9, 2025

min read

Tullie Murrell

Recommendation systems are the engines powering personalization across the web, from YouTube videos and Spotify playlists to Amazon products and TikTok feeds. Their goal is simple: connect users with items they'll love. But behind this simple goal lies a massive challenge: scale. Modern platforms deal with millions, even billions, of users and items. How can we efficiently find the few truly relevant items for a specific user from such a vast ocean of possibilities?

While traditional methods like Collaborative Filtering and Matrix Factorization laid the groundwork, they often struggle with the sheer scale and the need to incorporate rich user/item features. Deep learning opened new doors, but naively applying complex models to score every user-item pair is computationally infeasible at inference time.

Enter the Two-Tower Model. This deep learning architecture has become a cornerstone of large-scale industrial recommendation systems, particularly famous for its role in candidate generation. It elegantly balances powerful representation learning with the strict efficiency requirements of real-time recommendations.

This post provides a deep dive into the two-tower architecture. We'll cover:

What the two-tower model is and how it works.
Why it's so effective for scalable recommendations.
Key design choices and variations (features, networks, training).
Its primary role in candidate generation.
Its advantages, limitations, and comparisons to other models.
Current challenges and exciting future research directions.

Let's dive in!

What is the Two-Tower Model? Core Architecture Explained

At its heart, the two-tower model separates the computation for users and items into two distinct neural networks – the "towers."

The User Tower: This network takes various user-related inputs (like user ID, demographics, historical interactions, device, context) and processes them through layers (embedding layers, MLPs, RNNs, etc.) to output a single vector: the user embedding u. This vector represents the user's preferences and characteristics in a dense, low-dimensional space.

u = Tower_User(User_Features)

The Item Tower: Similarly, this network takes item-related inputs (item ID, category, description, image features, etc.) and processes them through its own set of layers to output an item embedding v. This vector represents the item's properties in the same embedding space.

v = Tower_Item(Item_Features)

Scoring in the Embedding Space: The magic happens when we need to predict the affinity between a user u and an item v. Instead of feeding all features into one giant network, the two-tower model calculates the score directly from the pre-computed embeddings, typically using a simple similarity function like:

Dot Product: Score(u, v) = u ⋅ v (Most common)

Cosine Similarity: Score(u, v) = (u ⋅ v) / (||u|| ||v||)

The Scalability Breakthrough: Training vs. Serving

The true elegance of the two-tower model shines during serving (inference):

Training: The two towers are trained jointly end-to-end. The goal is to learn embedding spaces where the similarity score (e.g., dot product) between a user u and relevant items v is high, and low for irrelevant items. This typically involves optimizing loss functions like log loss (pointwise), BPR loss (pairwise), or contrastive losses, often using negative sampling (including efficient in-batch negative sampling).
Serving (Offline Item Computation): Because the item tower only depends on item features, you can pre-compute the embeddings v for all items in your corpus (potentially billions!) offline and store them.
Serving (Online Retrieval): When a user request comes in:
1. Compute the user embedding u in real-time using the user tower (fast, as it's one forward pass).
2. Use this user embedding u to query the pre-computed item embeddings. Since calculating u ⋅ v for billions of items is still too slow, we use Approximate Nearest Neighbor (ANN) search techniques (e.g., Faiss, ScaNN, HNSW). ANN allows us to efficiently find the items whose embeddings v have the highest dot product (or cosine similarity) with the user embedding u, retrieving the top-K candidates in milliseconds.

This decoupling makes retrieving relevant candidates from massive catalogs feasible.

Dissecting the Towers: Features, Architectures, and Training

The performance of a two-tower model heavily depends on how you build the towers and train the system.

Input Features are Key: Effectively representing users and items is crucial.

IDs (User/Item): Learned via embedding layers.
Categorical Features: Also use embedding layers (e.g., item category, user location).
Numerical Features: Often normalized and fed into MLPs (e.g., user age, item price).
Text Features: Processed using anything from simple TF-IDF to sophisticated Transformer embeddings (like BERT).
Image Features: Often derived from pre-trained CNNs.
Sequential Features (User History): Modeled using RNNs (LSTM/GRU), CNNs, or Attention/Transformers (e.g., BERT4Rec, SASRec) within the user tower.

Tower Network Architectures:

MLPs: An easy starting point for combining various embedded and numerical features within each tower.
Specialized Networks: Towers can incorporate CNNs, RNNs, or Transformers to handle specific modalities (text, image, sequence) before feeding into final MLP layers. GNNs (like in PinSage) can also be used if graph data is available.

Training Objectives and Strategies:

Loss Functions: Pointwise (Log Loss, MSE), Pairwise (BPR), Listwise, or increasingly popular Contrastive Losses (like InfoNCE) using in-batch negatives.
Negative Sampling: Essential for implicit feedback. Strategies range from random sampling to popularity-based (beware bias!) and hard negative mining (selecting challenging negatives). In-batch negatives are often highly effective and efficient.
Regularization & Optimization: Standard deep learning techniques (dropout, batch norm, Adam optimizer) apply. Temperature scaling in contrastive loss is an important hyperparameter.

Where Two-Towers Shine: Candidate Generation and Beyond

The dominant application is candidate generation in multi-stage recommendation pipelines:

Retrieval (Candidate Generation): The two-tower model + ANN rapidly narrows down the millions/billions of items to a manageable set of hundreds or thousands (the "candidates"). Recall and efficiency are paramount here.
Ranking: A second, more complex model (the "ranker") takes these candidates and uses richer features, including explicit cross-features between user and item (which the two-tower model struggles with), to precisely re-rank them. Precision is the focus here. Examples of rankers include Deep & Wide, DCN, DeepFM.

While candidate generation is the primary use case, variations can be used for:

Related Item Recommendation: Training an item-only tower and finding nearest neighbors in the item embedding space.
Direct Ranking (Smaller Catalogs): If the item set isn't massive, the two-tower score itself might suffice for ranking.

Two-Tower Models: The Pros and Cons

Advantages:

✅ Highly Scalable: Handles enormous item catalogs efficiently via ANN search at serving time.
✅ Efficient Inference: Pre-computation of item embeddings drastically reduces online latency.
✅ Flexible Feature Integration: Easily incorporates diverse feature types within each tower.
✅ Effective Representation Learning: Learns meaningful user and item embeddings capturing complex patterns.
✅ Modular Design: User and item towers can often be iterated upon somewhat independently.

Disadvantages:

❌ Limited Feature Interactions: By design, it doesn't explicitly model interactions between user and item features until the final dot product. This limits its ability to capture fine-grained conditional preferences (e.g., "user likes this brand but only in that category"). This is why a separate ranker is usually needed.
❌ Cold-Start Challenges: Performance can suffer for new users/items with few features or interactions.
❌ Potential for Bias: Like any model learning from historical data, it can capture and even amplify biases (popularity, exposure, etc.). Negative sampling strategy heavily influences this.
❌ Simple Scoring Function: Dot product/cosine similarity might be too simplistic for complex user-item affinity.

Two-Towers vs. Other RecSys Models (MF, GNNs, Rankers)

Matrix Factorization (MF): Think of two-towers as a powerful, non-linear generalization of MF that can incorporate rich side features. Basic MF is like two towers with only ID embeddings and linear layers.
Factorization Machines (FMs) / Deep Rankers (DeepFM, DCN): These excel at modeling feature interactions but are too slow to score the entire item corpus. They are typically used after the two-tower model in the ranking stage.
Graph Neural Networks (GNNs): GNNs directly model the user-item interaction graph. They can be part of a tower (e.g., PinSage used a GNN for item embeddings) or serve as an alternative architecture. GNNs inherently capture collaborative signals through message passing. The choice often depends on data structure and specific goals.

Two-Tower Models in Practice: An Example with Shaped

Platforms like Shaped make deploying powerful retrieval models like the Two-Tower architecture straightforward. You can configure it as an embedding_policy to handle the candidate generation stage efficiently.

Here's a simplified example of how you might configure a Two-Tower model within Shaped:

    retriever_model.yaml
    
  

    
model:
    name: my-two-tower-retriever
    policy_configs:
        # Configure the Two-Tower model for embedding generation
        embedding_policy:
            policy_type: two-tower
            embedding_dims: 128       # Dimensionality of the shared embedding space
            negative_samples_count: 5 # Number of negative samples per positive during training
            n_epochs: 5               # Number of training epochs
            batch_size: 256           # Training batch size
            lr: 0.001                    # Learning rate
        # Define the scoring/ranking policy (often a different model like LightGBM)
        scoring_policy:
            policy_type: lightgbm
            objective: lambdarank
            # ... other scoring policy configurations ...
    
  

In this configuration:

We define an embedding_policy with policy_type: two-tower.
embedding_dims: Sets the size of the user and item vectors (u and v).
negative_samples_count: Controls the contrastive learning aspect during training.
Standard deep learning hyperparameters like n_epochs, batch_size, and lr (learning rate) are specified.
This Two-Tower model generates candidate embeddings, which are then typically passed to a scoring_policy (like lightgbm shown here) for the final ranking stage.

By using Shaped, you can leverage the scalability and effectiveness of the Two-Tower architecture for candidate retrieval without managing the complexities of ANN indexing, feature preprocessing pipelines, and distributed training infrastructure yourself.

Wrapping Up: The Enduring Impact of the Two-Tower Model

The two-tower model is more than just another deep learning architecture; it's a practical and powerful solution to a fundamental problem in large-scale recommendation: balancing relevance with efficiency. By smartly decoupling user and item representations and leveraging the speed of ANN search, it enables personalized recommendations over massive catalogs.

While it has limitations, particularly in modeling fine-grained feature interactions (often delegated to a subsequent ranking stage), its scalability, flexibility, and proven effectiveness have made it an indispensable tool for companies like Google, Facebook, LinkedIn, Pinterest, and many others. As research continues to address its challenges and explore new variations, the two-tower paradigm is set to remain a vital part of the recommendation system landscape for years to come.

The Two-Tower Model for Recommendation Systems: A Deep Dive

What is the Two-Tower Model? Core Architecture Explained

The Scalability Breakthrough: Training vs. Serving

Dissecting the Towers: Features, Architectures, and Training

Where Two-Towers Shine: Candidate Generation and Beyond

Two-Tower Models: The Pros and Cons

Two-Towers vs. Other RecSys Models (MF, GNNs, Rankers)

Two-Tower Models in Practice: An Example with Shaped

Wrapping Up: The Enduring Impact of the Two-Tower Model

Further Reading / References

Get up and running with one engineer in one sprint

Related Posts

Vector Search — Lucene is All You Need

Microsoft vs Google - ChatGPT taking over search?

Exploration vs. Exploitation in Recommendation Systems