Building Real-Time Recommendation Systems at Scale with Jason Liu

An interview with Jason Liu on Building Scalable Real-Time Recommendation Systems. In this video interview Jason, a seasoned expert in building recommendation systems, shares insights from his time working on recsys at Stitch Fix and Meta. Learn how Jason tackled the complexities of real-time recommendations and the lessons he gathered along the way.

Introduction to Jason Liu

Jason Liu's experience spans across various high-profile tech companies, including Meta and Stitch Fix, where he led efforts to build and maintain scalable recommendation systems. Jason is currently working as an independent consultant at the forefront of bringing RAG to production software. Jason's work has not only driven significant business outcomes but also provided him with a wealth of knowledge about the intricacies of real-time data processing, team management, and system optimization. Check out Jason’s blog here.

Q&A

What are the main challenges in building and maintaining recommendation systems at scale?

The biggest challenge lies in team dynamics. At Stitch Fix, we had multiple teams working on different subsystems. The risk of team members leaving or moving to other projects creates a knowledge gap. Having a consistent framework helps mitigate this risk, allowing seamless transitions between teams while maintaining performance. Additionally, observability is crucial. Monitoring end-to-end systems helps identify performance bottlenecks and allocate responsibilities.

How large were the teams you worked with on recommendation systems, and how did that impact efficiency?

At Stitch Fix, we had about 100-200 data scientists. Teams ranged from three to four people working on specific models to larger groups managing different aspects like inventory. Coordination among these teams was complex. While specialization has its benefits, it also creates communication challenges, especially when integrating various subsystems.

What are the technical challenges outside of team dynamics when building recommendation systems from scratch?

Transitioning from batch processing to real-time recommendations is a significant challenge. At Stitch Fix, we had developed deep expertise in batch processing, which made the shift to real-time a prolonged, six to eight-month endeavor. The key issue was syncing data in near real-time, ensuring users received relevant recommendations immediately after signing up.

What does "real-time" mean in the context of recommendation systems?

Real-time means that users receive personalized recommendations almost instantly after interacting with the system. For example, at Stitch Fix, the goal was to show relevant items immediately after a user completed their profile. This level of immediacy enhances user engagement and satisfaction.

Have there been any catastrophic failures in your experience, and how were they handled?

Yes, we've faced issues like embedding models with NaNs or missing features that disrupted inventory retrieval. Debugging these problems required a robust pipeline to pinpoint where the breakdown occurred—whether in the inventory, scoring algorithm, or another subsystem.

What are the hesitations you might have about using third-party solutions like Shaped for recommendation systems?

The main concerns are data ingestion and integration. Stitch Fix had its own data pipelines and warehouses, which made integrating new solutions challenging. However, if a third-party tool could seamlessly integrate and demonstrate superior performance, it would be worth considering as a feature in a stacked system.

What are some hidden costs or potential cost blowouts teams should be aware of?

Auto-scaling and infrastructure costs can escalate quickly, especially with real-time systems. We had unexpected marketing campaigns that overwhelmed servers, leading to downtime and lost revenue. Ensuring proper auto-scaling and communication between teams is crucial.

How do you handle data drift in a dynamic environment like fashion?

We managed by aligning inventory with the right customer segments. For instance, attracting the wrong demographic through marketing campaigns can lead to a mismatch between inventory and user preferences. Regularly reviewing and adjusting strategies helps mitigate this issue.

From a business perspective, how do you know when your recommendation system is "good enough"?

It's about balancing offline evaluation metrics with business outcomes. For example, improvements in offline AUC should correlate with increased revenue. Experimentation and quick iterations help determine the effectiveness of new models and features.

Has your perspective on building vs. buying solutions changed over time?

Absolutely. Initially, we were a build-centric team, but now I lean towards buying solutions to save time and resources. The goal is to focus on business outcomes rather than the technical details. Using third-party tools can accelerate time to market and allow us to concentrate on core business problems.

Closing

Jason Liu's journey in building real-time recommendation systems highlights the importance of team dynamics, robust infrastructure, and a strategic approach to technology adoption. His insights provide valuable lessons for anyone looking to enhance their recommendation systems.

Takeaway #1: Team dynamics and consistent frameworks are crucial for maintaining scalable recommendation systems.

Takeaway #2: Real-time processing requires significant effort but offers substantial benefits in user engagement.

Takeaway #3: Leveraging third-party solutions can save time and resources, allowing teams to focus on core business objectives.

Experiencing the same challenges as Jason? Try Shaped for free here.

Get up and running with one engineer in one sprint

Guaranteed lift within your first 30 days or your money back

100M+
Users and items
1000+
Queries per second
1B+
Requests

Related Posts

Tullie Murrell
 | 
September 20, 2022

Day 2 of #RecSys2022: Our favorite 5 papers and talks

Jaime Ferrando Huertas
 | 

Takeaways from the Nvidia Recommender Systems Summit 2022

Javier Jorge Cano
 | 
January 24, 2023

Whisper 🤫 : A multilingual and multitask robust ASR model