These are my notes based on the paper WTF: The Who to Follow Service at Twitter.
TL;DR
The paper covers Twitter’s recommendation system designed to suggest new users for others to follow. It revolves around maintaining Twitter’s active user base by leveraging a large-scale social graph to make relevant user recommendations. They built an in-memory graph processing engine, Cassovary, which allowed the entire Twitter graph to be processed on a single server, significantly reducing complexity and improving performance.
They leveraged two key algorithms - Random Walks and SALSA - to analyze the graph and generate high-quality recommendations. Over time, the system hit its cap and hence Twitter re-architected its solution and distributed the graph across multiple servers. Three things I found interesting were
- in-memory graph processing on a single server
- the design of Cassovary - their in-memory graph processing engine
- combining random walks and SALSA for better recommendations
Three interesting things
Continuing the discussion from the above TL;DR, here’s are three things that I found interesting in this paper and some quick details about each.
In-Memory Graph Processing on a Single Server
The decision to process the entire Twitter graph in memory on a single server is an interesting design decision. This greatly simplifies the architecture and avoids the complexities of distributed systems. This allowed quicker development cycles and reduced the number of potential failure points. This also offered significant speed advantages over disk-based or distributed approaches. What stood out was how the access time to the graph data is reduced to microseconds vs milliseconds (or worse).
The trade-offs, of course, include scalability concerns. As graph size grew, the in-memory solution reached its limits in terms of the hardware’s memory capacity and this required Twitter to transition to a distributed architecture in the second-generation system. This decision highlights a classic trade-off - simplicity vs scalability. It is a rare example of an extremely large-scale system opting for a centralized approach, which worked well enough to scale Twitter during its critical growth phase.
Cassovary: In-Memory Graph Processing Engine
Cassovary was not only a custom-built solution, but it was also open-sourced by Twitter. It was designed to be optimized for memory efficiency - using adjacency lists and compact node representations. Although built for WTF, Cassovary’s utility extended to search, discovery, and promoted products.
Random Walks + SALSA
Instead of relying solely on standard ML models, WTF leverages the structure of the graph itself to generate recommendations. This is a simple yet nuanced implementation. Random walks are optimized for exploring local graph neighborhoods effectively. By simulating a random walk from a user’s node, WTF identified other users who are closely related to the user. SALSA (Stochastic Approach for Link-Structure Analysis) helped in identifying influential users who are likely to be good follow recommendations.
By combining random walks with SALSA, Twitter was able to capture both the local (neighborhood-level) and the global (community-level) structure. I found it interesting because the combination of random walks and SALSA addresses the shortcomings of individual approaches, and in the process creates a good recommendation system.
Notes and a quick explanation
Cassovary
One of the most striking design decisions in building the WTF service was the choice to process the entire Twitter graph in memory on a single server. This decision was driven by several factors:
- the architectural complexity was significantly reduced
- simplified both development and deployment
- in-memory processing allowed for faster access to graph data
- quicker recommendation computations
- eradicates the need for disk I/O, thus improving overall throughput
In Cassovary nodes represent users and edges represent follow relationships. It is designed to handle the massive scale through a selection of efficient data structures and algorithms that minimize memory usage while maximizing performance.
Graph Recommendation Algorithms
Random Walks
Random walks are a fundamental technique in graph processing used to explore the connections between nodes. In WTF,
- walk starts from a given user
- the algorithm randomly traverses the graph by moving from one user to another through their Follow relationships
The goal is to identify nodes (users) that are closely connected to the starting user, either directly or through a small number of intermediaries. Note: the walk is random, hence non-deterministic which is what adds the required randomness in generating recommendations. The algorithm runs multiple random walks from a user’s node to collect statistics on the nodes visited. These statistics are then used to rank potential Follow recommendations.
SALSA: Stochastic Approach for Link-Structure Analysis
SALSA is another algorithm employed in WTF, designed for ranking nodes in bipartite graphs. In the context of Twitter:
- Twitter graph can be viewed as bipartite
- one set of nodes are users, while others are content (tweets, hashtags, etc.).
SALSA operates by iterating between these two sets of nodes, assigning scores to nodes based on their connections. This helps in identifying influential users and content. WTF combines SALSA with random walks to refine recommendations further, leveraging the best of both worlds.
Making things realtime
The need for real-time or near-real-time recommendations is crucial and the users expect instant suggestions as they interact with the platform. This requires processing and analyzing large subgraphs of Twitter’s social network in milliseconds. Because Cassovary is in-memory and the algorithms are chosen to favor performance, Twitter can meet the demand.
System Design and Implementation
The initial version of WTF relied heavily on Cassovary for graph processing, which worked well for several years. However, as Twitter’s user base grew, the limitations of this architecture became apparent.
- the ever-growing size of the graph constrained the system’s resources
- managing a single server, required the need for frequent hardware upgrades and the complexity of handling system failures
Transition to a Distributed System
This is precisely where Twitter’s engineering team began developing a second-generation system. This new system moved away from a single-server, in-memory approach to a more scalable, distributed architecture. Key features of this new system include:
- the graph was partitioned across multiple servers, each handling a portion of the graph
- this provided greater scalability and resilience
- this helped the system handle server failures gracefully, with redundancy and failover mechanisms in place
Reducing Latency
But now that the graph was distributed and durable and fault-tolerant, the challenge lay in delivering recommendations with minimal latency. The distributed approach ensures that the system can scale out, adding more servers as needed to maintain performance, and reducing latency in delivering recommendations. But still, some corners were chopped.
Some key lessons
One of the key takeaways is the trade-off between simplicity and scalability. While the initial in-memory approach simplified the design and allowed for rapid development, it eventually hit scalability limits. This highlights the need to balance the two - simplicity vs scalability. To me, the following things stood out,
- never over-engineer
- if it works for you, it works for you
- have a plan B and be ready to rewrite
The content presented here is a collection of my notes and explanations based on the paper. You can access the full paper WTF: The Who to Follow Service at Twitter . This is by no means an exhaustive explanation, and I strongly encourage you to read the actual paper for a comprehensive understanding. Any images you see are either taken directly from the paper or illustrated by me .