Authors: Chunlong Yu, Han Zheng, Jie Zhu, I-Hong Jhuo, Li Xia, Lin Zhu, Sawyer Shen, Yulan Yan
TL;DR
Most modern ranking stacks rely on large generative models as feature extractors, flattening their outputs into vectors that are then fed into downstream rankers. While effective, this pattern introduces additional pipeline complexity and often dilutes token‑level semantics. GenRec Direct Learning (DirL) explores a different direction: using a generative, token‑native sequential model as the ranking engine itself. In this formulation, ranking becomes an end‑to‑end sequence modeling problem over user behavior, context, and candidate items—without an explicit feature‑extraction stage.
Why revisit the classic L2 ranker design?
Large‑scale recommender systems have historically evolved as layered pipelines: more signals lead to more feature plumbing, which in turn introduces more special cases. In our previous L2 ranking architecture, signals were split into dense and sparse branches and merged late in the stack (Fig. 1). As the system matured, three recurring issues became increasingly apparent.
Figure 1: traditional ranking DNN
1) Growing pipeline surface area
Each new signal expands the surrounding ecosystem—feature definitions, joins, normalization logic, validation, and offline/online parity checks. Over time, this ballooning surface area slows iteration, raises operational overhead, and increases the risk of subtle production inconsistencies.
2) Semantics diluted by flattening
Generative models naturally capture rich structure: token‑level interactions, compositional meaning, and contextual dependencies. However, when these representations are flattened into sparse or dense feature vectors, much of that structure is lost—undermining the very semantics that make generative representations powerful.
3) Sequence modeling is treated as an add-on
While traditional rankers can ingest history features, modeling long behavioral sequences and fine‑grained temporal interactions typically requires extensive manual feature engineering. As a result, sequence modeling is often bolted on rather than treated as a first‑class concern.
DirL goal: treat ranking as native sequence learning, not as “MLP over engineered features.”
What “Direct Learning” means in DirL
The core shift behind Direct Learning (DirL) is simple but fundamental.
Instead of the conventional pipeline:
generative model → embeddings → downstream ranker,
DirL adopts an end‑to‑end formulation:
tokenized sequence → generative sequential model → ranking score(s).
In DirL, user context, long‑term behavioral history, and candidate item information are all represented within a single, unified token sequence. Ranking is then performed directly by a generative, token‑native sequential model.
This design enables several key capabilities:
- Long‑term behavior modeling beyond short summary windows
The model operates over extended user histories, allowing it to capture long‑range dependencies and evolving interests that are difficult to represent with fixed‑size aggregates.
- Fine‑grained user–content interaction learning
By modeling interactions at the token level, DirL learns detailed behavioral and content patterns rather than relying on coarse, pre‑engineered features.
- Preserved cross‑token semantics within the ranking model
Semantic structure is maintained throughout the ranking process, instead of being collapsed into handcrafted dense or sparse vectors before scoring.
Architecture overview (from signals to ranking)
1) Unified Tokenization
All inputs in DirL are converted into a shared token embedding space, allowing heterogeneous signals to be modeled within a single sequential backbone. Conceptually, each input sequence consists of three token types:
- User / context tokens
These tokens encode user or request‑level information, such as age or cohort‑like attributes, request or canvas context, temporal signals (e.g., day or time), and user‑level statistics like historical CTR.
- History tokens
These represent prior user interactions over time, including signals such as engaged document IDs, semantic or embedding IDs, and topic‑like attributes. Each interaction is mapped to a token, preserving temporal order and enabling long‑range behavior modeling.
- Candidate tokens
Each candidate item to be scored is represented as a token constructed from document features and user–item interaction features. These features are concatenated and projected into a fixed‑dimensional vector via an MLP, yielding a token compatible with the shared embedding space.
Categorical features are embedded directly, while dense numerical signals are passed through MLP layers before being fused into their corresponding tokens. As a result, the model backbone consumes a sequence of the form:
[1 user/context token] + [N history tokens] + [1 candidate token]
2) Long-sequence modeling backbone (HSTU)
To model long input sequence, DirL adopts a sequential backbone designed to scale beyond naïve full attention. In the current setup, the backbone consists of stacked HSTU layers with multi‑head attention and dropout for regularization. The hidden state of the candidate token from the final HSTU layer is then fed into an MMoE module for scoring.
3) Multi-task prediction head (MMoE)
Ranking typically optimizes multiple objectives (e.g., engagement‑related proxies). DirL employs a multi‑gate mixture‑of‑experts (MMoE) layer to support multi‑task prediction while sharing representation learning.
The MMoE layer consists of N shared experts and one task‑specific expert per task. For each task, a gating network produces a weighted combination of the shared experts and the task‑specific expert. The aggregated representation is then fed into a task‑specific MLP head to produce the final prediction.
Figure 2: DirL structure
Early experiments: what worked and what didn’t
What looked promising
Early results indicate that a token‑native setup improves both inhouse evaluation metrics and online engagement (time spent per UU), suggesting that modeling long behavior sequences in a unified token space is directionally beneficial.
The hard part: efficiency and scale
The same design choices that improve expressiveness also raise practical hurdles:
- Training velocity slows down: long-sequence modeling and larger components can turn iteration cycles from hours into days, making ablations expensive.
- Serving and training costs increase: large sparse embedding tables + deep sequence stacks can dominate memory and compute.
- Capacity constraints limit rollout speed: Hardware availability and cost ceilings become a gating factor for expanding traffic and experimentation.
In short: DirL’s main challenge isn’t “can it learn the right dependencies?”—it’s “can we make it cheap and fast enough to be a production workhorse?”
Path to production viability: exploratory directions
Our current work focuses on understanding how to keep the semantic benefits of token‑native modeling while exploring options that could help reduce overall cost.
1) Embedding tables
- consolidate and prune oversized sparse tables
- rely more on shared token representations where possible
2) Right-size the sequence model
- reduce backbone depth where marginal gains flatten
- evaluate minimal effective token sets—identify which tokens actually move metrics.
- explore sequence length vs. performance curves to find the “knee”
3) Inference and systems optimization
- dynamic batching tuned for token-native inference
- kernel fusion and graph optimizations
- quantization strategies that preserve ranking model behavior
Why this direction matters
DirL explores a broader shift in recommender systems—from feature‑heavy pipelines with shallow rankers toward foundation‑style sequential models that learn directly from user trajectories. If token‑native ranking can be made efficient, it unlocks several advantages:
- Simpler modeling interfaces, with fewer feature‑plumbing layers.
- Stronger semantic utilization, reducing information loss from aggressive flattening.
- A more natural path to long‑term behavior and intent modeling.
Early signals are encouraging. The next phase is about translating this promise into practice—making the approach scalable, cost‑efficient, and fast enough to iterate as a production system.
Using Microsoft Services to Enable Token‑Native Ranking Research
This work was developed and validated within Microsoft’s internal machine learning and experimentation ecosystem.
Training data was derived from seven days of MSN production logs and user behavior labels, encompassing thousands of features, including numerical, ID‑based, cross, and sequential features. Model training was performed using a PyTorch‑based deep learning framework built by the MSN infrastructure team and executed on Azure Machine Learning with a single A100 GPU.
For online serving, the trained model was deployed on DLIS, Microsoft’s internal inference platform. Evaluation was conducted through controlled online experiments on the Azure Exp platform, enabling validation of user engagement signals under real production traffic.
Although the implementation leverages Microsoft’s internal platforms, the core ideas behind DirL are broadly applicable. Practitioners interested in exploring similar approaches may consider the following high‑level steps:
- Construct a unified token space that captures user context, long‑term behavior sequences, and candidate items.
- Apply a long‑sequence modeling backbone to learn directly from extended user trajectories.
- Formulate ranking as a native sequence modeling problem, scoring candidates from token‑level representations.
- Evaluate both model effectiveness and system efficiency, balancing gains in expressiveness against training and serving cost.
Call to action
We encourage practitioners and researchers working on large‑scale recommender systems to experiment with token‑native ranking architectures alongside traditional feature‑heavy pipelines, compare trade‑offs in modeling power and system efficiency, and share insights on when direct sequence learning provides practical advantages in production environments.
Acknowledgement:
We would like to acknowledge the support and contributions from several colleagues who helped make this work possible.
We thank Gaoyuan Jiang and Lightning Huang for their assistance with model deployment, Jianfei Wang for support with the training platform, Gong Cheng for ranker monitoring, Peiyuan Xu for sequential feature logging, and Chunhui Han and Peng Hu for valuable discussions on model design.