Sequence modeling papers
Tight starter set — the arc from Markov next-click → RNN sessions → self-attention. Feeds Prediction and the starter track. OA legend: ✅ open · 🟡 free author copy · 🔒 paywalled. Read with How I read a paper.
🎯 Start here:
vaswani2017attention— every later model is a descendant; read it first and SASRec/BERT4Rec become legible. (Eng-background alternative: start atrendle2010fpmc, the Markov baseline that connects to Markov chains and HMMs.) Path: vaswani → rendle → hidasi → kang → sun → wang
⬜ P1 · Attention Is All You Need
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser & Polosukhin (2017) · NeurIPS · vaswani2017attention · ✅ yes
🔗 https://arxiv.org/abs/1706.03762
Why: The Transformer — multi-head self-attention + positional encodings. The substrate under every modern sequential model.
⬜ P2 · Factorizing Personalized Markov Chains for Next-Basket Recommendation (FPMC)
Rendle, Freudenthaler & Schmidt-Thieme (2010) · WWW 2010 · rendle2010fpmc · 🟡 author PDF
🔗 https://www.ismll.uni-hildesheim.de/pub/pdfs/RendleFreudenthaler2010-FPMC.pdf
Why: The canonical Markov-chain baseline — combines a first-order transition matrix with matrix factorization. The thing every neural model is measured against.
⬜ P3 · Session-based Recommendations with Recurrent Neural Networks (GRU4Rec)
Hidasi, Karatzoglou, Baltrunas & Tikk (2016) · ICLR · hidasi2016gru4rec · ✅ yes
🔗 https://arxiv.org/abs/1511.06939
Why: Brought RNNs to anonymous short-session prediction. Extract the session-parallel mini-batch training and ranking loss — both became standard.
⬜ P4 · Self-Attentive Sequential Recommendation (SASRec)
Kang & McAuley (2018) · ICDM · kang2018sasrec · ✅ yes
🔗 https://arxiv.org/abs/1808.09781
Why: The inflection where self-attention replaced RNNs — a causal (unidirectional) Transformer over item sequences. Extract the masking + positional-embedding scheme.
⬜ P5 · BERT4Rec: Sequential Recommendation with Bidirectional Encoder Representations from Transformer
Sun, Liu, Wu, Pei, Lin, Ou & Jiang (2019) · CIKM · sun2019bert4rec · ✅ yes
🔗 https://arxiv.org/abs/1904.06690
Why: Bidirectional (Cloze) masking. The contrast with SASRec's causal model sharpens your thinking on train-time vs inference-time leakage (Leakage checklist).
⬜ P6 · Sequential Recommender Systems: Challenges, Progress and Prospects
Wang, Hu, Wang, Cao, Sheng & Orgun (2019) · IJCAI survey track · wang2019sequential · ✅ yes
🔗 https://arxiv.org/abs/2001.04830
Why: The best single-document map — taxonomy (MC / RNN / attention / graph), datasets, evaluation conventions, open problems.