橙子的短想法

00:49 · Apr 20, 2025 · Sun

https://arxiv.org/abs/2305.18290 #llm #ai

今天深入学习了 DPO，再次感叹扎实的数学功底对 AI/ML Research 的重要性……

原始的 RLHF 是用 pairwise human preference data（A 和 B 哪个更好）去训练一个 reward model，然后用 RL 来训练主 model，objective 是 maximize reward / minimize negative log likelihood 加上 regularization。比如 PPO 就是通过新旧 policy 之间的 KL Divergence 来做 regularization。而且还需要一个 critic model 来预测 reward。这套流程涉及多个模型，而 RL 又是出了名的难搞。

DPO 的思路是，观察到 RLHF 的 objective 本质上是 minimize loss over (latent) reward function，通过一番 reparameterization 等数学推导，重新设计了一个 minimize loss over policy 的 objective，直接绕过了中间这个 reward model，让 gradient update 直接增加 winner response 的概率并降低 loser response 的概率，大幅简化了流程。

拓展阅读：
- KTO: 更进一步，不需要 pairwise comparison，只用对 individual example 的 upvote/downvote 也可以学习到 preference。
- IPO: 解决 DPO 容易 overfit 的问题。

arXiv.org

Direct Preference Optimization: Your Language Model is Secretly a...

While large-scale unsupervised language models (LMs) learn broad world knowledge and some reasoning skills, achieving precise control of their behavior is difficult due to the completely...

llm ai

06:13 · Apr 6, 2025 · Sun

👍 https://www.oreilly.com/radar/what-we-learned-from-a-year-of-building-with-llms-part-i/ #llm

O’Reilly Media

What We Learned from a Year of Building with LLMs (Part I)

llm

06:07 · Apr 6, 2025 · Sun

https://jax-ml.github.io/scaling-book/
非常值得学习的分享，作者列表里好几个 Gemini 核心团队的人😃 Sholto、Jacob、Sharad 等人都是超一流的 research engineer 🙏
#llm

jax-ml.github.io

How To Scale Your Model

Training LLMs often feels like alchemy, but understanding and optimizing the performance of your models doesn't have to. This book aims to demystify the science of scaling language models: how TPUs (and GPUs) work and how they communicate with each other…

llm

06:06 · Apr 6, 2025 · Sun

前段时间准备 ML Interview (with a focus on LLMs)，浏览了不少学习资源，这里分享一些：

CMU 11-711 Advanced NLP

Language Modeling 综述。

The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture

比较好的一篇 Transformer 综述。

3Blue1Brown: Attention in transformers, step-by-step

解释 Attention 最好的视频，没有之一。

Hugging Face: Mixture of Experts Explained

Hugging Face: RLHF

Hugging Face: Introduction to Deep Reinforcement Learning

Hugging Face: Multimodal Models

HF 这几个资源很适合快速查漏补缺相关的话题。

Lilian Weng: Agents

依然是最好的 Agents 综述之一。

Understanding Reasoning LLMs

一些 post-training 的细节，侧重分析了 DeepSeek R1 和 R1 Zero。

Designing Machine Learning Systems 笔记 by @tms_ur_way

适合快速查漏补缺 ML 实践中的要点。

Stable Diffusion Explained From Scratch

关于 Diffusion 基本原理的解释。

除此之外以下这几位的内容都很不错，可以针对话题有选择性地摄入。

- Andrej Karpathy 的 YouTube 视频
- Lilian Weng 的博客
- Chip Huyen 的博客

这里推荐的基本都比较入门 / high level，更多是为了查漏补缺。要深度挖掘具体话题还是得去看进一步的资源和论文等。 #ml #llm

ml llm

09:03 · Apr 5, 2025 · Sat

前段时间准备 ML Interview (with a focus on LLMs)，浏览了不少学习资源，这里分享一些：

CMU 11-711 Advanced NLP

Language Modeling 综述。

The Transformer Blueprint: A Holistic Guide to the Transformer Neural Network Architecture

比较好的一篇 Transformer 综述。

3Blue1Brown: Attention in transformers, step-by-step

解释 Attention 最好的视频，没有之一。

Hugging Face: Mixture of Experts Explained

Hugging Face: RLHF

Hugging Face: Introduction to Deep Reinforcement Learning

Hugging Face: Multimodal Models

HF 这几个资源很适合快速查漏补缺相关的话题。

Lilian Weng: Agents

依然是最好的 Agents 综述之一。

Understanding Reasoning LLMs

一些 post-training 的细节，侧重分析了 DeepSeek R1 和 R1 Zero。

Designing Machine Learning Systems 笔记 by @tms_ur_way

适合快速查漏补缺 ML 实践中的要点。

Stable Diffusion Explained From Scratch

关于 Diffusion 基本原理的解释。

除此之外以下这几位的内容都很不错，可以针对话题有选择性地摄入。

- Andrej Karpathy 的 YouTube 视频
- Lilian Weng 的博客
- Chip Huyen 的博客

这里推荐的基本都比较入门 / high level，更多是为了查漏补缺。要深度挖掘具体话题还是得去看进一步的资源和论文等。 #ml #llm

ml llm