🗽pi0 : Our First Generalist Policy
type
status
date
slug
summary
tags
category
icon
password
本质上 pi0 提出的是一个架构,是组合现有的模型实现 robotics 位姿的输出。需要做的,就是理解这个架构,看 VLA 这个模型是怎么构建起来的。
Introduction

这张图描述了整个 policy 的基本流程。核心的 VLA 模型用一个 pre-trained VLM 模型(我的理解是提供强大的推理能力)和一个 action expert 拼接而成。用 cross-embodiment 的数据做训练,用 high-quality 的数据做 fine-tuned。
以 VLM 为基础的原因
By basing our model on a VLM, we inherit the general knowledge, semantic reasoning, and problem-solving abilities of language and vision-language models.
论文提出了现在 generalist robot policies 的三个 bottlenecks ,分别是
- very large scale
- a right model architectures make use of diverse data source, represent intricate and subtle behaviors
- a right training recipe
论文分别给出了如下解决方案
- Problem 1:以已经预训练好的 VLM 为 base
- Problem 2:
- diverse data: VLM, cross-embodiment training
- intricate and subtle behaviors: an action chunking architecture with flow matching
- Problem 3:pre-training/post-training separation
pre-training/post-training separation
Intuitively, training only on high-quality data does not teach the model how to recover from mistakes, since mistakes are rarely seen in such data. Training on only lower-quality pretraining data does not teach the model to act efficiently and robustly. Combining both provides the desired behavior: the model attempts insofar as possible to act in a manner similar to the high-quality data, but still has a repertoire of recoveries and corrections that it can deploy in the case of a mistake.
Related Work
之前的模型会采用 autoregressive discretization 来预测动作,这里使用的是一种 diffusion model 的变体,称为 flow matching(主要卖点)。flow matching 可以提供高精度和多模态的能力。
diffusion model for action generation
论文里列举了两篇
训练模式与之前不同
… we train our model via a diffusion-style (flow matching) loss applied on individual sequence elements, in lieu of the standard cross-entropy loss for decoder-only transformers.
… we use a separate set of weights for the tokens corresponding to diffusion.
Overview

Overview of pi0’s framework
We start with a pre-training mixture, which consists of both our own dexterous manipulation datasets and open-source data. We use this mixture to train our flow matching VLA model, which consists of a larger VLM backbone and a smaller action expert for processing robot states and actions. The VLM backbone weights are initialized from PaliGemma, providing representations learned from large-scale Internet pre-training. The resulting π 0 model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks.
训练总体分两步:pre-training 希望得到一个有多种能力、良好泛化性的模型,post-training 希望让动作更精细
The model
整个模型的骨架是 VLM。VLM 的底层是 language model transformer。在机器人上 image encoder 做的是将机器人观察到的图像 embed 到 language tokens 的语义空间中。
late fusion VLM recipe
VLM 是需要搞懂的,论文中列举了 3 篇
在这个 backbone 上扩展,对于机器人我们需要加入特定的输入 proprioceptive state (本体状态)和特定的输出 robot action。
使用 conditional flow matching 来为连续的动作分布函数建模,基本框架遵循 Transfusion,用 flow matching loss 来监督 “tokens corresponding to continuous outputs”,用 cross-entropy loss 来监督 “tokens corresponding to discrete outputs”。
Transfusion
trains a single transformer using multiple objective
对于机器人的 state 和 action 单独使用一组不同的参数(应该就是指把 action 的预测分离出来)能获得更好的性能,类似 a mixture of experts(使用两套 mixture elements,第一套针对 image 和 text 的输入,第二套针对 robotics-specific inputs and outputs)。论文中将这一套参数称为 action expert。
Mathematical representation
问题的本质是想建模出一个 action 的概率密度函数 ,对这个函数做随机抽样来预测下一个 action。
其中,action chunk 表示未来的一系列动作;
observation 。 是 RGB 图像输入, 是 language tokens 序列, 是 a vector of joint angle。
用各自的 encoder + a linear projection layer 将 image 和 state 映射到和 language tokens 一样的 embedding space 中。
接下来的数学推导有点复杂。
对于 中的每个动作 ,有一个对应的输入给 action expert 的 action token。使用 conditional flow matching 的 loss(这玩意儿就离谱)来监督

上一篇
Attention Is All You Need
下一篇
Denoising Diffusion Probabilistic Models (Part 1)
Loading...







