🗽pi0 : Our First Generalist Policy

type
status
date
slug
summary
tags
category
icon
password
💡
本质上 pi0 提出的是一个架构,是组合现有的模型实现 robotics 位姿的输出。需要做的,就是理解这个架构,看 VLA 这个模型是怎么构建起来的。

 

Introduction

notion image
这张图描述了整个 policy 的基本流程。核心的 VLA 模型用一个 pre-trained VLM 模型(我的理解是提供强大的推理能力)和一个 action expert 拼接而成。用 cross-embodiment 的数据做训练,用 high-quality 的数据做 fine-tuned。
以 VLM 为基础的原因
By basing our model on a VLM, we inherit the general knowledge, semantic reasoning, and problem-solving abilities of language and vision-language models.
 
论文提出了现在 generalist robot policies 的三个 bottlenecks ,分别是
  • very large scale
  • a right model architectures make use of diverse data source, represent intricate and subtle behaviors
  • a right training recipe
论文分别给出了如下解决方案
  • Problem 1:以已经预训练好的 VLM 为 base
  • Problem 2:
    • diverse data: VLM, cross-embodiment training
    • intricate and subtle behaviors: an action chunking architecture with flow matching
  • Problem 3:pre-training/post-training separation
pre-training/post-training separation
Intuitively, training only on high-quality data does not teach the model how to recover from mistakes, since mistakes are rarely seen in such data. Training on only lower-quality pretraining data does not teach the model to act efficiently and robustly. Combining both provides the desired behavior: the model attempts insofar as possible to act in a manner similar to the high-quality data, but still has a repertoire of recoveries and corrections that it can deploy in the case of a mistake.

 

Related Work

之前的模型会采用 autoregressive discretization 来预测动作,这里使用的是一种 diffusion model 的变体,称为 flow matching(主要卖点)。flow matching 可以提供高精度和多模态的能力。
diffusion model for action generation
 
训练模式与之前不同
… we train our model via a diffusion-style (flow matching) loss applied on individual sequence elements, in lieu of the standard cross-entropy loss for decoder-only transformers.
… we use a separate set of weights for the tokens corresponding to diffusion.

 

Overview

notion image
Overview of pi0’s framework
We start with a pre-training mixture, which consists of both our own dexterous manipulation datasets and open-source data. We use this mixture to train our flow matching VLA model, which consists of a larger VLM backbone and a smaller action expert for processing robot states and actions. The VLM backbone weights are initialized from PaliGemma, providing representations learned from large-scale Internet pre-training. The resulting π 0 model can be used to control multiple robot embodiments with differing action spaces to accomplish a wide variety of tasks.
训练总体分两步:pre-training 希望得到一个有多种能力、良好泛化性的模型,post-training 希望让动作更精细
 

The model

整个模型的骨架是 VLM。VLM 的底层是 language model transformer。在机器人上 image encoder 做的是将机器人观察到的图像 embed 到 language tokens 的语义空间中。
late fusion VLM recipe
在这个 backbone 上扩展,对于机器人我们需要加入特定的输入 proprioceptive state (本体状态)和特定的输出 robot action。
使用 conditional flow matching 来为连续的动作分布函数建模,基本框架遵循 Transfusion,用 flow matching loss 来监督 “tokens corresponding to continuous outputs”,用 cross-entropy loss 来监督 “tokens corresponding to discrete outputs”。
Transfusion
💡
对于机器人的 state 和 action 单独使用一组不同的参数(应该就是指把 action 的预测分离出来)能获得更好的性能,类似 a mixture of experts(使用两套 mixture elements,第一套针对 image 和 text 的输入,第二套针对 robotics-specific inputs and outputs)。论文中将这一套参数称为 action expert。
 

Mathematical representation

问题的本质是想建模出一个 action 的概率密度函数 ,对这个函数做随机抽样来预测下一个 action。
其中,action chunk 表示未来的一系列动作;
observation 是 RGB 图像输入, 是 language tokens 序列, 是 a vector of joint angle。
用各自的 encoder + a linear projection layer 将 image 和 state 映射到和 language tokens 一样的 embedding space 中。
 
接下来的数学推导有点复杂。
对于 中的每个动作 ,有一个对应的输入给 action expert 的 action token。使用 conditional flow matching 的 loss(这玩意儿就离谱)来监督
notion image
上一篇
Attention Is All You Need
下一篇
Denoising Diffusion Probabilistic Models (Part 1)
Loading...