GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self Driving

2021-07-13 2021-07-13 约 2667 字预计阅读 6 分钟

<GeoSim> GeoSim: Realistic video simulation via geometry-aware composition for self-driving

CVPR2021 Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yun Chen, Frieda Rong, Shivam Duggal, Shenlong Wang, Xinchen Yan, Sivabalan Manivasagam, Shangjie Xue, Ersin Yumer, Raquel Urtasun

Stanford, MIT, Unversity of Toronto Uber

auto-drive datasets (KITTI), camera lidar reconstruction, image-based neural rendering, image warping, post synthesis net

Preprint Project

编者按

动态部件细节：轮子转动时没有的；但是这个可能对自动驾驶感知并无影响？
做的事情都是把物体添加到已有观测中；
- 并没有新的整体观测视角；
- 并没有全新的场景
是 image-based neural rendering，基于图像的渲染，不是基于物理材质的；
- 其 asset bank 存储的只是三维曲面和各个视角观测信息、图像信息；
- ❌ 渲染过程甚至是像素迁移（所以才叫 image-based neural rendering，基于图像的神经渲染）！
  - 利用 mesh 渲染的深度图，在两个 view 中不同投影下的对应点，来把旧的像素迁移（warp）到新的 view 下得到新的位置
需要用到的其他数据/组件：
- asset bank 建立过程：
  - C + L 自动驾驶数据集，存在 3d bbox 标注（弱监督）
  - 用到了 SOTA 的物体分割 PointRend 来产生物体分割 GT
- 图像合成过程
  - 背景场景 C + L
  - HD map，以 lane graph 的形式
  - 行车轨迹生成方法
    - Interaction-Aware Probabilistic Behavior Prediction in Urban Environments. arXiv, 2018.
    - Trafficsim: Learning to simulate realistic multiagent behaviors. arXiv, 2021.

Motivation

现有合成模型的问题：

no photo realistic / no 3D：缺乏物理，缺乏可控性

task / 到底做了什么：

image-manipulation framework
大规模传感器数据仿真
geometry-aware image composition process
- augmenting existing images with dynamic objects
  - 物体来源：从其他 scenes 中提取， render at novel poses；
  - 构建了一个 diverse 的车的 asset bank
creating pictures by replicating visual content
combine data-driven techs with image-based rendering techs，组合数据驱动技术与图像渲染技术
image-based neural rendering + neural inpainting 来调整原始图像与插入 actor 之间的差异
geomtry-aware，是 3D-layout-aware，所以可以做可控的、真实的场景修正

应用：

长距离、多机位真实图像仿真
合成数据用于数据增广、下游分割任务

三个步骤：

合适的物体放置
渲染 asset bank 中的动态物体的新视角
组合、blend 场景（类似 GIRAFFE 中的 neural renderer）
- 🤔 思考：
  - 现在在多物体场景合成中，是否很少有能够直接真实渲染场景的？一定要用 neural rendering 额外做一次融合？主要问题在于多物体光照 / 光照环境处理？
  - 从具体外观中还需要解耦出材质

自动驾驶仿真目前的两种思路：

simulation engine
- pros
  - 场景复杂度可以提升；可以组合场景
- cons
  - 三维模型人为设计，消耗大；多样性有限
  - real2sim gap
data-driven：一个 scalable 的替代路线
- pros
  - 在 existing recorded scenes 中做增强
- cons
  - 要么关注 LiDAR，需要 CAD model registration，限制了能仿真的动态物体类型范围
  - 要么需要额外的努力来 scale to 高分辨率图像（比如 GAN）

图像合成与操作（manipulation）

目前工作主要关注从中间表示来生成二维图像，如：
- scene graphs, surface normal maps, semantic segmenations, images with different styles
- cons
  - 在材质、物体形状上存在 artifacts
CGAN 思路
- SpatialTransformer-GAN，2018；通过迭代控制几何场景来寻找/达成合适的前景物体在背景物体中的分布；
  - 事实上，前景物体和背景都是不同的图片；控制、transformer 更新的是二维前景物体图片的仿射变换（扭曲、位移等）
  - 只有仿射变换，并不涉及 relighting 等
- Learning Hierarchical Semantic Image Manipulation through Structured Representations. arXiv, 2018.
  - 从 hierarchical 的表示生成图像，从而可以增删物体
  - cons：纯粹的网络 based 图像生成（指没有明确物体三维表示的，所有物体全部都是网络表示），很难应对复杂的物理规律，比如光照变化
- Cut-and-paste neural rendering. arXiv, 2020.
  - 在 data-driven 的方法中结合图形学知识

视频合成与操作（manipulation）【全部都是 2D 的】

仅仅只有图像合成是不充足的；-> 将图像合成直接扩充到视频合成的思路：加入正则化，保证时域的一致性
条件视频生成：输入分割 mask/深度图/或姿态轨迹数据作为输入
Relate: Physically plausible multi-object scene synthesis using structured latent spaces. NeurIPS, 2020.
- 之前看过的，基于 GQN 的，稍后回顾下
自动的视频 manipulation 方法：在已经存在的视频中插入前景物体
- Inserting videos into videos. In CVPR, 2019.
- Inserting virtual static object with geometry consistency into real video.

三维重建与视角合成

image-based rendering methods
appearance flow
encode gemetric info in latent representation & strong shape priors
最近的可微分渲染以及相应的开源库，使得传统渲染框架变为可微分、可最优化模型

Overview

Realistic 3D assets creation：换句话说，就是利用自动驾驶点云图像数据集的 learning-based 重建

用到的

数据
- C + L 自动驾驶数据集，存在 3d bbox 标注（弱监督）
技术
- 用到了 SOTA 的物体分割 PointRend 来产生物体分割 GT

直接使用自动驾驶车辆所采集的数据构建 asset bank，而不是人为设计

主要出发点：现在已经有很多自动驾驶的数据集开源，每个都包含了上千个 asset，都是潜在的可以用来重建的
- Argoverse: 3d tracking and forecasting with rich maps. In CVPR, 2019.
- Scalability in perception for autonomous driving: Waymo open dataset. In CVPR, 2020.
- The kitti dataset. IJRR, 2013.

使用 3D BBOX 弱监督：用处就是在点云数据集中找到那些落在汽车 bbox 里面的点云

网络结构：

- multiview 图像特征直接通过最大池聚合在一起
- multiview 点云在同一个 canonical frame 下聚在一起，然后通过 pointnet 得到特征
- 图像特征与点云特征拼在一起，过 MLP 来预测一个 mean shape 的 mesh 顶点变形
训练：自监督的（事实上是有监督的）
- 分割 mask 的 sihoulette 轮廓监督
  - 预测分割 mask 来自 mesh 的 neural rendering operator 得到一个可微分的 mask
  - 分割 mask 监督信号来自 SOTA 分割方法 PointRend
    - 🤔 思考
      - nerf-based 表征想要可微分的 render sihoulette，还是需要$\Psi(\text{SDF})$改造；否则是不存在明确的表面边界的
- 点云 CD 距离
  - 预测点云其实就是 mesh 顶点，其产生过程也是可微分的

Geometry-aware image simulation

用到的

数据
- 背景场景 C + L
- HD map，以 lane graph 的形式
技术
- 行车轨迹生成方法
  - Interaction-Aware Probabilistic Behavior Prediction in Urban Environments. arXiv, 2018.
  - Trafficsim: Learning to simulate realistic multiagent behaviors. arXiv, 2021.
- 三维物体检测与跟踪

对于新增物体考虑的点：

其他 actors 和背景的几何遮挡（geometric occlusions）
位置和运动的合适与否
和其他动态 agent 的交互，避免碰撞

scenario generation

利用高精地图【包含车道线在 BEV(俯瞰)视角下的位置】；
然后在车道区域中，用 $(x,y,\theta)$ 表达车辆位置，再用local ground elevation 转为 6D pose；从而得到 new actor 在初始帧中的姿态
- $(x,y)$ 属于 lane 区域，并且在相机视角下，随机 sample
- $\theta$ 从车道线方向获得
然后利用 Intelligent Driver Model (IDM) 来拟合到一个动力学模型，遵循[26]，从而在接下来的视频中，在遵循交通流的前提下，得到在新的帧中的 actor 姿态
- [26] Autonomous drifting with onboard sensors. 2016
- 这里应该用到了三维检测跟踪相关的信息
note
- 从 asset bank 选择时，选择的都是那些和当前观测姿态相近的（防止 large unseen 区域）、观测距离相近的（防止选择那些分辨率较低的）
  - 在 supp 中阐述了详细的 scoring 方法

occulision-aware neural rendering 考虑了遮挡的神经渲染

novel-view warping

📍 利用 mesh 渲染的深度图（这里其实当做点云了）在不同 view 下的迁移投影，把旧的像素warp 到当前 view 下的新位置
- 在 target view 下渲染材质，得到 target view 下深度图
- 重投影，得到 bank 中的 source view 下的深度图
- 在 target view 下的 texture map / 像素，便是 source view 中重投影后的深度图对应的那些颜色值

shadow generation

因为几何信息是已知的，直接使用图形学引擎中的影子生成方法
- 其实就是选了一个在物体正上方的光源；然后利用得到的影子权重调整不同位置的像素的亮度值
- 没有使用 waving stick 的方式，而是用而一个多云的 HDRI 来 cast shadows

occlusion reasoning

就是用深度图比较来做；
使用一个 depth completion network；从稀疏点云数据出发，补全场景的深度图
- Learning joint 2d-3d representations for depth completion. In ICCV, 2019.

post-composition synthesis

用一个后处理合成网络：
- 结构类似 in-painting network：
  - Free-Form Image Inpainting with Gated Convolution. arXiv, 2019.
- 输入：
  - 背景图像
  - 渲染好的目标物体
  - 目标物体 binary mask
- 训练数据：
  - 使用目标场景中的、使用 PointRend 产生分割 mask 的数据来训练
  - 额外使用数据增广如随机遮挡、颜色扰动、随机对比度、饱和度
- loss：
  - perceptual loss：整体分辨率
    - Perceptual losses for real-time style transfer and super-resolution.
  - GAN loss：提高真实性

目录

目录

GeoSim: Realistic Video Simulation via Geometry-Aware Composition for Self Driving

编者按