Feedforward novel view synthesis

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

A decoupled Transformer representation that separates RGB appearance and Plücker-ray geometry while preserving shared attention routing.

Yihang Wu^1,2*, Yihang Sun^3*, Shaofeng Zhang⁴, Zuxuan Wu^1,2, Junchi Yan³, Xiaosong Jia^1,2, Yu-gang Jiang^1,2

¹ Institute of Trustworthy Embodied Artificial Intelligence (TEAI), Fudan University
² Shanghai Key Laboratory of Multimodal Embodied AI
³ Sch. of Artificial Intelligence & Sch. of Computer Science, Shanghai Jiao Tong University
⁴ University of Science and Technology of China

* Equal Contributions. Correspondence Author.

Paper Code BibTeX

Generation Process

Interactive generation explorer

Two posed RGB inputs form the semantic source set, while three target Plücker-ray queries define the output views. The explorer links 3D camera geometry, branch routing, and the rendered trajectory so the feedforward generation process is visible at a glance.

Input views

Two-view input set

Three target queries rendered as novel views.

Rendered trajectory

Novel-view trajectory generated from sparse posed source views.

Abstract

Decoupling appearance and geometry

Transformer-based models have advanced feedforward novel view synthesis (NVS). Current architectures such as GS-LRM and LVSM mix semantic information (e.g., RGB) and spatial information (e.g., Plücker rays) into a shared feature space. Since Plücker rays naturally carry lattice-like spatial structure, these designs can make the spatial bias interfere with appearance representation and degrade rendering fidelity. To this end, we propose to decouple the representation of feedforward NVS transformers into separate semantic and spatial tokens. The decoupled design keeps semantic and spatial information explicit in their branches while preserving cross-branch interaction through shared attention routing. Built on this design, we introduce optional categorized supervision and bidirectional modulation: the former provides branch-specific training signals, while the latter improves interaction between the two branches. Notably, the base decoupled design introduces virtually zero additional inference latency due to its architectural design. The proposed designs achieve consistent improvements, demonstrating effectiveness across decoder-only and encoder-decoder feedforward NVS models.

Feature Analysis

Where grid-like artifacts come from

The visualizations isolate how Plücker-ray structure propagates through intermediate features and why separating semantic and spatial streams reduces representation ambiguity.

RGB-zero Plucker-ray control feature visualization — **RGB-zero control.** Setting input RGB values to zero while keeping Plücker rays unchanged still produces network-like grid patterns in intermediate features, indicating that the spatial ray representation can induce the artifacts.

Layer-wise semantic and spatial feature visualization — **Layer-wise branch visualization.** Semantic and spatial branches remain separately organized across layers for both input and target views, making the feature roles easier to interpret than a fully entangled token stream.

Method

Separate branches, shared routing

RGB patches initialize semantic tokens, while Plücker-ray patches initialize spatial tokens. Each decoupled Transformer block computes shared Q/K attention over the full token for cross-view coordination, but applies branch-specific value projections, output projections, layer normalization, and FFNs so appearance and geometry are not forced into the same update channel.

Semantic-Spatial Decoupling architecture diagram — The base decoupled design keeps the feedforward inference path compact. Optional categorized supervision trains the semantic branch with DINOv3/iREPA features and the spatial branch with DA3-derived correspondences, while bidirectional modulation adds controlled cross-branch conditioning.

Controlled Results

Fixed-budget comparisons

Baselines and decoupled variants use the same codebase, data split, view sampling, 256x256 resolution, 50K-step schedule, and training budget. The numbers isolate the effect of decoupling under a controlled reimplementation setting.

Main controlled reimplementation results
Architecture	Dataset	Model	PSNR up	SSIM up	LPIPS down
Decoder-only	RE10K	Baseline	26.10	0.839	0.144
Decoder-only	RE10K	Ours (full)	27.21	0.869	0.125
Decoder-only	Objaverse	Baseline	23.75	0.864	0.150
Decoder-only	Objaverse	Ours (full)	26.46	0.899	0.101
Encoder-decoder	RE10K	Baseline	24.06	0.775	0.206
Encoder-decoder	RE10K	Ours (full)	25.31	0.806	0.154

Citation

BibTeX

Paper: https://arxiv.org/abs/2605.18599

@misc{wu2026resolvingrepresentationambiguityfeedforward,
      title={Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling},
      author={Yihang Wu and Yihang Sun and Shaofeng Zhang and Zuxuan Wu and Junchi Yan and Xiaosong Jia and Yu-gang Jiang},
      year={2026},
      eprint={2605.18599},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2605.18599},
}