SceneScript treats 3D reconstruction as a language problem rather than a geometry one. The model watches a video of a room and just learns to write a script for it. It autoregressively spits out text commands like make_wall(...) or make_bbox(...) that define the scene. Stanford's new "Scene Language" paper goes a step further adding CLIP embeddings to capture visual appearance too. The fact that language models already understand spatial relationships well enough to write out scene graphs is pretty wild.
100,51K