Neural Neural Textures Make Sim2Real Consistent

Ryan Burgert, Jinghuan Shang, Xiang Li, Michael S. Ryoo

Stony Brook University

To appear at CoRL 2022

[Paper] [Code-Soon]

We propose TRITON (Texture Recovering Image Translation Network): an unpaired image translation algorithm which takes the UV map and object labels of a 3d scene and renders a realistic image. TRITON combines differentiable rendering with image translation to achieve temporal consistency over indefinite timescales, using surface consistency losses and neural neural textures.

Checkout the videos below and see how TRITON works!


Genereate realistic images from unseen viewpoint

TRITON was trained using two views (the rows labeled "Camera1" and "Camera2"), but was also evaluated on an unseen camera angle ("Unseen Camera"). "Sim (UVL)" is the input image containing UV maps and labels, and "Real GT" contains photographs of the robot arm to match the poses in the real world.

Robot policy trained by sim2real

TRITON enables a robot reacher task. In this sim2real experiment, we train a behavioral cloning policy that takes single RGB image from a fixed camera in the simulator and deploy it directly to the real robot without further fine-tuning. The action policy predicts the location of all the target objects simultaneously and is trained fully by only 2000 photorealistic images generated from TRITON. Check out the demo video below.

Apply textures to objects from sim

TRITON makes simulated images realistic, while being more consistent than other image translation algorithms. On the top row of the video we have input images, and on the bottom row we have TRITON's output images. None of these object placements were seen in real life. Note how the surfaces of the objects appear consistent throughout the video, even though the cubes have slight shadows under them and the apple and soda cans remain shiny.

Deformable object

TRITON also works on deformable objects like this American flag.

Comparison with other methods

We compare TRITON to other image translation algorithms by moving the objects around. Note how although each individual frame might look realistic, CycleGAN and CUT let the top of each cube can randomly shift whereas they remain the same using TRITON.

TRITON recovered textures

From TRITON's outputs, we can recover textures for each object. In this image we show the three recovered texture sets with respect to the above video.

Moreover, TRITON is also able to apply any learned texture to any other objects by giving the texture label to that object.


TRITON's goal is to turn 3D simulated images into realistic fake photographs (by training it without any matching pairs), while maintaining high surface consistency. It does this by simultaneously learning both an image translator and a set of realistic textures. TRITON introduces a learnable neural neural texture with two novel surface consistency losses to an existing image translator.

Check the figure below to see how we apply each loss and please refer to our paper to find more details.

The code will be released soon!

Neural Neural Textures

Previous works called these learnable textures "neural textures", and were parametrized by a discrete grid of differentiable texels. In contrast, we call our learnable textures as neural nerual textures, because our textures themselves are represented as a neural network function, parameterized continuously over UV space. Using this representation instead of using discrete texels allows TRITON to learn faster and yields better results.