Test-Time Canonicalization by Foundation Models for Robust Perception ICML 2025
- Utkarsh Singhal* UC Berkeley
- Ryan Feng* University of Michigan
- Stella X. Yu University of Michigan
- Atul Prakash University of Michigan
TL;DR
Test-time search makes models more robust to natural input variations by converting the varied versions of the input into a 'typical' version.
Abstract
Real-world visual perception requires invariance to diverse transformations, yet current methods rely heavily on specialized architectures or training on predefined augmentations, limiting generalization. We propose FOCAL, a test-time, datadriven framework that achieves robust perception by leveraging internet-scale visual priors from foundation models. By generating and optimizing candidate transformations toward visually typical, “canonical” views FOCAL enhances robustness without retraining or architectural changes. Experiments demonstrate improved robustness of CLIP and SAM across challenging transformations, including 2D/3D rotations, illumination shifts (contrast and color), and day-night variations. We also highlight potential applications in active vision. Our approach challenges the assumption that transform-specific training is necessary, instead offering a scalable path to invariance.
Citation
Acknowledgements
This project was supported, in part, by Beyster Fellowship to R. Feng, by NSF 2215542, NSF 2313151, and Bosch gift funds to S. Yu at UC Berkeley and the University of Michigan.
The website template was borrowed from the Fourier Feature Networks project page and Michaël Gharbi.