VLM-IRP

Input representations for spatial reasoning in vision-language models.