We use a language model (LM) to aggregate the outputs of 2+ vision-language models (VLMs). Our model assemble approach is named Cola (COordinative LAnguage model or visual reasoning). Cola is most ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results