Ferret is an innovative multimodal large language model (MLLM) developed by Apple. It is designed to excel in both image understanding and language processing tasks, with a particular focus on understanding spatial references.
The key feature of Ferret is its ability to refer and ground anything anywhere at any granularity within an image. This means that Ferret can accurately identify and locate specific objects, regions, or even fine-grained details within an image based on natural language instructions.
To achieve this capability, Ferret employs a hybrid region representation that combines discrete coordinates and continuous features. This unique approach allows Ferret to represent and understand regions in the image in a unified manner.
Furthermore, Ferret is trained on a comprehensive refer-and-ground instruction tuning dataset called GRIT. This dataset contains a vast amount of samples that cover various spatial knowledge hierarchies, enabling Ferret to learn and understand different types of spatial references.
The evaluations of Ferret have demonstrated its superior performance in classical referring and grounding tasks. It outperforms existing MLLMs in region-based and localization-demanded multimodal chatting. Moreover, Ferret shows remarkable improvements in describing image details and reduces object hallucination.
To learn more about Ferret and its advanced capabilities in referring and grounding, you can access the full paper titled “Ferret: Refer and Ground Anything Anywhere at Any Granularity” by Haoxuan You and other authors at this link.