• AI This Week
  • Posts
  • Inside Fuyu-8B: Adept's Super Innovative Multimodal Foundation Model for AI Agents

Inside Fuyu-8B: Adept's Super Innovative Multimodal Foundation Model for AI Agents

The model was designed for agent-based tasks and exhibits some unique capabilities for language and computer vision.

Inside Fuyu-8B: Adept's Super Innovative Multimodal Foundation Model for AI Agents

The model was designed for agent-based tasks and exhibits some unique capabilities for language and computer vision.

Adept.ai is part of the generation of newly minted AI unicorns. Initially incubated by some of the authors of the iconic transformers paper, Adept is working in the area of autonomous AI agents. To date, Adept has raised over $415 million at a valuation exceeding $1 billion. The platform is dedicated to constructing agents that comprehend high-level objectives and convert them into actions, relying primarily on computer vision and language. Very little was known about the models behind Adept until now when Adept open sourced Fuyu-8B, a smaller version of the model powering its platform.

Adept is on a mission to create a smart companion for knowledge workers, a digital copilot with a wide-ranging intellect. To achieve this goal, Adept places a strong emphasis on grasping user context and taking actions on their behalf. A crucial part of this endeavor involves adept image comprehension. In the world of knowledge work, users anticipate their copilot to seamlessly access what’s visible on their screens. Often, vital information is conveyed through images, be it charts, slides, or PDFs. Furthermore, executing actions often necessitates interacting with on-screen elements like buttons and menus. While it would be ideal if all these tasks could be accomplished via APIs, many business-oriented software lacks comprehensive APIs, making it necessary to navigate these applications through their graphical user interfaces (UIs) to keep users engaged.

In general, Fuyu-8B exhibits some characteristics that makes it unique among the new generation of multimodal models:

1. Smaller and simpler than standard architecture.

2. Designed for the agent paradigm.

3. Fast.

4. Able to match with larger models across standard benchmark while surpassing them in agent-specific tasks.

The Architecture

When comes to the latest generation of foundation models, multimodal models share a common structure. They typically feature a distinct image encoder, the output of which is integrated into an existing Large Language Model (LLM) through cross-attention mechanisms or adapters. The examples are everywhere. Models such as PALM-e, PALI-X, QWEN-VL, LLaVA 1.5, and Flamingo adhere to this paradigm. These models typically operate at fixed image resolutions. During inference, images exceeding this resolution must be downsized, while those with differing aspect ratios require padding or distortion.

On the training front, many other multimodal models undergo a multi-step training process. The image encoder is trained separately from the LLM, often using contrastive training objectives, which can be intricate to implement and manage. Decisions must be made about when to freeze the weights of various components. Some models even include an additional high-resolution image phase to ensure competent handling of high-res images.

Scaling these models presents a challenge when determining how to proportionately scale each of these components. Questions arise about how to allocate additional parameters to the encoder versus the decoder and where to allocate computational resources during training. Adept, however, presents a model that sidesteps these complexities.

Architecturally, Fuyu is a straightforward, decoder-only transformer with the same specifications as Persimmon-8B, devoid of a dedicated image encoder. Image patches are directly projected into the transformer’s first layer, bypassing the embedding lookup. This approach treats the traditional transformer decoder as an image transformer, albeit without pooling and with causal attention. For more details, refer to the accompanying diagram.

This simplification offers the flexibility to handle images of varying resolutions effortlessly. To achieve this, image tokens are treated much like their textual counterparts. Image-specific position embeddings are removed, and image tokens are fed into the model in raster-scan order, with a special image-newline character indicating line breaks. The model can leverage its existing position embeddings to adapt to different image sizes. During training, images of any size can be used, obviating the need for separate high and low-resolution training stages.

The Capabilities

In addition to the standard features expected in multimodal foundation models, Fuyu-8B exhibits an interesting set of unique capabilities:

QA in Images

Fuyu exhibits the ability to tackle intricate questions within images, as demonstrated below:

Chart Comprehension

When faced with complex visual data, such as the chart depicted below, Fuyu excels at discerning intricate relationships, tracing connections between various elements, and counting to provide insightful answers:

Document Mastery

Fuyu’s competence extends to deciphering a wide range of documents, whether they involve intricate infographics or aging PDFs:

Diagram Interpretation

The model’s expertise also extends to deciphering complex scientific diagrams, addressing intricate relational queries with finesse:

OCR Proficiency

In addition to these skills, Adept has honed its internal models to excel in two essential tasks when presented with an image of a user interface (UI):

· bbox_to_text: Given a bounding box, Adept can precisely identify the text contained within that bounding box.

· text_to_bbox: Conversely, when provided with text, Adept can skillfully return the bounding box that encompasses the specified text.

Fuyu-8B is certainly one of the most interesting recent releases in open source foundation models. The simplicity of its architecture and the unique set of capabilities makes it one of the models to track in the space.