Published August 28th, 2024, Motiff Tech

MLLM by Motiff: Shaping the future of UI design

Wei Zhao
Wei Zhao
AI Lab Researcher
Jingming
Jingming
AI Lab Researcher
Share This Post

Motiff aims to become a leading design tool in the AI era by focusing on two main areas: first, using AI to create innovative features that assist designers and their teams; second, ensuring that the AI technologies behind these features are robust enough to make the product truly effective.

The large language models have been rapidly evolving, demonstrating enhanced learning capabilities and greater generalization potential. These advancements offer fresh perspectives on AI applications, and the Motiff team is actively exploring these opportunities.

Reflecting on the past year, we have two key insights on the impact of large language models in AI product development:

  1. 1.Language User Interfaces (LUI) are becoming crucial entry points for products, capable of handling complex tasks like AI-generated design drafts.
  2. 2.Large language models can quickly develop AI functionalities at a lower cost, unlike traditional domain-specific solutions that require high startup costs.

In practical applications, we've tried using general large models to tackle product challenges, but they often fall short in specialized UI fields. Unlike other companies, Motiff focuses on enhancing product capabilities, which has led us to develop specialized large models. Mature technology in fields like healthcare and law inspires our confidence in creating a UI-specific model.

Through these efforts, we've confirmed the need for self-developed models. For instance, the Motiff team validated AI-generated design drafts with just 200 Shots using a domain model. MLLM by Motiff reduces costs and enhances innovation efficiency in UI design.

We're excited to share our innovative progress in this area.

The Development of Multimodal Large Models

Multimodal large models like LLaVA and GPT-4V/4o have advanced rapidly, integrating diverse data types (text, images, videos) for improved understanding and accuracy. This progress results from academic and industry collaboration. Initially focused on enhancing input-output modes, multimodal technologies are now expanding into fields like Microsoft's LLaVA-Med.

Despite the growth and availability of open-source models like LLaVA, applying multimodal models to specialized fields remains challenging, with many areas still unexplored. This context sets the stage for Motiff's self-developed UI multimodal large models (MLLM by Motiff).

The chart below illustrates the rapid development of multimodal large language models from 2022 to 2024.

General Framework of Multimodal Large Model

Training general multimodal models involves three stages:

  1. 1.Independent Pre-training: Visual and language models are pre-trained separately, leveraging existing resources.
  2. 2.Alignment Training: Visual and language models are aligned by fine-tuning connectors, using "image-text pairs" to translate visual language into natural language.
  3. 3.Instruction Fine-tuning: The model is fine-tuned for specific tasks, with varied methods and data sources to enhance task performance.

MLLM by Motiff aims to innovate in UI design by leveraging these advancements.

MLLM adaptation for the UI domain

Training a multimodal large model (MLLM) specifically for the UI domain from scratch involves challenges such as limited domain-specific data and high costs. Instead of starting anew, we can adapt existing multimodal models to fit UI design needs.

Here's how we can refine and optimize:

  1. 1.Adapt visual and language models, initially trained on generic data, to incorporate domain-specific data.
  2. 2.Enhance existing connectors with domain-specific data during alignment.
  3. 3.Replace general instruction fine-tuning with UI-specific task fine-tuning.

Our experience shows that focusing on the later training stages yields better domain-specific performance. Thus, our strategy prioritizes optimizing these stages. Currently, our efforts are on the latter two stages, but exploring unimodal domain adaptation's impact on multimodal models remains an interesting future research avenue for Motiff.

The Training Journey of MLLM by Motiff

Motiff's MLLM uses a classic expert model integration approach by linking a pre-trained Visual Encoder with a Large Language Model (LLM) via a Connector. As illustrated, images pass through the Vision Encoder and the Visual-Language Connector, converting them into visual tokens that the LLM can process. These visual tokens, combined with text tokens, enable the LLM to generate a comprehensive text response, enhancing UI design interactions.

Data Collection in the UI Domain

High-quality UI domain data, especially for mobile platforms, is scarce. To address this, we used methods like manual annotation, pseudo-labeling, and domain knowledge distillation to gather high-quality UI data, categorized as follows:

Type 1: UI Screenshot Captions

This common multimodal training data is used during alignment and instruction fine-tuning.

Unlike natural scene images, UI screenshots contain more details and require more reasoning.

Through a series of Prompt Engineering, we have generated descriptions similar to the following. These descriptions introduce each UI screenshot module-by-module from top to bottom, including layout styles, component names, key UI elements, and module functionalities. Finally, an comprehensive evaluation of the overall page design is provided.

Type 2: UI Screenshot Structured Captions

Influenced by Meta researchers '2023 paper "LIMA: Less Is More for Alignment"', we have moved away from the "more is better" approach.

Instead, when training the MLLM by Motiff, we have incorporated a batch of high-quality, knowledge-intensive UI data. This data allows for precise localization and comprehensive understanding of each element on the UI interface, enabling the Motiff multimodal large language model to:

  1. 1.Locate and identify more than 40 types of fine-grained UI components such as buttons, chips, tabs, search bars, navigation bars, tab bars, lists, cards, etc.
  2. 2.Locate and recognize all text.
  3. 3.Locate each icon on the UI interface and describe its meaning and function in the current context.
  4. 4.Locate all images in the UI interface and describe its content.

Type 3: UI Instruction Tuning Data

We leveraged successful experiences from general domains to collect and construct a rich set of UI interface-related instruction-following data. This mainly covers functions such as UI interface descriptions, question and answering based on UI interfaces, pixel-level interface element localization, fine-grained interface element descriptions, and UI interface interaction guides.

In the data generation task, we introduced several expert models, such as component recognition expert model, icon recognition expert model, OCR recognition expert model, etc. By combining the data generated by these expert models with our structured descriptions and feeding them into the private LLM, we are able to generate high-quality training data that is closer to real user scenarios.

Existing large language model solutions for the UI domain, such as Apple's Ferret UI and Google's ScreenAI, typically rely on classifiers to generate static icon descriptions. In contrast, our approach merges the results from the icon recognition expert model with detailed structured text fed into private LLM. This integration allows the same icon to be described with varying meanings depending on the context, thereby enhancing the accuracy and contextual relevance of the descriptions.

As illustrated in the image below, the upper section showcases data generated by ScreenAI, while the three sections below display data generated by our method respectively.

In addition to the three types of UI-related data mentioned above, we found that tasks such as chart Q&A, document Q&A, and OCR can also improve the understanding of UI interfaces.

Finally, to maintain general capabilities, we also included general domain data such as natural scene image descriptions, natural scene image question and answering, and generic text-based instructions.

In summary, we have collected tens of millions of multimodal training data samples, which include common app screenshots and a large number of web screenshots available in the market, infusing the MLLM by Motiff with extensive UI expertise.

Domain Adaptation of MLLM by Motiff

MLLM by Motiff has specifically focused on the unique requirements of UI interface scenes in its foundational choices. Unlike general natural scenes, UI interfaces contain a large number of fine-grained elements. Therefore, we have employed a visual encoder that supports high-resolution inputs.

This high-resolution processing capability enables the visual encoder to capture more details, significantly enhancing the model's ability to perceive the complex details of UI interfaces, thereby reducing the risk of blurring and misclassification caused by low-resolution images.

Through this series of optimizations, the model's accuracy and detail-handling capability when processing UI interfaces have been significantly improved.

As previously mentioned, our domain migration training is currently applied in two stages:

Stage 1: Alignment Training — Introducing UI domain knowledge during the alignment training of visual models and large language models.

In this stage, we introduced two types of UI-related data. The first type is UI interfaces and their natural language descriptions, and the second type is UI interfaces and their structured descriptions. The former is similar to the description of natural scene images, while the latter is unique to UI interfaces. For training stability, we only trained the connector at this stage and froze the visual models and large language models.

Stage 2: Domain-Specific Instruction Fine-Tuning — Introduce UI Domain Knowledge through End-to-End Training of MLLM

In this stage, we trained all task data, including textual modal data in the general domain, multimodal data in the general domain and multimodal data in the UI domain. The goal was to enhance domain knowledge while maintaining general abilities.

Performance Evaluation of MLLM by Motiff

We conducted a comprehensive evaluation of the MLLM by Motiff, comparing it with state-of-the-art (SOTA) models for interface-related tasks. The evaluation covered five common UI interface scenarios:

1. ScreenQA

ScreenQA [2] is a benchmark dataset for screen understanding proposed by Google DeepMind in 2022. This dataset aims to evaluate the model's understanding capabilities through question-answer pairs based on screenshots. The evaluation section includes approximately 8,419 manually annotated Q&A pairs, covering 3,489 screenshots from the Rico dataset.

As one of the most representative datasets available, ScreenQA not only provides rich visual information but also involves various elements and interaction methods within user interfaces.

Therefore, evaluating on the ScreenQA dataset effectively tests the model's overall ability to understand and answer interface-related questions.

2. Screen2Words

Screen2Words [3] is a joint abstraction task specifically designed for mobile UI interfaces. It is proposed by scholars from the University of Toronto and Google Research. The main purpose of the task is to evaluate the model's ability to understand and describe important content and abstract functions in the screen.

The dataset contains screenshots from various application scenarios, along with corresponding textual descriptions. These descriptions include both explicit content of the interface (such as text and images) and abstract functions of the interface (such as the purpose of buttons and the main theme of the page).

By evaluating the Screen2Words dataset, we can gain deeper insights into the model's performance in generating natural language descriptions and inferring the functions of the interface.

3. RefExp

The RefExp [4] task evaluates the model's ability to precisely locate interface components. This task requires the model to accurately find the referenced component on the screen based on a given referring expression.

The evaluation dataset provides screenshots of mobile UI interfaces along with corresponding natural language descriptions that point to a specific interface element (such as a button, icon, input box, etc.).

The model needs to recognize and locate these elements within the screen image, which not only tests the model's understanding of natural language but also examines its capability in pixel-level visual parsing and precise localization.

The RefExp task has practical applications in voice control systems, such as smart assistants that can locate specific buttons or options on the screen based on the user's verbal instructions.

4. Widget Captioning

The Widget Captioning [5] task aims to evaluate the model's ability to generate natural language descriptions, specifically for various components within an interface. This task requires the model to produce brief and accurate descriptions of different UI components (such as buttons, icons, etc.).

The dataset includes common interface components from various applications along with their corresponding descriptive text. These descriptions need to precisely cover the visual characteristics of the components as well as reflect their functions and purposes.

This task helps test the model's ability to understand and generate semantically appropriate natural language descriptions, which is particularly valuable for practical applications in screen readers and assistive technologies.

5. MoTIF-Automation

The Mobile App Tasks with Iterative Feedback (MoTIF) [7] dataset is specifically designed to evaluate the model's ability to execute natural language instructions within mobile applications.

This task not only involves understanding natural language instructions but also requires the model to perform corresponding actions on the screen, such as clicking, typing, swiping, etc. These actions lead to changes in the interface state, thereby assessing the model's capability in dynamic interaction and feedback handling.

After providing a detailed introduction to each evaluation dataset, we will now showcase the evaluation results of the MLLM by Motiff on each dataset.

From the results, it is evident that in these five UI-related metrics, the general large language model (GPT-4) is noticeably weaker than the domain-specific models (Ferret-UI, ScreenAI, and MLLM by Motiff).

Additionally, the MLLM by Motiff significantly outperforms Apple's Ferret-UI model on these metrics, with its overall capabilities approaching those of Google's ScreenAI model, even surpassing ScreenAI in certain aspects.

Evaluation Results:

  1. 1.MoTIF-Automation and RefExp: MLLM by Motiff scored 86.09 and 85.13, respectively, slightly below ScreenAI's 87.4 and 86.3 but still excellent.
  2. 2.Screen2Words: MLLM by Motiff achieved a Cider Score of 121.19, outperforming ScreenAI's 120.8, showing superior performance in screen content description.
  3. 3.ScreenQA Short: The model's F1 score was 93.03, just below ScreenAI's 94.6, demonstrating strong Q&A capability.
  4. 4.Widget Captioning: The model achieved a Cider Score of 161.77, surpassing ScreenAI's 156.4, indicating a significant advantage in component description.

Overall, MLLM by Motiff outperforms Apple's Ferret-UI and closely matches Google's ScreenAI, even surpassing it in some aspects.

Summary

The Motiff team is committed to leading in AI-related products, with the UI multimodal large language model being a pivotal step toward this goal.

The MLLM by Motiff allows us to quickly implement AI capabilities, integrate them into products, and gather user feedback. This feedback loop enhances the development of smarter, more efficient UI design tools in the AI era.

Human creativity arises from cognition and understanding. In the AI era, user interface creation will begin with large language models that fully comprehend these interfaces.

Looking forward, the Motiff team aims to utilize this model to make AI design tools more intelligent and efficient, enabling "unbounded creativity for designers".

References

[1] Yin S, Fu C, Zhao S, et al. A survey on multimodal large language models[J]. arXiv preprint arXiv:2306.13549, 2023.

[2] Hsiao Y C, Zubach F, Wang M. Screenqa: Large-scale question-answer pairs over mobile app screenshots[J]. arXiv preprint arXiv:2209.08199, 2022.

[3] Wang B, Li G, Zhou X, et al. Screen2words: Automatic mobile UI summarization with multimodal learning[C]//The 34th Annual ACM Symposium on User Interface Software and Technology. 2021: 498-510.

[4] Bai C, Zang X, Xu Y, et al. Uibert: Learning generic multimodal representations for ui understanding[J]. arXiv preprint arXiv:2107.13731, 2021.

[5] Li Y, Li G, He L, et al. Widget captioning: Generating natural language description for mobile user interface elements[J]. arXiv preprint arXiv:2010.04295, 2020.

[6] Burns A, Arsan D, Agrawal S, et al. A dataset for interactive vision-language navigation with unknown command feasibility[C]//European Conference on Computer Vision. Cham: Springer Nature Switzerland, 2022: 312-328.

[7] Burns A, Arsan D, Agrawal S, et al. Mobile app tasks with iterative feedback (motif): Addressing task feasibility in interactive visual environments[J]. arXiv preprint arXiv:2104.08560, 2021.

Subscribe to Motiff Blog
I agree to opt-in to Motiff's mailing list.
By clicking "Subscribe" you agree to our TOS and Privacy Policy.