Agents centered around Large Language Models (LLMs) are now capable of
automating mobile device operations for users. After fine-tuning to learn a
user’s mobile operations, these agents can adhere to high-level user
instructions online. They execute tasks such as goal decomposition, sequencing
of sub-goals, and interactive environmental exploration, until the final
objective is achieved. However, privacy concerns related to personalized user
data arise during mobile operations, requiring user confirmation. Moreover,
users’ real-world operations are exploratory, with action data being complex
and redundant, posing challenges for agent learning. To address these issues,
in our practical application, we have designed interactive tasks between agents
and humans to identify sensitive information and align with personalized user
needs. Additionally, we integrated Standard Operating Procedure (SOP)
information within the model’s in-context learning to enhance the agent’s
comprehension of complex task execution. Our approach is evaluated on the new
device control benchmark AitW, which encompasses 30K unique instructions across
multi-step tasks, including application operation, web searching, and web
shopping. Experimental results show that the SOP-based agent achieves
state-of-the-art performance without incurring additional inference costs,
boasting an overall action success rate of 66.92%.

The concept of automating mobile device operations using Large Language Models (LLMs) has gained significant attention in recent years. This article highlights the capabilities of LLM-based agents in executing complex tasks on mobile devices, such as goal decomposition and sequencing of sub-goals, ultimately achieving the final objective. However, it also acknowledges the privacy concerns associated with personalized user data, which necessitates user confirmation during mobile operations.

One of the key challenges in training these agents is the exploratory nature of users’ real-world operations. Action data can be complex and redundant, making it difficult for agents to learn effectively. To address these challenges, the article describes a practical application that incorporates interactive tasks between agents and humans. These interactive tasks help identify sensitive information and align with personalized user needs. This multi-disciplinary approach combines expertise from natural language processing, human-computer interaction, and privacy preservation.

Another significant aspect mentioned in the article is the integration of Standard Operating Procedure (SOP) information into the model’s in-context learning. By leveraging SOPs, the agents gain a better understanding of complex task execution. This integration enhances the agent’s comprehension and improves their performance in executing multi-step tasks on mobile devices.

To evaluate the effectiveness of the proposed approach, the article introduces the AitW benchmark, which includes 30K unique instructions across various types of tasks, such as application operation, web searching, and web shopping. Experimental results demonstrate that the SOP-based agent achieves state-of-the-art performance without incurring additional inference costs. With an overall action success rate of 66.92%, this approach showcases the potential of LLM-based agents in automating mobile device operations.

This research not only highlights the advancements in natural language processing and agent-based technology but also emphasizes the importance of addressing privacy concerns and integrating human-computer interaction principles into the development of mobile automation systems. As the field continues to evolve, further improvements can be expected in terms of privacy protection, user experience, and the breadth of tasks that LLM-based agents can handle.

Read the original article