Data processing is one of the fundamental steps in machine learning pipelines
to ensure data quality. Majority of the applications consider the user-defined
function (UDF) design pattern for data processing in databases. Although the
UDF design pattern introduces flexibility, reusability and scalability, the
increasing demand on machine learning pipelines brings three new challenges to
this design pattern — not low-code, not dependency-free and not
knowledge-aware. To address these challenges, we propose a new design pattern
that large language models (LLMs) could work as a generic data operator
(LLM-GDO) for reliable data cleansing, transformation and modeling with their
human-compatible performance. In the LLM-GDO design pattern, user-defined
prompts (UDPs) are used to represent the data processing logic rather than
implementations with a specific programming language. LLMs can be centrally
maintained so users don’t have to manage the dependencies at the run-time.
Fine-tuning LLMs with domain-specific data could enhance the performance on the
domain-specific tasks which makes data processing knowledge-aware. We
illustrate these advantages with examples in different data processing tasks.
Furthermore, we summarize the challenges and opportunities introduced by LLMs
to provide a complete view of this design pattern for more discussions.

Data Processing with Large Language Models

Data processing plays a crucial role in machine learning pipelines as it ensures data quality. However, the existing user-defined function (UDF) design pattern for data processing in databases faces new challenges presented by the increasing demand on machine learning pipelines. These challenges include not being low-code, not being dependency-free, and not being knowledge-aware.

To tackle these challenges, a new design pattern is proposed: the Large Language Model-Generated Data Operator (LLM-GDO). In this design pattern, large language models (LLMs) are employed as generic data operators for reliable data cleansing, transformation, and modeling with human-compatible performance.

The LLM-GDO design pattern relies on user-defined prompts (UDPs) to represent the data processing logic, freeing users from implementing specific programming languages. This allows for greater flexibility, reusability, and scalability. Additionally, LLMs can be centrally maintained, eliminating the need for users to manage dependencies at runtime.

Fine-tuning LLMs with domain-specific data further enhances their performance on specific tasks, making them knowledge-aware. By using LLMs as data operators, the LLM-GDO design pattern enables efficient and effective data processing across various tasks.

Advantages of LLM-GDO Design Pattern

There are several advantages to using LLM-GDO as a design pattern for data processing:

  • Flexibility: The use of UDPs allows for flexible representation of data processing logic without being tied to specific programming languages.
  • Reusability: LLMs can be fine-tuned and shared across different data processing tasks, promoting reusability and efficiency.
  • Scalability: With LLM-GDO, data processing can be scaled easily as LLMs are designed to handle large amounts of data.
  • Dependency Management: By centrally maintaining LLMs, users are relieved from managing dependencies at runtime, simplifying the deployment process.
  • Knowledge-Awareness: Fine-tuning LLMs with domain-specific data makes them knowledgeable about specific tasks, leading to improved performance.

Challenges and Opportunities

While LLM-GDO offers significant advantages, it also introduces challenges and opportunities. One challenge is the need for large computational resources to train LLMs. Furthermore, fine-tuning LLMs requires domain-specific data, which might not always be readily available.

On the other hand, the availability of pre-trained LLMs and the potential for transfer learning provide opportunities for faster development and improved performance. Additionally, the multi-disciplinary nature of LLM-GDO opens doors for collaboration between linguists, domain experts, and machine learning practitioners.

In conclusion, the LLM-GDO design pattern offers a novel approach to data processing in machine learning pipelines. By leveraging LLMs as generic data operators and utilizing UDPs for flexible representation of data processing logic, this design pattern promises greater flexibility, reusability, scalability, and knowledge-awareness. Despite its challenges, the adoption of LLM-GDO presents numerous opportunities for advancing data processing techniques and fostering cross-disciplinary collaborations.

Read the original article