Industrial cascade recommendation systems (RS) play a crucial role in delivering relevant items to users in today’s digital landscape. However, the growing size and complexity of these systems have led to significant energy consumption and carbon emissions. To address this concern, a new paper introduces GreenFlow, a practical computation allocation framework for RS that takes into account both accuracy and carbon emission during inference.
The framework focuses on optimizing computation in each stage of a cascade RS, such as recall, pre-ranking, and ranking. When a user triggers a request, the framework considers two key actions: the trained instances of models with different computational complexity and the number of items to be inferred in each stage. These actions form chains, and a reward score is estimated for each chain. The framework then uses dynamic primal-dual optimization to balance both the reward and computation budget.
The effectiveness of GreenFlow is demonstrated through extensive experiments. In an industrial mobile application, the framework reduces computation consumption by 41% without compromising commercial revenue. Additionally, it leads to significant energy savings, saving approximately 5000kWh of electricity and reducing 3 tons of carbon emissions per day.
Abstract:Given the enormous number of users and items, industrial cascade recommendation systems (RS) are continuously expanded in size and complexity to deliver relevant items, such as news, services, and commodities, to the appropriate users. In a real-world scenario with hundreds of thousands requests per second, significant computation is required to infer personalized results for each request, resulting in a massive energy consumption and carbon emission that raises concern.
This paper proposes GreenFlow, a practical computation allocation framework for RS, that considers both accuracy and carbon emission during inference. For each stage (e.g., recall, pre-ranking, ranking, etc.) of a cascade RS, when a user triggers a request, we define two actions that determine the computation: (1) the trained instances of models with different computational complexity; and (2) the number of items to be inferred in the stage. We refer to the combinations of actions in all stages as action chains. A reward score is estimated for each action chain, followed by dynamic primal-dual optimization considering both the reward and computation budget. Extensive experiments verify the effectiveness of the framework, reducing computation consumption by 41% in an industrial mobile application while maintaining commercial revenue. Moreover, the proposed framework saves approximately 5000kWh of electricity and reduces 3 tons of carbon emissions per day.