本文最后更新于：2023年11月2日下午

Paella: Low-latency Model Serving with Software-defined GPU Scheduling

github: https://github.com/eniac/paella/tree/sosp23_artifact
SOSP 2023

摘要

Model serving systems play a critical role in multiplexing machine learning inference jobs across shared GPU infras- tructure. These systems have traditionally sat at a high level of abstraction—receiving jobs from clients through a nar- row API and relying on black-box GPU scheduling mechanisms when dispatching them. Fundamental limitations in the built-in GPU hardware scheduler, in particular, can lead to inefficiency when executing concurrent jobs. The cur- rent abstraction level also incurs system overheads that are similarly most significant when the GPU is heavily shared.

In this paper, we argue for co-designing the model compiler, local clients, and the scheduler to bypass the built-in GPU scheduler and enable software control of kernel execu- tion order. Doing so enables the use of arbitrary scheduling algorithms and reduces system overheads throughout the critical path of inference.

传统的模型服务系统工作在较高抽象层次上 - 通过窄的API从客户端接收作业,并在调度作业到GPU时依赖黑盒式GPU调度机制。
内置GPU硬件调度器的基本局限性,特别是在执行并发作业时,会导致效率低下。
当前的抽象级别在GPU被大量共享时,也会引入显著的系统开销。
文章主张协同设计模型编译器、本地客户端和调度器,以跳过内置GPU调度器,并实现软件控制内核执行顺序。这可以实现任意调度算法并减少推理关键路径上的系统开销。

背景

NVIDIA 的 Triton、Clipper 和 Clockwork 等模型服务系统在此过程中发挥着关键作用，使多个用户和模型能够共享 GPU 基础设施。这些系统通常位于模型执行引擎（例如 Tensorflow、PyTorch、JAX 等）之上并与客户端应用程序分开，负责接收传入请求、选择模型/配置、调度并最终分派模型到底层执行引擎执行。
这些平台难以提供低延迟, 尤其是当 “when the GPU is heavily shared and models exhibit high dispersion” (GPU共享度高, 模型分散) 时, 原因在于:
- 服务平台的开销高 (序列化, 反序列化, 调度, 分派) 等
- 低效的 FIFO 调度策略 (已经融入了 CUDA 运行时, GPU驱动程序和硬件) 容易受到频繁的队头阻塞 (HoL) 的影响

Paella:

co-design of a request submission interface, model scheduler, and model execution logic
We introduce a compiler-service co-design that instruments kernels to offer visibility into and full control over black-box GPU scheduling decisions. (基于TVM)
我们提出了一种从GPU 中抽象出调度的服务架构。 Paella 使用它来最小化整个推理管道的延迟：接收请求，将它们分派到 GPU，然后返回结果。

GPU 调度

Baseline:

FIFO, and only single process & single stream

Today’s GPU features:

with Multi-Process Service (MPS), kernels can be submitted from different processes and receive similar benefits;
with Multi-Instance GPU (MIG), users can set up multiple virtual GPUs with strong iso- lation properties.

Fermi and earlier GPU:

GPU 仅仅支持单个硬件队列, 无论用户定义的 stream 数量如何
在这种情况下, 只有相邻模型的最后一个内核可以共享 GPU (only after all of a kernel’s blocks have been placed and the new head of the hardware queue has no running dependencies, in which case streams may allow the GPU to execute an independent kernel.)

Kepler and later GPU:

添加了多个硬件队列, 每个 stream 被映射到其中一个队列
这时 GPU 可以独立考虑每个硬件队列的头部, 从而增加了在资源可用时的独立调度 kernel 的机会
MPS 将这种抽象扩展到来自不同进程的 kernel

系统设计和组成

paella 的系统组成

一个基于 TVM 构建的编译器, 对内核进行检测, 并在运行时公开有关 GPU SM 的占用和利用率的详细信息
一个客户端库和通信协议, 使用进程间通信向 paella service 发送请求和响应
Paella 调度程序, 接受请求并监控 GPU 的资源使用情况, 做出细粒度的调度决策

实现

c++, & cuda, 基于 TVM 和 boost coroutine

paella 调度程序在指定的 linux 内核上运行, 并具有实时调度优先级

阅读

#paper #论文 #Deep Learning

论文略读: Paella

https://blog.roccoshi.top/posts/42284/

作者

Moreality

发布于

2023年11月2日

许可协议

论文略读: Tetris 上一篇

论文略读: MuxFlow 下一篇