本文最后更新于：2023年11月2日下午

TETRIS : Memory-efficient Serverless Inference through Tensor Sharing

会议: ATC '22
天津大学, slide: https://www.usenix.org/sites/default/files/conference/protected-files/atc22_slides_li-jie.pdf

摘要

Executing complex, memory-intensive deep learning infer- ence services poses a major challenge for serverless computing frameworks, which would densely deploy and maintain inference models at high throughput. We observe the excessive memory consumption problem in serverless inference systems, due to the large-sized models and high data redun- dancy.

We present TETRIS, a serverless platform catered to inference services with an order of magnitude lower memory footprint. TETRIS’s design carefully considers the extensive memory sharing of runtime and tensors. It supports minimizing the runtime redundancy through a combined optimization of batching and concurrent execution and eliminates tensor redundancy across instances from either the same or different functions using a lightweight and safe tensor mapping mechanism. Our comprehensive evaluation demonstrates that TETRIS saves up to 93% memory footprint for inference services, and increases the function density by 30× without impairing the latency.

执行复杂的、内存密集型的深度学习推理服务对无服务器计算框架提出了重大挑战，该框架将以高吞吐量密集部署和维护推理模型。我们观察到无服务器推理系统中由于模型规模大和数据冗余度高而存在内存消耗过多的问题。

我们推出了 TETRIS，这是一个面向推理服务的无服务器平台，内存占用量低了一个数量级。 TETRIS 的设计仔细考虑了运行时和 tensor 的广泛内存共享。它支持通过批处理和并发执行的组合优化来最小化运行时冗余，并使用轻量级且安全的 tensor 映射机制消除来自相同或不同函数的实例之间的 tensor 冗余。我们的综合评估表明，TETRIS 为推理服务节省了高达 93% 的内存占用，并将功能密度提高了 30 倍，而不会影响延迟。

Introduction

发现的问题

在 serverless 背景下, 内存占用是一个瓶颈, 可能推理只占用一部分时间, 但是由于系统资源预留和缓存的原因会导致长时间的内存占用和冗余
本文研究了 serveless 推理的应用, 并发现了严重的 tensor冗余 的情况, 即 “tensors in the computational graphs of inference models are highly duplicated across function instances”

应对的措施

对于 tensor 冗余的问题, 可以采用如下的方式解决:

OS 内核级别的 page merging methods (页面合并), 但是这样会带来扫描的开销, 并且实现起来复杂度较大, 还有 side-channel attack 的风险

本文探索的方式:

运行时共享和 tensor 级共享
特点: 低开销, 安全

本文的主要贡献

我们观察无服务器推理系统中的张量冗余问题，并提出相应的张量共享思想以提高内存效率。
• 我们设计了一种轻量级、基于用户空间张量映射的共享方法，消除了无服务器推理系统中的张量冗余问题。
• 我们实现了TETRIS原型系统，它是用开源 OpenFaaS 和 TensorFlow Serving 构建的，支持内存共享、内存回收和实例调度。
• 我们使用一套全面的基准测试和生产工作负载来广泛评估TETRIS。实验结果表明，与最先进的方法相比，TETRIS 可以节省高达 93% 的内存，并将功能密度提高 30 倍。

系统设计 System Design

tensor 的加载方式

其实就是一个加锁的取 map 操作:

tensor的生命周期

实现 Implementation

平台:

OpenFaaS: 开源的无服务平台
Tensorflow Serving: tensorflow 官方的推理框架
同时为 tensor 回收器 (reclaimer) 设计了单独的守护进程
系统运行时: kubernetes

tensor store 的实现

在集群的每个实例上都维护一个共享的 tensor store (而不是使用全局存储):

Since maintaining a cluster-level global tensor store is costly due to frequent tensor access during inference and high network latency, TETRIS maintains a shared tensor store on each server for performance guarantee while minimiz- ing cluster memory consumption through instance scheduling.

不使用 docker 的 -ipc=host (和宿主机共享 ipc 命名空间), 而是使用 tmpfs (内存文件系统) + docker -v 挂载的方式来共享内存

Although shared memory can be enabled by Docker through setting the –ipc=host option at the container creation time, allowing all containers to share the host ipc namespace, this introduces significant risks of malicious activities or misoperations. Hence, we instead im- plement the shared memory by mounting a memory-based tmpfs, in which the tensor store is just a directory. Then, it could be mounted to each container during its creation time using command like docker -v. Tensors are stored as files under the mounted tmpfs directory and their hash values are set as the filenames. In this way, we can build multiple dedicated tensor stores flexibly, just by creating different mounted directories.

对 tensorflow serving 和 openfaas 的修改

修改了 tensorflow serving 的 RestoreOp 接口, 提供了一个新的 tensor 分配器来使用 open 和 mmap 的系统调用
使用 tcmalloc 替换 tensorflow serving 的 malloc 接口
直接为实例创建 kubernetes pod, 并在实例创建过程中挂载 tmpfs 目录, 为了方便使用运行时共享和张量共享

论文略读: Tetris