Cake is a pure Rust implementation of the LLama3 distributed inference based on Candle. The goal of the project is being able to run big (70B+) models by repurposing consumer hardware into an heterogeneous cluster of iOS, macOS, Linux and Windows devices.
Cake 是基于 Candle 的 LLama3 分布式推理的纯 Rust 实现。该项目的目标是通过将消费类硬件重新利用到 iOS、macOS、Linux 和 Windows 设备的异构集群中,能够运行大型 (70B+) 模型。
This is experimental code.
这是实验代码。
The idea is to shard the transformer blocks to multiple devices in order to be able to run the inference on models that wouldn't normally fit in the GPU memory of a single device. Inferences over contiguous transformer blocks on the same worker are batched in order to minimize latency due to data transfer.
这个想法是将转换器块分片到多个设备,以便能够对通常不适合单个设备的 GPU 内存的模型运行推理。对同一工作线程上的连续变压器块的推理是分批进行的,以便最大限度地减少数据传输造成的延迟。