PAI - PAI灵骏智算构建全链路大模型LLM服务的最佳实践.pdf

PAI LLM Contents 02 03 05 01 04 01 XLAB XPSP AI-T ensor ﬂow PAI-Py Tor ch P AI-S tudio DL C DS W EAS NLP/CV/ • 千亿参数 • • ODL • M6 • OF A • Swin-T ransf ormer PAI AI 9 SLA 数据训练推理稳定性 PAI 面向LLM全链路的一站式智算平台 02 -Data Deduplication fr om Google (2022/03) - Text Deduplication fr om BigCode (2023/05) - The R eﬁnedW eb for Falcon LLM (2023/06) 高质量的文本输入可以获得更好的大语言模型 jieba MinHash MinHashLSH G A B A B G G 1 . 2. Power la w 10 Distributed union ﬁnd 1. join 2. 图连通分量算法示例实现样本数重复率耗时Precision Recall F1 P AI 5亿 50%1h 34min 879993 其他实现 5亿50%4h 10min 859290 P AI 10亿 50%3h 0min 829990 其他实现 10亿50%6h 54min 809085 03 A general framework that helps dispatching the oper ators into new backends (AICompiler) and meanwhile pr ovides new T ensor expression that sw aps in eager mode . AI An Compiler that uses the adv anced optimization skills in order to support high perf ormance codegen. Suppor t FSDP, TP and other distribute str ategies. Tor chAcceler ator Tor chAcceler ator Tor chAcceler ator 基于Kube Scheduler Fr amework AI ASW/DS W/PSW 合适的网络架构的调度选择可以更充分的释放高性能网络的潜力 04 LLM EAS OP T/GP T/Bloom/GLM * 模型压缩权重量化激活量化 K V Cache量化系统优化编译器优化高性能算子库分布式执行张量并行流水并行 Nvidia GPU AMD GPU 建模主流模型高性能实现开源模型全兼容 OP T-66B GPU 0 1 2 3 4 A100(80GB) V100(32GB) A10(24GB) fp16 int8 int4 OPT-66B perplexity 0 3 6 9 12 wikitext2 ptbc4 fp16 int8 int4 服务吞吐提升 1.7~3.8倍首包延迟降低 8. 7~13.8倍 LLM BladeLLM Model weights / conﬁg Compr ession Compiling Serving User Platform 05 高性能灵骏集群带来了非常有挑战的稳定性 •ECC Err or • NCCL T imeout • NCCL Hang • PCIE降速 • NVLink Err or • …… AIMaster Eas yCKP T AIMaster Hang Checkpoint EasyCKP T • • 多级存储异步并行存储 • 最快支持秒级存储，大幅

查看更多收起部分

PAI - PAI灵骏智算构建全链路LLM服务的最佳实践.pdf

PAI - PAI灵骏智算 构建全链路大模型LLM服务的最佳实践.pdf

PAI - PAI灵骏智算构建全链路大模型LLM服务的最佳实践.pdf