The Crucial Role of NUMA Awareness in High-Performance Deep LearningPyTorch Model Performance Analysis and Optimization — Part 10Jul 7Jul 7
Pipelining AI/ML Training workloads With CUDA StreamsPyTorch Model Performance Analysis and Optimization — Part 9Jun 21A response icon3Jun 21A response icon3
A Caching Strategy for Identifying Bottlenecks on the Data Input PipelinePyTorch Model Performance Analysis and Optimization — Part 8Jun 6Jun 6
The Case for Centralized AI Model Inference ServingOptimizing Highly Parallel AI Algorithm ExecutionMar 18Mar 18
Debugging the Dreaded NaNCapturing and Reproducing Failures in PyTorch Training with LightningFeb 26A response icon1Feb 26A response icon1
Streaming Data from Cloud Storage with Mountpoint for Amazon S3A First Look at a New Solution for Mounting Cloud Based DataFeb 10Feb 10
Efficient Metric Collection in PyTorch: Avoiding the Performance Pitfalls of TorchMetricsPyTorch Model Performance Analysis and Optimization — Part 7Feb 4A response icon1Feb 4A response icon1
Published inTDS ArchiveOptimizing Transformer Models for Variable-Length Input SequencesHow PyTorch NestedTensors, FlashAttention2, and xFormers can Boost Performance and Reduce AI CostsNov 26, 2024A response icon4Nov 26, 2024A response icon4
Published inTDS ArchiveIncreasing Transformer Model Efficiency Through Attention Layer OptimizationHow paying “better” attention can drive ML cost savingsNov 18, 2024Nov 18, 2024
Published inTDS ArchiveOn the Programmability of AWS Trainium and InferentiaAccelerating AI/ML Model Training with Custom Operators — Part 4Nov 1, 2024Nov 1, 2024