Speaker 1: Henry Speaker 2: Jeremy Speaker 3: Horace

Summary

0:00 | Introduction 0:49 | Shared Memory 12:00 | Shared Memory in Python 18:41 | Dynamic Shared Memory 20:19 | Shared Memory Kernel Runner 32:32 | Relationship between Blocks and Tile Size 34:33 | CUDA-like Approach 35:19 | Python Threads 42:13 | Auto-generating CUDA Code 48:43 | Dynamic vs. Static Shared Memory 52:52 | Future Message from Jeremy 56:10 | Number Library 1:00:25 | CUDA Simulator 1:03:34 | Pros and Cons of Number 1:04:01 | Discussion and Q&A 1:06:02 | Comparison to Qt Plus and PyTorch 1:07:38 | Future Directions 1:08:24 | Community Collaboration 1:09:18 | Hardware Features 1:09:49 | PyTorch Matrix Multiplication 1:10:41 | Importance of Fast Compilation 1:11:23 | Chat GPT Capabilities 1:12:03 | Shared Memory in Chat GPT 1:12:32 | Future of Software Development 1:13:57 | Comparison to Triton 1:15:53 | Avoiding Learning CUDA 1:16:43 | Triton Experience 1:17:14 | Conclusion