The Precision Problem
Modern GPUs contain specialized tensor cores designed for matrix operations. But these cores don't activate automatically—they require specific conditions: FP16 or INT8 precision, and matrix dimensions that align with hardware tile sizes. Generic FP32 operations leave tensor cores dormant, achieving less than 1% of theoretical performance.
THE ACTIVATION INSIGHT
Tensor cores are not engaged by default. They require precise conditions—data formatted correctly, dimensions aligned properly, precision reduced strategically. Understanding these conditions unlocks the hardware's true potential.
Performance Reality
Validation on GTX 1650 reveals the precision-performance relationship:
Activation Requirements
Tensor core activation requires three specific conditions:
- Precision Reduction — FP16 or INT8 instead of FP32. Tensor cores are designed for reduced precision with higher throughput.
- Tile Alignment — Matrix dimensions must be multiples of tensor core tile sizes (8×8×4 for INT8, 16×16×16 for FP16).
- Memory Layout — Data must be arranged in memory to enable coalesced access patterns that tensor cores can process efficiently.
The Trinity Approach
Trinity architecture implements precision-driven routing:
- Dequantization — Q4_K/Q6_K weights expanded to FP16 on iGPU (AMD Vega) with unified memory
- Tensor Operations — FP16 matrix multiplication routed to dGPU (GTX 1650) for tensor core activation
- Accumulation — FP32 accumulation for numerical stability where precision matters
This mixed-precision pipeline activates tensor cores where beneficial while maintaining accuracy where required.
What Remains Hidden
The exact kernel implementations that achieve tensor core activation—the specific WGSL shader sequences, the memory layout transformations, the tile scheduling algorithms—remain within the protected core. We present the principle: precision drives performance. The execution details are the sauce.