PTX and Threads Scheduler
Assignment 1. Please analyze GPU PTX, SSE Assembler (or NEON Assembler), and CPU Assembler Instruction Queues, (and Cambricon[1] as optional) of matrix operations, for instance, matrix (vector) addition, multiply.
- Analyze the reasons of "why GPU is faster at matrix operations", ( and why Cambricon is more efficient than GPU in DNN computations, also optional) .
- Please figure out which instructions are loading data, which instructions are SIMD operations, and compare them with traditional x86 instructions (in x86 scalar instructions, matrix operation always are organized with loop).
Assignment 2.Study the threads scheduler of GPGPU by analyzing warp scheduler.
- Read the relevant GPGPU-sim code of warp-scheduler and find where the score-boarding algorithm is described. Please flow the algorithm and the warp controller structure (that means drawing the flow diagram and structure diagram).
- Please illustration the performances with whether the memory accessing latency is hidden by warp scheduler (the key of this problem is just to construct sufficient operations for scheduler).
Note: A Latex Template has been uploaded to overleaf.com with url:https://www.overleaf.com/read/vkyjvtnzrczh
Reference
[1] S. Liu, Z. Du, J. Tao, D. Han, T. Luo, Y. Xie, Y. Chen, and T. Chen, “ Cambricon: An Instruction Set Architecture for Neural Networks,” presented at the 2016 ACM/IEEE 43rd Annual International Symposium on Computer Architecture (ISCA), pp. 393 – 405.