
12

Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks, Adam _______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


This sounds like it would be really useful for 3D Graphics APIs as SPIRV (the Vulkan/OpenCL2.1 intermediate representation) has matrices as firstclass types. This will promote roundtripability. I have also heard that RISCV's V extension wants to support matrices (not sure if it's as an additional extension on top of V or as part of V proper).
For the ABI, I recommend considering compatibility with the layouts required for 3D Graphics APIs. For Vulkan, I think the relevant section in the spec is 14.5.4.
Jacob Lifshay
On Wed, Oct 10, 2018 at 11:10 PM Adam Nemet via llvmdev < [hidden email]> wrote: Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks, Adam _______________________________________________
LLVM Developers mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/llvmdev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Adam Nemet via cfedev < [hidden email]> writes:
> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
> %2 = call <4 x 4 x float>
> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0,
> <4 x 4 x float> %1)
> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
This sounds very interesting. Would it make sense to later expand the
idea to allow an arbitrary number of dimensions? Maybe that doesn't
make sense if we're restricted to staticallyknown dimensions.
How would this relate to scalable vectors? Most of the time matrix
dimensions are not known statically. Would <n x m x float> be possible?
Do you have a prototype of this?
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> On Oct 11, 2018, at 7:56 AM, David Greene < [hidden email]> wrote:
>
> Adam Nemet via cfedev < [hidden email]> writes:
>
>> %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
>> %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
>> %2 = call <4 x 4 x float>
>> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0,
>> <4 x 4 x float> %1)
>> store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
>
> This sounds very interesting. Would it make sense to later expand the
> idea to allow an arbitrary number of dimensions? Maybe that doesn't
> make sense if we're restricted to staticallyknown dimensions.
Yes, that matches my current feeling too.
> How would this relate to scalable vectors?
Scalable vectors would be a possible of lowering of the matrix type. I *believe* you'd need go generate loops or at least some conditional code at run time due to the unknown scale factor.
> Most of the time matrix dimensions are not known statically. Would <n x m x float> be possible?
No, we only support staticallyknown dimensions.
> Do you have a prototype of this?
Yes. I can make it available it there is interest.
Adam
>
> David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfedev < [hidden email]> wrote: Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
This seems to me to be proposing two distinct things  builtin matrix support in the frontend as a language extension, and support for matrices in LLVM IR  and I think it makes sense to discuss them at least somewhat separately.
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks, Adam _______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Thu, Oct 11, 2018, 15:43 Richard Smith via llvmdev < [hidden email]> wrote: On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
I had mentioned it earlier: I believe RISCV's V extension is planning on supporting matrices in hardware (maybe as an extension on top of V), so that satisfies there being a target with support for matrix registers. Additionally, SPIRV has matrices as a firstclass type, and I believe that there is support for targeting SPIRV either inprogress or planned.
Jacob
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Fri, 12 Oct 2018, 00:44 Richard Smith via cfedev, < [hidden email]> wrote:
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
Ciao, .Andrea
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfedev < [hidden email]> wrote: Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
This seems to me to be proposing two distinct things  builtin matrix support in the frontend as a language extension, and support for matrices in LLVM IR  and I think it makes sense to discuss them at least somewhat separately.
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?
Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes. On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
That’s hard for me to answer as I didn’t look at complexnumber performance. How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
As you say, complex numbers are not a firstclass type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.
Adam
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks, Adam _______________________________________________ cfedev mailing list [hidden email] http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 10/11/2018 08:29 PM, Adam Nemet via cfedev wrote:
On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfedev < [hidden email]> wrote:
Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:
%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix
in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}
This seems to me to be proposing two distinct things  builtin matrix support in the frontend as a language extension, and support for matrices in LLVM IR  and I think it makes sense to discuss them at least somewhat separately.
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language
extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias
Kretz and Tim Shen; see http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for
current / prior work in this area)?
Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider
community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences.
My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my
area of expertise so give that opinion as much or little weight as you see fit.
Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices
at a time. You don’t need to go to very large matrices at all to start to hit this.
Can you please write more about the cases you're seeing where this provides benefits? What kinds of InstCombine(like) rules have you implemented to go along with this? Is this something that we could get if we had better loop transformations (i.e., loop fusion)?
Thanks again,
Hal
However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
That’s hard for me to answer as I didn’t look at complexnumber performance.
How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
As you say, complex numbers are not a firstclass type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.
Adam
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic
also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the
original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector
boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in
order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between
rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending
the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to
make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since
Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element
type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further
lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just
need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks,
Adam
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Thu, 11 Oct 2018 at 18:29, Adam Nemet via cfedev < [hidden email]> wrote:
On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfedev < [hidden email]> wrote: Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
This seems to me to be proposing two distinct things  builtin matrix support in the frontend as a language extension, and support for matrices in LLVM IR  and I think it makes sense to discuss them at least somewhat separately.
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias Kretz and Tim Shen; see http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for current / prior work in this area)?
Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences. My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
Yes, I think that's OK, but I also think you should start talking to the relevant people from WG14 and WG21 early (ie, right now) so that they can provide feedback to you and vice versa about the needs of the most likely eventual language model and of the optimizer. We don't want to find in a year or two that this somehow fundamentally fails to satisfy the frontend language needs, or conversely that WG21 has standardized something that's problematic to lower to our optimizable primitives.
If you've not done so already, it'd be a good idea to also talk to the Eigen folks to make sure that what you're adding is something they could use in the cases where it's applicable.
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes. On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit.
Since people already spoke up about direct HW needs, you’re probably satisfied here
Yes, I am. but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices at a time. You don’t need to go to very large matrices at all to start to hit this.
Well, modern C++ matrix libraries deal with similar issues using expression templates, but such an approach is certainly not without drawbacks. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
That’s hard for me to answer as I didn’t look at complexnumber performance. How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
As you say, complex numbers are not a firstclass type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.
This seems worth thinking about early, even if you choose to defer implementing anything in this space. Adam
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks, Adam _______________________________________________ cfedev mailing list [hidden email] http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 11 Oct 2018, at 23:42, Richard Smith via cfedev < [hidden email]> wrote:
>
> On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
I think there are a few potential problems with adding these into LLVM IR as a firstclass type:
 With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.
 A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?
 What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?
 Does the IR representation enforce an inmemory representation? If different targets represent matrixes in different orders (e.g. columnmajor vs rowmajor) then how much do optimisers need to be aware of this?
 Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?
 Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?
 As others have said, how does this interact with variablesized matrixes?
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 10/11/2018 08:29 PM, Adam Nemet via cfedev wrote:
On Wed, 10 Oct 2018 at 23:10, Adam Nemet via cfedev < [hidden email]> wrote:
Hi,
We are proposing firstclass type support for a new matrix type. This is a natural extension of the current vector type with an extra dimension.
For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float:
%0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16
%1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16
%2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1)
store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix
in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) {
return __builtin_matrix_multiply(a, b);
}
This seems to me to be proposing two distinct things  builtin matrix support in the frontend as a language extension, and support for matrices in LLVM IR  and I think it makes sense to discuss them at least somewhat separately.
I'm totally happy to assume that you can cover most of those points, but point 4 seems likely to be a potential sticking point. Have you talked to WG14 about adding a matrix extension to C? (I'd note that they don't even have a vector language
extension yet, but as noted on that page, we should be driving the relevant standards, not diverging from them, so perhaps we should be pushing for that too). Have you talked to the people working on adding vector types to C++ about this (in particular, Matthias
Kretz and Tim Shen; see http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2018/p0214r9.pdf and http://www.openstd.org/jtc1/sc22/wg21/docs/papers/2015/n4454.pdf for
current / prior work in this area)?
Yes, we’re certainly interested pushing this into the appropriate language standards. As I was saying in the proposal, right now this is experimental. We see great performance improvement with this but we’d like to get feedback and data from the wider
community and also continue to improve the implementation. Also the surface area for this extension is pretty minimal and piggybacks vector. As you’ve seen in the thread there is a diverse set of people from GPU to CPU expressing interests and preferences.
My goal was to put this out there and have people collaborate and then propose what’s actually working. Obviously ABI is a big consideration here and I don’t want to design in a vacuum. Would this model work for you?
I also strongly believe that the right approach for good matrix performance is gradual lowering rather than lowering all the way to loops and then trying to recover the original intent of the program which is what the above approach takes.
On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my
area of expertise so give that opinion as much or little weight as you see fit.
Since people already spoke up about direct HW needs, you’re probably satisfied here but as I said there are other benefits. For a chain of matrix operations it is very beneficial to fuse the operations and then have it operate on tiles of the matrices
at a time. You don’t need to go to very large matrices at all to start to hit this.
Can you please write more about the cases you're seeing where this provides benefits? What kinds of InstCombine(like) rules have you implemented to go along with this?
Sure. We have a case for example where the result of a matrix multiply is inplace added to a submatrix. The intermediate matrix values can’t fit in the vector registers but even if they could using the minimal number of registers is very beneficial to provide more register renaming bandwidth to the OOO cores. So we just load enough from the input matrices to compute a single native vector registersize chunk of the multiply, add the matching chunk from the submatrix and then store it back. What the optimal input chunks are depends on whether the inputs are transposed or not which where the InstCombinelike rules come in.
This case takes advantage of another semantical rule with matrices that the new representation allows. Because we know we are loading from the same submatrix that we store to we know that the different columns between the load and the store can’t alias which allows the loads and the stores to be reordered in order to enable the fusion above. Is this something that we could get if we had better loop transformations (i.e., loop fusion)?
As we both know, aliasing usually makes this pretty difficult. For relatively small matrices runtime memchecks can be costly. Also we need to be able to evaluate these fusion opportunities in the inliner so having firstclass type representation makes this cheap.
Adam
Thanks again,
Hal
However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.)
That’s hard for me to answer as I didn’t look at complexnumber performance.
How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
As you say, complex numbers are not a firstclass type so this is not supported just like it’s not for vectors, i.e. you’d have to use an array of complex numbers. That said this may tip the scales to introduce complex numbers as well.
Adam
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors. Having a matrixmultiply intrinsic
also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the
original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector
boundary (e.g. for a 3x3 matrix).
The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in
order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between
rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending
the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
** ABI **
We currently default to columnmajor order with no padding between the columns in memory. We have plans to also support rowmajor order and we would probably have to support padding at some point for targets where unaligned accesses are slow. In order to
make the IR selfcontained I am planning to make the defaults explicit in the DataLayout string.
For function parameters and return values, matrices are currently placed in memory. Moving forward, we should pass small matrices in vector registers. Treating matrices as structures of vectors seems a natural default. This works well for AArch64, since
Homogenous ShortVector Aggregates (HVA) can use all 8 SIMD argument registers. Thus we could pass for example two 4 x 4 x float matrices in registers. However on X86, we can only pass “four eightbytes”, thus limiting us to two 2 x 2 x float matrices.
Alternatively, we could treat a matrix as if its rows/columns were passed as separate vector arguments. This would allow using all 8 vector argument registers on X86 too.
Alignment of the matrix type is the same as the alignment of its first row/column vector.
** Flow **
Clang at this point mostly just forwards everything to LLVM. Then in LLVM, we have an IR function pass that lowers the matrices to targetsupported vectors. As with vectors, matrices can be of any static size with any of the primitive types as the element
type.
After the lowering pass, we only have matrix function arguments and instructions building up and splitting matrix values from and to vectors. CodeGen then lowers the arguments and forwards the vector values. CodeGen is already capable of further
lowering vectors of any size to scalars if the target does not support vectors.
The lowering pass is also run at O0 rather than legitimizing the matrix type during CodeGen like it’s done for structure values or invalid vectors. I don’t really see a big value of duplicating this logic across the IR and CodeGen. We just
need a lighter mode in the pass at O0.
** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
We are obviously interested in maintaining and improving this code in the future.
Looking forward to comments and suggestions.
Thanks,
Adam
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev

Hal Finkel
Lead, Compiler Technology and Programming Languages
Leadership Computing Facility
Argonne National Laboratory
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Richard Smith via cfedev < [hidden email]> writes:
> However, I do wonder: how is this different from, say, complex
> numbers, for which we don't have a native IR representation? (Maybe
> the answer is that we should have native IR support for complex
> numbers too.)
We've got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don't have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.
Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :)
As for matrices, I again can see the advantages but have no practical
experience to draw upon. But some people at Apple seem to think it's
advantageous and I'm interested in learning more.
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Adam Nemet via llvmdev < [hidden email]> writes:
> Is this something that we could get if we had better loop
> transformations (i.e., loop fusion)?
>
> As we both know, aliasing usually makes this pretty difficult. For
> relatively small matrices runtime memchecks can be costly. Also we
> need to be able to evaluate these fusion opportunities in the inliner
> so having firstclass type representation makes this cheap.
Fortran defines away many of these problems. Would better local
restrict support help with Cfamily languages? There are several old
patches from Hal that need some attention.
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Oct 12, 2018, at 10:36 AM, David Greene via cfedev < [hidden email]> wrote:
Adam Nemet via llvmdev < [hidden email]> writes: Is this something that we could get if we had better loop transformations (i.e., loop fusion)?
As we both know, aliasing usually makes this pretty difficult. For relatively small matrices runtime memchecks can be costly. Also we need to be able to evaluate these fusion opportunities in the inliner so having firstclass type representation makes this cheap.
Fortran defines away many of these problems. Would better local restrict support help with Cfamily languages? There are several old patches from Hal that need some attention.
Agreed these patches would be neat to revive. I think we’d also want someone to pursue wg21.link/n4150. However, none of that seems like it should gate matrix support in the IR. _______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


JF Bastien via llvmdev < [hidden email]> writes:
> Agreed these patches would be neat to revive.
We've been trying but haven't got responses to Phab comments we've
posted.
> I think we’d also want someone to pursue wg21.link/n4150.
Definitely!
> However, none of that seems like it should gate matrix support in the
> IR.
Agreed. But we should at least be aware of alternatives so we have a
good sense of the pros/cons of the proposal.
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> Recent NVIDIA GPUs do support some matrix operations natively in the hardware
The inputs to the operations in question are actually defined as opaque. You have to load the inputs using a special instruction, the semantics of which are no precisely specified. This instruction loads the matrix's values distributed across the threads in a warp, and you are not allowed to assume anything about which values are where. In fact they claim that the arrangement of values is architecturedependent.
I suppose one might be able to spec the corresponding LLVM intrinsics for loading data in preparation for a matmul as taking a matrix pointer rather than a plain float*, but since as soon as you load it you no longer have a matrix in hand (each thread has some undefined subset of the matrix's elements), I'm not seeing any real use there other than a sort of handwavy "typesafety" argument. On Fri, Oct 12, 2018 at 11:02 AM David Greene via cfedev < [hidden email]> wrote: JF Bastien via llvmdev <[hidden email]> writes:
> Agreed these patches would be neat to revive.
We've been trying but haven't got responses to Phab comments we've
posted.
> I think we’d also want someone to pursue wg21.link/n4150.
Definitely!
> However, none of that seems like it should gate matrix support in the
> IR.
Agreed. But we should at least be aware of alternatives so we have a
good sense of the pros/cons of the proposal.
David
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Fri, Oct 12, 2018, 10:33 David Greene via llvmdev < [hidden email]> wrote: Richard Smith via cfedev <[hidden email]> writes:
> However, I do wonder: how is this different from, say, complex
> numbers, for which we don't have a native IR representation? (Maybe
> the answer is that we should have native IR support for complex
> numbers too.)
We've got at least one person here with decades of experience handling
complex operations in compilers who thinks LLVM IR should absolutely
have native support for complex, FWIW. I don't have enough experience
with it myself to make a definitive statement but I can certainly see
advantages to it.
Most of us here also think that LLVM IR should absolutely have native
support for predication/masking. :)
I can certainly agree.
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> On Oct 12, 2018, at 2:02 AM, David Chisnall < [hidden email]> wrote:
>
> On 11 Oct 2018, at 23:42, Richard Smith via cfedev < [hidden email]> wrote:
>>
>> On the LLVM IR side, I'm personally unconvinced that we should model matrices in the IR directly as a new firstclass type, unless there's some target out there that has matrix operations in hardware / matrix registers, but IR is not really my area of expertise so give that opinion as much or little weight as you see fit. However, I do wonder: how is this different from, say, complex numbers, for which we don't have a native IR representation? (Maybe the answer is that we should have native IR support for complex numbers too.) How would you expect the frontend to lower (eg) matrix multiplication for matrices of complex numbers?
>
> I think there are a few potential problems with adding these into LLVM IR as a firstclass type:
>
>  With vectors, we have a simple extension from scalar operations to vector ones. A scalar operation on a vector is exactly the same as the same scalar operation applied pairwise to the elements. This is not the case for matrix types.
Vectors also have things like reductions, masked load/store and scatter/gather. Matrices also have elementwise operations and then some custom ones. I don’t think the situation is all that different.
>
>  A lot of analysis and transform passes need to have special cases for vectors. This is already a bit of a maintenance problem, but one that we put up with because of the big wins from vector types. Do matrix types have the same tradeoff?
I don’t think that’s the right way to look at this proposal. As I said I think it’s better to look at matrices as an extension to vectors adding a different view on the same underlying data. Thus most of the special cases for vectors would also apply to matrices as considered SequentialTypes or a new common base class. So effectively isVector becomes isVectorOrMatrix in many cases. This has definitely been the experience so far in the prototype.
With this extension we can cover a new class of applications that otherwise would require heroic frontend or userlevel efforts. And since all these approaches perform premature lowering, enabler passes like the inliner can’t reason about the representation. Leaving them as firstclass constructs makes this reasoning trivial.
>
>  What does a matrix of pointers look like? What is a GEP on a matrix of pointers? If I do GEP on a matrix of pointers and a matrix of integers and I replace it with ptrtoint, add, inttoptr, is that equivalent?
Unlike vectors, I don’t think that’s a case that we want to support unless one of the HW implementations requires this. Do you think this would be useful?
>
>  Does the IR representation enforce an inmemory representation? If different targets represent matrixes in different orders (e.g. columnmajor vs rowmajor) then how much do optimisers need to be aware of this?
Most of this should be limited to the lowering pass. That is where we would reason about the layout, fusion, register pressure and vectorization.
How much InstCombine and things like that would be affected we can decide based on the tradeoff of the specific cases we want to optimize. I don’t think that’s a priority or a requirement for now.
I feel that having this extension experimental would allow us to evaluate pros and cons in this regard.
>
>  Can all of the more complex operations that you might want to implement on matrixes be implemented in terms of a smaller number of primitives, without resorting to long extract and insert element sequences?
I’d like to keep the canonical representation for these operations short so that highlevel optimizations like the inliner can easily reason about them.
>
>  Can the same optimisations be used on sparse matrix representations that are then lowered by something custom nearer the back end?
I feel this is related to your scatter/gather question but I am not sure I completely get it. Can you please elaborate?
>
>  As others have said, how does this interact with variablesized matrixes?
I’d like to experiment with lowering variablesized matrix operations to this representation. I have some handwavy ideas about this. Something where we would have a lambda expressing operations on fixedsized tiles and then a mapping function that applies/extends that to variablesized matrices. An expressiontemplate library may be able to gather all the required pieces for this.
Adam
>
> David
>
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On Oct 10, 2018, at 11:09 PM, Adam Nemet via cfedev < [hidden email]> wrote: Hi,
We are proposing firstclass type support for a new matrix type.
Interesting! Here are some thoughts, I’m sorry but I haven’t read the responses downthread.
This is a natural extension of the current vector type with an extra dimension. For example, this is what the IR for a matrix multiply would look like for a 4x4 matrix with element type float: %0 = load <4 x 4 x float>, <4 x 4 x float>* %a, align 16 %1 = load <4 x 4 x float>, <4 x 4 x float>* %b, align 16 %2 = call <4 x 4 x float> @llvm.matrix.multiply.m4_4f32.m4_4f32.m4_4f32(<4 x 4 x float> %0, <4 x 4 x float> %1) store <4 x 4 x float> %2, <4 x 4 x float>* %c, align 16
LLVM already has a pretty general vector type (arbitrary number of elements). I’m aware of hardware that has rectangular vectors, e.g. nvidia tensor cores, Google has a proprietary inhouse design with nonsquare vector registers, etc.
Currently we support elementwise binary operations, matrix multiply, matrixscalar multiply, matrix transpose, extract/insert of an element. Besides the regular fullmatrix load and store, we also support loading and storing a matrix as a submatrix of a larger matrix in memory. We are also planning to implement vectorextract/insert and matrixvector multiply.
All of these are currently implemented as intrinsics. Where applicable we also plan to support these operations with native IR instructions (e.g. add/fadd).
Ok. Makes sense, I agree that supporting the existing pointwise vector operations makes sense. These are exposed in clang via builtins. E.g. the above operations looks like this in C/C++:
typedef float mf4x4_t __attribute__((matrix_type(4, 4)));
mf4x4_t add(mf4x4_t a, mf4x4_t b) { return __builtin_matrix_multiply(a, b); }
I’d recommend splitting the clang discussion from the LLVM discussion, they are completely different tradeoffs involved. I’ll focus on the LLVM IR side of things.
** Benefits **
Having matrices represented as IR values allows for the usual algebraic and redundancy optimizations. But most importantly, by lifting memory aliasing concerns, we can guarantee vectorization to targetspecific vectors.
Right, it is basically the same benefit as having a vector type. You also get the ability to have specific alignments etc.
I think there are several options in the design space here:
1. Do nothing to the type system, but just use the existing vector types (<16 x float> in this case) with a new set of operations. 2. Introduce a “matrix” concept and associated operations. 3. Introduce Ndimensional vector registers and associated operations.
Personally, I’d be in favor of going with #1 followed by #3 followed distantly by #2.
The reason I’m opposed to a matrix *type* is that this is far too specific of a concept to put into LLVM. We don’t even have signedness of integers in the type system: the instruction set is the major load bearing part of the IR design, and the instruction set is extensible through intrinsics.
Arguing in favor of #1: AFAICT, you only need to add the new intrinsics to do matmul etc. You could just define them to take 1D vectors but apply math to them that interprets them as a 2D space. This is completely an IR level modeling issue, and would be a very noninvasive patch. You’re literally just adding a few intrinsics. All the pointwise operations and insert/extract/shuffles will “just work”. The frontend handles mapping 2d indices to 1D indices.
f you are interested in expanding the type system, I think that #3 is the way to go: extend VectorType to support multiple constant dimensions, like <4 x 32 x float>. I think that all of the existing vector intrinsics and instructions are still defined (pointwise). You’d then add matrix specific intrinsics like your matmul above. If you are going to extend the type system, here are things to think about:
1) You will need to introduce legalization infra to lower all valid (as in, passing the verifier) cases to arbitrary targets, e.g. turning Nd elementwise add operations into <4 x float> adds or scalar operations. Legalization has a lot of the infra to do this, but it would be a big project.
2) Extending the type system has nonlocal effects because you have to audit (e.g.) the mid level optimizer to make sure that everything touching vector type, or “add” is doing the right thing for the new case.
3) Extending the IR like this is plausible, but would need to be carefully considered by the community.
Having a matrixmultiply intrinsic also allows using FMA regardless of the optimization level which is the usual sticking point with adopting FPcontraction.
Just to nitpick here, but I’d suggest following the existing precedent of the IR in terms of contraction etc. You shouldn’t assume that all matmuladds are contractable just because your current client want that. Consistency with the rest of the IR is far more important and we would WANT other clients to use this someday.
Adding a new dedicated firstclass type has several advantages over mapping them directly to existing IR types like vectors in the front end. Matrices have the unique requirement that both rows and columns need to be accessed efficiently. By retaining the original type, we can analyze entire chains of operations (after inlining) and determine the most efficient intermediate layout for the matrix values between ABI observable points (calls, memory operations).
I don’t understand this point at all. The resulting intermediate layout could be something like a single vector spanning the entire matrix or a set of vectors and scalars representing individual rows/columns. This is beneficial for example because rows/columns would be aligned to the HW vector boundary (e.g. for a 3x3 matrix).
LLVM already has alignment support for memory objects, and the frontend can handle complex lowering like turning a 3x3 matrix into an LLVM IR <16xfloat> (i.e. a 4x4 matrix) if that is more efficient for some reason. If there is a design to put this logic into the code generator and make it frontend independent etc, then putting it in the LLVM IR type system is reasonable, so long as you “do it right” and implement full support for this (including legalization for all targets etc). The layout could also be made up of tiles/submatrices of the matrix. This is an important case for us to fight register pressure. Rather than loading entire matrices into registers it lets us load only parts of the input matrix at a time in order to compute some part of the output matrix. Since this transformation reorders memory operations, we may also need to emit runtime alias checks.
LLVM isn’t really set up to do these sorts of optimizations on the IR: it treats SSA values as units, and tiling within an ND vector will be challenging. if you're interested in tackling this, it would be great to start doing it for 1D vectors, which have exactly the same issue: we generate atrocious code for chains of <256 x float> operations on machines that support <4 x float> because of register pressure etc. Having a dedicated firstclass type also allows for dedicated targetspecific ABIs for matrixes. This is a pretty rich area for matrices. It includes whether the matrix is stored rowmajor or columnmajor order. Whether there is padding between rows/columns. When and how matrices are passed in argument registers. Providing flexibility on the ABI side was critical for the adoption of the new type at Apple.
There are other ways to model this. You can just add a new llvm parameter attribute if necessary. Having all this knowledge at the IR level means that frontends are able to share the complexities of the implementation. They just map their matrix type to the IR type and the builtins to intrinsics.
Agreed, but this only happens through a significant investment in making the midlevel IR general. I’m not opposed to this, but this is a significant engineering task. At Apple, we also need to support interoperability between rowmajor and columnmajor layout. Since conversion between the two layouts is costly, they should be separate types requiring explicit instructions to convert between them. Extending the new type to include the order makes tracking the format easy and allows finding optimal conversion points.
I don’t see how your proposal helps with this. How do you represent layout in the IR? ** Rollout and Maintenance **
Since this will be experimental for some time, I am planning to put this behind a flag: fenableexperimentalmatrixtype. ABI and intrinsic compatibility won’t be guaranteed initially until we lift the experimental status.
I think it is really important to understand the design and get buyin from many folks before starting to land patches. I’d also love to chat about this at the devmtg if you’re available.
Chris
_______________________________________________
cfedev mailing list
[hidden email]
http://lists.llvm.org/cgibin/mailman/listinfo/cfedev

12
