
12

Hello,
This is a Clangfocused follow up to the original proposal on llvmdev (http://lists.llvm.org/pipermail/llvmdev/2019October/136240.html). On the LLVM side, we recently landed the first commit adding matrix intrinsics as proposed.
On the Clang side, we would like to propose adding support for matrix math operations to Clang. This includes adding a new matrix type (similar to ext_vector_type) and a set of builtins to operate on values of the matrix type.
Our main motivation for the matrix support in Clang is to give users a way to
 Guarantee generation of highquality code for matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimizations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
The rest of this RFC is structured as follows: First we propose a draft specification for the matrix type and accompanying builtins. Next we show an example of how matrix operations will be lowered by Clang, followed by a discussion of the contributing criteria for new extensions. We wrap up the RFC by discussing possible extensions to the matrix type.Draft SpecificationMatrix TYPE AttributeThe attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type( R , C ) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix TypeA matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. Future Work: Initialization syntax. Future Work: Access syntax. m[col][row] . Future Work: Conversions between matrix types with const qualified and unqualified element types. Future Work: Conversions between matrix types with different element types.
Matrix Type builtin OperationsEach matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard.
Definitions:
 M, M1, M2, M3  Matrix types
 T  Element type
 row, col  Row and column arguments respectively.
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. Element Operations Preconditions: row and col are in the ranges [0, rows in M) and [0, columns in M) respectively.
M __builtin_matrix_insert(M matrix, int row, int col, T elt)
Remarks: The return type and the type T are inferred from the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Returns: a copy of matrix with the element at the specified row and column set to elt .
T __builtin_matrix_extract(M matrix, int row, int col)
The return type is inferred from the cvunqualified type of the matrix argument’s element type.
Returns: a copy of the element at the specified row and column. Simple Binary Operations For the following binary operations matrix1 and matrix2 shall be matrix values of the same cvunqualified type, and the return type is the cvunqualified version of that type.
M __builtin_matrix_add(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C) + __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
M __builtin_matrix_sub(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C)  __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
Other OperationsM3 __builtin_matrix_multiply(M1 matrix1, M2 matrix2)
Mandates: M1 and M2 shall be matrix types with the same cvunqualified element type and M1’s number of columns matching M2’s number of row.
Remarks: The return type is a cvunqualified matrix type with the same element type as M1 and M2 if both M1 and M2’s element type is const, or the cvunqualified element type otherwise, and with the same number of rows as M1 and the same number of columns as M2.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M, EltTy to the element type of M and inner refers to the number of columns of M1.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += __builtin_matrix_extract(matrix1, R, K) * __builtin_matrix_extract(matrix2, K, C) } Res = __builtin_matrix_insert(Res, R, C, Elt); } Remark: With respect to rounding errors, the operation preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C); Res = __builtin_matrix_insert(Res, C, R, Elt); } }
M __builtin_matrix_column_load( T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0.
Preconditions: stride >= row .
Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively.
Returns: A matrix Res equivalent to:
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res = __builtin_matrix_insert(Res, R, C, ptr[R]); ptr += stride }
void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M.
Effects: Equivalent to:
for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = __builtin_matrix_extract(matrix, R, C); ptr += stride } Remarks: The type T is the constunqualified version of the matrix argument’s element type.
M __builtin_matrix_scalar_multiply(M matrix, T scalar)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C) * scalar; Res = __builtin_matrix_insert(Res, R, C, Elt); } } Remarks: The return type and the type T are the cvunqualified type of the matrix argument and its cvunqualified element type respectively.Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition:
typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c); } This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4 ret void } declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg) Contributing CriteriaEvidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users. Currently this is part of one of our compiler toolchains and used on a few large internal codebases. The matrix type can be used by matrix libraries like Eigen, to offload some of the optimization responsibility from the library to the compiler. It would also be suitable target for implementing a standard matrix library. It also provides functionality similar to various libraries for matrix math on small matrixes, like https://developer.apple.com/documentation/accelerate/working_with_matrices, with more flexibility (supports any combination of input dimensions).
A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project. We want to expose this feature at the C/C++ level. For that, it needs to be part of Clang.
A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature. We currently have the design above and will work on a more comprehensive spec.
Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies. We think this extension would fall outside of the realm of the standards bodies. It is an implementation detail used to implement matrix math libraries and such, much like the vector extensions are an implementation detail for SIMD libraries.
A longterm support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself. We are using this internally and adding this feature to Clang upstream means we intend to support it as part of our ongoing Clang work.
A highquality implementation: The implementation must fit well into Clang's architecture, follow LLVM's coding conventions, and meet Clang's quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler. Will we provide a series of patches to implement the extension soon and look forward to any feedback to make sure the patches meet the quality requirement.
A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it We will provide this as part of Clang’s unit tests and testsuite.
ExtensionsInitially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. This is similar to the defaults for the Matrix type in Eigen, for example. But our proposed type can be extended naturally to
 Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
 Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
 Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known integer constants as dimensions and we do not plan to support dynamic dimensions for now, as the main optimization potential comes from the fact that we know the dimensions. Supporting dynamic dimensions should be fairly straight forward, but means we lose the ability to type check matrix expressions at compile time and we also have to rely on dynamic dimension during code generation.
Cheers, Florian _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Ping.
I’ve also put up 2 patches on Phabricator, to illustrate how the implementation could look like:
Cheers, Florian On Dec 20, 2019, at 18:31, Florian Hahn via cfedev < [hidden email]> wrote:
Hello,
This is a Clangfocused follow up to the original proposal on llvmdev (http://lists.llvm.org/pipermail/llvmdev/2019October/136240.html). On the LLVM side, we recently landed the first commit adding matrix intrinsics as proposed.
On the Clang side, we would like to propose adding support for matrix math operations to Clang. This includes adding a new matrix type (similar to ext_vector_type) and a set of builtins to operate on values of the matrix type.
Our main motivation for the matrix support in Clang is to give users a way to
 Guarantee generation of highquality code for matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimizations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
The rest of this RFC is structured as follows: First we propose a draft specification for the matrix type and accompanying builtins. Next we show an example of how matrix operations will be lowered by Clang, followed by a discussion of the contributing criteria for new extensions. We wrap up the RFC by discussing possible extensions to the matrix type.Draft SpecificationMatrix TYPE AttributeThe attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type( R , C ) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix TypeA matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. Future Work: Initialization syntax. Future Work: Access syntax. m[col][row] . Future Work: Conversions between matrix types with const qualified and unqualified element types. Future Work: Conversions between matrix types with different element types.
Matrix Type builtin OperationsEach matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard.
Definitions:
 M, M1, M2, M3  Matrix types
 T  Element type
 row, col  Row and column arguments respectively.
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. Element Operations Preconditions: row and col are in the ranges [0, rows in M) and [0, columns in M) respectively.
M __builtin_matrix_insert(M matrix, int row, int col, T elt)
Remarks: The return type and the type T are inferred from the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Returns: a copy of matrix with the element at the specified row and column set to elt .
T __builtin_matrix_extract(M matrix, int row, int col)
The return type is inferred from the cvunqualified type of the matrix argument’s element type.
Returns: a copy of the element at the specified row and column. Simple Binary Operations For the following binary operations matrix1 and matrix2 shall be matrix values of the same cvunqualified type, and the return type is the cvunqualified version of that type.
M __builtin_matrix_add(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C) + __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
M __builtin_matrix_sub(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C)  __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
Other OperationsM3 __builtin_matrix_multiply(M1 matrix1, M2 matrix2)
Mandates: M1 and M2 shall be matrix types with the same cvunqualified element type and M1’s number of columns matching M2’s number of row.
Remarks: The return type is a cvunqualified matrix type with the same element type as M1 and M2 if both M1 and M2’s element type is const, or the cvunqualified element type otherwise, and with the same number of rows as M1 and the same number of columns as M2.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M, EltTy to the element type of M and inner refers to the number of columns of M1.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += __builtin_matrix_extract(matrix1, R, K) * __builtin_matrix_extract(matrix2, K, C) } Res = __builtin_matrix_insert(Res, R, C, Elt); } Remark: With respect to rounding errors, the operation preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C); Res = __builtin_matrix_insert(Res, C, R, Elt); } }
M __builtin_matrix_column_load( T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0.
Preconditions: stride >= row .
Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively.
Returns: A matrix Res equivalent to:
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res = __builtin_matrix_insert(Res, R, C, ptr[R]); ptr += stride }
void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M.
Effects: Equivalent to:
for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = __builtin_matrix_extract(matrix, R, C); ptr += stride } Remarks: The type T is the constunqualified version of the matrix argument’s element type.
M __builtin_matrix_scalar_multiply(M matrix, T scalar)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C) * scalar; Res = __builtin_matrix_insert(Res, R, C, Elt); } } Remarks: The return type and the type T are the cvunqualified type of the matrix argument and its cvunqualified element type respectively.Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition:
typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c); } This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4 ret void } declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg) Contributing CriteriaEvidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users. Currently this is part of one of our compiler toolchains and used on a few large internal codebases. The matrix type can be used by matrix libraries like Eigen, to offload some of the optimization responsibility from the library to the compiler. It would also be suitable target for implementing a standard matrix library. It also provides functionality similar to various libraries for matrix math on small matrixes, like https://developer.apple.com/documentation/accelerate/working_with_matrices, with more flexibility (supports any combination of input dimensions).
A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project. We want to expose this feature at the C/C++ level. For that, it needs to be part of Clang.
A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature. We currently have the design above and will work on a more comprehensive spec.
Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies. We think this extension would fall outside of the realm of the standards bodies. It is an implementation detail used to implement matrix math libraries and such, much like the vector extensions are an implementation detail for SIMD libraries.
A longterm support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself. We are using this internally and adding this feature to Clang upstream means we intend to support it as part of our ongoing Clang work.
A highquality implementation: The implementation must fit well into Clang's architecture, follow LLVM's coding conventions, and meet Clang's quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler. Will we provide a series of patches to implement the extension soon and look forward to any feedback to make sure the patches meet the quality requirement.
A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it We will provide this as part of Clang’s unit tests and testsuite.
ExtensionsInitially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. This is similar to the defaults for the Matrix type in Eigen, for example. But our proposed type can be extended naturally to
 Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
 Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
 Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known integer constants as dimensions and we do not plan to support dynamic dimensions for now, as the main optimization potential comes from the fact that we know the dimensions. Supporting dynamic dimensions should be fairly straight forward, but means we lose the ability to type check matrix expressions at compile time and we also have to rely on dynamic dimension during code generation.
Cheers, Florian _______________________________________________ cfedev mailing list [hidden email]https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Bump.
I’d like to share a bit more context/motivation for the proposal. We are still looking for feedback and would appreciate any feedback/concerns on the overall proposal or any of the details!
I’ve uploaded WIP patches that add the various proposed builtins and linked them to the initial commit: https://reviews.llvm.org/D72281 Besides that I’ve prepared a two examples to illustrate the use of the builtins. First, I’ve put up a patch for testsuite that adds a set of tests that check the matrix builtins match the spec and a naive loop based implementation: https://reviews.llvm.org/D72770 Second, I ran a few benchmarks comparing the performance of the matrix builtins and Eigen for operations on small matrixes (ranging from 3x3 to 16x16). The benchmarks compare the performance of a single matrix multiply, a matrix multiply add and a larger matrix expression. I’ve shared the numbers below. For sizes smaller than 16x16, the matrix builtins comfortably beat Eigen (between 1.5x and 3x speedups).
Currently, Eigen still outperforms the matrix builtins for the following cases
• 3x3 matrixes (we are aware of the issue there and have a good idea of how to improve those cases without much effort)
• Larger matrixes (roughly 15x15+)
The regression on larger matrixes is not surprising at the moment, as we have not implemented any sort of tiling for the matrix builtins. But this is certainly something we are planning on implementing in the future, extending the number of cases that can be handled by the matrix builtins.
To summarize, we think that Eigen (and similar libraries) could get a nice speedup for operations on small matrixes by using the builtins, while also likely simplifying the implementation by offleading vector code generation to the compiler, rather than using targetspecific intrinsics.
The benchmark code can be found here: https://gist.github.com/fhahn/03796abb21bfc242c083cf7333ac960c The numbers are gathered with LLVM master as of today, with the Clang patches applied on top of them. The benchmarks where built with O3.
Cheers,Florian
Benchmark numbers (CPU time in ns shown). Values < 1 in the (Matrix builtins / Eigen) column means the matrix builtin version outperforms Eigen.
X86 macOS
name

Matrix builtins

Eigen.  (Matrix builtins / Eigen)

BM_GEMM_Mult_Square<float, 3, 3, 3, 3>

4.61

6.65

0.693

BM_GEMM_Mult_Square<double, 3, 3, 3, 3>

3.93

6.17

0.638

BM_GEMM_Mult_Square<float, 5, 5, 5, 5>

13.06

25.44

0.513

BM_GEMM_Mult_Square<double, 5, 5, 5, 5>

18.53

42.29

0.438

BM_GEMM_Mult_Square<float, 8, 8, 8, 8>

34.26

108.32

0.316

BM_GEMM_Mult_Square<double, 8, 8, 8, 8>

70.07

178.25

0.393

BM_GEMM_Mult_Square<float, 11, 11, 11, 11>

151.37

306.53

0.494

BM_GEMM_Mult_Square<double, 11, 11, 11, 11>

215.47

422.73

0.51

BM_GEMM_Mult_Square<float, 16, 16, 16, 16>

357.9

466.22

0.768

BM_GEMM_Mult_Square<double, 16, 16, 16, 16>

719.86

722.31

0.997

BM_GEMM_Mult_Add_Square<float, 3, 3, 3, 3>

6.44

7.43

0.867

BM_GEMM_Mult_Add_Square<double, 3, 3, 3, 3>

4.77

7.94

0.601

BM_GEMM_Mult_Add_Square<float, 5, 5, 5, 5>

13.7

27.09

0.506

BM_GEMM_Mult_Add_Square<double, 5, 5, 5, 5>

20.83

44.82

0.465

BM_GEMM_Mult_Add_Square<float, 8, 8, 8, 8>

39.36

113.99

0.345

BM_GEMM_Mult_Add_Square<double, 8, 8, 8, 8>

75.4

186.2

0.405

BM_GEMM_Mult_Add_Square<float, 11, 11, 11, 11>

169.78

310.76

0.546

BM_GEMM_Mult_Add_Square<double, 11, 11, 11, 11>

243.95

497.39

0.49

BM_GEMM_Mult_Add_Square<float, 16, 16, 16, 16>

430.09

487.65

0.882

BM_GEMM_Mult_Add_Square<double, 16, 16, 16, 16>

854.13

843.71

1.012

BM_GEMM_Expr_Square<float, 3, 3, 3, 3>

10.89

18.13

0.6

BM_GEMM_Expr_Square<double, 3, 3, 3, 3>

9.41

17.17

0.548

BM_GEMM_Expr_Square<float, 5, 5, 5, 5>

28.07

54.42

0.516

BM_GEMM_Expr_Square<double, 5, 5, 5, 5>

40.45

72.36

0.559

BM_GEMM_Expr_Square<float, 8, 8, 8, 8>

79.37

222.89

0.356

BM_GEMM_Expr_Square<double, 8, 8, 8, 8>

152.4

393.13

0.388

BM_GEMM_Expr_Square<float, 11, 11, 11, 11>

299.12

659.46

0.454

BM_GEMM_Expr_Square<double, 11, 11, 11, 11>

444.06

862.66

0.515

BM_GEMM_Expr_Square<float, 16, 16, 16, 16>

772.21

842.29

0.917

BM_GEMM_Expr_Square<double, 16, 16, 16, 16>

1580.45

1578.02

1.002

ARM64 Darwin
name

Matrix builtins

Eigen (

Matrix builtins / Eigen)

BM_GEMM_Mult_Square<float, 3, 3, 3, 3>

6.29

6

1.048

BM_GEMM_Mult_Square<double, 3, 3, 3, 3>

5.14

4.87

1.056

BM_GEMM_Mult_Square<float, 5, 5, 5, 5>

14.8

37.46

0.395

BM_GEMM_Mult_Square<double, 5, 5, 5, 5>

21

65.01

0.323

BM_GEMM_Mult_Square<float, 8, 8, 8, 8>

39.56

88.73

0.446

BM_GEMM_Mult_Square<double, 8, 8, 8, 8>

84.58

156.45

0.541

BM_GEMM_Mult_Square<float, 11, 11, 11, 11>

184.59

298.33

0.619

BM_GEMM_Mult_Square<double, 11, 11, 11, 11>

270.03

399.78

0.675

BM_GEMM_Mult_Square<float, 16, 16, 16, 16>

430.07

345.05

1.246

BM_GEMM_Mult_Square<double, 16, 16, 16, 16>

891.57

608.66

1.465

BM_GEMM_Mult_Add_Square<float, 3, 3, 3, 3>

8.87

6.77

1.31

BM_GEMM_Mult_Add_Square<double, 3, 3, 3, 3>

7.1

6.8

1.044

BM_GEMM_Mult_Add_Square<float, 5, 5, 5, 5>

16.23

37.89

0.428

BM_GEMM_Mult_Add_Square<double, 5, 5, 5, 5>

23.32

68.01

0.343

BM_GEMM_Mult_Add_Square<float, 8, 8, 8, 8>

42.61

91.3

0.467

BM_GEMM_Mult_Add_Square<double, 8, 8, 8, 8>

89.15

162.19

0.55

BM_GEMM_Mult_Add_Square<float, 11, 11, 11, 11>

216.04

304.33

0.71

BM_GEMM_Mult_Add_Square<double, 11, 11, 11, 11>

300.04

423.92

0.708

BM_GEMM_Mult_Add_Square<float, 16, 16, 16, 16>

440.06

374.63

1.175

BM_GEMM_Mult_Add_Square<double, 16, 16, 16, 16>

913

777.27

1.175

BM_GEMM_Expr_Square<float, 3, 3, 3, 3>

16.69

32.46

0.514

BM_GEMM_Expr_Square<double, 3, 3, 3, 3>

14.36

30.56

0.47

BM_GEMM_Expr_Square<float, 5, 5, 5, 5>

33.48

72.54

0.461

BM_GEMM_Expr_Square<double, 5, 5, 5, 5>

47.15

108.44

0.435

BM_GEMM_Expr_Square<float, 8, 8, 8, 8>

91.3

205.74

0.444

BM_GEMM_Expr_Square<double, 8, 8, 8, 8>

187.31

385.48

0.486

BM_GEMM_Expr_Square<float, 11, 11, 11, 11>

400.05

660.08

0.606

BM_GEMM_Expr_Square<double, 11, 11, 11, 11>

582.93

874.4

0.667

BM_GEMM_Expr_Square<float, 16, 16, 16, 16>

938.7

788.69

1.19

BM_GEMM_Expr_Square<double, 16, 16, 16, 16>

1900.25

1543.1

1.231
 Ping.
I’ve also put up 2 patches on Phabricator, to illustrate how the implementation could look like:
1. [Matrix] Add matrix type to Clang (WIP). https://reviews.llvm.org/D72281 2. [Matrix] Add __builtin_matrix_insert to Clang (WIP). https://reviews.llvm.org/D72283
Cheers, Florian
On Dec 20, 2019, at 18:31, Florian Hahn via cfedev <[hidden email]> wrote:
Hello,
This is a Clangfocused follow up to the original proposal on llvmdev (http://lists.llvm.org/pipermail/llvmdev/2019October/136240.html). On the LLVM side, we recently landed the first commit adding matrix intrinsics as proposed.
On the Clang side, we would like to propose adding support for matrix math operations to Clang. This includes adding a new matrix type (similar to ext_vector_type) and a set of builtins to operate on values of the matrix type.
Our main motivation for the matrix support in Clang is to give users a way to
• Guarantee generation of highquality code for matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
• Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
• Move optimizations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
The rest of this RFC is structured as follows: First we propose a draft specification for the matrix type and accompanying builtins. Next we show an example of how matrix operations will be lowered by Clang, followed by a discussion of the contributing criteria for new extensions. We wrap up the RFC by discussing possible extensions to the matrix type. Draft Specification
Matrix TYPE Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. Future Work: Initialization syntax. Future Work: Access syntax. m[col][row]. Future Work: Conversions between matrix types with const qualified and unqualified element types. Future Work: Conversions between matrix types with different element types.
Matrix Type builtin Operations
Each matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard.
Definitions:
• M, M1, M2, M3  Matrix types
• T  Element type
• row, col  Row and column arguments respectively.
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
Element Operations
Preconditions: row and col are in the ranges [0, rows in M) and [0, columns in M) respectively.
M __builtin_matrix_insert(M matrix, int row, int col, T elt)
Remarks: The return type and the type T are inferred from the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Returns: a copy of matrix with the element at the specified row and column set to elt.
T __builtin_matrix_extract(M matrix, int row, int col)
The return type is inferred from the cvunqualified type of the matrix argument’s element type.
Returns: a copy of the element at the specified row and column.
Simple Binary Operations
For the following binary operations matrix1 and matrix2 shall be matrix values of the same cvunqualified type, and the return type is the cvunqualified version of that type.
M __builtin_matrix_add(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M. M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C) + __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
M __builtin_matrix_sub(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M. M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C)  __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
Other Operations
M3 __builtin_matrix_multiply(M1 matrix1, M2 matrix2)
Mandates: M1 and M2 shall be matrix types with the same cvunqualified element type and M1’s number of columns matching M2’s number of row.
Remarks: The return type is a cvunqualified matrix type with the same element type as M1 and M2 if both M1 and M2’s element type is const, or the cvunqualified element type otherwise, and with the same number of rows as M1 and the same number of columns as M2.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M, EltTy to the element type of M and inner refers to the number of columns of M1. M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += __builtin_matrix_extract(matrix1, R, K) * __builtin_matrix_extract(matrix2, K, C) } Res = __builtin_matrix_insert(Res, R, C, Elt); } Remark: With respect to rounding errors, the operation preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C); Res = __builtin_matrix_insert(Res, C, R, Elt); } }
M __builtin_matrix_column_load(T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0.
Preconditions: stride >= row.
Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively.
Returns: A matrix Res equivalent to: M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res = __builtin_matrix_insert(Res, R, C, ptr[R]); ptr += stride }
void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M.
Effects: Equivalent to: for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = __builtin_matrix_extract(matrix, R, C); ptr += stride } Remarks: The type T is the constunqualified version of the matrix argument’s element type.
M __builtin_matrix_scalar_multiply(M matrix, T scalar)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C) * scalar; Res = __builtin_matrix_insert(Res, R, C, Elt); } } Remarks: The return type and the type T are the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Example
This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition: typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c); } This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs. define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4 ret void } declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg)
Contributing Criteria
Evidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users. Currently this is part of one of our compiler toolchains and used on a few large internal codebases. The matrix type can be used by matrix libraries like Eigen, to offload some of the optimization responsibility from the library to the compiler. It would also be suitable target for implementing a standard matrix library. It also provides functionality similar to various libraries for matrix math on small matrixes, like https://developer.apple.com/documentation/accelerate/working_with_matrices, with more flexibility (supports any combination of input dimensions).
A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project. We want to expose this feature at the C/C++ level. For that, it needs to be part of Clang.
A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature. We currently have the design above and will work on a more comprehensive spec.
Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies. We think this extension would fall outside of the realm of the standards bodies. It is an implementation detail used to implement matrix math libraries and such, much like the vector extensions are an implementation detail for SIMD libraries.
A longterm support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself. We are using this internally and adding this feature to Clang upstream means we intend to support it as part of our ongoing Clang work.
A highquality implementation: The implementation must fit well into Clang's architecture, follow LLVM's coding conventions, and meet Clang's quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler. Will we provide a series of patches to implement the extension soon and look forward to any feedback to make sure the patches meet the quality requirement.
A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it We will provide this as part of Clang’s unit tests and testsuite. Extensions
Initially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. This is similar to the defaults for the Matrix type in Eigen, for example. But our proposed type can be extended naturally to
• Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
• Support column/rowwise padding, by adding a column_padding clause to the attribute. Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
• Support row & column major layouts, by adding a layout clause to the attribute. Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known integer constants as dimensions and we do not plan to support dynamic dimensions for now, as the main optimization potential comes from the fact that we know the dimensions. Supporting dynamic dimensions should be fairly straight forward, but means we lose the ability to type check matrix expressions at compile time and we also have to rely on dynamic dimension during code generation.
Cheers, Florian _______________________________________________ cfedev mailing list [hidden email] https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________ cfedev mailing list [hidden email] https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Fangrui Song via cfedev
On Fri, 20 Dec 2019 at 10:32, Florian Hahn via cfedev < [hidden email]> wrote: Hello,
This is a Clangfocused follow up to the original proposal on llvmdev (http://lists.llvm.org/pipermail/llvmdev/2019October/136240.html). On the LLVM side, we recently landed the first commit adding matrix intrinsics as proposed.
On the Clang side, we would like to propose adding support for matrix math operations to Clang. This includes adding a new matrix type (similar to ext_vector_type) and a set of builtins to operate on values of the matrix type.
Our main motivation for the matrix support in Clang is to give users a way to
 Guarantee generation of highquality code for matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimizations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
The rest of this RFC is structured as follows: First we propose a draft specification for the matrix type and accompanying builtins. Next we show an example of how matrix operations will be lowered by Clang, followed by a discussion of the contributing criteria for new extensions. We wrap up the RFC by discussing possible extensions to the matrix type.Draft SpecificationMatrix TYPE AttributeThe attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type( R , C ) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix TypeA matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. Future Work: Initialization syntax. Future Work: Access syntax. m[col][row] . Future Work: Conversions between matrix types with const qualified and unqualified element types. Future Work: Conversions between matrix types with different element types.
Matrix Type builtin OperationsEach matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard.
Definitions:
 M, M1, M2, M3  Matrix types
 T  Element type
 row, col  Row and column arguments respectively.
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
Element Operations Preconditions: row and col are in the ranges [0, rows in M) and [0, columns in M) respectively.
M __builtin_matrix_insert(M matrix, int row, int col, T elt)
Remarks: The return type and the type T are inferred from the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Returns: a copy of matrix with the element at the specified row and column set to elt .
T __builtin_matrix_extract(M matrix, int row, int col)
The return type is inferred from the cvunqualified type of the matrix argument’s element type.
Returns: a copy of the element at the specified row and column. Simple Binary Operations For the following binary operations matrix1 and matrix2 shall be matrix values of the same cvunqualified type, and the return type is the cvunqualified version of that type.
M __builtin_matrix_add(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C) + __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
M __builtin_matrix_sub(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C)  __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
Other OperationsM3 __builtin_matrix_multiply(M1 matrix1, M2 matrix2)
Mandates: M1 and M2 shall be matrix types with the same cvunqualified element type and M1’s number of columns matching M2’s number of row.
Remarks: The return type is a cvunqualified matrix type with the same element type as M1 and M2 if both M1 and M2’s element type is const, or the cvunqualified element type otherwise, and with the same number of rows as M1 and the same number of columns as M2.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M, EltTy to the element type of M and inner refers to the number of columns of M1.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += __builtin_matrix_extract(matrix1, R, K) * __builtin_matrix_extract(matrix2, K, C) } Res = __builtin_matrix_insert(Res, R, C, Elt); } Remark: With respect to rounding errors, the operation preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
The above seem like they would be better if provided as operators rather than as builtin functions. We don't provide builtins for these kinds of operations for vector types, because we expect all code to use the operator syntax instead.
M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C); Res = __builtin_matrix_insert(Res, C, R, Elt); } }
Maybe it's a bit cute, but have you considered using an operator such as prefix ~ for this, or perhaps a posfix .T? (This is in some sense a swizzle, and we use memberaccesslike syntax for those already.)
M __builtin_matrix_column_load( T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0.
Preconditions: stride >= row .
Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively.
Returns: A matrix Res equivalent to:
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res = __builtin_matrix_insert(Res, R, C, ptr[R]); ptr += stride }
void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M.
Effects: Equivalent to:
for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = __builtin_matrix_extract(matrix, R, C); ptr += stride } Remarks: The type T is the constunqualified version of the matrix argument’s element type.
Presumably these would be unnecessary if we permitted casting between an M* and a T* and treating the M* as a suitablysized array of T? (Again, we don't have anything like this for vector types, for which we do guarantee that you can cast a vector* to a T* and access the vector elements directly.)
M __builtin_matrix_scalar_multiply(M matrix, T scalar)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C) * scalar; Res = __builtin_matrix_insert(Res, R, C, Elt); } } Remarks: The return type and the type T are the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
(As with the above operators, using the * operator for this seems more appropriate to me.) Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition:
typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c); } This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4 ret void } declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg) Contributing CriteriaEvidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users. Currently this is part of one of our compiler toolchains and used on a few large internal codebases. The matrix type can be used by matrix libraries like Eigen, to offload some of the optimization responsibility from the library to the compiler.
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
It would also be suitable target for implementing a standard matrix library. It also provides functionality similar to various libraries for matrix math on small matrixes, like https://developer.apple.com/documentation/accelerate/working_with_matrices, with more flexibility (supports any combination of input dimensions).
A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project. We want to expose this feature at the C/C++ level. For that, it needs to be part of Clang.
A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature. We currently have the design above and will work on a more comprehensive spec.
Do you anticipate the various psABIs being updated to specify the calling convention for matrix parameters and return values? If not, you'll need to include that in your specification too. Similarly, you will need to specify a mangling to use for these types in both the Itanium and MS C++ ABIs.
Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies. We think this extension would fall outside of the realm of the standards bodies. It is an implementation detail used to implement matrix math libraries and such, much like the vector extensions are an implementation detail for SIMD libraries.
A longterm support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself. We are using this internally and adding this feature to Clang upstream means we intend to support it as part of our ongoing Clang work.
A highquality implementation: The implementation must fit well into Clang's architecture, follow LLVM's coding conventions, and meet Clang's quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler. Will we provide a series of patches to implement the extension soon and look forward to any feedback to make sure the patches meet the quality requirement.
A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it We will provide this as part of Clang’s unit tests and testsuite.
ExtensionsInitially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. This is similar to the defaults for the Matrix type in Eigen, for example. But our proposed type can be extended naturally to
 Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
Hmm. "matrix" wouldn't really be the right name for the generalized attribute. Presumably matrix_type(N) would mean the same thing as ext_vector_type(N)? Are there realistic use cases for this? (I expect it's not worth planning for this eventuality until we actually have such a use case.)  Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
 Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known integer constants as dimensions and we do not plan to support dynamic dimensions for now, as the main optimization potential comes from the fact that we know the dimensions. Supporting dynamic dimensions should be fairly straight forward, but means we lose the ability to type check matrix expressions at compile time and we also have to rely on dynamic dimension during code generation.
Cheers, Florian _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Thanks for the feedback! I’ve responded inline.
On Fri, 20 Dec 2019 at 10:32, Florian Hahn via cfedev < [hidden email]> wrote: Hello,
This is a Clangfocused follow up to the original proposal on llvmdev (http://lists.llvm.org/pipermail/llvmdev/2019October/136240.html). On the LLVM side, we recently landed the first commit adding matrix intrinsics as proposed.
On the Clang side, we would like to propose adding support for matrix math operations to Clang. This includes adding a new matrix type (similar to ext_vector_type) and a set of builtins to operate on values of the matrix type.
Our main motivation for the matrix support in Clang is to give users a way to
 Guarantee generation of highquality code for matrix operations. For isolated operations, we can guarantee vector code generation suitable for the target. For trees of operations, the proposed value type helps with eliminating temporary loads & stores.
 Make use of specialized matrix ISA extensions, like the new matrix instructions in ARM v8.6 or various proprietary matrix accelerators, in their C/C++ code.
 Move optimizations from matrix wrapper libraries into the compiler. We use it internally to simplify an Eigenstyle matrix library, by relying on LLVM for generating tiled & fused loops for matrix operations.
The rest of this RFC is structured as follows: First we propose a draft specification for the matrix type and accompanying builtins. Next we show an example of how matrix operations will be lowered by Clang, followed by a discussion of the contributing criteria for new extensions. We wrap up the RFC by discussing possible extensions to the matrix type.Draft SpecificationMatrix TYPE AttributeThe attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type( R , C ) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix TypeA matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. Future Work: Initialization syntax. Future Work: Access syntax. m[col][row] . Future Work: Conversions between matrix types with const qualified and unqualified element types. Future Work: Conversions between matrix types with different element types.
Matrix Type builtin OperationsEach matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard.
Definitions:
 M, M1, M2, M3  Matrix types
 T  Element type
 row, col  Row and column arguments respectively.
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type. I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Element Operations Preconditions: row and col are in the ranges [0, rows in M) and [0, columns in M) respectively.
M __builtin_matrix_insert(M matrix, int row, int col, T elt)
Remarks: The return type and the type T are inferred from the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
Returns: a copy of matrix with the element at the specified row and column set to elt .
T __builtin_matrix_extract(M matrix, int row, int col)
The return type is inferred from the cvunqualified type of the matrix argument’s element type.
Returns: a copy of the element at the specified row and column. Simple Binary Operations For the following binary operations matrix1 and matrix2 shall be matrix values of the same cvunqualified type, and the return type is the cvunqualified version of that type.
M __builtin_matrix_add(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C) + __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
M __builtin_matrix_sub(M matrix1, M matrix2)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M and EltTy to the element type of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix1, R, C)  __builtin_matrix_extract(matrix2, R, C) Res = __builtin_matrix_insert(Res, R, C, Elt); } }
Other OperationsM3 __builtin_matrix_multiply(M1 matrix1, M2 matrix2)
Mandates: M1 and M2 shall be matrix types with the same cvunqualified element type and M1’s number of columns matching M2’s number of row.
Remarks: The return type is a cvunqualified matrix type with the same element type as M1 and M2 if both M1 and M2’s element type is const, or the cvunqualified element type otherwise, and with the same number of rows as M1 and the same number of columns as M2.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, row to the number of rows of M, EltTy to the element type of M and inner refers to the number of columns of M1.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += __builtin_matrix_extract(matrix1, R, K) * __builtin_matrix_extract(matrix2, K, C) } Res = __builtin_matrix_insert(Res, R, C, Elt); } Remark: With respect to rounding errors, the operation preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
The above seem like they would be better if provided as operators rather than as builtin functions. We don't provide builtins for these kinds of operations for vector types, because we expect all code to use the operator syntax instead.
Agreed .
M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows.
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C); Res = __builtin_matrix_insert(Res, C, R, Elt); } }
Maybe it's a bit cute, but have you considered using an operator such as prefix ~ for this, or perhaps a posfix .T? (This is in some sense a swizzle, and we use memberaccesslike syntax for those already.)
Something like .t()/T() would probably be quite convenient for the users.
M __builtin_matrix_column_load( T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0.
Preconditions: stride >= row .
Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively.
Returns: A matrix Res equivalent to:
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res = __builtin_matrix_insert(Res, R, C, ptr[R]); ptr += stride }
void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M.
Effects: Equivalent to:
for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = __builtin_matrix_extract(matrix, R, C); ptr += stride } Remarks: The type T is the constunqualified version of the matrix argument’s element type.
Presumably these would be unnecessary if we permitted casting between an M* and a T* and treating the M* as a suitablysized array of T? (Again, we don't have anything like this for vector types, for which we do guarantee that you can cast a vector* to a T* and access the vector elements directly.)
Yes, they are not strictly necessary, but I think they are very convenient for users and help guaranteeing vector code generation for those loads/stores, rather than relying on vectorization of load/store loops (I think it would be good to not encourage people to much to use loops with matrix values). Having the loads/stores expressed on the whole matrix likely also helps with alias analysis, although we haven’t explored that direction so far.
M __builtin_matrix_scalar_multiply(M matrix, T scalar)
Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M.
M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = __builtin_matrix_extract(matrix, R, C) * scalar; Res = __builtin_matrix_insert(Res, R, C, Elt); } } Remarks: The return type and the type T are the cvunqualified type of the matrix argument and its cvunqualified element type respectively.
(As with the above operators, using the * operator for this seems more appropriate to me.) Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition:
typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = __builtin_matrix_add(__builtin_matrix_multiply(*a, *b), *c); } This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4 ret void } declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg) Contributing CriteriaEvidence of a significant user community: This is based on a number of factors, including an existing user community, the perceived likelihood that users would adopt such a feature if it were available, and any secondary effects that come from, e.g., a library adopting the feature and providing benefits to its users. Currently this is part of one of our compiler toolchains and used on a few large internal codebases. The matrix type can be used by matrix libraries like Eigen, to offload some of the optimization responsibility from the library to the compiler.
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
It would also be suitable target for implementing a standard matrix library. It also provides functionality similar to various libraries for matrix math on small matrixes, like https://developer.apple.com/documentation/accelerate/working_with_matrices, with more flexibility (supports any combination of input dimensions).
A specific need to reside within the Clang tree: There are some extensions that would be better expressed as a separate tool, and should remain as separate tools even if they end up being hosted as part of the LLVM umbrella project. We want to expose this feature at the C/C++ level. For that, it needs to be part of Clang.
A specification: The specification must be sufficient to understand the design of the feature as well as interpret the meaning of specific examples. The specification should be detailed enough that another compiler vendor could implement the feature. We currently have the design above and will work on a more comprehensive spec.
Do you anticipate the various psABIs being updated to specify the calling convention for matrix parameters and return values? If not, you'll need to include that in your specification too.
We don’t have any plans to update ABIs at the moment. Matrix value would be passed in memory. I thought we spelled that out in the proposal, but couldn’t find it while rereading. Similarly, you will need to specify a mangling to use for these types in both the Itanium and MS C++ ABIs.
Yes we will have to update those. In our initial implementation we used Dm{NumRows}_{NumColumns} for Itanium.
Representation within the appropriate governing organization: For extensions to a language governed by a standards committee (C, C++, OpenCL), the extension itself must have an active proposal and proponent within that committee and have a reasonable chance of acceptance. Clang should drive the standard, not diverge from it. This criterion does not apply to all extensions, since some extensions fall outside of the realm of the standards bodies. We think this extension would fall outside of the realm of the standards bodies. It is an implementation detail used to implement matrix math libraries and such, much like the vector extensions are an implementation detail for SIMD libraries.
A longterm support plan: increasingly large or complex extensions to Clang need matching commitments to supporting them over time, including improving their implementation and specification as Clang evolves. The capacity of the contributor to make that commitment is as important as the commitment itself. We are using this internally and adding this feature to Clang upstream means we intend to support it as part of our ongoing Clang work.
A highquality implementation: The implementation must fit well into Clang's architecture, follow LLVM's coding conventions, and meet Clang's quality standards, including diagnostics and complete AST representations. This is particularly important for language extensions, because users will learn how those extensions work through the behavior of the compiler. Will we provide a series of patches to implement the extension soon and look forward to any feedback to make sure the patches meet the quality requirement.
A test suite: Extensive testing is crucial to ensure that the language extension is not broken by ongoing maintenance in Clang. The test suite should be complete enough that another compiler vendor could conceivably validate their implementation of the feature against it We will provide this as part of Clang’s unit tests and testsuite.
ExtensionsInitially we want to focus on 2D matrixes without padding in columnmajor layout as a concrete use case. This is similar to the defaults for the Matrix type in Eigen, for example. But our proposed type can be extended naturally to
 Support N (known constant) dimensions by turning matrix_type attribute into a variadic attribute.
Hmm. "matrix" wouldn't really be the right name for the generalized attribute. Presumably matrix_type(N) would mean the same thing as ext_vector_type(N)? Are there realistic use cases for this? (I expect it's not worth planning for this eventuality until we actually have such a use case.)  Support column/rowwise padding, by adding a column_padding clause to the attribute.
Dealing with the padding could be exclusively handled on the frontend side, by emitting additional shufflevector instructions to extract the data. If there is a desire to exploit the padding more on the LLVM side, we can add a set of intrinsics for that.
 Support row & column major layouts, by adding a layout clause to the attribute.
Again, this naively could be handled while lowering to LLVM IR in Clang using shufflevector to produce flattened vectors with the required layout. For better optimisations, the LLVM intrinsics relying on shape/layout information can be extended to take the layout as additional argument. Through propagating the layout information similar to the dimensions, we should be able to optimise the points where we need to transform the layout of the underlying matrixes.
In all cases, we require known integer constants as dimensions and we do not plan to support dynamic dimensions for now, as the main optimization potential comes from the fact that we know the dimensions. Supporting dynamic dimensions should be fairly straight forward, but means we lose the ability to type check matrix expressions at compile time and we also have to rely on dynamic dimension during code generation.
If I understand your question correctly, if N = 1, matrix_type(N) would be the same thing as ext_vector_type(N).
There might be interesting use cases for 3+ dimensions, but as you said, I think it would be best to plan for that once we have a concrete use case. The name itself might need extra generalization, but I think most of the proposal can be extended quite easily. _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi, We updated the proposal to use operators for most operations (with a lot of help from Michael Spencer!).I’ve added the updated section inline and the complete updated proposal can be found at the end of the mail. On Jan 18, 2020, at 12:01, Florian Hahn via cfedev < [hidden email]> wrote:
Thanks for the feedback! I’ve responded inline.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type. I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; }
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
I reached out on the Eigen mailing list and got an encouraging response: https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html I’ve also added people involved in GCC to the WIP patch adding the type.
Cheers, Florian
Draft Specification Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form: (constantexpression, constantexpression) Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed. An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns. If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns. Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type. A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions. TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information. Future Work: Initialization syntax. Future Work: Conversions between matrix types with const qualified and unqualified element types. Standard Conversions
The standard conversions are extended as follows. For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element. * If the original value was not of matrix type, each element takes the value of the original value.
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; } All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix). Matrix Type Builtin Operations
Each matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard. Definitions: * M, M1, M2, M3  Matrix types * T  Element type * row, col  Row and column arguments respectively. M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows. Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[C][R] = matrix[R][C]; M __builtin_matrix_column_load(T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0. Preconditions: stride >= row. Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively. Returns: A matrix Res equivalent to: M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res[R][C] = ptr[R]; ptr += stride } void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M. Effects: Equivalent to: for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = matrix[R][C]; ptr += stride }
Remarks: The type T is the constunqualified version of the matrix argument’s element type. Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition: typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = *a + (*b * *c); }
This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4
ret void
}
; Function Attrs: nounwind readnone speculatable willreturn
declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg)
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Ping.
Richard, are the proposed operators in the latest update along the line you suggested?
Are there any other thoughts or comments on the proposal?
Cheers, Florian On Feb 25, 2020, at 21:02, Florian Hahn via cfedev < [hidden email]> wrote:
Hi, We updated the proposal to use operators for most operations (with a lot of help from Michael Spencer!).I’ve added the updated section inline and the complete updated proposal can be found at the end of the mail. On Jan 18, 2020, at 12:01, Florian Hahn via cfedev < [hidden email]> wrote:
Thanks for the feedback! I’ve responded inline.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type. I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; }
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
I reached out on the Eigen mailing list and got an encouraging response: https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html I’ve also added people involved in GCC to the WIP patch adding the type.
Cheers, Florian
Draft Specification Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form: (constantexpression, constantexpression) Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed. An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns. If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns. Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type. A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions. TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information. Future Work: Initialization syntax. Future Work: Conversions between matrix types with const qualified and unqualified element types. Standard Conversions
The standard conversions are extended as follows. For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element. * If the original value was not of matrix type, each element takes the value of the original value.
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; } All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix). Matrix Type Builtin Operations
Each matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard. Definitions: * M, M1, M2, M3  Matrix types * T  Element type * row, col  Row and column arguments respectively. M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows. Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[C][R] = matrix[R][C]; M __builtin_matrix_column_load(T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0. Preconditions: stride >= row. Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively. Returns: A matrix Res equivalent to: M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res[R][C] = ptr[R]; ptr += stride } void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M. Effects: Equivalent to: for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = matrix[R][C]; ptr += stride }
Remarks: The type T is the constunqualified version of the matrix argument’s element type. Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition: typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = *a + (*b * *c); }
This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4
ret void
}
; Function Attrs: nounwind readnone speculatable willreturn
declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg)
_______________________________________________ cfedev mailing list [hidden email]https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Fangrui Song via cfedev
Ping.
Richard, are the proposed operators in the latest update along the line you suggested?
Are there any additional thoughts or comments on the proposal?
Cheers, Florian On Feb 25, 2020, at 21:02, Florian Hahn via cfedev < [hidden email]> wrote:
Hi, We updated the proposal to use operators for most operations (with a lot of help from Michael Spencer!).I’ve added the updated section inline and the complete updated proposal can be found at the end of the mail. On Jan 18, 2020, at 12:01, Florian Hahn via cfedev < [hidden email]> wrote:
Thanks for the feedback! I’ve responded inline.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type. I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; }
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
I reached out on the Eigen mailing list and got an encouraging response: https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html I’ve also added people involved in GCC to the WIP patch adding the type.
Cheers, Florian
Draft Specification Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form: (constantexpression, constantexpression) Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed. An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns. If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns. Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type. A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions. TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information. Future Work: Initialization syntax. Future Work: Conversions between matrix types with const qualified and unqualified element types. Standard Conversions
The standard conversions are extended as follows. For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element. * If the original value was not of matrix type, each element takes the value of the original value.
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; } All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix). Matrix Type Builtin Operations
Each matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard. Definitions: * M, M1, M2, M3  Matrix types * T  Element type * row, col  Row and column arguments respectively. M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows. Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[C][R] = matrix[R][C]; M __builtin_matrix_column_load(T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0. Preconditions: stride >= row. Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively. Returns: A matrix Res equivalent to: M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res[R][C] = ptr[R]; ptr += stride } void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M. Effects: Equivalent to: for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = matrix[R][C]; ptr += stride }
Remarks: The type T is the constunqualified version of the matrix argument’s element type. Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition: typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = *a + (*b * *c); }
This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4
ret void
}
; Function Attrs: nounwind readnone speculatable willreturn
declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg)
_______________________________________________ cfedev mailing list [hidden email]https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Fangrui Song via cfedev
On Tue, 25 Feb 2020 at 13:02, Florian Hahn via cfedev < [hidden email]> wrote: Hi,
We updated the proposal to use operators for most operations (with a lot of help from Michael Spencer!).I’ve added the updated section inline and the complete updated proposal can be found at the end of the mail.
Thanks. I'm concerned about the inconsistency between * and the / and % operators, but other than that I think this proposal looks good.
On Jan 18, 2020, at 12:01, Florian Hahn via cfedev < [hidden email]> wrote:
Thanks for the feedback! I’ve responded inline.
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type. I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Are you aware of the OpenMP array sections extension? ( https://www.openmp.org/spechtml/5.0/openmpsu21.html) Inspired by that, you could use something like E1[:][E2] and E1[E2][:] to form column and row vectors. Just a thought; I think using builtins for these is also fine.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
Supporting elementwise matrix1 / matrix2 seems likely to be confusing if * means matrix multiplication. Is this an important operation? (I'd imagine not, if you're willing to give up elementwise *.)
I think the following would all be OK:
1) +, , and * all follow normal math conventions (* is matrix multiplication or matrix/vector or matrix/scalar multiplication, + and  are elementwise and require the same type on both sides); / and % are not provided 2) +, , *, /, and % are elementwise; matrix multiplication is done a different way. 3) + and  are elementwise; *, /, and % are not provided and those operations are provided a different way
Of those, I think (1) is the cleanest approach, even though it's inconsistent with how we treat * for two operands of vector type. (2) would be consistent with our approach for vectors, and might be a nice approach if we can find a good alternative syntax for matrix multiplication (infix x is likely unambiguous, but would certainly be novel).
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0;
Is EltTy the matrix element type or the promoted matrix element type? (Based on what you wrote below, I assume the former, just checking...) for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; }
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
OK, so in particular you guarantee to perform the additions in the order given by the above rewrite (assuming no ffpcontract option)? (So, for example, you get the precision and rounding behavior of doing the adds in that order, and for a signed integer type, you get signed overflow if and only if the intermediate Elt value overflows even if the final Elt value for any (R,C) is back in range.) Seems OK, if that's consistent with your goals.
Have the Eigen developers indicated they would consider using this if it were available to them? Have you reached out to the GCC developers to see if they would also be likely to support this extension? We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
I reached out on the Eigen mailing list and got an encouraging response: https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html I’ve also added people involved in GCC to the WIP patch adding the type.
Cheers, Florian
Draft Specification Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type. An attributeargumentclause must be present and it shall have the form: (constantexpression, constantexpression) Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed. An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns. If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns. Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type. A matrix type is a scalar type with the same alignment as its underlying element type, but objects of matrix type are not usable in constant expressions. TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work. TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information. Future Work: Initialization syntax. Future Work: Conversions between matrix types with const qualified and unqualified element types. Standard Conversions
The standard conversions are extended as follows. For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element. * If the original value was not of matrix type, each element takes the value of the original value.
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0; for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; } All operations on matrix types match the behavior of the underlying element type with respect to signed overflows. With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix). Matrix Type Builtin Operations
Each matrix type supports a collection of builtin expressions that look like function calls but do not form an overload set. Here they are described as function declarations with rules for how to construct the argument list types and return type and the library description elements from [library.description.structure.specifications]/3 in the C++ standard. Definitions: * M, M1, M2, M3  Matrix types * T  Element type * row, col  Row and column arguments respectively. M2 __builtin_matrix_transpose(M1 matrix)
Remarks: The return type is a cvunqualified matrix type that has the same element type as M1 and has the the same number of rows as M1 has columns and the same number of columns as M1 has rows. Returns: A matrix Res equivalent to the code below, where col refers to the number of columns of M, and row to the number of rows of M. M Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[C][R] = matrix[R][C]; M __builtin_matrix_column_load(T *ptr, int row, int col, int stride)
Mandates: row and col shall be integral constants greater than 0. Preconditions: stride >= row. Remarks: The return type is a cvunqualified matrix type with an element type of the cvunqualified version of T and a number of rows and columns equal to row and col respectively. Returns: A matrix Res equivalent to: M Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++K) Res[R][C] = ptr[R]; ptr += stride } void __builtin_matrix_column_store(M matrix, T *ptr, int stride)
Preconditions: stride is greater than or equal to the number of rows in M. Effects: Equivalent to: for (int C = 0; C < columns in M; ++C) { for (int R = 0; R < rows in M; ++K) ptr[R] = matrix[R][C]; ptr += stride }
Remarks: The type T is the constunqualified version of the matrix argument’s element type. Example This code performs a matrixmultiply of two 4x4 matrices followed by an matrix addition: typedef float m4x4_t __attribute__((matrix_type(4, 4))); void f(m4x4_t *a, m4x4_t *b, m4x4_t *c, m4x4_t *r) { *r = *a + (*b * *c); }
This will get lowered by Clang to the LLVM IR below. In our current implementation, we use LLVM’s array type as storage type for the matrix data. Before accessing the data, we cast the array to a vector type. This allows us to use the element width as alignment, without running into issues with LLVM’s large default alignment for vector types, which is problematic in structs.
define void @f([16 x float]* %a, [16 x float]* %b, [16 x float]* %c, [16 x float]* %r) #0 { entry: %a.addr = alloca [16 x float]*, align 8 %b.addr = alloca [16 x float]*, align 8 %c.addr = alloca [16 x float]*, align 8 %r.addr = alloca [16 x float]*, align 8 store [16 x float]* %a, [16 x float]** %a.addr, align 8 store [16 x float]* %b, [16 x float]** %b.addr, align 8 store [16 x float]* %c, [16 x float]** %c.addr, align 8 store [16 x float]* %r, [16 x float]** %r.addr, align 8 %0 = load [16 x float]*, [16 x float]** %a.addr, align 8 %1 = bitcast [16 x float]* %0 to <16 x float>* %2 = load <16 x float>, <16 x float>* %1, align 4 %3 = load [16 x float]*, [16 x float]** %b.addr, align 8 %4 = bitcast [16 x float]* %3 to <16 x float>* %5 = load <16 x float>, <16 x float>* %4, align 4 %6 = call <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float> %2, <16 x float> %5, i32 4, i32 4, i32 4) %7 = load [16 x float]*, [16 x float]** %c.addr, align 8 %8 = bitcast [16 x float]* %7 to <16 x float>* %9 = load <16 x float>, <16 x float>* %8, align 4 %10 = fadd <16 x float> %6, %9 %11 = load [16 x float]*, [16 x float]** %r.addr, align 8 %12 = bitcast [16 x float]* %11 to <16 x float>* store <16 x float> %10, <16 x float>* %12, align 4
ret void
}
; Function Attrs: nounwind readnone speculatable willreturn
declare <16 x float> @llvm.matrix.multiply.v16f32.v16f32.v16f32(<16 x float>, <16 x float>, i32 immarg, i32 immarg, i32 immarg)
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Hi,
Thanks. I'm concerned about the inconsistency between * and the / and % operators, but other than that I think this proposal looks good.
Thank you very much for taking a look! I’ve responded inline.
As next step I am planning on putting a patch up with the draft specification (that should make it easier to keep track of additional comments about formulation details) and get the clang operator patches ready.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Are you aware of the OpenMP array sections extension? ( https://www.openmp.org/spechtml/5.0/openmpsu21.html) Inspired by that, you could use something like E1[:][E2] and E1[E2][:] to form column and row vectors. Just a thought; I think using builtins for these is also fine.
I was not aware of that, thanks for pointing it out!
It looks like it would provide a convenient syntax to use (and similar to what some other languages use) for extracting rows and columns. However I think given that the performance characteristics depend on the underlying layout I think it would be preferable to just provide builtins initially, assuming it would be OK to add better syntax later on.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
Supporting elementwise matrix1 / matrix2 seems likely to be confusing if * means matrix multiplication. Is this an important operation? (I'd imagine not, if you're willing to give up elementwise *.)
I think the following would all be OK:
1) +, , and * all follow normal math conventions (* is matrix multiplication or matrix/vector or matrix/scalar multiplication, + and  are elementwise and require the same type on both sides); / and % are not provided 2) +, , *, /, and % are elementwise; matrix multiplication is done a different way. 3) + and  are elementwise; *, /, and % are not provided and those operations are provided a different way
Of those, I think (1) is the cleanest approach, even though it's inconsistent with how we treat * for two operands of vector type. (2) would be consistent with our approach for vectors, and might be a nice approach if we can find a good alternative syntax for matrix multiplication (infix x is likely unambiguous, but would certainly be novel).
Thanks for highlighting the potential for confusion. I think dropping support for / and % (option 1) would be best from a user perspective. The most important operations to cover by far are +,  and matrix multiplication. We mostly added elementwise / and % for completeness, but given the potential for confusion it seems better to exclude them.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res; for (int C = 0; C < col; ++C) for (int R = 0; R < row; ++R) Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy: MTy Res; for (int C = 0; C < col; ++C) { for (int R = 0; R < row; ++R) { EltTy Elt = 0;
Is EltTy the matrix element type or the promoted matrix element type? (Based on what you wrote below, I assume the former, just checking...)
EltTy here refers to the element type of MTy (I’ll clarify the wording), hence it would be the result of the conversion/promotion based on the types of M1 and M2. for (int K = 0; K < inner; ++K) { Elt += M1[R][K] * M2[K][C]; } Res[R][C] = Elt; }
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
OK, so in particular you guarantee to perform the additions in the order given by the above rewrite (assuming no ffpcontract option)? (So, for example, you get the precision and rounding behavior of doing the adds in that order, and for a signed integer type, you get signed overflow if and only if the intermediate Elt value overflows even if the final Elt value for any (R,C) is back in range.) Seems OK, if that's consistent with your goals.
Yes the order of the dependent operations should be equivalent to the rewrite, unless relaxed by the user, e.g. via options like ffpcontract.
Cheers, Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> On Mar 19, 2020, at 18:52, Florian Hahn via cfedev < [hidden email]> wrote:
>
>> On Mar 18, 2020, at 20:38, Richard Smith < [hidden email]> wrote:
>>
>> Thanks. I'm concerned about the inconsistency between * and the / and % operators, but other than that I think this proposal looks good.
>
>
> Thank you very much for taking a look! I’ve responded inline.
>
> As next step I am planning on putting a patch up with the draft specification (that should make it easier to keep track of additional comments about formulation details) and get the clang operator patches ready.
I’ve put a set of patches up on Phabricator:
* [Matrix] Add draft specification for matrix support in Clang. https://reviews.llvm.org/D76612* [Matrix] Implement matrix index expressions ([][]). https://reviews.llvm.org/D76791* [Matrix] Implement + and  operators for MatrixType. https://reviews.llvm.org/D76793* [Matrix] Implement * binary operator for MatrixType. https://reviews.llvm.org/D76794And updated the original patch adding the matrix type support to Clang( https://reviews.llvm.org/D72281)
I’m looking forward to any feedback!
Cheers,
Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


In reply to this post by Fangrui Song via cfedev
On 25 Feb 2020, at 16:02, Florian Hahn via cfedev wrote:
Hi,
We updated the proposal to use operators for most operations (with a lot of help from Michael Spencer!).I’ve added the updated section inline and the complete updated proposal can be found at the end of the mail.
On Jan 18, 2020, at 12:01, Florian Hahn via cfedev <[hidden email]> wrote:
Thanks for the feedback! I’ve responded inline.
On 16 Jan 2020, at 23:06, Richard Smith <[hidden email] <[hidden email]>> wrote:
Do you anticipate providing builtin operators for matrices? If not, then the utility of a dedicated type and `matrix_type` attribute seems greatly diminished: the builtin matrix operators could instead  in principle  operate on a suitable vector type (either as a flat vector, matching the LLVM IR model, or as a vector of vectors, to support twodimensional indexing). I think your proposal should express why those would be inferior choices (eg, do matrix types have different calling conventions, alignment requirements, ... on some target? Do you intend to provide matrix x matrix multiplication and matrix x vector multiplication via the * operator in the future?). Adding *only* builtin functions and no new matrix types would be a substantial simplification in the proposal.
I think it would make sense to provide builtin operators instead of the proposed builtins for math operations. Same for element insertion/extraction. However I am not sure how to provide the strided matrix load/store as operators. Would it be OK to just have builtins for those? The reason we went for builtins initially was that we thought that might make the proposal a bit more lightweight, but it sounds like builtin operators would be preferred with the type.
I do not think ext_vector_type would be suitable for our proposal, as it matches LLVM’s vector alignment and the matrix type should match the alignment of the underlying data type, to allow easy interaction with existing matrix libraries.
A vector of vectors should work in principle, as long as we could fix both dimensions on a type level. Not having the dimensions guaranteed by the type would have a negative impact on the userexperience I think, as we would, for example, loose the ability to typecheck if the dimensions match the operators and users would have to provide the dimensions for certain operations. Also, it would make supporting 3+ dimensions a bit more tricky.
Below is a proposal for matrix type element access & binary operators:
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Matrix Type Binary Operators
Each matrix type supports the following binary operators: +, , *, /, and %. The * operator provides matrix multiplication, while +, , /, and % are performed elementwise. There are also scalar versions of the operators, which take a matrix type and the underlying element type. The operation is applied to all elements of the matrix using the scalar value.
The operands of +, , *, and / shall have either matrix type, arithmetic or unscoped enumeration type. The operands of % shall have either matrix type with an element type of integral type, integral type or unscoped enumeration type. At least one of the operands shall be of matrix type.
For BIN_OP in +, , *, /, and %, given the expression M1 BIN_OP M2 where, for *, one of M1 or M2 is of arithmetic type:
* The usual arithmetic conversions are applied to M1 and M2. [ Note: if M1 or M2 are of arithmetic type, they are broadcast to matrices here. — end note ]
* The matrix types of M1 and M2 shall have the same number of rows and columns.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in the matrix type:
decltype(M1) Res;
for (int C = 0; C < col; ++C)
for (int R = 0; R < row; ++R)
Res[R][C] = M1[R][C] BIN_OP M2[R][C];
Given the expression M1 * M2 where M1 and M2 are of matrix type:
* The usual arithmetic conversions are applied to M1 and M2
* The type of M1 shall have the same number of columns as the type of M2 has rows.
* The resulting type, MTy, is the result of applying the usual arithmetic conversions to M1 and M2, but with the same number of rows as M1’s matrix type and the same number of columns as M2’s matrix type.
* The result is equivalent to Res in the following where col is the number of columns and row is the number of rows in MTy:
MTy Res;
for (int C = 0; C < col; ++C) {
for (int R = 0; R < row; ++R) {
EltTy Elt = 0;
for (int K = 0; K < inner; ++K) {
Elt += M1[R][K] * M2[K][C];
}
Res[R][C] = Elt;
}
All operations on matrix types match the behavior of the underlying element type with respect to signed overflows.
With respect to rounding errors, the the * operator preserves the behavior of the separate multiply and add operations by default. We propose to provide a Clang option to override this behavior and allow contraction of those operations (e.g. ffpcontract=matrix).
Have the Eigen developers indicated they would consider using this if it were available to them?
Have you reached out to the GCC developers to see if they would also be likely to support this extension?
We should be aiming to build critical mass behind this feature so that it gets adopted; it would be a waste of resources if a different technology ends up being adopted in this space and we're left maintaining a system that noone outside Apple uses.
We hoped to get some initial feedback before reaching out, to make sure the proposal is in reasonably good shape. I plan to reach out to them early next week.
I reached out on the Eigen mailing list and got an encouraging response: https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html <https://listengine.tuxfamily.org/lists.tuxfamily.org/eigen/2020/02/msg00000.html>
I’ve also added people involved in GCC to the WIP patch adding the type.
Cheers,
Florian
Draft Specification
Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type.
You should require the element type to be unqualified and then apply qualifiers “outside” of the matrix type. That is, you can have a const float4x4 , but not a const_float4x4 .
An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type,
Why do you want to guarantee the same alignment as the element type? Isn’t that going to be seriously performancelimiting for most matrix types?
If your goal is to make it easy to load / store matrices from an existing buffer, I would recommend adding separate operations to do that instead of (I assume) expecting users to simply cast those buffers to a pointertomatrix type and dereference. That will also allow you to use nonpacked representations for matrices. Furthermore, you can parameterize the operation by the row/columnmajorness of the buffer and then simply use a canonical representation for matrix types.
but objects of matrix type are not usable in constant expressions.
This is contrary to the prevailing language directions of both C and C++. It might be an acceptable temporary language restriction, but it’s not something you should specify.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work.
This seems like an antigoal. Maybe you want a reinterpret_cast between matrix types with the same total storage size, though? Not sure if that’s a good idea.
TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information.
C will need a spelling for these, too.
Future Work: Initialization syntax.
Future Work: Conversions between matrix types with const qualified and unqualified element types.
Should be defined away by handling qualifiers correctly, per above.
Standard Conversions
The standard conversions are extended as follows.
For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element.
* If the original value was not of matrix type, each element takes the value of the original value.
This isn’t sufficient; you need to describe what happens if the element counts don’t match. I assume this is illformed.
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
This seems like a mistake, mostly because of the integer promotions. You really want short_matrix + short_matrix to yield an int_matrix , or short_matrix to be an int_matrix ? I would recommend that you specify this as working without integer promotion, so that the result of an operation two integer operations just has (1) the greater rank and (2) is unsigned if there’s a signedness mismatch and the signed type can’t express the full range of the unsigned type.
You also need to specify what happens if you pass a matrix as a variadic argument.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Okay, so what happens if you do write matrix[0] ?
John.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


Thank you very much for your comments John! Responses inline and I’ve also updated
[snip]
Draft Specification
Matrix Type Attribute
The attributetoken matrix_type is used to declare a matrix type. It shall appear at most once in each attributelist. The attribute shall only appertain to a typedefname of a typedef of a nonvolatile type that is a signed integer type, an unsigned integer type, or a floatingpoint type.
You should require the element type to be unqualified and then apply qualifiers “outside” of the matrix type. That is, you can have a const float4x4 , but not a const_float4x4 .
Agreed, thanks! I’ve added
"The element type must be unqualified and qualifiers are applied to the matrix type directly.”
An attributeargumentclause must be present and it shall have the form:
(constantexpression, constantexpression)
Both constantexpressions shall be a positive nonzero integral constant expressions. The maximum of the product of the constants is implementation defined. If that implementation defined limit is exceeded, the program is illformed.
An attribute of the form matrix_type(R, C) forms a matrix type with an element type of the cvqualified type the attribute appertains to and R rows and C columns.
If a declaration of a typedefname has a matrix_type attribute, then all declaration of that typedefname shall have a matrix_type attribute with the same element type, number of rows, and number of columns.
Matrix Type
A matrix type has an underlying element type, a constant number of rows, and a constant number of columns. Matrix types with the same element type, rows, and columns are the same type. A value of a matrix type contains rows * columns values of the element type laid out in columnmajor order without padding in a way compatible with an array of at least that many elements of the underlying element type.
A matrix type is a scalar type with the same alignment as its underlying element type,
Why do you want to guarantee the same alignment as the element type? Isn’t that going to be seriously performancelimiting for most matrix types? If your goal is to make it easy to load / store matrices from an existing buffer, I would recommend adding separate operations to do that instead of (I assume) expecting users to simply cast those buffers to a pointertomatrix type and dereference. That will also allow you to use nonpacked representations for matrices. Furthermore, you can parameterize the operation by the row/columnmajorness of the buffer and then simply use a canonical representation for matrix types.
The main reason was allowing users to cast between element type/matrix pointers to load/store matrixes from buffers. But I think it would indeed be better to disallow such casts and not limit ourselves to the element type alignment. There already are the builtin_matrix_column_load/builtin_matrix_column_store which can be used instead of going through casts. In the future, similar builtins could be added to load data in rowmajor layouts.
I think initially it would be best to go with implementation defined alignment. What do you think? (I’ve reworded the sentence to "A matrix type is a *scalar type* and its alignment is implementation defined.”)
but objects of matrix type are not usable in constant expressions.
This is contrary to the prevailing language directions of both C and C++. It might be an acceptable temporary language restriction, but it’s not something you should specify.
Sounds good, let’s drop this restriction.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work.
This seems like an antigoal. Maybe you want a reinterpret_cast between matrix types with the same total storage size, though? Not sure if that’s a good idea.
Agreed, now that we cannot cast element type pointers to matrix pointer I think we should disallow it here. It should simply be enough to drop the TODO, right?
TODO: Does it make sense to allow M::element_type, M::rows, and M::columns where M is a matrix type? We don’t support this anywhere else, but it’s convenient. The alternative is using template deduction to extract this information.
C will need a spelling for these, too.
I’ve added that to the TODO.
Future Work: Initialization syntax.
Future Work: Conversions between matrix types with const qualified and unqualified element types.
Should be defined away by handling qualifiers correctly, per above.
Thanks, I’ll drop this.
Standard Conversions
The standard conversions are extended as follows.
For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element.
* If the original value was not of matrix type, each element takes the value of the original value.
This isn’t sufficient; you need to describe what happens if the element counts don’t match. I assume this is illformed.
Yes, the dimension should match.
I’ve added "If the number of rows or columns differ between the original and resulting type, the program is illformed. Otherwise the resulting value is as follows:"
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
This seems like a mistake, mostly because of the integer promotions. You really want short_matrix + short_matrix to yield an int_matrix , or short_matrix to be an int_matrix ? I would recommend that you specify this as working without integer promotion, so that the result of an operation two integer operations just has (1) the greater rank and (2) is unsigned if there’s a signedness mismatch and the signed type can’t express the full range of the unsigned type.
The intention of choosing the rules for the underlying type was consistency with the existing behavior, but maybe it would be beneficial to forbid them. I think with both there might be source for confusion, but it might be simpler to not allow promotion. You also need to specify what happens if you pass a matrix as a variadic argument.
Initially they would be passed similar to ext_vector values, but we should spell the rules out correctly here. I’ll add a TODO. Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Okay, so what happens if you do write matrix[0] ?
That is not supported. We should probably state explicitly that single index operators on matrix values are illdefined or something like that?
Cheers, Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 26 Mar 2020, at 18:21, Florian Hahn wrote:
Thank you very much for your comments John! Responses inline and I’ve also updated
On Mar 25, 2020, at 20:39, John McCall <[hidden email]> wrote:
You should require the element type to be unqualified and then apply qualifiers “outside” of the matrix type. That is, you can have a const float4x4, but not a const_float4x4.
Agreed, thanks! I’ve added
"The element type must be unqualified and qualifiers are applied to the matrix type directly.”
Why do you want to guarantee the same alignment as the element type? Isn’t that going to be seriously performancelimiting for most matrix types?
If your goal is to make it easy to load / store matrices from an existing buffer, I would recommend adding separate operations to do that instead of (I assume) expecting users to simply cast those buffers to a pointertomatrix type and dereference. That will also allow you to use nonpacked representations for matrices. Furthermore, you can parameterize the operation by the row/columnmajorness of the buffer and then simply use a canonical representation for matrix types.
The main reason was allowing users to cast between element type/matrix pointers to load/store matrixes from buffers. But I think it would indeed be better to disallow such casts and not limit ourselves to the element type alignment. There already are the builtin_matrix_column_load/builtin_matrix_column_store which can be used instead of going through casts. In the future, similar builtins could be added to load data in rowmajor layouts.
I think initially it would be best to go with implementation defined alignment. What do you think? (I’ve reworded the sentence to "A matrix type is a *scalar type* and its alignment is implementation defined.”)
Yeah, making the size and alignment implementationdefined is the right specese.
TODO: Allow reinterpret_cast from pointer to element type. Make aliasing work.
This seems like an antigoal. Maybe you want a reinterpret_cast between matrix types with the same total storage size, though? Not sure if that’s a good idea.
Agreed, now that we cannot cast element type pointers to matrix pointer I think we should disallow it here. It should simply be enough to drop the TODO, right?
Yeah, that would be fine.
Standard Conversions
The standard conversions are extended as follows.
For integral promotions, floatingpoint promotion, integral conversions, floatingpoint conversions, and floatingintegral conversions: apply the rules to the underlying type of the matrix type. The resulting type is a matrix type with that underlying element type. The resulting value is as follows:
* If the original value was of matrix type, each element is converted element by element.
* If the original value was not of matrix type, each element takes the value of the original value.
This isn’t sufficient; you need to describe what happens if the element counts don’t match. I assume this is illformed.
Yes, the dimension should match.
I’ve added "If the number of rows or columns differ between the original and resulting type, the program is illformed. Otherwise the resulting value is as follows:"
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
This seems like a mistake, mostly because of the integer promotions. You really want short_matrix + short_matrix to yield an int_matrix, or short_matrix to be an int_matrix? I would recommend that you specify this as working without integer promotion, so that the result of an operation two integer operations just has (1) the greater rank and (2) is unsigned if there’s a signedness mismatch and the signed type can’t express the full range of the unsigned type.
The intention of choosing the rules for the underlying type was consistency with the existing behavior, but maybe it would be beneficial to forbid them. I think with both there might be source for confusion, but it might be simpler to not allow promotion.
It’s an interesting question. In a totally different language, I would say that you shouldn’t allow mismatched binary operations or implicit conversions, but that explicit casts should be able to do arbitrary elementwise conversions. That’s not really consistent with C’s normal behavior, though. It also might be really pedantic around e.g. scalar multiplication by a literal: would you have to write short4x4 * (short) 2 ? On the other hand, presumably you wouldn’t want float4x4 * 2.0 to yield a double4x4 .
You also need to specify what happens if you pass a matrix as a variadic argument.
Initially they would be passed similar to ext_vector values, but we should spell the rules out correctly here. I’ll add a TODO.
Yeah, this is definitely a longterm question.
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Okay, so what happens if you do write matrix[0]?
That is not supported. We should probably state explicitly that single index operators on matrix values are illdefined or something like that?
Yeah, you can say that incomplete subscripts can only be used as the base operand of another subscript. There’s a similar rule with memberfunction accesses in C++, which can only be the function operand of a call. You can enforce that rule reliably in Clang with a placeholder type.
John.
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 26 Mar 2020, at 18:21, Florian Hahn wrote:
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
This seems like a mistake, mostly because of the integer promotions. You really want short_matrix + short_matrix to yield an int_matrix, or short_matrix to be an int_matrix? I would recommend that you specify this as working without integer promotion, so that the result of an operation two integer operations just has (1) the greater rank and (2) is unsigned if there’s a signedness mismatch and the signed type can’t express the full range of the unsigned type.
The intention of choosing the rules for the underlying type was consistency with the existing behavior, but maybe it would be beneficial to forbid them. I think with both there might be source for confusion, but it might be simpler to not allow promotion.
It’s an interesting question. In a totally different language, I would say that you shouldn’t allow mismatched binary operations or implicit conversions, but that explicit casts should be able to do arbitrary elementwise conversions. That’s not really consistent with C’s normal behavior, though. It also might be really pedantic around e.g. scalar multiplication by a literal: would you have to write short4x4 * (short) 2 ? On the other hand, presumably you wouldn’t want float4x4 * 2.0 to yield a double4x4 .
I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
If there are no concerns about maintaining consistency with C’s normal behavior, I think we should require the operand types to match and I’ll update the patch with the draft spec accordingly ( https://reviews.llvm.org/D76612)
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Okay, so what happens if you do write matrix[0]?
That is not supported. We should probably state explicitly that single index operators on matrix values are illdefined or something like that?
Yeah, you can say that incomplete subscripts can only be used as the base operand of another subscript. There’s a similar rule with memberfunction accesses in C++, which can only be the function operand of a call. You can enforce that rule reliably in Clang with a placeholder type.
Great, thanks for pointing me to placeholder types! I’ve updated the Clang patch adding matrix indexing expression to use a new placeholder type to display a proper error message.
Cheers, Florian _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 1 Apr 2020, at 13:15, Florian Hahn wrote:
On Mar 26, 2020, at 23:02, John McCall <[hidden email]> wrote:
On 26 Mar 2020, at 18:21, Florian Hahn wrote:
Arithmetic Conversions
The usual arithmetic conversions are extended as follows.
Insert at the start:
* If either operand is of matrix type, apply the usual arithmetic conversions using its underlying element type. The resulting type is a matrix type with that underlying element type.
This seems like a mistake, mostly because of the integer promotions. You really want short_matrix + short_matrix to yield an int_matrix, or short_matrix to be an int_matrix? I would recommend that you specify this as working without integer promotion, so that the result of an operation two integer operations just has (1) the greater rank and (2) is unsigned if there’s a signedness mismatch and the signed type can’t express the full range of the unsigned type.
The intention of choosing the rules for the underlying type was consistency with the existing behavior, but maybe it would be beneficial to forbid them. I think with both there might be source for confusion, but it might be simpler to not allow promotion.
It’s an interesting question. In a totally different language, I would say that you shouldn’t allow mismatched binary operations or implicit conversions, but that explicit casts should be able to do arbitrary elementwise conversions. That’s not really consistent with C’s normal behavior, though. It also might be really pedantic around e.g. scalar multiplication by a literal: would you have to write short4x4 * (short) 2? On the other hand, presumably you wouldn’t want float4x4 * 2.0 to yield a double4x4.
I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
I think converting scalars is a necessary convenience in C given that e.g. literals are normally of type int , but yes, I think it’s fine to not implicitly convert matrices as long as you have a way to convert them explicitly.
John.
If there are no concerns about maintaining consistency with C’s normal behavior, I think we should require the operand types to match and I’ll update the patch with the draft spec accordingly (https://reviews.llvm.org/D76612)
Matrix Type Element Access Operator
An expression of the form `postfixexpression [expression][expression]` where the postfixexpression is of matrix type is a matrix element access expression. expression shall not be a comma expression, and shall be a prvalue of unscoped enumeration or integral type. Given the expression E1[E2][E3], the result is an lvalue of the same type as the underlying element type of the matrix that refers to the value at E2 row and E3 column in the matrix. The expression E1 is sequenced before E2 and E3. The expressions E2 and E3 are unsequenced.
Note: We thought about providing an expression of the form `postfixexpression [expression]` to access columns of a matrix. We think that such an expression would be problematic once both column and row major matrixes are supported: depending on the memory layout, either accessing columns or rows can be done efficiently, but not both. Instead, we propose to provide builtins to extract rows and columns from a matrix. This makes the operations more explicit.
Okay, so what happens if you do write matrix[0]?
That is not supported. We should probably state explicitly that single index operators on matrix values are illdefined or something like that?
Yeah, you can say that incomplete subscripts can only be used as the base operand of another subscript. There’s a similar rule with memberfunction accesses in C++, which can only be the function operand of a call. You can enforce that rule reliably in Clang with a placeholder type.
Great, thanks for pointing me to placeholder types! I’ve updated the Clang patch adding matrix indexing expression to use a new placeholder type to display a proper error message.
Cheers,
Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> On Apr 1, 2020, at 1:23 PM, John McCall via cfedev < [hidden email]> wrote:
>
>> On 1 Apr 2020, at 13:15, Florian Hahn wrote:
>> I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
>>
>> Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
>
> I think converting scalars is a necessary convenience in C given that e.g. literals are normally of type int, but yes, I think it’s fine to not implicitly convert matrices as long as you have a way to convert them explicitly.
For matrix operations implicit scalar conversions will be less important than they are for vectors, but they turned out to be really critical to have for vector programming. It may make sense to have them for matrices as well.
The main issue for vectors is that there’s no way to get a subint literal in C (i.e. there’s no i8 suffix). This meant that simple vector code like, e.g.:
uchar16 a, b;
uchar16 c = a & 0xf  b << 4;
Had to be written as:
uchar16 c = a & (uchar16)0xf  b << (uchar16)4;
Which, well, it really sucked before we added the conversion rule. This sort of thing is less common for matrix math than vector math, but it probably makes sense to match the semantics for convenience and consistency; there haven’t really been a lot of drawbacks to the vector semantics.
There should be no implicit conversions for matrix operands (and certainly not the C usual arithmetic conversions; approximately no one wants that behavior).
– Steve
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


> On Apr 1, 2020, at 20:48, Stephen Canon < [hidden email]> wrote:
>
>> On Apr 1, 2020, at 1:23 PM, John McCall via cfedev < [hidden email]> wrote:
>>
>>> On 1 Apr 2020, at 13:15, Florian Hahn wrote:
>>> I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
>>>
>>> Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
>>
>> I think converting scalars is a necessary convenience in C given that e.g. literals are normally of type int, but yes, I think it’s fine to not implicitly convert matrices as long as you have a way to convert them explicitly.
>
Converting them explicitly should be covered by the standard conversion rules, right?
> For matrix operations implicit scalar conversions will be less important than they are for vectors, but they turned out to be really critical to have for vector programming. It may make sense to have them for matrices as well.
>
> The main issue for vectors is that there’s no way to get a subint literal in C (i.e. there’s no i8 suffix). This meant that simple vector code like, e.g.:
>
> uchar16 a, b;
> uchar16 c = a & 0xf  b << 4;
>
> Had to be written as:
>
> uchar16 c = a & (uchar16)0xf  b << (uchar16)4;
>
> Which, well, it really sucked before we added the conversion rule. This sort of thing is less common for matrix math than vector math, but it probably makes sense to match the semantics for convenience and consistency; there haven’t really been a lot of drawbacks to the vector semantics.
>
> There should be no implicit conversions for matrix operands (and certainly not the C usual arithmetic conversions; approximately no one wants that behavior).
>
Thanks, I’ve updated the arithmetic conversion rules in https://reviews.llvm.org/D76612 with the text below:
* If both operands are of matrix type, no arithmetic conversion is performed.
* If one operand is of matrix type and the other operand is of an integer or
floating point type, convert the integer or floating point operand to the
underlying element type of the operand of matrix type.
Cheers,
Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 2 Apr 2020, at 9:47, Florian Hahn wrote:
On Apr 1, 2020, at 20:48, Stephen Canon <[hidden email]> wrote:
On Apr 1, 2020, at 1:23 PM, John McCall via cfedev <[hidden email]> wrote:
On 1 Apr 2020, at 13:15, Florian Hahn wrote:
I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
I think converting scalars is a necessary convenience in C given that e.g. literals are normally of type int, but yes, I think it’s fine to not implicitly convert matrices as long as you have a way to convert them explicitly.
Converting them explicitly should be covered by the standard conversion rules, right?
Do you want matrices to be implicitly convertible in nonoperator contexts? Like, if you assign a float4x4 to an int4x4 , should that implicitly convert the elements or be illformed?
John.
For matrix operations implicit scalar conversions will be less important than they are for vectors, but they turned out to be really critical to have for vector programming. It may make sense to have them for matrices as well.
The main issue for vectors is that there’s no way to get a subint literal in C (i.e. there’s no i8 suffix). This meant that simple vector code like, e.g.:
uchar16 a, b;
uchar16 c = a & 0xf  b << 4;
Had to be written as:
uchar16 c = a & (uchar16)0xf  b << (uchar16)4;
Which, well, it really sucked before we added the conversion rule. This sort of thing is less common for matrix math than vector math, but it probably makes sense to match the semantics for convenience and consistency; there haven’t really been a lot of drawbacks to the vector semantics.
There should be no implicit conversions for matrix operands (and certainly not the C usual arithmetic conversions; approximately no one wants that behavior).
Thanks, I’ve updated the arithmetic conversion rules in https://reviews.llvm.org/D76612 with the text below:
* If both operands are of matrix type, no arithmetic conversion is performed.
* If one operand is of matrix type and the other operand is of an integer or
floating point type, convert the integer or floating point operand to the
underlying element type of the operand of matrix type.
Cheers,
Florian
_______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev


On 2 Apr 2020, at 9:47, Florian Hahn wrote:
On Apr 1, 2020, at 20:48, Stephen Canon <[hidden email]> wrote:
On Apr 1, 2020, at 1:23 PM, John McCall via cfedev <[hidden email]> wrote:
On 1 Apr 2020, at 13:15, Florian Hahn wrote:
I agree that ideally we would not allow mismatched binary operations to avoid surprises. It looks like the existing vector type does not perform implicit conversion for binary operations with 2 vector operands (unless the there is a type mismatch and the data size matches, then the LHS type is chosen). For binary ops with vector and scalar operands, the scalar operand is converted to the vector element type. So short4 + short4 > short4, short4 + int > short4, short4 + int4 > invalid.
Given that precedence, I think we should opt for not providing conversions for binary operators with two matrix operands. For the matrix and scalar versions, we could convert to the element type automatically for convenience. But at that point, the gain is probably quite small and it would be simpler to don’t do conversions in any case. Does that sound reasonable?
I think converting scalars is a necessary convenience in C given that e.g. literals are normally of type int, but yes, I think it’s fine to not implicitly convert matrices as long as you have a way to convert them explicitly.
Converting them explicitly should be covered by the standard conversion rules, right?
Do you want matrices to be implicitly convertible in nonoperator contexts? Like, if you assign a float4x4 to an int4x4 , should that implicitly convert the elements or be illformed?
I think the current formulation allows for implicit conversions for matrixes.
But given the recent changes in the arithmetic context, it might be better to only allow explicit conversions. Having implicit conversions for nonoperator contexts and not in operator contexts seems a bit inconsistent. I think we should keep the conversion rules based on the element types, but limit them to explicit conversions. Does that make sense?
Cheers, Florian _______________________________________________
cfedev mailing list
[hidden email]
https://lists.llvm.org/cgibin/mailman/listinfo/cfedev

12
