How to generate Unique Module identifier

classic Classic list List threaded Threaded
4 messages Options
Reply | Threaded
Open this post in threaded view
|

How to generate Unique Module identifier

Vassil Vassilev via cfe-dev
Hi All,

There have been recent discussions about how to generate unique module identifiers which can be embedded in AIX static init function names.
 
On AIX, static init functions are sinit/sterm pairs looking like this:

__sinit<priority #>_<unique module identifier>
__sterm<priority #>_<unique module identifier>


There is one sinit/sterm pair per priority number for each module.

The AIX linker collects static init functions simply based on their name. So we need to guarantee that each module has its own unique sinit/sterm pairs. To achieve that, we need a unique module identifier which will be used as a part of static init function name as suffix.

Our several thoughts about this so far are as follows:

1. `getUniqueModuleId` function to generate unique module identifier
https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255

“Produce unique identifier for a module by taking the MD5 sum of the names of the module's strong external symbols. However, if the module has no strong external symbols (such a module may still have a semantic effect if it performs global initialization), we cannot produce a unique identifier for this module, so we return the empty string.”

Issues with this `getUniqueModuleId` function are:
(1)Since this function does not take either `Internal linkage` or `WeakOnceODR linkage` global variables, so it is not able to return a string for the following cases:
1)
class test {
public:
    test();
    ~test();
};

static test t;  //Internal linkage


2)
extern "C" int puts(const char *);

template <typename = void>
struct A {
   A() { puts("hello\n"); }
  ~A() { puts("bye\n"); }
  static A instance;
};

template <typename T> A<T> A<T>::instance;
template A<> A<>::instance;   //WeakOnceODR linkage


(2) Even if we add our own version `getUniqueModuleId` to care about above linkage types, the biggest issue here is content-based hashing won't work for the identical-content internal linkage case.


2. Source filename string as the module identifier
The `source filename` string is set to the original module identifier, which will be the name of the compiled source file when compiling from source through the clang front end. [ https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename ]

That means if we have multiple objects compiled with the same command-line source file path, we have same module identifiers. The static init functions are not guaranteed to be unique.

Also, there's Unique Names for Functions with Internal Linkage patch, whose solution does not guarantee uniqueness either.

3. Using the information around the compilation process itself
Though using the information around the compilation process itself (PID, timestamp) can give us unique module identifiers, but it could be problematic for reproducibility.

4. source file full path + OutputFile name following -o  option
Another thing hopeful is to use the source file full path plus the OutputFile name following -o option as something to hash on or as a suffix for static init functions on AIX.

We didn’t find any precedent in LLVM to do so so far. And it requires us to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as SourceFileName.
https://llvm.org/doxygen/Module_8cpp_source.html#l00073



Any thoughts about what to hash on or encode into the unique ID we need?

Please let me know if there are any questions as well. Your feedback is appreciated.
 
Regards,
 
Xiangling Liao

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Fwd: How to generate Unique Module identifier

Vassil Vassilev via cfe-dev
ping.

---------- Forwarded message ---------
From: Xiangling Liao <[hidden email]>
Date: Fri, May 29, 2020 at 3:15 PM
Subject: [cfe-dev] How to generate Unique Module identifier
To: <[hidden email]>
Cc: <[hidden email]>


Hi All,

There have been recent discussions about how to generate unique module identifiers which can be embedded in AIX static init function names.
 
On AIX, static init functions are sinit/sterm pairs looking like this:

__sinit<priority #>_<unique module identifier>
__sterm<priority #>_<unique module identifier>


There is one sinit/sterm pair per priority number for each module.

The AIX linker collects static init functions simply based on their name. So we need to guarantee that each module has its own unique sinit/sterm pairs. To achieve that, we need a unique module identifier which will be used as a part of static init function name as suffix.

Our several thoughts about this so far are as follows:

1. `getUniqueModuleId` function to generate unique module identifier
https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255

“Produce unique identifier for a module by taking the MD5 sum of the names of the module's strong external symbols. However, if the module has no strong external symbols (such a module may still have a semantic effect if it performs global initialization), we cannot produce a unique identifier for this module, so we return the empty string.”

Issues with this `getUniqueModuleId` function are:
(1)Since this function does not take either `Internal linkage` or `WeakOnceODR linkage` global variables, so it is not able to return a string for the following cases:
1)
class test {
public:
    test();
    ~test();
};

static test t;  //Internal linkage


2)
extern "C" int puts(const char *);

template <typename = void>
struct A {
   A() { puts("hello\n"); }
  ~A() { puts("bye\n"); }
  static A instance;
};

template <typename T> A<T> A<T>::instance;
template A<> A<>::instance;   //WeakOnceODR linkage


(2) Even if we add our own version `getUniqueModuleId` to care about above linkage types, the biggest issue here is content-based hashing won't work for the identical-content internal linkage case.


2. Source filename string as the module identifier
The `source filename` string is set to the original module identifier, which will be the name of the compiled source file when compiling from source through the clang front end. [ https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename ]

That means if we have multiple objects compiled with the same command-line source file path, we have same module identifiers. The static init functions are not guaranteed to be unique.

Also, there's Unique Names for Functions with Internal Linkage patch, whose solution does not guarantee uniqueness either.

3. Using the information around the compilation process itself
Though using the information around the compilation process itself (PID, timestamp) can give us unique module identifiers, but it could be problematic for reproducibility.

4. source file full path + OutputFile name following -o  option
Another thing hopeful is to use the source file full path plus the OutputFile name following -o option as something to hash on or as a suffix for static init functions on AIX.

We didn’t find any precedent in LLVM to do so so far. And it requires us to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as SourceFileName.
https://llvm.org/doxygen/Module_8cpp_source.html#l00073



Any thoughts about what to hash on or encode into the unique ID we need?

Please let me know if there are any questions as well. Your feedback is appreciated.
 
Regards,
 
Xiangling Liao

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: How to generate Unique Module identifier

Vassil Vassilev via cfe-dev


On Wed, Jun 3, 2020 at 8:30 AM Xiangling Liao <[hidden email]> wrote:
ping.

---------- Forwarded message ---------
From: Xiangling Liao <[hidden email]>
Date: Fri, May 29, 2020 at 3:15 PM
Subject: [cfe-dev] How to generate Unique Module identifier
To: <[hidden email]>
Cc: <[hidden email]>


Hi All,

There have been recent discussions about how to generate unique module identifiers which can be embedded in AIX static init function names.
 
On AIX, static init functions are sinit/sterm pairs looking like this:

__sinit<priority #>_<unique module identifier>
__sterm<priority #>_<unique module identifier>


There is one sinit/sterm pair per priority number for each module.

The AIX linker collects static init functions simply based on their name. So we need to guarantee that each module has its own unique sinit/sterm pairs. To achieve that, we need a unique module identifier which will be used as a part of static init function name as suffix.

Our several thoughts about this so far are as follows:

1. `getUniqueModuleId` function to generate unique module identifier
https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255

“Produce unique identifier for a module by taking the MD5 sum of the names of the module's strong external symbols. However, if the module has no strong external symbols (such a module may still have a semantic effect if it performs global initialization), we cannot produce a unique identifier for this module, so we return the empty string.”

Issues with this `getUniqueModuleId` function are:
(1)Since this function does not take either `Internal linkage` or `WeakOnceODR linkage` global variables, so it is not able to return a string for the following cases:
1)
class test {
public:
    test();
    ~test();
};

static test t;  //Internal linkage


2)
extern "C" int puts(const char *);

template <typename = void>
struct A {
   A() { puts("hello\n"); }
  ~A() { puts("bye\n"); }
  static A instance;
};

template <typename T> A<T> A<T>::instance;
template A<> A<>::instance;   //WeakOnceODR linkage


(2) Even if we add our own version `getUniqueModuleId` to care about above linkage types, the biggest issue here is content-based hashing won't work for the identical-content internal linkage case.


2. Source filename string as the module identifier
The `source filename` string is set to the original module identifier, which will be the name of the compiled source file when compiling from source through the clang front end. [ https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename ]

That means if we have multiple objects compiled with the same command-line source file path, we have same module identifiers. The static init functions are not guaranteed to be unique.

Also, there's Unique Names for Functions with Internal Linkage patch, whose solution does not guarantee uniqueness either.

I just have a few thoughts.  I worked on the unique names patch for internal linkage functions.

1)  Low probability of collisions:  I was only interested in reducing the probability of internal linkage functions getting the same names.  In the context of PGO/FDO, this is useful because the profile information can be attributed to the right instance of the function.  While the Unique names solution does not guarantee uniqueness, it makes it really small in practice.

2) Name stability:  We do not want the symbol names to constantly change either and there should be some amount of stability.  This is because we generate profiles with one version of the source and use it to optimize a later version.  Name changes across versions could make the profiles for those functions useless.  In your case, how important is stability?

3) Using the file system's attributes where possible:  Just spitballing here, how about using say inode number in the hash for the symbol with Linux and similar attributes for other file systems.  Looks like this could be kept stable and would handle the problem of identical source names.
 

Thanks
Sri

3. Using the information around the compilation process itself
Though using the information around the compilation process itself (PID, timestamp) can give us unique module identifiers, but it could be problematic for reproducibility.

4. source file full path + OutputFile name following -o  option
Another thing hopeful is to use the source file full path plus the OutputFile name following -o option as something to hash on or as a suffix for static init functions on AIX.

We didn’t find any precedent in LLVM to do so so far. And it requires us to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as SourceFileName.
https://llvm.org/doxygen/Module_8cpp_source.html#l00073



Any thoughts about what to hash on or encode into the unique ID we need?

Please let me know if there are any questions as well. Your feedback is appreciated.
 
Regards,
 
Xiangling Liao

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: How to generate Unique Module identifier

Vassil Vassilev via cfe-dev
On Wed, Jun 3, 2020 at 3:10 PM Sriraman Tallam <[hidden email]> wrote:


On Wed, Jun 3, 2020 at 8:30 AM Xiangling Liao <[hidden email]> wrote:
ping.

---------- Forwarded message ---------
From: Xiangling Liao <[hidden email]>
Date: Fri, May 29, 2020 at 3:15 PM
Subject: [cfe-dev] How to generate Unique Module identifier
To: <[hidden email]>
Cc: <[hidden email]>


Hi All,

There have been recent discussions about how to generate unique module identifiers which can be embedded in AIX static init function names.
 
On AIX, static init functions are sinit/sterm pairs looking like this:

__sinit<priority #>_<unique module identifier>
__sterm<priority #>_<unique module identifier>


There is one sinit/sterm pair per priority number for each module.

The AIX linker collects static init functions simply based on their name. So we need to guarantee that each module has its own unique sinit/sterm pairs. To achieve that, we need a unique module identifier which will be used as a part of static init function name as suffix.

Our several thoughts about this so far are as follows:

1. `getUniqueModuleId` function to generate unique module identifier
https://llvm.org/doxygen/ModuleUtils_8cpp_source.html#l00255

“Produce unique identifier for a module by taking the MD5 sum of the names of the module's strong external symbols. However, if the module has no strong external symbols (such a module may still have a semantic effect if it performs global initialization), we cannot produce a unique identifier for this module, so we return the empty string.”

Issues with this `getUniqueModuleId` function are:
(1)Since this function does not take either `Internal linkage` or `WeakOnceODR linkage` global variables, so it is not able to return a string for the following cases:
1)
class test {
public:
    test();
    ~test();
};

static test t;  //Internal linkage


2)
extern "C" int puts(const char *);

template <typename = void>
struct A {
   A() { puts("hello\n"); }
  ~A() { puts("bye\n"); }
  static A instance;
};

template <typename T> A<T> A<T>::instance;
template A<> A<>::instance;   //WeakOnceODR linkage


(2) Even if we add our own version `getUniqueModuleId` to care about above linkage types, the biggest issue here is content-based hashing won't work for the identical-content internal linkage case.


2. Source filename string as the module identifier
The `source filename` string is set to the original module identifier, which will be the name of the compiled source file when compiling from source through the clang front end. [ https://releases.llvm.org/10.0.0/docs/LangRef.html#source-filename ]

That means if we have multiple objects compiled with the same command-line source file path, we have same module identifiers. The static init functions are not guaranteed to be unique.

Also, there's Unique Names for Functions with Internal Linkage patch, whose solution does not guarantee uniqueness either.

I just have a few thoughts.  I worked on the unique names patch for internal linkage functions.

1)  Low probability of collisions:  I was only interested in reducing the probability of internal linkage functions getting the same names.  In the context of PGO/FDO, this is useful because the profile information can be attributed to the right instance of the function.  While the Unique names solution does not guarantee uniqueness, it makes it really small in practice.
We would need more guaranteed uniqueness than this. It would not be good for packaged static libraries to have symbols that collide by accident with the user program or with other static libraries.
 

2) Name stability:  We do not want the symbol names to constantly change either and there should be some amount of stability.  This is because we generate profiles with one version of the source and use it to optimize a later version.  Name changes across versions could make the profiles for those functions useless.  In your case, how important is stability?
Stability is important for keeping the relative ordering of C++ initialization/destruction for non-locals reasonably the same between builds.
 

3) Using the file system's attributes where possible:  Just spitballing here, how about using say inode number in the hash for the symbol with Linux and similar attributes for other file systems.  Looks like this could be kept stable and would handle the problem of identical source names.
I believe we would want stability to extend to having the original source tree moved to another directory and a different source tree placed where the original source tree was in the directory structure.
 
 

Thanks
Sri

3. Using the information around the compilation process itself
Though using the information around the compilation process itself (PID, timestamp) can give us unique module identifiers, but it could be problematic for reproducibility.

4. source file full path + OutputFile name following -o  option
Another thing hopeful is to use the source file full path plus the OutputFile name following -o option as something to hash on or as a suffix for static init functions on AIX.

We didn’t find any precedent in LLVM to do so so far. And it requires us to pass -o ’s OutputFile name from `FrontendOpts` to `llvm::Module` like we pass each `Input` from `FrontendOpts.Inputs` to `llvm::Module` as SourceFileName.
https://llvm.org/doxygen/Module_8cpp_source.html#l00073



Any thoughts about what to hash on or encode into the unique ID we need?

Please let me know if there are any questions as well. Your feedback is appreciated.
 
Regards,
 
Xiangling Liao

_______________________________________________
cfe-dev mailing list
[hidden email]
https://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev