RFC: Module file extensions

classic Classic list List threaded Threaded
3 messages Options
Reply | Threaded
Open this post in threaded view
|

RFC: Module file extensions

Don Hinton via cfe-dev
Hi all,

Modules provide a useful place to cache the results of parsing a library’s headers for later, efficient consumption. Clang does this for all of the information it gathers during parsing, including the ASTs, preprocessor state, lookup tables, comments, and so on, with the intent to minimize the amount of deserialization that occurs when a particular module is being used.

I’d like to extend this capability so that tools built on top of Clang can store their own information in the module file on disk. This mechanism would be used when there is information that can be computed at module-build time that would be expensive to recompute in every user of the module. For example, function summaries for the static analyzer, application-specific indexes that would require deserializing the entire module file to recompute. One could compute this information and put it into a separate file or database, but given that the information is naturally tied to modules—and generally needs to be invalidated at the same time the module file itself needs to be rebuilt—it makes more architectural sense to put that information directly in the module file when it is built, while the full AST is still efficiently accessible in memory.

module file extension is a bit of custom logic that can piggy-back data into a module file. Each module file extension is described by a unique block name identifying the extension, as well as other metadata (major/minor version, user information string) about the extension itself. Each module file extension gets to write into its own separate extension block in the resulting module file, separate from the rest of the module file contents and from other extensions. To do that, it provides both a writer (that writes bitstream records into the output file) and a reader (that can read back those bitstream records).

I’ve attached an implementation of module file extensions. It sketches out the interface to a module file extension (see below, or check out ModuleFileExtension.h for the full interface) and implements a module file extension for testing purposes so we can illustrate the round-tripping of data through the module file format, matching of extension blocks written to extension blocks read, and so on. 

/// Metadata for a module file extension.
struct ModuleFileExtensionMetadata {
  /// The name used to identify this particular extension block within
  /// the resulting module file. It should be unique to the particular
  /// extension, because this name will be used to match the name of
  /// an extension block to the appropriate reader.
  std::string BlockName;

  /// The major version of the extension data.
  unsigned MajorVersion;

  /// The minor version of the extension data.
  unsigned MinorVersion;

  /// A string containing additional user information that will be
  /// stored with the metadata.
  std::string UserInfo;
};

/// An abstract superclass that describes a custom extension to the
/// module/precompiled header file format.
///
/// A module file extension can introduce additional information into
/// compiled module files (.pcm) and precompiled headers (.pch) via a
/// custom writer that can then be accessed via a custom reader when
/// the module file or precompiled header is loaded.
class ModuleFileExtension : public llvm::RefCountedBase<ModuleFileExtension> {
public:
  virtual ~ModuleFileExtension();

  /// Retrieves the metadata for this module file extension.
  virtual ModuleFileExtensionMetadata getExtensionMetadata() const = 0;

  /// Hash information about the presence of this extension into the
  /// module hash code.
  ///
  /// The module hash code is used to distinguish different variants
  /// of a module that are incompatible. If the presence, absence, or
  /// version of the module file extension should force the creation
  /// of a separate set of module files, override this method to
  /// combine that distinguishing information into the module hash
  /// code.
  ///
  /// The default implementation of this function simply returns the
  /// hash code as given, so the presence/absence of this extension
  /// does not distinguish module files.
  virtual llvm::hash_code hashExtension(llvm::hash_code Code) const;

  /// Create a new module file extension writer, which will be
  /// responsible for writing the extension contents into a particular
  /// module file.
  virtual std::unique_ptr<ModuleFileExtensionWriter>
  createExtensionWriter() = 0;

  /// Create a new module file extension reader, given the
  /// metadata read from the block and the cursor into the extension
  /// block.
  ///
  /// May return null to indicate that an extension block with the
  /// given metadata cannot be read.
  virtual std::unique_ptr<ModuleFileExtensionReader>
  createExtensionReader(const ModuleFileExtensionMetadata &Metadata,
                        ASTReader &Reader, serialization::ModuleFile &Mod,
                        const llvm::BitstreamCursor &Stream) = 0;
};

/// Abstract base class that writes a module file extension block into
/// a module file.
class ModuleFileExtensionWriter {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionWriter(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  virtual ~ModuleFileExtensionWriter();

  /// Retrieve the module file extension with which this writer is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  /// Write the contents of the extension block into the given bitstream.
  ///
  /// Responsible for writing the contents of the extension into the
  /// given stream. All of the contents should be written into custom
  /// records with IDs >= FIRST_EXTENSION_RECORD_ID.
  virtual void writeExtensionContents(llvm::BitstreamWriter &Stream) = 0;
};

/// Abstract base class that reads a module file extension block from
/// a module file.
///
/// Subclasses 
class ModuleFileExtensionReader {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionReader(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  /// Retrieve the module file extension with which this reader is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  virtual ~ModuleFileExtensionReader();
};

I suspect that the Reader and Writer interfaces will grow somewhat as we get more clients, but this is a start.

Thoughts?

- Doug



_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

0001-Introduce-module-file-extensions-to-piggy-back-data-.patch (88K) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Module file extensions

Don Hinton via cfe-dev
I'd be curious to learn more about why you need to have it in one file. Invalidation is something build systems usually take care of, and for those it seems mostly the same whether they invalidate one or two files (they already need to invalidate a multitude of files, like .o files, module files, linked libraries and binaries, and generated code).
I could see an argument for deployment, but for deployment of libraries you'll need the module and its headers anyway.
Given that we already have a way to embed sources into the modules, I can see that you might be able to build full deployable bundles, where a module file is the only thing you need, and it includes the headers, potentially .o files to link in, and other information (like the one you cite).
So, all that said, I'm mainly curious whether that's what you're aiming for :)

On Tue, Oct 27, 2015 at 5:17 AM Douglas Gregor via cfe-dev <[hidden email]> wrote:
Hi all,

Modules provide a useful place to cache the results of parsing a library’s headers for later, efficient consumption. Clang does this for all of the information it gathers during parsing, including the ASTs, preprocessor state, lookup tables, comments, and so on, with the intent to minimize the amount of deserialization that occurs when a particular module is being used.

I’d like to extend this capability so that tools built on top of Clang can store their own information in the module file on disk. This mechanism would be used when there is information that can be computed at module-build time that would be expensive to recompute in every user of the module. For example, function summaries for the static analyzer, application-specific indexes that would require deserializing the entire module file to recompute. One could compute this information and put it into a separate file or database, but given that the information is naturally tied to modules—and generally needs to be invalidated at the same time the module file itself needs to be rebuilt—it makes more architectural sense to put that information directly in the module file when it is built, while the full AST is still efficiently accessible in memory.

module file extension is a bit of custom logic that can piggy-back data into a module file. Each module file extension is described by a unique block name identifying the extension, as well as other metadata (major/minor version, user information string) about the extension itself. Each module file extension gets to write into its own separate extension block in the resulting module file, separate from the rest of the module file contents and from other extensions. To do that, it provides both a writer (that writes bitstream records into the output file) and a reader (that can read back those bitstream records).

I’ve attached an implementation of module file extensions. It sketches out the interface to a module file extension (see below, or check out ModuleFileExtension.h for the full interface) and implements a module file extension for testing purposes so we can illustrate the round-tripping of data through the module file format, matching of extension blocks written to extension blocks read, and so on. 

/// Metadata for a module file extension.
struct ModuleFileExtensionMetadata {
  /// The name used to identify this particular extension block within
  /// the resulting module file. It should be unique to the particular
  /// extension, because this name will be used to match the name of
  /// an extension block to the appropriate reader.
  std::string BlockName;

  /// The major version of the extension data.
  unsigned MajorVersion;

  /// The minor version of the extension data.
  unsigned MinorVersion;

  /// A string containing additional user information that will be
  /// stored with the metadata.
  std::string UserInfo;
};

/// An abstract superclass that describes a custom extension to the
/// module/precompiled header file format.
///
/// A module file extension can introduce additional information into
/// compiled module files (.pcm) and precompiled headers (.pch) via a
/// custom writer that can then be accessed via a custom reader when
/// the module file or precompiled header is loaded.
class ModuleFileExtension : public llvm::RefCountedBase<ModuleFileExtension> {
public:
  virtual ~ModuleFileExtension();

  /// Retrieves the metadata for this module file extension.
  virtual ModuleFileExtensionMetadata getExtensionMetadata() const = 0;

  /// Hash information about the presence of this extension into the
  /// module hash code.
  ///
  /// The module hash code is used to distinguish different variants
  /// of a module that are incompatible. If the presence, absence, or
  /// version of the module file extension should force the creation
  /// of a separate set of module files, override this method to
  /// combine that distinguishing information into the module hash
  /// code.
  ///
  /// The default implementation of this function simply returns the
  /// hash code as given, so the presence/absence of this extension
  /// does not distinguish module files.
  virtual llvm::hash_code hashExtension(llvm::hash_code Code) const;

  /// Create a new module file extension writer, which will be
  /// responsible for writing the extension contents into a particular
  /// module file.
  virtual std::unique_ptr<ModuleFileExtensionWriter>
  createExtensionWriter() = 0;

  /// Create a new module file extension reader, given the
  /// metadata read from the block and the cursor into the extension
  /// block.
  ///
  /// May return null to indicate that an extension block with the
  /// given metadata cannot be read.
  virtual std::unique_ptr<ModuleFileExtensionReader>
  createExtensionReader(const ModuleFileExtensionMetadata &Metadata,
                        ASTReader &Reader, serialization::ModuleFile &Mod,
                        const llvm::BitstreamCursor &Stream) = 0;
};

/// Abstract base class that writes a module file extension block into
/// a module file.
class ModuleFileExtensionWriter {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionWriter(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  virtual ~ModuleFileExtensionWriter();

  /// Retrieve the module file extension with which this writer is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  /// Write the contents of the extension block into the given bitstream.
  ///
  /// Responsible for writing the contents of the extension into the
  /// given stream. All of the contents should be written into custom
  /// records with IDs >= FIRST_EXTENSION_RECORD_ID.
  virtual void writeExtensionContents(llvm::BitstreamWriter &Stream) = 0;
};

/// Abstract base class that reads a module file extension block from
/// a module file.
///
/// Subclasses 
class ModuleFileExtensionReader {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionReader(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  /// Retrieve the module file extension with which this reader is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  virtual ~ModuleFileExtensionReader();
};

I suspect that the Reader and Writer interfaces will grow somewhat as we get more clients, but this is a start.

Thoughts?

- Doug

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev
Reply | Threaded
Open this post in threaded view
|

Re: RFC: Module file extensions

Don Hinton via cfe-dev

On Oct 26, 2015, at 11:58 PM, Manuel Klimek <[hidden email]> wrote:

I'd be curious to learn more about why you need to have it in one file. Invalidation is something build systems usually take care of, and for those it seems mostly the same whether they invalidate one or two files (they already need to invalidate a multitude of files, like .o files, module files, linked libraries and binaries, and generated code).

With one exception that I know of, build systems don’t handle module files at all: Clang handles them all internally via file-level locking. Clang has logic to invalidate and rebuild module files when they go out of date, clean up stale module files that might be lying around in the module cache, etc. For an external tool to have to duplicate that work—managing another set of files that mirror the module files, with their own invalidation/cleanup rules—would be a buggy mess. Hence, putting that information directly into the module file let’s Clang handle the invalidation logic, and the tool just gets a scratch pad in there to store its per-module results.

I could see an argument for deployment, but for deployment of libraries you'll need the module and its headers anyway.
Given that we already have a way to embed sources into the modules, I can see that you might be able to build full deployable bundles, where a module file is the only thing you need, and it includes the headers, potentially .o files to link in, and other information (like the one you cite).

I’m not motivated by the deployment aspects, although you’re right that one could probably use module file extensions to help with some of them.

So, all that said, I'm mainly curious whether that's what you're aiming for :)

My primary motivation is that I need another kind of index stored within the module file. It naturally fits in the module file, because it cross-references the specific declarations stored within that module file (e.g., by their local declaration ID number) for fast lookup from an external tool. It’s completely natural to rebuild the index along with building the module file, since it’s indexing that module specifically.

- Doug


On Tue, Oct 27, 2015 at 5:17 AM Douglas Gregor via cfe-dev <[hidden email]> wrote:
Hi all,

Modules provide a useful place to cache the results of parsing a library’s headers for later, efficient consumption. Clang does this for all of the information it gathers during parsing, including the ASTs, preprocessor state, lookup tables, comments, and so on, with the intent to minimize the amount of deserialization that occurs when a particular module is being used.

I’d like to extend this capability so that tools built on top of Clang can store their own information in the module file on disk. This mechanism would be used when there is information that can be computed at module-build time that would be expensive to recompute in every user of the module. For example, function summaries for the static analyzer, application-specific indexes that would require deserializing the entire module file to recompute. One could compute this information and put it into a separate file or database, but given that the information is naturally tied to modules—and generally needs to be invalidated at the same time the module file itself needs to be rebuilt—it makes more architectural sense to put that information directly in the module file when it is built, while the full AST is still efficiently accessible in memory.

module file extension is a bit of custom logic that can piggy-back data into a module file. Each module file extension is described by a unique block name identifying the extension, as well as other metadata (major/minor version, user information string) about the extension itself. Each module file extension gets to write into its own separate extension block in the resulting module file, separate from the rest of the module file contents and from other extensions. To do that, it provides both a writer (that writes bitstream records into the output file) and a reader (that can read back those bitstream records).

I’ve attached an implementation of module file extensions. It sketches out the interface to a module file extension (see below, or check out ModuleFileExtension.h for the full interface) and implements a module file extension for testing purposes so we can illustrate the round-tripping of data through the module file format, matching of extension blocks written to extension blocks read, and so on. 

/// Metadata for a module file extension.
struct ModuleFileExtensionMetadata {
  /// The name used to identify this particular extension block within
  /// the resulting module file. It should be unique to the particular
  /// extension, because this name will be used to match the name of
  /// an extension block to the appropriate reader.
  std::string BlockName;

  /// The major version of the extension data.
  unsigned MajorVersion;

  /// The minor version of the extension data.
  unsigned MinorVersion;

  /// A string containing additional user information that will be
  /// stored with the metadata.
  std::string UserInfo;
};

/// An abstract superclass that describes a custom extension to the
/// module/precompiled header file format.
///
/// A module file extension can introduce additional information into
/// compiled module files (.pcm) and precompiled headers (.pch) via a
/// custom writer that can then be accessed via a custom reader when
/// the module file or precompiled header is loaded.
class ModuleFileExtension : public llvm::RefCountedBase<ModuleFileExtension> {
public:
  virtual ~ModuleFileExtension();

  /// Retrieves the metadata for this module file extension.
  virtual ModuleFileExtensionMetadata getExtensionMetadata() const = 0;

  /// Hash information about the presence of this extension into the
  /// module hash code.
  ///
  /// The module hash code is used to distinguish different variants
  /// of a module that are incompatible. If the presence, absence, or
  /// version of the module file extension should force the creation
  /// of a separate set of module files, override this method to
  /// combine that distinguishing information into the module hash
  /// code.
  ///
  /// The default implementation of this function simply returns the
  /// hash code as given, so the presence/absence of this extension
  /// does not distinguish module files.
  virtual llvm::hash_code hashExtension(llvm::hash_code Code) const;

  /// Create a new module file extension writer, which will be
  /// responsible for writing the extension contents into a particular
  /// module file.
  virtual std::unique_ptr<ModuleFileExtensionWriter>
  createExtensionWriter() = 0;

  /// Create a new module file extension reader, given the
  /// metadata read from the block and the cursor into the extension
  /// block.
  ///
  /// May return null to indicate that an extension block with the
  /// given metadata cannot be read.
  virtual std::unique_ptr<ModuleFileExtensionReader>
  createExtensionReader(const ModuleFileExtensionMetadata &Metadata,
                        ASTReader &Reader, serialization::ModuleFile &Mod,
                        const llvm::BitstreamCursor &Stream) = 0;
};

/// Abstract base class that writes a module file extension block into
/// a module file.
class ModuleFileExtensionWriter {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionWriter(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  virtual ~ModuleFileExtensionWriter();

  /// Retrieve the module file extension with which this writer is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  /// Write the contents of the extension block into the given bitstream.
  ///
  /// Responsible for writing the contents of the extension into the
  /// given stream. All of the contents should be written into custom
  /// records with IDs >= FIRST_EXTENSION_RECORD_ID.
  virtual void writeExtensionContents(llvm::BitstreamWriter &Stream) = 0;
};

/// Abstract base class that reads a module file extension block from
/// a module file.
///
/// Subclasses 
class ModuleFileExtensionReader {
  ModuleFileExtension *Extension;

protected:
  ModuleFileExtensionReader(ModuleFileExtension *Extension)
    : Extension(Extension) { }

public:
  /// Retrieve the module file extension with which this reader is
  /// associated.
  ModuleFileExtension *getExtension() const { return Extension; }

  virtual ~ModuleFileExtensionReader();
};

I suspect that the Reader and Writer interfaces will grow somewhat as we get more clients, but this is a start.

Thoughts?

- Doug

_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev


_______________________________________________
cfe-dev mailing list
[hidden email]
http://lists.llvm.org/cgi-bin/mailman/listinfo/cfe-dev