【TVM 教程】向 TVM 中添加 Codegen 原創

發布于 2025-8-20 10:24

瀏覽

0收藏

Apache TVM 是一個深度的深度學習編譯框架，適用于 CPU、GPU 和各種機器學習加速芯片。更多 TVM 中文文檔可訪問 →https://tvm.hyper.ai/

隨著深度學習工作負載所針對的硬件設備數量不斷增加，用戶在各種設備上實現高性能所需的知識也在不斷增加。為了讓數據科學家在開發新模型時不必擔心性能問題，硬件廠商或是基于一些常見的深度學習算子，提供 MKLDNN 或 cuDNN 等庫，或是提供 TensorRT 等框架，讓用戶按照某種方式描述模型，從而提高模型性能。

然而，用戶在嘗試使用新的庫或設備時，必須學習新的編程接口。因此，一個統一的編程接口變得越來越重要：1）讓所有用戶及硬件廠商信息同步，2）提供一個可行的解決方案，讓特定硬件或庫只支持具有極高性能的、廣泛使用的算子，不受支持的算子則回退到 CPU/GPU 等通用設備。

本開發手冊演示了硬件廠商如何輕松實現自己的 Codegen，并將其注冊為 Relay 后端編譯器，從而支持自己的硬件設備/庫。本手冊涵蓋了兩種基于不同計算圖的 codegen：

1. 希望生成 C 代碼。

如果你的硬件已經具備了一個高度優化的 C/C++ 庫，如對于 CPU 而言的 Intel CBLAS/MKL 庫，或針對 GPU 而言的 NVIDIA CUBLAS 庫，那么本節內容非常適合你。幸運的是，C 源代碼模塊與 TVM runtime 模塊完全兼容，這意味著生成的代碼可以由任何具有適當編譯標志的 C/C++ 編譯器編譯，因此用戶只需實現一個能為子圖生成 C 代碼的 codegen，并將 C 源代碼模塊集成到 TVM runtime 模塊中。下一節內容講詳細演示如何為硬件實現 C codegen。

2. 希望生成任意計算圖。

有時候，硬件可能需要其他形式的計算圖如 JSON。這種情況下，用戶不僅要實現一個 codegen，還要實現一個自定義 TVM runtime 模塊，從而使得 TVM runtime 知道如何執行這個計算圖。如果你的硬件已經擁有完整的計算圖執行引擎（graph execution engine），如適用于 GPU 的 TensorRT，那么該解決方案對你而言非常具有參考價值。

完成 codegen 和 runtime 后，可以讓客戶借助你的自定義標簽，對模型進行注釋并加以利用。終端用戶如何注釋和啟動特定 codegen 的教程，將在后續進行補充。

實現 C Codegen

在這一部分中，我們將演示如何借助預實現的算子函數，生成 C 代碼的 codegen。簡單起見，本示例 codegen 不依賴于第三方庫。相反，我們在 C 中手動實現了兩個宏：

#define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)         \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
        for (int64_t i = 0; i < p_DIM1_; ++i) {             \
            out[i] = a[i] p_OP_ b[i];                       \
        }                                                   \
    }

#define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {   \
        for (int64_t i = 0; i < p_DIM1_; ++i) {               \
            for (int64_t j = 0; j < p_DIM2_; ++j) {           \
                int64_t k = i * p_DIM2_ + j;                  \
                out[k] = a[k] p_OP_ b[k];                     \
            }                                                 \
        }                                                     \
    }

使用這兩個宏，可以為一維和二維張量生成二元算子（binary operator）。例如，給定如下所示的子圖，假設所有輸入都是 shape 為（10, 10）的二維張量：

c_compiler_input0
       |
      add <-- c_compiler_input1
       |
    subtract <-- c_compiler_input2
       |
    multiply <-- c_compiler_input3
       |
      out

我們的目標是生成以下可編譯代碼來執行子圖：

#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/packed_func.h>
#include <dlpack/dlpack.h>
#include <cstdint>
#include <cstring>
#include <iostream>

#define GCC_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)           \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      out[i] = a[i] p_OP_ b[i];                           \
    }                                                     \
  }

#define GCC_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
  extern "C" void p_ID_(float* a, float* b, float* out) { \
    for (int64_t i = 0; i < p_DIM1_; ++i) {               \
      for (int64_t j = 0; j < p_DIM2_; ++j) {             \
        int64_t k = i * p_DIM2_ + j;                      \
        out[k] = a[k] p_OP_ b[k];                         \
      }                                                   \
    }                                                     \
  }

// 注 1
GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);
GCC_BINARY_OP_2D(gcc_0_1, -, 10, 10);
GCC_BINARY_OP_2D(gcc_0_2, +, 10, 10);

// 注 2
extern "C" void gcc_0_(float* gcc_input0, float* gcc_input1,
                       float* gcc_input2, float* gcc_input3, float* out) {
  float* buf_0 = (float*)malloc(4 * 100);
  float* buf_1 = (float*)malloc(4 * 100);
  gcc_0_2(gcc_input0, gcc_input1, buf_0);
  gcc_0_1(buf_0, gcc_input2, buf_1);
  gcc_0_0(buf_1, gcc_input3, out);
  free(buf_0);
  free(buf_1);
}

// 注 3
extern "C" int gcc_0_wrapper(DLTensor* arg0, DLTensor* arg1, DLTensor* arg2,
                             DLTensor* arg3, DLTensor* out) {
  gcc_0_(static_cast<float*>(arg0->data), static_cast<float*>(arg1->data),
         static_cast<float*>(arg2->data), static_cast<float*>(arg3->data),
         static_cast<float*>(out->data));
  return 0;
}
TVM_DLL_EXPORT_TYPED_FUNC(gcc_0, gcc_0_wrapper);

這里詳細介紹一下上面代碼里的注釋：

注1：子圖中三個節點的函數實現。
注2：通過分配中間數組（intermediate buffer）并調用相應函數來執行子圖的函數。
注3：TVM runtime 兼容的包裝函數。它接收一個輸入張量列表和一個輸出張量（最后一個參數），并將其轉換為正確的數據類型，調用注2 中描述的子圖函數。此外，TVM_DLL_EXPORT_TYPED_FUNC?是一個 TVM 宏，它通過將所有張量打包到?TVMArgs?來生成另一個函數?gcc_0，該函數具有統一的函數參數。因此，TVM runtime 可以直接調用?gcc_0?來執行子圖，無需其他操作。生成上述代碼后，TVM 能夠將其與計算圖的其余部分一起編譯并導出單個庫以進行部署。

在本節的其余部分，我們將逐步創建一個 codegen，來實現上述代碼。你的 codegen 必須位于?src/relay/backend/contrib/<your-codegen-name>/。在這個例子中，我們將 codegen 命名為 “codegen_c”，并將其放在?/src/relay/backend/contrib/codegen_c/?目錄下。你可以隨時查看這個文件，了解完整的實現過程。

具體來說，我們將在這個文件中實現兩個類，兩個類的關系如下：

            subgraph                                subgraph
TVM backend -----------------------------> CSourceCodegen -------------> CodegenC
       ^                                       |    ^                       |
       |                                       |    |                       |
       ----------------------------------------      ------------------------
          generated C source runtime module              generated C code

當 TVM 后端發現 Relay 計算圖中的函數（子圖），用注冊的編譯器標簽（本例中為?ccompiler）進行了注釋時，TVM 后端就會調用?CSourceCodegen?并傳遞子圖。?CSourceCodegen?的成員函數?CreateCSourceModule?將：

1）為子圖生成 C 代碼；

2）將生成的 C 代碼包裝到 C source runtime 模塊中，以便 TVM 后端進行編譯和部署。

特別是，C codegen 對?CodegenC?類是透明的，因為它提供了許多有用的實用程序來簡化 codegen 實現。下面的章節將自下而上實現這兩個類。

實現 CodegenC

在?src/relay/backend/contrib/codegen_c/codegen.cc?中，首先在?tvm.relay.contrib?的命名空間下創建一個 codegen 類骨架：

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

#include "codegen_c.h"

namespace tvm {
namespace relay {
namespace contrib {

class CodegenC : public ExprVisitor, public CodegenCBase {
  public:
    explicit CodegenC(const std::string& id) { this->ext_func_id_ = id; }

    void VisitExpr_(const VarNode* node) { ; }
    void VisitExpr_(const CallNode* call) final { ; }
    std::string JIT() { ; }

  private:
    /*! \brief The function id that represents a C source function. */
    std::string ext_func_id_ = "";
    /*! \brief The index of a wrapped C function. */
    int func_idx = 0;
    /*! \brief The index of allocated buffers. */
    int buf_idx_ = 0;
    /*! \brief The arguments of a C compiler compatible function. */
    std::vector<std::string> ext_func_args_;
    /*! \brief The statements of a C compiler compatible function. */
    std::vector<std::string> ext_func_body;
    /*! \brief The declaration statements of a C compiler compatible function. */
    std::vector<std::string> func_decl_;
    /*! \brief The declaration statements of buffers. */
    std::vector<std::string> buf_decl_;
    /*! \brief The name and index pairs for output. */
    std::vector<std::pair<std::string, int>> out_;
}

CodegenC?類繼承了兩個類：?ExprVisitor?提供遍歷子圖的能力，然后收集所需的信息并生成子圖函數，例如?gcc_0_。

CodegenCBase?提供了生成包裝函數的能力和實用程序，例如上例中的?gcc_0。可以看出，我們只需要在這個 codegen 類中實現三個函數就可以了。

算子的代碼生成

首先實現?VisitExpr_(const CallNode* call)。該函數在遍歷子圖時會訪問所有調用節點。每個調用節點都包含一個我們想要卸載（offload）到硬件中的算子。因此，我們需要按照拓撲順序生成具有正確算子的相應 C 代碼。完整實現過程如下：

1. 生成函數聲明

示例結果：GCC_BINARY_OP_2D(gcc_0_0, *, 10, 10);

要生成函數聲明，如上所示，我們需要：

1）函數名（例如?gcc_0_0）

2）算子的類型（例如?*?）

3）輸入張量 shape（例如?(10, 10)?）

這些信息可以從?CallNode?輕松獲取：

std::ostringstream macro_stream;
std::ostringstream decl_stream;
std::ostringstream buf_stream;

// Generate a unique function name you like.
std::string func_name = ext_func_id_ + "_" + std::to_string(func_idx++);

// Make function declaration string.
macro_stream << "CSOURCE_BINARY_OP_" << call->args.size() << "D(" << func_name << ", ";

// Check the operator type.
if (IsOp(call, "add")) {
  macro_stream << "+";
} else if (IsOp(call, "subtract")) {
  macro_stream << "-";
} else if (IsOp(call, "multiply")) {
  macro_stream << "*";
} else {
  LOG(FATAL) << "Unrecognized op";
}

// Extract the input tensor shape.
auto in_shape = GetShape(call->args[0]->checked_type());
for (size_t i = 0; i < in_shape.size(); ++i) {
  macro_stream << ", " << in_shape[i];
}
macro_stream << ");";
func_decl_.push_back(macro_stream.str());

可以看出，我們將生成的代碼推送到類成員變量?func_decl_?中。這意味著在我們完成遍歷整個子圖之后，我們已經收集了所有必需的函數聲明，我們唯一需要做的就是用 GCC 編譯它們。?VisitExpr_(const CallNode* call)?的其余實現也遵循這個概念。

2. 生成函數調用

示例結果：gcc_0_0(buf_1, gcc_input3, out);

生成函數聲明后，我們需要生成一個具有正確輸入和輸出的函數調用。要想知道調用這個函數時應該放置哪些輸入或數組，必須訪問它的參數：

bool first = true;
decl_stream << func_name << "(";
for (size_t i = 0; i < call->args.size(); ++i) {
  VisitExpr(call->args[i]); // 注 1
  for (auto out : out_) {
    if (!first) {
      decl_stream << ", ";
    }
    first = false;
    decl_stream << out.first;
  }
}
// 注 2

同樣，重點介紹一下上述代碼中的注釋：

注1：VisitExpr(call->args[i])?是訪問當前函數參數的遞歸調用。參數可以是另一個節點的輸出或輸入張量。在該示例中，需要確保每個節點在離開訪問器之前，都更新一個類變量?out_。圖解如下：

 arg_node                 arg_node <- Visit arg (Note 1)       arg_node
     |                        |                                    |
 curr_node <- Process      curr_node                            curr_node <- Put "buf_0" as an input buffer

(a) out_ = {}            (b) out_ = {}                   (c) out_ = {("buf_0", 20)}

從上圖中可以看出，類變量?out_?在訪問參數節點前是空的，它被填充了?arg_node?輸出數組的名稱和大小。因此在完成對參數節點的訪問時，可以通過查看?out_?得知應該放置的正確輸入數組。本節末尾以及下一節中，我們將介紹如何更新?out_。

注2：你可能注意到，我們在這一步沒有關閉函數調用字符串。當前函數調用字符串看起來像：gcc_0_0(buf_1, gcc_input3。這是因為我們沒有將最后一個參數（如 output）放入此調用中。函數調用的輸出可以是分配的臨時數組或子圖輸出張量。簡單起見，在本例中我們為每個調用節點都分配老一個輸出數組（下一步），并將最后一個數組中的結果復制到了輸出張量。

3. 生成輸出數組（output buffer）

示例結果：float buf_0 = (float)malloc(4 * 100);

如上一步所述，除了子圖輸入和輸出張量外，還需要數組來保存中間結果。為了生成數組，我們提取 shape 信息，以確定數組的類型和大小：

// 這個例子僅支持單個輸出。
auto type_node = call->checked_type().as<TensorTypeNode>();
ICHECK(type_node != nullptr && runtime::TypeMatch(type_node->dtype, kDLFloat, 32))
      << "Only support single output tensor with float type";

// 生成一個唯一的數組名字。
std::string out = "buf_" + std::to_string(buf_idx_++);

// 提取 shape 作為數組大小。
auto out_shape = GetShape(call->checked_type());
int out_size = 1;
for (size_t i = 0; i < out_shape.size(); ++i) {
  out_size *= out_shape[i];
}

// 分配數組并推送至數組聲明
buf_stream << "float* " << out << " = (float*)std::malloc(4 * " << out_size << ");";
buf_decl_.push_back(buf_stream.str());

分配了輸出數組之后，現在可以關閉函數調用字符串，并將生成的函數調用推送到類變量?ext_func_body。

decl_stream << ", " << out << ");";
ext_func_body.push_back(decl_stream.str());

4. 更新輸出數組

為了使得下一個節點（接受當前調用節點的輸出，作為其輸入）知道它應該使用哪個數組，我們需要在離開這個訪問函數之前更新類變量?out_：

out_.clear();
out_.push_back({out, out_size});

恭喜！到這一步我們已經完成了這個類中最困難的函數。接下來的兩節中，我們將進一步完善這個函數的功能。

輸入變量的代碼生成

回想一下，我們通過訪問調用節點的參數（上一節中的第 2 步）收集了輸入數組信息，并處理了參數是另一個調用節點的情況（第 4 步）。本節我們將以?VarNode?為例，演示如何處理其他節點。

VarNode?表示模型中的輸入張量。它非常重要的一點就是名稱提示（例如，data、weight?等）。訪問?VarNode?時，只需更新類變量?out_?傳遞名稱提示，后代（descendant）調用節點就可以生成正確的函數調用。

void VisitExpr_(const VarNode* node) {
  ext_func_args_.push_back(node->name_hint());
  out_.clear();
  out_.push_back({node->name_hint(), 0});
}

注意：在這個例子中，我們假設要卸載的子圖只有調用節點和變量節點。如果子圖包含其他類型的節點，如?TupleNode，那么你也需要訪問它們并繞過輸出數組信息。

Code Emitting

Codegen Class 的最后一部分是?JIT?函數，它為子圖 emit 一個 C 函數，并將剛生成的 C 代碼作為函數體。注意，除了在前幾節中生成的子圖函數外，還需要一個具有統一參數的 wrapper 函數，供 TVM runtime 調用和傳遞數據。幸運的是，我們繼承的基類已經提供了一個實現，即?JitImpl，來生成該函數。調用?JitImpl的方式如下：

JitImpl("gcc_0" /* Subgraph symbol (ID) */,
        {"gcc_input0", "gcc_input1", "gcc_input2", "gcc_input3"} /* Input arguments */,
        {"float *buf_0 = (float*)malloc(4 * 20)", ...} /* Buffer allocations */,
        {"gcc_0_2(gcc_input0, gcc_input1, buf_0);"} /* Function body */,
        {"out"} /* Output */);

上述調用將生成三個函數（一個來自 TVM wrapper 宏）：

子圖函數?gcc_0_（函數名末尾多了一個下劃線）以及為執行子圖而生成的所有 C 代碼；
帶有?DLTensor?參數列表的 wrapper 函數?gcc_0__wrapper_?，將數據轉換為正確的類型并調用?gcc_0_
TVM runtime 兼容函數?gcc_0?具有 TVM 統一函數參數，可解包 TVM 打包張量并調用?gcc_0__wrapper_

因此，在?JIT?實現中唯一要做的，就是將生成的所有子圖函數代碼傳遞給?JitImpl：

std::string JIT() {
  // Write function macros
  for (auto decl : func_decl_) {
    code_stream_ << decl << "\n";
  }
  return JitImpl(ext_func_id_, ext_func_args_, buf_decl_, ext_func_body, out_);
}

傳遞的所有變量（ext_func_id?等）都是類變量，并在遍歷子圖時被填充。

實現 CSourceCodegen

創建一個類并實現所需功能，注意：需要繼承自?CSourceModuleCodegenBase：

class CSourceCodegen : public CSourceModuleCodegenBase {
 public:
  // 傳遞一個子圖函數, 并生成 C 代碼。
  void GenCFunc(const Function& func) { ; }

  // 使用 GenCFunc 來生成 C 代碼并將它包裝成一個 C 源模塊。
  runtime::Module CreateCSourceModule(const NodeRef& ref) override { ; }

 private:
  std::ostringstream code_stream_;
};

實現 GenCFunc

GenCFunc?只是簡單地使用我們剛剛實現的?CodegenC?來遍歷一個 Relay 函數（子圖），得到生成的 C 代碼。內置函數?GetExtSymbol?在 Relay 函數中檢索唯一的符號名稱（例如?gcc_0），注意：必須將其用作 C 函數名稱，因為該符號將用于 DSO 運行查找。

void GenCFunc(const Function& func) {
  ICHECK(func.defined()) << "Input error: expect a Relay function.";

  // 記錄運行查找的外部符號。
  auto sid = GetExtSymbol(func);

  CodeGenC builder(sid);
  builder.VisitExpr(func->body);
  code_stream_ << builder.JIT();
}

實現 CreateCSourceModule

此函數為外部庫創建了一個 runtime 模塊。本事例中，我們創建了一個可以直接被編譯并與 TVM 生成的 DSOModule 鏈接在一起的 CSourceModule。CodegenC?實現之后，再實現這個功能就比較簡單了：

runtime::Module CreateCSourceModule(const NodeRef& ref) override {
  // 創建頭文件
  code_stream_ << "#include <cstdint>\n";
  code_stream_ << "#include <iostream>\n";
  code_stream_ << "#include <cstdlib>\n";
  code_stream_ << "#include <stdio.h>\n";
  code_stream_ << "#include <cstring>\n";
  code_stream_ << "#include <tvm/runtime/c_runtime_api.h>\n";
  code_stream_ << "#include <dlpack/dlpack.h>\n";

  // 為算子定義添加一些公共宏。
  const char* operator_macro = R"op_macro(
  #define CSOURCE_BINARY_OP_1D(p_ID_, p_OP_, p_DIM1_)       \
    extern "C" void p_ID_(float* a, float* b, float* out) { \
      for (int64_t i = 0; i < p_DIM1_; ++i) {               \
        out[i] = a[i] p_OP_ b[i];                           \
      }                                                     \
    }

  #define CSOURCE_BINARY_OP_2D(p_ID_, p_OP_, p_DIM1_, p_DIM2_)  \
    extern "C" void p_ID_(float* a, float* b, float* out) {     \
      for (int64_t i = 0; i < p_DIM1_; ++i) {                   \
        for (int64_t j = 0; j < p_DIM2_; ++j) {                 \
          int64_t k = i * p_DIM2_ + j;                          \
          out[k] = a[k] p_OP_ b[k];                             \
        }                                                       \
      }                                                         \
    }
  )op_macro";

  code_stream_ << operator_macro << "\n\n";

  // 為子圖生成 C 代碼。
  if (ref->IsInstance<FunctionNode>()) {
    GenCFunc(Downcast<Function>(ref));
  } else if (ref->IsInstance<relay::ModuleNode>()) {
    relay::Module mod = Downcast<relay::Module>(ref);
    for (const auto& it : mod->functions) {
      GenCFunc(Downcast<Function>(it.second));
    }
  } else {
    LOG(FATAL) << "The input ref is expected to be a Relay function or module"
               << "\n";
  }

  // 創建一個 CSourceModule
  const auto* pf = runtime::Registry::Get("module.csource_module_create");
  ICHECK(pf != nullptr) << "Cannot find csource module to create the external runtime module";
  return (*pf)(code_stream_.str(), "cc");
}

注冊 CodegenC

最后一步是將 codegen 注冊到 TVM 后端。首先實現一個簡單的函數，調用 codegen 并生成一個 runtime 模塊：

runtime::Module CCompiler(const NodeRef& ref) {
  CSourceCodegen csource;
  return csource.CreateCSourceModule(ref);
}

接下來將此函數注冊到 TVM 后端：

TVM_REGISTER_GLOBAL("relay.ext.ccompiler").set_body_typed(CCompiler);

其中?ccompiler?是一個自定義標簽，它告知 TVM 這是用?ccompiler?注釋子圖時，應該用來生成和卸載子圖的 codegen。

最后，設置一個 CMake 配置標志，只包含客戶的編譯器。首先創建一個 cmake 文件：cmake/modules/contrib/CODEGENC.cmake：

if(USE_CODEGENC)
  file(GLOB CSOURCE_RELAY_CONTRIB_SRC src/relay/backend/contrib/codegen_c/codegen.cc)
  list(APPEND COMPILER_SRCS ${CSOURCE_RELAY_CONTRIB_SRC})
endif(USE_CODEGENC)

用戶在使用?config.cmake?配置 TVM 時，可以自行決定是否配置編譯器：

set(USE_CODEGENC ON)

為表征（Representation）實現 Codegen

盡管我們已經演示了如何實現 C codegen，但用戶硬件可能還需要其他形式的計算圖表征（Graph Representation），如 JSON。在這種情況下，用戶可以通過修改?CodegenC?類，生成自己的計算圖表征，并實現一個自定義 runtime 模塊，告訴 TVM runtime 如何執行這個計算圖表征。

簡單起見，本指南中定義了一個名為 “ExampleJSON” 的計算圖表征。 ExampleJSON 并不是 JSON，而是沒有控制流的計算圖的簡單表示。例如，假設有以下名為?subgraph_0?的子圖：

input0
   |
  add <-- input1
   |
subtract <-- input2
   |
multiply <-- input3
   |
  out

那么這個子圖的 ExampleJON 看起來類似：

subgraph_0
  input 0 10 10
  input 1 10 10
  input 2 10 10
  input 3 10 10
  add 4 inputs: 0 1 shape: 10 10
  sub 5 inputs: 4 2 shape: 10 10
  mul 6 inputs: 5 3 shape: 10 10

input?關鍵字聲明一個輸入張量及其 ID 和 shape；其他語句用?<op> <output ID> inputs: [input ID] shape: [shape]?語法描述了其計算過程。

在本節中，我們試圖實現以下自定義 TVM runtime 模塊，來執行 ExampleJSON 計算圖。

runtime::Module ExampleJsonCompiler(const NodeRef& ref) {
    ExampleJsonCodeGen codegen(ref);
    std::string code = codegen.gen(); // 注 1
    const auto* pf = runtime::Registry::Get("module.examplejson_module_create"); // 注 2
    ICHECK(pf != nullptr) << "Cannot find ExampleJson module to create the external runtime module";
    return (*pf)(code);
}
TVM_REGISTER_GLOBAL("relay.ext.examplejsoncompiler").set_body_typed(ExampleJsonCompiler);

注1：稍后我們將實現一個自定義 codegen，通過取一個子圖來生成一個 ExampleJSON 代碼字符串。

注2：此行獲取了一個用于創建自定義 runtime 模塊的函數的指針。可以看到它采用剛剛生成的 ExampleJSON 格式的子圖代碼，并對一個 runtime 模塊進行了初始化。

后續章節中，我們將介紹 1）如何實現?ExampleJsonCodeGen?和 2）如何實現和注冊?examplejson_module_create。

實現 ExampleJsonCodeGen

與 C codegen 類似，從?ExprVisitor?派生?ExampleJsonCodeGen?以訪問器模式進行子圖遍歷。另一方面，因為不會用到 TVM C++ wrapper，所以不必繼承?CodegenCBase。 codegen 類實現如下：

#include <tvm/relay/expr_functor.h>
#include <tvm/relay/transform.h>
#include <tvm/relay/type.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/object.h>

#include <fstream>
#include <sstream>

namespace tvm {
namespace relay {
namespace contrib {

class ExampleJsonCodeGen : public ExprVisitor {
  public:
    explicit ExampleJsonCodeGen();

    // 注 1
    void VisitExpr_(const VarNode* node) { /* Skip in this example. */ }
    void VisitExpr_(const CallNode* call) final { /* Skip in this example. */ }

    // 注 2
    std::string gen(NodeRef& ref) {
        this->code = "";
        if (ref->IsInstance<FunctionNode>()) {
            this->visit(Downcast<Function>(ref));
        } else if (ref->IsInstance<relay::ModuleNode>()) {
            relay::Module mod = Downcast<relay::Module>(ref);
            for (const auto& it : mod->functions) {
                this->visit(Downcast<Function>(it.second));
            }
        } else {
            LOG(FATAL) << "The input ref is expected to be a Relay function or module";
        }
        return this->code;
    }

  private:
      /*! \brief The function id that represents a C source function. */
     std::string code;
}

注1：再次實現相應的 visitor 函數，以生成 ExampleJSON 代碼，并將其存儲到類變量?code?中（由于與 C codegen 基本一致，這里跳過了 visitor 函數的實現）。完成計算圖訪問后，在?code?中會生成一個 ExampleJSON 計算圖。

注2：定義內部 API?gen?來獲取子圖，并生成 ExampleJSON 代碼。用戶可以依據個人喜好，為這個 API 命名。

接下來，實現一個自定義 runtime，來利用?ExampleJsonCodeGen?的輸出。

實現自定義 runtime

本節將逐步演示如何自定義 TVM runtime，并將其注冊到 TVM runtime 模塊。自定義 runtime 應位于?src/runtime/contrib/<your-runtime-name>/。本示例中，我們將 runtime 命名為 “example_ext_runtime”。

首先，如下所示定義一個自定義 runtime 類。注意：這個類必須由 TVM?ModuleNode?派生，以保證與其他 TVM runtime 模塊兼容。

#include <dmlc/logging.h>
#include <tvm/runtime/c_runtime_api.h>
#include <tvm/runtime/memory.h>
#include <tvm/runtime/module.h>
#include <tvm/runtime/ndarray.h>
#include <tvm/runtime/object.h>
#include <tvm/runtime/packed_func.h>
#include <tvm/runtime/registry.h>

#include <fstream>
#include <cmath>
#include <map>
#include <sstream>
#include <string>
#include <vector>

namespace tvm {
namespace runtime {
class ExampleJsonModule : public ModuleNode {
 public:
  explicit ExampleJsonModule(std::string graph_json);

  PackedFunc GetFunction(const std::string& name,
                         const ObjectPtr<Object>& sptr_to_self) final;

  const char* type_key() const { return "examplejson"; }

  void SaveToBinary(dmlc::Stream* stream) final;

  static Module LoadFromBinary(void* strm);

  static Module Create(const std::string& path);

  std::string GetSource(const std::string& format = "");

  void Run(int id, const std::vector<int>& inputs, int output);

  void ParseJson(const std::string& json);

 private:
  /* \brief 代表計算圖的 json 字符串。 */
  std::string graph_json_;
  /* \brief 正在被處理的子圖。 */
  std::string curr_subgraph_;
  /*! \brief 由子圖 id 到節點條目的簡單圖。 */
  std::map<std::string, std::vector<NodeEntry> > graph_;
  /* \brief 包含圖中每一個節點的張量的簡單池。 */
  std::vector<NDArray> data_entry_;
  /* \brief 從節點 id 到算子名字的映射。 */
  std::vector<std::string> op_id_;
};

以下這些從?ModuleNode?派生的函數，必須在?ExampleJsonModule?中實現：

構造函數：這個類的構造函數，應該接收一個表征中的子圖，用戶可以自行決定處理和存儲的格式。保存的子圖可以被以下兩個函數使用。
GetFunction：這是這個類中最重要的函數。當 TVM runtime 要使用編譯器標簽（compiler tag）執行子圖時，它會從自定義 runtime 模塊中調用此函數。它提供函數名及 runtime 參數，GetFunction?會返回一個打包的函數實現，以供 TVM runtime 執行。
SaveToBinary?和?LoadFromBinary：SaveToBinary?將 runtime 模塊序列化為二進制格式以供后續部署。用戶使用?export_library?API 時，TVM 會調用這個函數。另一方面，由于用戶這時使用的是自己的計算圖表征，因此必須確保?LoadFromBinary?能夠采用SaveToBinary?生成的序列化二進制文件，來構造相同的 runtime 模塊。
GetSource（可選）：如果想查看生成的 ExampleJSON 代碼，可以實現這個函數來轉存；否則則可以跳過實現。

實現構造函數

explicit ExampleJsonModule(std::string graph_json) {
  this->graph_json_ = graph_json;
  ParseJson(this->graph_json_);
}

接下來，實現?ParseJson?來解析 ExampleJSON 格式的子圖，并在內存中構造一個計算圖供后續使用。由于本示例不支持帶有分支的子圖，因此只需用一個數組，按順序存儲子圖中的每個節點。

void ParseJson(const std::string& json) {
  std::string line;
  std::string curr_subgraph;
  std::stringstream ss(json);

  while (std::getline(ss, line, '\n')) {
    std::stringstream ss2(line);
    std::string token;
    int id = 0;

    ss2 >> token;
    if (token.find("subgraph_") != std::string::npos) {
      curr_subgraph = token;
      continue;
    }

    ss2 >> id;
    if (op_id_.size() <= static_cast<size_t>(id)) {
      op_id_.resize(id + 1);
      data_entry_.resize(id + 1);
    }

    int64_t total_elements = 1;
    std::vector<int64_t> shape;
    if (token == "input") {
      int64_t size = 0;
      while (ss2 >> size) {
        total_elements *= size;
        shape.push_back(size);
      }
    } else {
      op_id_[id] = token; // 注 1
      bool shape_data = false;
      NodeEntry entry;
      while (ss2 >> token) {
        if (token == "shape:") {
          shape_data = true;
        } else if (shape_data) {
          total_elements *= std::stoll(token);
          shape.push_back(std::stoll(token));
        } else if (token != "inputs:") {
          entry.inputs.push_back(std::stoi(token));
        }
      }
      entry.id = id;
      entry.output = id;
      graph_[curr_subgraph].push_back(entry); // 注 2
    }
    DLDevice dev;
    dev.device_type = static_cast<DLDeviceType>(1);
    dev.device_id = 0;
    data_entry_[id] = NDArray::Empty(shape, DLDataType{kDLFloat, 32, 1}, dev); // 注 3
  }
}

注1：使用類變量?op_id_?將子圖節點 ID 映射到算子名稱（例如?add），以便在 runtime 中調用相應的算子函數。

注2：使用類變量?graph_?從子圖名稱映射到節點數組。GetFunction?將在 runtime 通過子圖 ID 查詢計算圖節點。

注3：使用類變量?data_entry_ ?將子圖節點 ID 映射到張量數據占位符。將輸入和輸出放入 runtime 中對應的數據條目中。

實現 GetFunction

構造函數實現后，以上類變量準備就緒。接下來實現?GetFunction?為 TVM runtime 提供可執行的子圖函數：

PackedFunc GetFunction(const std::string& name,
                       const ObjectPtr<Object>& sptr_to_self) final {
  if (this->graph_.find(name) != this->graph_.end()) {
    this->curr_subgraph_ = name;
    return PackedFunc([sptr_to_self, this](TVMArgs args, TVMRetValue* rv) {

      // Copy input tensors to corresponding data entries.
      for (auto i = 0; i < args.size(); ++i) {
        ICHECK(args[i].type_code() == kNDArrayContainer || args[i].type_code() == kArrayHandle)
            << "Expect NDArray or DLTensor as inputs\n";
        if (args[i].type_code() == kArrayHandle) {
          DLTensor* arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        } else {
          NDArray arg = args[i];
          this->data_entry_[i].CopyFrom(arg);
        }
      }

      // Execute the subgraph.
      for (const auto& it : this->graph_[this->curr_subgraph_]) {
        this->Run(it.id, it.inputs, it.output);
      }
      ICHECK_GT(graph_.count(this->curr_subgraph_), 0U);

      // Copy the output from a data entry back to TVM runtime argument.
      auto out_idx = graph_[this->curr_subgraph_].back().output;
      if (args[args.size() - 1].type_code() == kArrayHandle) {
        DLTensor* arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      } else {
        NDArray arg = args[args.size() - 1];
        this->data_entry_[out_idx].CopyTo(arg);
      }
      *rv = data_entry_.back();
    });
  } else {
    LOG(FATAL) << "Unknown subgraph: " << name << "\n";
    return PackedFunc();
  }
}

可以看出，GetFunction?由三個主要部分組成。第一部分將數據從 TVM runtime 參數，復制到構造函數中指定的對應數據條目。第二部分使用?Run?函數執行子圖（并稍后實現），并將結果保存到另一個數據條目。第三部分將輸出數據條目中的結果，復制回對應的 TVM runtime 參數進行輸出。

實現 Run

Run?函數接收 1）子圖 ID，2）輸入數據條目索引列表和 3）輸出數據條目索引。

void Run(int id, const std::vector<int>& inputs, int output) {
  // Make a list data entry indexs.
  std::vector<int> args(inputs.begin(), inputs.end());
  args.push_back(output);

  // Initialize data holders.
  std::vector<TVMValue> values(args.size());
  std::vector<int> type_codes(args.size());

  // Initialize a TVM arg setter with TVMValue and its type code.
  TVMArgsSetter setter(values.data(), type_codes.data());

  // Set each argument to its corresponding data entry.
  if (op_id_[id] == "add" || op_id_[id] == "sub" || op_id_[id] == "mul") {
    for (size_t i = 0; i < args.size(); i++) {
      setter(i, data_entry_[args[i]]);
    }
  }

  // Invoke the corresponding operator function.
  if (op_id_[id] == "add") {
    Add(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "sub") {
    Sub(values.data(), type_codes.data(), args.size());
  } else if (op_id_[id] == "mul") {
    Mul(values.data(), type_codes.data(), args.size());
  } else {
    LOG(FATAL) << "Unknown op: " << op_id_[id] << "\n";
  }
}

Run?函數主要包括兩部分。第一部分負責分配?TVMValue?列表，并映射相應的數據輸入塊。這也會成為算子函數的參數。第二部分調用算子函數。盡管使用的 C 函數與上一個示例相同，但用戶可以將?Add、Sub?和?Mul?替換為自己的引擎。注意，這里需要確保引擎將結果存儲到最后一個參數，從而使得它們可以傳輸回 TVM runtime。

實現上述功能后，用戶自定義的 codegen 和 runtime 就可以執行子圖了。最后一步是注冊一個 API（examplejson_module_create）來創建這個模塊：

TVM_REGISTER_GLOBAL("module.examplejson_module_create")
.set_body_typed([](std::string code){
    auto n = make_object<ExampleJsonModule>(code);
    return runtime::Module(n);
});

實現 SaveToBinary 和 LoadFromBinary

到目前為止，我們已經實現了與其他 TVM runtime 用法一致的自定義 runtime 的主要功能。但是，當用戶想要將構建的 runtime 保存到磁盤以進行部署時，TVM 不知道如何保存。這就是實現?SaveToBinary?和?LoadFromBinary?的原因，它們會告訴 TVM 這個自定義 runtime 如何持久化和復原。

首先實現?SaveToBinary?函數，允許用戶將此模塊保存在磁盤中。

void SaveToBinary(dmlc::Stream* stream) final {
    stream->Write(this->graph_json_);
}

這個函數非常簡單。在構造函數中，我們采取的唯一參數是一個子圖表征（subgraph representation）。也就是說只需一個子圖表征來構造/恢復這個自定義 runtime 模塊。SaveToBinary?只是將子圖寫到一個輸出的 DMLC 流中，當用戶使用?export_library?API 輸出模塊時，自定義模塊將是一個子圖的 ExampleJSON 流。

LoadFromBinary?讀取子圖流并重新構建自定義 runtime 模塊的流程與此類似：

static Module LoadFromBinary(void* strm) {
  dmlc::Stream* stream = static_cast<dmlc::Stream*>(strm);
  std::string graph_json;
  stream->Read(&graph_json);
  auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
  return Module(n);
}

此外，還需要注冊以下函數，啟用相應的 Python API：

TVM_REGISTER_GLOBAL("module.loadbinary_examplejson")
.set_body_typed(ExampleJsonModule::LoadFromBinary);

上述注冊意味著當用戶調用?tvm.runtime.load_module(lib_path)?API，并且導出庫有一個 ExampleJSON 流時，LoadFromBinary?將被調用以創建相同的自定義 runtime 模塊。

另外，如果想支持直接從 ExampleJSON 文件創建模塊，還可以實現一個非常簡單的函數，并注冊一個 Python API，如下所示：

static Module Create(const std::string& path) {
    std::ifstream filep;
    filep.open(path, std::ios::in);
    std::string graph_json;
    std::string line;
    while (std::getline(filep, line)) {
        graph_json += line;
        graph_json += "\n";
    }
    filep.close();
    auto n = tvm::runtime::make_object<ExampleJsonModule>(graph_json);
    return Module(n);
}

TVM_REGISTER_GLOBAL("module.loadfile_examplejson")
.set_body([](TVMArgs args, TVMRetValue* rv) {
    *rv = ExampleJsonModule::Create(args[0]);
});

這意味著用戶可以手動編寫/修改 ExampleJSON 文件，并使用 Python API?tvm.runtime.load_module("mysubgraph.examplejson", "examplejson")?構建自定義模塊。

總結

匯總前文重點：

從?ExprVisitor?和?CodegenCBase（僅適用于 C codegen）派生的 codegen 類，具有以下功能：
- VisitExpr_(const CallNode* call)?收集調用節點信息。
- 收集子圖信息所需的其他 visitor 函數。
- JIT?生成子圖代碼。
- 注冊 codegen。
創建?CSourceModule?的函數（用于 C codegen）。
從?ModuleNode?派生的 runtime 模塊類，具有以下功能（用于計算圖表征）。
- 構造函數。
- GetFunction?生成與 TVM runtime 兼容的?PackedFunc。
- Run?執行子圖。
- 注冊 runtime creation API。
- SaveToBinary?和?LoadFromBinary?序列化/反序列化自定義 runtime 模塊。
- 注冊?LoadFromBinary?API 為tvm.runtime.load_module(your_module_lib_path)提供支持。
- （可選）Create?支持從表征的子圖文件，構建自定義 runtime 模塊。
一個注釋器，用于注釋用戶 Relay 程序，利用編譯器和 runtime（待定）。