(CSE 599W)Reverse Mode Autodiff

Posted by 111qqz on Monday, April 5, 2021

TOC

背景

怎么算微分。。通常有三种方法。

  • Symbolic Differentiation
  • Numerical Differentiation
  • Automatic Differentiation (auto diff)

c8589036cb6d845eb07a05441e2d32f8.md.png

7f409550ef544562ea67816c7a884fcb.md.png

auto diff中两种主流的方式分别是forward-mode和reverse-mode 由于forward-mode的方法中,计算的时间复杂度是O(n),n是输入的参数个数。而reverse-mode中,计算的时间复杂度是O(m),m是输出节点的个数。在dnn中,n往往很大,远大于m,因此这里主要介绍reverse-mode auto diff方法。

backprop和reverse mode auto diff的区别

看了reverse mode auto diff的过程,感觉和backprop是一回事呀。。。 实际上,backprop指的是训练神经网络根据loss的gradient来更新weight的过程,而auto diff是backprop使用的一个用来计算gradient的 technique.

Bakpropagation refers to the whole process of training an artificial neural network using multiple backpropagation steps, each of which computes gradients and uses them to perform a Gradient Descent step. In contrast, reverse-mode auto diff is simply a technique used to compute gradients efficiently and it happens to be used by backpropagation.

chain rule

chain rule..也就是微积分里的链式法则 更准确的说,是多变量的chain rule 简单来说,就是同一条路径相乘,不同路径相加 MULTI-VARIABLE CHAIN RULE

可以参考 MULTI-VARIABLE CHAIN RULE

实现

放一个CSE 599W Systems for ML 课程的assigment 1,比较简单,只有add,mul,matmul这几个算子。(因为重点并不在于支持算子的种类) 需要实现几个算子的forward和梯度的计算。。这部分比较好写。。 然后需要实现executor的run函数。。做个拓扑排序然后计算即可。。

  def run(self, feed_dict):
        """Computes values of nodes in eval_node_list given computation graph.
        Parameters
        ----------
        feed_dict: list of variable nodes whose values are supplied by user.

        Returns
        -------
        A list of values for nodes in eval_node_list. 
        """
        node_to_val_map = dict(feed_dict)
        print("self.eval_node_list={}".format(self.eval_node_list))
        # Traverse graph in topological sort order and compute values for all nodes.
        topo_order = find_topo_sort(self.eval_node_list)
        # 按照拓扑排序的顺序来计算,保证计算当前节点的值时,其依赖的值都计算出来了。
        """TODO: Your code here"""
        for node in topo_order:
            if isinstance(node.op, PlaceholderOp):
                continue
            # 在实际计算的时候,要用具体的值来替代node
            input_vals = [node_to_val_map[x] for x in node.inputs]
            res = node.op.compute(node,input_vals)
            node_to_val_map[node] = res

        # Collect node values.
        # 因为是按照topo order计算的,最后再变为和输入相同的顺序去输出
        node_val_results = [node_to_val_map[node] for node in self.eval_node_list]
        return node_val_results

然后是计算梯度这部分 需要注意 node_to_output_grads_list,是一个dict, key是node,val其实是一个list,表示node对哪些后续节点的gradient有作用.

然后input_grads,表示的是node的input节点相对node的gradient

听起来有些让人费解。。看个具体的例子

对于下面这个计算图,相关变量的值如下

def test_multi_var_chain_rule():
    x1 = ad.Variable(name="x1")
    x2 = x1+3
    x3 = x1+5
    y = x2*x3

    grad_x1, grad_x2, grad_x3 = ad.gradients(y, [x1, x2, x3])
   
    executor = ad.Executor([y, grad_x1, grad_x2, grad_x3])
    x1_val = 1 * np.ones(3)
    y_val, grad_x1_val, grad_x2_val, grad_x3_val = executor.run(feed_dict = {x1 : x1_val})


nosetests -s  -v  autodiff_test.py                                                                                                      130 ↵
autodiff_test.custom_test ... output_node=(x1+((x2*x3)*x1))
node.name=(x1+((x2*x3)*x1))
node_to_output_grads_list[node]=[Oneslike((x1+((x2*x3)*x1)))]
grad=Oneslike((x1+((x2*x3)*x1)))
input_grads=[Oneslike((x1+((x2*x3)*x1))), Oneslike((x1+((x2*x3)*x1)))]

node.name=((x2*x3)*x1)
node_to_output_grads_list[node]=[Oneslike((x1+((x2*x3)*x1)))]
grad=Oneslike((x1+((x2*x3)*x1)))
input_grads=[(Oneslike((x1+((x2*x3)*x1)))*x1), (Oneslike((x1+((x2*x3)*x1)))*(x2*x3))]

node.name=(x2*x3)
node_to_output_grads_list[node]=[(Oneslike((x1+((x2*x3)*x1)))*x1)]
grad=(Oneslike((x1+((x2*x3)*x1)))*x1)
input_grads=[((Oneslike((x1+((x2*x3)*x1)))*x1)*x3), ((Oneslike((x1+((x2*x3)*x1)))*x1)*x2)]

node.name=x3
node_to_output_grads_list[node]=[((Oneslike((x1+((x2*x3)*x1)))*x1)*x2)]
grad=((Oneslike((x1+((x2*x3)*x1)))*x1)*x2)
input_grads=None

node.name=x2
node_to_output_grads_list[node]=[((Oneslike((x1+((x2*x3)*x1)))*x1)*x3)]
grad=((Oneslike((x1+((x2*x3)*x1)))*x1)*x3)
input_grads=None

node.name=x1
node_to_output_grads_list[node]=[Oneslike((x1+((x2*x3)*x1))), (Oneslike((x1+((x2*x3)*x1)))*(x2*x3))]
grad=(Oneslike((x1+((x2*x3)*x1)))+(Oneslike((x1+((x2*x3)*x1)))*(x2*x3)))
input_grads=None

length of node_to_output_grads_list = {(x1+((x2*x3)*x1)): [Oneslike((x1+((x2*x3)*x1)))], x1: [Oneslike((x1+((x2*x3)*x1))), (Oneslike((x1+((x2*x3)*x1)))*(x2*x3))], ((x2*x3)*x1): [Oneslike((x1+((x2*x3)*x1)))], (x2*x3): [(Oneslike((x1+((x2*x3)*x1)))*x1)], x2: [((Oneslike((x1+((x2*x3)*x1)))*x1)*x3)], x3: [((Oneslike((x1+((x2*x3)*x1)))*x1)*x2)]}
self.eval_node_list=[(x1+((x2*x3)*x1)), (Oneslike((x1+((x2*x3)*x1)))+(Oneslike((x1+((x2*x3)*x1)))*(x2*x3))), ((Oneslike((x1+((x2*x3)*x1)))*x1)*x3), ((Oneslike((x1+((x2*x3)*x1)))*x1)*x2)]


 for node in reverse_topo_order:
        # print("node.name={} op={}".format(node.name, type(node.op)))
        grad = sum_node_list(node_to_output_grads_list[node])
        # print("grad={}".format(grad))
        input_grads = node.op.gradient(node, grad)
        # input_grads表示的node的input节点相对node的gradient
        # print("input_grads={}".format(input_grads))
        node_to_output_grad[node] = grad
        for idx, inp in enumerate(node.inputs):
            node_to_output_grads_list[inp] = node_to_output_grads_list.get(inp, [
            ])
            node_to_output_grads_list[inp].append(input_grads[idx])

参考链接