深度強化學習入門：用TensorFlow構建你的第一個游戲AI

作者：Medium 2017-11-21 09:20:06

強化學習起初看似非常有挑戰(zhàn)性，但其實要入門并不困難。在這篇文章中，我們將創(chuàng)造一個基于 Keras 的簡單機器人，使它能玩 Catch 游戲。

去年，DeepMind 的 AlphaGo 以 4-1 的比分打敗了世界圍棋冠軍李世乭。超過 2 億的觀眾就這樣看著強化學習（reinforce learning）走上了世界舞臺。幾年前，DeepMind 制作了一個可以玩 Atari 游戲的機器人，引發(fā)軒然大波。此后這個公司很快被谷歌收購。

很多研究者相信，強化學習是我們創(chuàng)造通用人工智能（Artificial General Intelligence）的最佳手段。這是一個令人興奮的領域，有著許多未解決的挑戰(zhàn)和巨大的潛能。

強化學習起初看似非常有挑戰(zhàn)性，但其實要入門并不困難。在這篇文章中，我們將創(chuàng)造一個基于 Keras 的簡單機器人，使它能玩 Catch 游戲。

Catch 游戲

[[210668]]

原始的 Catch 游戲界面

Catch 是一個非常簡單的街機游戲，你可能在孩提時代玩過它。游戲規(guī)則如下：水果從屏幕的頂部落下，玩家必須用一個籃子抓住它們；每抓住一個水果，玩家得一分；每漏掉一個水果，玩家會被扣除一分。這里的目標是讓電腦自己玩 Catch 游戲。不過，我們不會使用這么漂亮的游戲界面。相反，我們會使用一個簡單的游戲版本來簡化任務：

簡化的 Catch 游戲界面

玩 Catch 游戲時，玩家要決定三種可能的行為。玩家可以將籃子左移、右移或保持不動。這個決定取決于游戲的當前狀態(tài)。也就是說，取決于果子掉落的位置和籃子的位置。我們的目標是創(chuàng)造這樣一個模型：它能在給定游戲屏幕內(nèi)容的情況下，選擇導致得分最高的動作。

這個任務可以被看做一個簡單的分類問題。我們可以讓游戲?qū)＜叶啻瓮孢@個游戲，并記錄他們的行為。然后，可以通過選擇類似于游戲?qū)＜业摹刚_」動作來訓練模型。

但這實際上并不是人類學習的方式。人類可以在無指導的情況下，自學像 Catch 這樣的游戲。這非常有用。想象一下，你如果每次想學習像 Catch 一樣簡單的東西，就必須雇傭一批專家玩這個游戲上千次！這必然非常昂貴而緩慢。

而在強化學習中，模型不會根據(jù)標記的數(shù)據(jù)訓練，而是通過以往的經(jīng)歷。

深度強化學習

強化學習受行為心理學啟發(fā)。我們并不為模型提供「正確的」行為，而是給予獎勵和懲罰。該模型接受關于當前環(huán)境狀態(tài)的信息（例如計算機游戲屏幕）。然后，它將輸出一個動作，就像游戲手柄一樣。環(huán)境將對這個動作做出回應，并提供下一個狀態(tài)和獎懲行為。

[[210670]]

據(jù)此，模型學習并尋找最大化獎勵的行為。

實際上，有很多方式能夠做到這一點。下面，讓我們了解一下 Q-Learning。利用 Q-Learning 訓練計算機玩 Atari 游戲的時候，Q-Learning 曾引起了轟動。現(xiàn)在，Q-Learning 依然是一個有重大意義的概念。大多數(shù)現(xiàn)代的強化學習算法，都是 Q-Learning 的一些改進。

理解 Q-Learning

了解 Q-Learning 的一個好方法，就是將 Catch 游戲和下象棋進行比較。

在這兩種游戲中，你都會得到一個狀態(tài) S。在象棋中，這代表棋盤上棋子的位置。在 Catch 游戲中，這代表水果和籃子的位置。

然后，玩家要采取一個動作，稱作 A。在象棋中，玩家要移動一個棋子。而在 Catch 游戲中，這代表著將籃子向左、向右移動，或是保持在當前位置。據(jù)此，會得到一些獎勵 R 和一個新狀態(tài) S’。

Catch 游戲和象棋的一個共同點在于，獎勵并不會立即出現(xiàn)在動作之后。

在 Catch 游戲中，只有在水果掉到籃子里或是撞到地板上時你才會獲得獎勵。而在象棋中，只有在整盤棋贏了或輸了之后，才會獲得獎勵。這也就是說，獎勵是稀疏分布的（sparsely distributed）。大多數(shù)時候，R 保持為零。

產(chǎn)生的獎勵并不總是前一個動作的結果。也許，很早之前采取的某些動作才是獲勝的關鍵。要弄清楚哪個動作對最終的獎勵負責，這通常被稱為信度分配問題（credit assignment problem）。

由于獎勵的延遲性，優(yōu)秀的象棋選手并不會僅通過最直接可見的獎勵來選擇他們的落子方式。相反，他們會考慮預期未來獎勵（expected future reward），并據(jù)此進行選擇。例如，他們不僅要考慮下一步是否能夠消滅對手的一個棋子。他們也會考慮那些從長遠的角度有益的行為。

在 Q-Learning 中，我們根據(jù)最高的預期未來獎勵選行動。我們使用 Q 函數(shù)進行計算。這個數(shù)學函數(shù)有兩個變量：游戲的當前狀態(tài)和給定的動作。因此，我們可以將其記為 Q（state，action）。在 S 狀態(tài)下，我們將估計每個可能的動作 A 所帶來的的回報。我們假定在采取行動 A 且進入下一個狀態(tài) S’ 以后，一切都很完美。

對于給定狀態(tài) S 和動作 A，預期未來獎勵 Q（S，A）被計算為即時獎勵 R 加上其后的預期未來獎勵 Q（S’，A’）。我們假設下一個動作 A’ 是最優(yōu)的。

由于未來的不確定性，我們用 γ 因子乘以 Q（S’，A’）表示折扣：

Q(S,A) = R + γ * max Q(S’，A’)

象棋高手擅長在心里估算未來回報。換句話說，他們的 Q 函數(shù) Q（S，A）非常精確。大多數(shù)象棋訓練都是圍繞著發(fā)展更好的 Q 函數(shù)進行的。玩家使用棋譜學習，從而了解特定動作如何發(fā)生，以及給定的動作有多大可能會導致勝利。但是，機器如何評估一個 Q 函數(shù)的好壞呢？這就是神經(jīng)網(wǎng)絡大展身手的地方了。

最終回歸

玩游戲的時候，我們會產(chǎn)生很多「經(jīng)歷」，包括以下幾個部分：

初始狀態(tài)，S
采取的動作，A
獲得的獎勵，R
下一狀態(tài)，S’

這些經(jīng)歷就是我們的訓練數(shù)據(jù)。我們可以將估算 Q（S，A）的問題定義為回歸問題。為了解決這個問題，我們可以使用神經(jīng)網(wǎng)絡。給定一個由 S 和 A 組成的輸入向量，神經(jīng)網(wǎng)絡需要能預測 Q（S，A）的值等于目標：R + γ * max Q(S’，A’)。

如果我們能很好地預測不同狀態(tài) S 和不同行為 A 的 Q（S，A），我們就能很好地逼近 Q 函數(shù)。請注意，我們通過與 Q（S，A）相同的神經(jīng)網(wǎng)絡估算 Q（S’，A’）。

訓練過程

給定一批經(jīng)歷 <S，A，R，S’>，其訓練過程如下：

對于每個可能的動作 A’（向左、向右、不動），使用神經(jīng)網(wǎng)絡預測預期未來獎勵 Q（S’，A’）；
選擇 3 個預期未來獎勵中的最大值，作為 max Q（S’，A’）；
計算 r + γ * max Q(S’，A’)，這就是神經(jīng)網(wǎng)絡的目標值；
使用損失函數(shù)（loss function）訓練神經(jīng)網(wǎng)絡。損失函數(shù)可以計算預測值離目標值的距離。此處，我們使用 0.5 * (predicted_Q(S,A)—target)² 作為損失函數(shù)。

在游戲過程中，所有的經(jīng)歷都會被存儲在回放存儲器（replay memory）中。這就像一個存儲 <S，A，R，S’> 對的簡單緩存。這些經(jīng)歷回放類同樣能用于準備訓練數(shù)據(jù)。讓我們看看下面的代碼：

class ExperienceReplay(object): 
    """ 
    During gameplay all the experiences < s, a, r, s’ > are stored in a replay memory.  
    In training, batches of randomly drawn experiences are used to generate the input and target for training. 
    """ 
    def __init__(self, max_memory=100, discount=.9): 
        """ 
        Setup 
        max_memory: the maximum number of experiences we want to store 
        memory: a list of experiences 
        discount: the discount factor for future experience 
         
        In the memory the information whether the game ended at the state is stored seperately in a nested array 
        [... 
        [experience, game_over] 
        [experience, game_over] 
        ...] 
        """ 
        self.max_memory = max_memory 
        self.memory = list() 
        self.discount = discount 
 
    def remember(self, states, game_over): 
        #Save a state to memory 
        self.memory.append([states, game_over]) 
        #We don't want to store infinite memories, so if we have too many, we just delete the oldest one 
        if len(self.memory) > self.max_memory: 
            del self.memory[0] 
 
    def get_batch(self, model, batch_size=10): 
         
        #How many experiences do we have? 
        len_memory = len(self.memory) 
         
        #Calculate the number of actions that can possibly be taken in the game 
        num_actions = model.output_shape[-1] 
         
        #Dimensions of the game field 
        env_dim = self.memory[0][0][0].shape[1] 
         
        #We want to return an input and target vector with inputs from an observed state... 
        inputs = np.zeros((min(len_memory, batch_size), env_dim)) 
         
        #...and the target r + gamma * max Q(s’,a’) 
        #Note that our target is a matrix, with possible fields not only for the action taken but also 
        #for the other possible actions. The actions not take the same value as the prediction to not affect them 
        targets = np.zeros((inputs.shape[0], num_actions)) 
         
        #We draw states to learn from randomly 
        for i, idx in enumerate(np.random.randint(0, len_memory, 
                                                  size=inputs.shape[0])): 
            """ 
            Here we load one transition <s, a, r, s’> from memory 
            state_t: initial state s 
            action_t: action taken a 
            reward_t: reward earned r 
            state_tp1: the state that followed s’ 
            """ 
            state_t, action_t, reward_t, state_tp1 = self.memory[idx][0] 
             
            #We also need to know whether the game ended at this state 
            game_over = self.memory[idx][1] 
 
            #add the state s to the input 
            inputs[i:i+1] = state_t 
             
            # First we fill the target values with the predictions of the model. 
            # They will not be affected by training (since the training loss for them is 0) 
            targets[i] = model.predict(state_t)[0] 
             
            """ 
            If the game ended, the expected reward Q(s,a) should be the final reward r. 
            Otherwise the target value is r + gamma * max Q(s’,a’) 
            """ 
            #  Here Q_sa is max_a'Q(s', a') 
            Q_sa = np.max(model.predict(state_tp1)[0]) 
             
            #if the game ended, the reward is the final reward 
            if game_over:  # if game_over is True 
                targets[i, action_t] = reward_t 
            else: 
                # r + gamma * max Q(s’,a’) 
                targets[i, action_t] = reward_t + self.discount * Q_sa 
        return inputs, targets

定義模型

現(xiàn)在讓我們定義這個利用 Q-Learning 學習 Catch 游戲的模型。我們使用 Keras 作為 Tensorflow 的前端。我們的基準模型是一個簡單的三層密集網(wǎng)絡。這個模型在簡單版的 Catch 游戲當中表現(xiàn)很好。你可以在 GitHub 中找到它的完整實現(xiàn)過程。

你也可以嘗試更加復雜的模型，測試其能否獲得更好的性能。

num_actions = 3  # [move_left, stay, move_right] 
hidden_size = 100 # Size of the hidden layers 
grid_size = 10 # Size of the playing field 
 
def baseline_model(grid_size,num_actions,hidden_size): 
    #seting up the model with keras 
    model = Sequential() 
    model.add(Dense(hidden_size, input_shape=(grid_size**2,), activation='relu')) 
    model.add(Dense(hidden_size, activation='relu')) 
    model.add(Dense(num_actions)) 
    model.compile(sgd(lr=.1), "mse") 
    return model

探索

Q-Learning 的最后一種成分是探索。日常生活的經(jīng)驗告訴我們，有時候你得做點奇怪的事情或是隨機的手段，才能發(fā)現(xiàn)是否有比日常動作更好的東西。

Q-Learning 也是如此。總是做最好的選擇，意味著你可能會錯過一些從未探索的道路。為了避免這種情況，學習者有時會添加一個隨機項，而未必總是用最好的。我們可以將定義訓練方法如下：

def train(model,epochs): 
    # Train 
    #Reseting the win counter 
    win_cnt = 0 
    # We want to keep track of the progress of the AI over time, so we save its win count history 
    win_hist = [] 
    #Epochs is the number of games we play 
    for e in range(epochs): 
        loss = 0. 
        #Resetting the game 
        env.reset() 
        game_over = False 
        # get initial input 
        input_t = env.observe() 
         
        while not game_over: 
            #The learner is acting on the last observed game screen 
            #input_t is a vector containing representing the game screen 
            input_tm1 = input_t 
             
            #Take a random action with probability epsilon 
            if np.random.rand() <= epsilon: 
                #Eat something random from the menu 
                action = np.random.randint(0, num_actions, size=1) 
            else: 
                #Choose yourself 
                #q contains the expected rewards for the actions 
                q = model.predict(input_tm1) 
                #We pick the action with the highest expected reward 
                action = np.argmax(q[0]) 
 
            # apply action, get rewards and new state 
            input_t, reward, game_over = env.act(action) 
            #If we managed to catch the fruit we add 1 to our win counter 
            if reward == 1: 
                win_cnt += 1         
             
            #Uncomment this to render the game here 
            #display_screen(action,3000,inputs[0]) 
             
            """ 
            The experiences < s, a, r, s’ > we make during gameplay are our training data. 
            Here we first save the last experience, and then load a batch of experiences to train our model 
            """ 
             
            # store experience 
            exp_replay.remember([input_tm1, action, reward, input_t], game_over)     
             
            # Load batch of experiences 
            inputs, targets = exp_replay.get_batch(model, batch_size=batch_size) 
   
            # train model on experiences 
            batch_loss = model.train_on_batch(inputs, targets) 
             
            #sum up loss over all batches in an epoch 
            loss += batch_loss 
        win_hist.append(win_cnt) 
    return win_hist

我將這個游戲機器人訓練了 5000 個 epoch，結果表現(xiàn)得很不錯！

Catch 機器人的動作

正如你在上述動圖中看到的那樣，機器人可以抓住從天空中掉落的蘋果。為了將這個模型學習的過程可視化，我繪制了每一個 epoch 的勝利移動平均線，結果如下：

接下來做什么？現(xiàn)在，你已經(jīng)對強化學習有了初步的直覺了解。我建議仔細閱讀該教程的完整代碼。你也可以試驗看看。

責任編輯：龐桂玉來源： 36大數(shù)據(jù)