囯产无码在线视频|日本不卡中文字幕|亚洲精品无码白丝喷白浆在线|伊人精品久久久大香线蕉

<blockquote id="e240w"></blockquote>

<center id="e240w"></center>

<center id="e240w"><dl id="e240w"></dl></center><menu id="e240w"><noscript id="e240w"></noscript></menu>

吐客號(hào)

登錄注冊(cè)

首頁(yè) > 排行榜 > 數(shù)碼正文

自學(xué)圍棋的AlphaGo Zero 你也可以造一個(gè)

想我所想

投訴/報(bào)錯(cuò)

遙想當(dāng)年，AlphaGo的Master版本，在完勝柯潔九段之后不久，就被后輩AlphaGo Zero(簡(jiǎn)稱(chēng)狗零) 擊潰了。

從一只完全不懂圍棋的AI，到打敗Master，狗零只用了21天。

而且，它不需要用人類(lèi)知識(shí)來(lái)喂養(yǎng)，成為頂尖棋手全靠自學(xué)。

如果能培育這樣一只AI，即便自己不會(huì)下棋，也可以很驕傲吧。

于是，來(lái)自巴黎的少年Dylan Djian (簡(jiǎn)稱(chēng)小笛) ，就照著狗零的論文去實(shí)現(xiàn)了一下。

他給自己的AI棋手起名SuperGo，也提供了代碼(傳送門(mén)見(jiàn)文底) 。

除此之外，還有教程——

一個(gè)身子兩個(gè)頭

智能體分成三個(gè)部分：

一是特征提取器(Feature Extractor) ，二是策略網(wǎng)絡(luò)(Policy Network) ，三是價(jià)值網(wǎng)絡(luò)(Value Network) 。

于是，狗零也被親切地稱(chēng)為“雙頭怪”。特征提取器是身子，其他兩個(gè)網(wǎng)絡(luò)是腦子。

特征提取器

特征提取模型，是個(gè)殘差網(wǎng)絡(luò) (ResNet) ，就是給普通CNN加上了跳層連接 (Skip Connection) ，讓梯度的傳播更加通暢。

跳躍的樣子，寫(xiě)成代碼就是：

1classBasicBlock(nn.Module):

2 """

3 Basic residual block with 2 convolutions and a skip connection

4 before the last ReLU activation.

5 """

6

7def__init__(self, inplanes, planes, stride=1, downsample=None):

8 super(BasicBlock, self).__init__()

9

10 self.conv1 = nn.Conv2d(inplanes, planes, kernel_size=3,

11 stride=stride, padding=1, bias=False)

12 self.bn1 = nn.BatchNorm2d(planes)

13

14 self.conv2 = nn.Conv2d(planes, planes, kernel_size=3,

15 stride=stride, padding=1, bias=False)

16 self.bn2 = nn.BatchNorm2d(planes)

17

18

19defforward(self, x):

20 residual = x

21

22 out = self.conv1(x)

23 out = F.relu(self.bn1(out))

24

25 out = self.conv2(out)

26 out = self.bn2(out)

27

28 out += residual

29 out = F.relu(out)

30

31returnout

然后，把它加到特征提取模型里面去：

1classExtractor(nn.Module):

2def__init__(self, inplanes, outplanes):

3 super(Extractor, self).__init__()

4 self.conv1 = nn.Conv2d(inplanes, outplanes, stride=1,

5 kernel_size=3, padding=1, bias=False)

6 self.bn1 = nn.BatchNorm2d(outplanes)

7

8forblockinrange(BLOCKS):

9 setattr(self, "res{}".format(block),

10 BasicBlock(outplanes, outplanes))

11

12

13defforward(self, x):

14 x = F.relu(self.bn1(self.conv1(x)))

15forblockinrange(BLOCKS - 1):

16 x = getattr(self, "res{}".format(block))(x)

17

18 feature_maps = getattr(self, "res{}".format(BLOCKS - 1))(x)

19returnfeature_maps

策略網(wǎng)絡(luò)

策略網(wǎng)絡(luò)就是普通的CNN了，里面有個(gè)批量標(biāo)準(zhǔn)化(Batch Normalization) ，還有一個(gè)全連接層，輸出概率分布。

1classPolicyNet(nn.Module):

2def__init__(self, inplanes, outplanes):

3 super(PolicyNet, self).__init__()

4 self.outplanes = outplanes

5 self.conv = nn.Conv2d(inplanes, 1, kernel_size=1)

6 self.bn = nn.BatchNorm2d(1)

7 self.logsoftmax = nn.LogSoftmax(dim=1)

8 self.fc = nn.Linear(outplanes - 1, outplanes)

9

10

11defforward(self, x):

12 x = F.relu(self.bn(self.conv(x)))

13 x = x.view(-1, self.outplanes - 1)

14 x = self.fc(x)

15 probas = self.logsoftmax(x).exp()

16

17returnprobas

價(jià)值網(wǎng)絡(luò)

這個(gè)網(wǎng)絡(luò)稍微復(fù)雜一點(diǎn)。除了標(biāo)配之外，還要再多加一個(gè)全連接層。最后，用雙曲正切 (Hyperbolic Tangent) 算出 (-1,1) 之間的數(shù)值，來(lái)表示當(dāng)前狀態(tài)下的贏面多大。

代碼長(zhǎng)這樣——

1classValueNet(nn.Module):

2def__init__(self, inplanes, outplanes):

3 super(ValueNet, self).__init__()

4 self.outplanes = outplanes

5 self.conv = nn.Conv2d(inplanes, 1, kernel_size=1)

6 self.bn = nn.BatchNorm2d(1)

7 self.fc1 = nn.Linear(outplanes - 1, 256)

8 self.fc2 = nn.Linear(256, 1)

9

10

11defforward(self, x):

12 x = F.relu(self.bn(self.conv(x)))

13 x = x.view(-1, self.outplanes - 1)

14 x = F.relu(self.fc1(x))

15 winning = F.tanh(self.fc2(x))

16returnwinning

未雨綢繆的樹(shù)

狗零，還有一個(gè)很重要的組成部分，就是蒙特卡洛樹(shù)搜索(MCTS) 。

它可以讓AI棋手提前找出，勝率最高的落子點(diǎn)。

在模擬器里，模擬對(duì)方的下一手，以及再下一手，給出應(yīng)對(duì)之策，所以提前的遠(yuǎn)不止是一步。

節(jié)點(diǎn) (Node)

樹(shù)上的每一個(gè)節(jié)點(diǎn)，都代表一種不同的局勢(shì)，有不同的統(tǒng)計(jì)數(shù)據(jù)：

每個(gè)節(jié)點(diǎn)被經(jīng)過(guò)的次數(shù)n，總動(dòng)作值w，經(jīng)過(guò)這一點(diǎn)的先驗(yàn)概率p，平均動(dòng)作值q (q=w/n) ，還有從別處來(lái)到這個(gè)節(jié)點(diǎn)走的那一步，以及從這個(gè)節(jié)點(diǎn)出發(fā)、所有可能的下一步。

1classNode:

2def__init__(self, parent=None, proba=None, move=None):

3 self.p = proba

4 self.n = 0

5 self.w = 0

6 self.q = 0

7 self.children = []

8 self.parent = parent

9 self.move = move

部署 (Rollout)

第一步是PUCT (多項(xiàng)式上置信樹(shù)) 算法，選擇能讓PUCT函數(shù) (下圖) 的某個(gè)變體 (Variant)最大化，的走法。

寫(xiě)成代碼的話(huà)——

1defselect(nodes, c_puct=C_PUCT):

2 " Optimized version of the selection based of the PUCT formula "

3

4 total_count = 0

5foriinrange(nodes.shape[0]):

6 total_count += nodes[i][1]

7

8 action_scores = np.zeros(nodes.shape[0])

9foriinrange(nodes.shape[0]):

10 action_scores[i] = nodes[i][0] + c_puct * nodes[i][2] *

11 (np.sqrt(total_count) / (1 + nodes[i][1]))

12

13 equals = np.where(action_scores == np.max(action_scores))[0]

14ifequals.shape[0] > 0:

15returnnp.random.choice(equals)

16returnequals[0]

結(jié)束 (Ending)

選擇在不停地進(jìn)行，直至到達(dá)一個(gè)葉節(jié)點(diǎn) (Leaf Node) ，而這個(gè)節(jié)點(diǎn)還沒(méi)有往下生枝。

1defis_leaf(self):

2 """ Check whether a node is a leaf or not """

3

4returnlen(self.children) == 0

到了葉節(jié)點(diǎn)，那里的一個(gè)隨機(jī)狀態(tài)就會(huì)被評(píng)估，得出所有“下一步”的概率。

所有被禁的落子點(diǎn)，概率會(huì)變成零，然后重新把總概率歸為1。

然后，這個(gè)葉節(jié)點(diǎn)就會(huì)生出枝節(jié) (都是可以落子的位置，概率不為零的那些) 。代碼如下——

1defexpand(self, probas):

2 self.children = [Node(parent=self, move=idx, proba=probas[idx])

3foridxinrange(probas.shape[0])ifprobas[idx] > 0]

更新一下

枝節(jié)生好之后，這個(gè)葉節(jié)點(diǎn)和它的媽媽們，身上的統(tǒng)計(jì)數(shù)據(jù)都會(huì)更新，用的是下面這兩串代碼。

1defupdate(self, v):

2 """ Update the node statistics after a rollout """

3

4 self.w = self.w + v

5 self.q = self.w / self.nifself.n > 0else0

1whilecurrent_node.parent:

2 current_node.update(v)

3 current_node = current_node.parent

選擇落子點(diǎn)

模擬器搭好了，每個(gè)可能的“下一步”，都有了自己的統(tǒng)計(jì)數(shù)據(jù)。

按照這些數(shù)據(jù)，算法會(huì)選擇其中一步，真要落子的地方。

選擇有兩種，一就是選擇被模擬的次數(shù)最多的點(diǎn)。試用于測(cè)試和實(shí)戰(zhàn)。

另外一種，隨機(jī) (Stochastically) 選擇，把節(jié)點(diǎn)被經(jīng)過(guò)的次數(shù)轉(zhuǎn)換成概率分布，用的是以下代碼——

1 total = np.sum(action_scores)

2 probas = action_scores / total

3 move = np.random.choice(action_scores.shape[0], p=probas)

后者適用于訓(xùn)練，讓AlphaGo探索更多可能的選擇。

三位一體的修煉

狗零的修煉分為三個(gè)過(guò)程，是異步的。

一是自對(duì)弈(Self-Play) ，用來(lái)生成數(shù)據(jù)。

1defself_play():

2whileTrue:

3 new_player, checkpoint = load_player()

4ifnew_player:

5 player = new_player

6

7 ## Create the self-play match queue of processes

8 results = create_matches(player, cores=PARALLEL_SELF_PLAY,

9 match_number=SELF_PLAY_MATCH)

10for_inrange(SELF_PLAY_MATCH):

11 result = results.get()

12 db.insert({

13 "game": result,

14 "id": game_id

15 })

16 game_id += 1

二是訓(xùn)練(Training) ，拿新鮮生成的數(shù)據(jù)，來(lái)改進(jìn)當(dāng)前的神經(jīng)網(wǎng)絡(luò)。

1deftrain():

2 criterion = AlphaLoss()

3 dataset = SelfPlayDataset()

4 player, checkpoint = load_player(current_time, loaded_version)

5 optimizer = create_optimizer(player, lr,

6 param=checkpoint['optimizer'])

7 best_player = deepcopy(player)

8 dataloader = DataLoader(dataset, collate_fn=collate_fn,

9 batch_size=BATCH_SIZE, shuffle=True)

10

11whileTrue:

12forbatch_idx, (state, move, winner)inenumerate(dataloader):

13

14 ## Evaluate a copy of the current network

15iftotal_ite % TRAIN_STEPS == 0:

16 pending_player = deepcopy(player)

17 result = evaluate(pending_player, best_player)

18

19ifresult:

20 best_player = pending_player

21

22 example = {

23 'state': state,

24 'winner': winner,

25 'move' : move

26 }

27 optimizer.zero_grad()

28 winner, probas = pending_player.predict(example['state'])

29

30 loss = criterion(winner, example['winner'],

31 probas, example['move'])

32 loss.backward()

33 optimizer.step()

34

35 ## Fetch new games

36iftotal_ite % REFRESH_TICK == 0:

37 last_id = fetch_new_games(collection, dataset, last_id)

訓(xùn)練用的損失函數(shù)表示如下：

1classAlphaLoss(torch.nn.Module):

2def__init__(self):

3 super(AlphaLoss, self).__init__()

4

5defforward(self, pred_winner, winner, pred_probas, probas):

6 value_error = (winner - pred_winner) ** 2

7 policy_error = torch.sum((-probas *

8 (1e-6 + pred_probas).log()), 1)

9 total_error = (value_error.view(-1) + policy_error).mean()

10returntotal_error

三是評(píng)估(Evaluation) ，看訓(xùn)練過(guò)的智能體，比起正在生成數(shù)據(jù)的智能體，是不是更優(yōu)秀了 (最優(yōu)秀者回到第一步，繼續(xù)生成數(shù)據(jù)) 。

1defevaluate(player, new_player):

2 results = play(player, opponent=new_player)

3 black_wins = 0

4 white_wins = 0

5

6forresultinresults:

7ifresult[0] == 1:

8 white_wins += 1

9elifresult[0] == 0:

10 black_wins += 1

11

12 ## Check if the trained player (black) is better than

13 ## the current best player depending on the threshold

14ifblack_wins >= EVAL_THRESH * len(results):

15returnTrue

16returnFalse

第三部分很重要，要不斷選出最優(yōu)的網(wǎng)絡(luò)，來(lái)不斷生成高質(zhì)量的數(shù)據(jù)，才能提升AI的棋藝。

三個(gè)環(huán)節(jié)周而復(fù)始，才能養(yǎng)成強(qiáng)大的棋手。

有志于AI圍棋的各位，也可以試一試這個(gè)PyTorch實(shí)現(xiàn)。

本來(lái)摘自量子位，原作 Dylan Djian。

代碼實(shí)現(xiàn)傳送門(mén)：

網(wǎng)頁(yè)鏈接

教程原文傳送門(mén)：

網(wǎng)頁(yè)鏈接

AlphaGo Zero論文傳送門(mén)：

網(wǎng)頁(yè)鏈接

節(jié)點(diǎn) 落子概率

相關(guān)閱讀

內(nèi)容加載中……

小鵝花錢(qián)申請(qǐng)條件是什么？

數(shù)碼

vivo手機(jī)如何提醒紅包來(lái)了

數(shù)碼

微信文字轉(zhuǎn)換成語(yǔ)音怎么設(shè)置

數(shù)碼

支付寶挪車(chē)碼怎么申請(qǐng)

數(shù)碼

大屏幕手機(jī)有什么優(yōu)缺點(diǎn)

數(shù)碼

樂(lè)視視頻怎么投放電視

數(shù)碼

支付寶積分怎樣快速獲得

數(shù)碼

realmev15怎么設(shè)置抬起亮屏

數(shù)碼

如何登錄路由器管理界面（后臺(tái)頁(yè)面）

數(shù)碼

路由器原理是什么？

數(shù)碼

加載中...

熱點(diǎn)推薦換一換

本周熱文

猜你喜歡

圖說(shuō)世界換一換

<span id="evrch"></span>

<source id="evrch"><del id="evrch"></del></source>