前言

啊哈,我又来更新了,原本是想结束了来着,但是又找到了几个好玩的题,所以来更新一下,顺便把之前欠的补了,完整代码见 https://github.com/zong4/AILearning。

石头剪刀布

不知道当你看到这个题目的时候想到了什么策略,总之强化学习肯定是用不了的,因为纯概率游戏是没有最优策略的,给大家看看我写的三个策略。

策略1

出能赢对手上次出的那一手。

1
2
3
4
5
6
7
8
9
# Strategy1: play the winning move of the opponent's last move
if prev_play == 'R':
return 'P'
elif prev_play == 'P':
return 'S'
elif prev_play == 'S':
return 'R'
else:
return random.choice(['R', 'P', 'S'])

策略2

出能赢对手最容易出的那一手。

1
2
3
4
5
6
7
8
9
# Strategy2: play the winning move of the opponent's most frequent move
if opponent_history.count('R') > opponent_history.count('P') and opponent_history.count('R') > opponent_history.count('S'):
return 'P'
elif opponent_history.count('P') > opponent_history.count('R') and opponent_history.count('P') > opponent_history.count('S'):
return 'S'
elif opponent_history.count('S') > opponent_history.count('R') and opponent_history.count('S') > opponent_history.count('P'):
return 'R'
else:
return random.choice(['R', 'P', 'S'])

策略3

用马尔可夫链预测对手最容易出的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
# Strategy3: play the winning move through markov chain
r2r_count = opponent_history.count('RR')
r2p_count = opponent_history.count('RP')
r2s_count = opponent_history.count('RS')

p2p_count = opponent_history.count('PP')
p2r_count = opponent_history.count('PR')
p2s_count = opponent_history.count('PS')

s2s_count = opponent_history.count('SS')
s2r_count = opponent_history.count('SR')
s2p_count = opponent_history.count('SP')

if prev_play == 'R':
if r2r_count > r2p_count and r2r_count > r2s_count:
return 'R'
elif r2p_count > r2r_count and r2p_count > r2s_count:
return 'P'
elif r2s_count > r2r_count and r2s_count > r2p_count:
return 'S'
else:
return random.choice(['R', 'P', 'S'])

elif prev_play == 'P':
if p2p_count > p2r_count and p2p_count > p2s_count:
return 'P'
elif p2r_count > p2p_count and p2r_count > p2s_count:
return 'R'
elif p2s_count > p2p_count and p2s_count > p2r_count:
return 'S'
else:
return random.choice(['R', 'P', 'S'])

elif prev_play == 'S':
if s2s_count > s2r_count and s2s_count > s2p_count:
return 'S'
elif s2r_count > s2s_count and s2r_count > s2p_count:
return 'R'
elif s2p_count > s2s_count and s2p_count > s2r_count:
return 'P'
else:
return random.choice(['R', 'P', 'S'])

else:
return random.choice(['R', 'P', 'S'])

上面这些应该都挺简单的,大家随便过一下就行,来看下一趴。

人类的本质是复读机

不知道大家有没有听过这句话,“人的本质是复读机”,我们今天就来模拟一下。

生存游戏

因为是模拟人类社会,所以这个生存游戏肯定得有合作有竞争,所以就使用比较经典的囚徒困境就行了。

1
2
3
4
5
6
7
# Define the pay - off matrix for the Prisoner's Dilemma
PAYOFF_MATRIX = {
('cooperate', 'cooperate'): (3, 3),
('cooperate', 'defect'): (-1, 3),
('defect', 'cooperate'): (3, -1),
('defect', 'defect'): (-3, -3)
}

生存策略

这边的话就先来四个策略,第三个就是所谓的复读机(重复遇到的上一个玩家的策略),最后应该会再放一个 AI 进去学。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
# Define different strategies
def always_cooperate(history):
return 'cooperate'

def always_defect(history):
return 'defect'

def tit_for_tat(history):
if not history:
return 'cooperate'
return history[-1][1]

def random_choice(history):
return random.choice(['cooperate', 'defect'])

# Player class
class Player:
def __init__(self, strategy):
self.strategy = strategy
self.score = 10
self.history = []

def make_choice(self):
return self.strategy(self.history)

def update_score(self, payoff):
self.score += payoff

def update_history(self, own_choice, other_choice):
self.history.append((own_choice, other_choice))

模拟

因为是生存模拟,所以肯定得淘汰低于0分的。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
# Function to run the simulation
def run_simulation(num_players_per_strategy, num_rounds):
strategies = [always_cooperate, always_defect, tit_for_tat, random_choice]
players = []

# Create players with equal distribution of strategies
for strategy in strategies:
for _ in range(num_players_per_strategy):
players.append(Player(strategy))

for round_num in range(num_rounds):
print("round_num: " + str(round_num))

random.shuffle(players)
for i in range(0, len(players), 2):
if i + 1 < len(players):
play_round(players[i], players[i + 1])

# Remove players with score <= 0
players = [player for player in players if player.score > 0]

always_cooperate_count = sum(1 for player in players if player.strategy == always_cooperate)
always_defect_count = sum(1 for player in players if player.strategy == always_defect)
tit_for_tat_count = sum(1 for player in players if player.strategy == tit_for_tat)
random_choice_count = sum(1 for player in players if player.strategy == random_choice)

print("always_cooperate: " + str(always_cooperate_count))
print("always_defect: " + str(always_defect_count))
print("tit_for_tat: " + str(tit_for_tat_count))
print("random_choice: " + str(random_choice_count))
print()

优化1

看一下这个结果,总是合作的竟然一个都没被淘汰,说明囚徒困境的得分设置的不合理。

1
2
3
4
5
round_num: 9999
always_cooperate: 10
always_defect: 6
tit_for_tat: 10
random_choice: 8

可以调整成下面这样,这样的话对于任何策略的得分期望就是0了。

1
2
3
4
5
6
PAYOFF_MATRIX = {
('cooperate', 'cooperate'): (3, 3),
('cooperate', 'defect'): (-3, 3),
('defect', 'cooperate'): (3, -3),
('defect', 'defect'): (-3, -3)
}

优化2

结果全死光了,就剩一个复读机。

1
2
3
4
5
round_num: 9999
always_cooperate: 0
always_defect: 0
tit_for_tat: 1
random_choice: 0

所以得再每次淘汰的时候添加新生儿,直接copy一份就好了(无性繁殖hhh)。

1
2
3
4
5
if(len(players) < len(strategies) * num_players_per_strategy):
print("Copy players")
players_copy = players.copy()
for player in players_copy:
players.append(Player(player.strategy, player.score))

优化3

跑了两趟,结果还是差别挺大的,毕竟跟谁玩是完全随机的。

1
2
3
4
5
Round: 99
Always cooperate: 21.21212121212121%
Always defect: 31.818181818181817%
Tit for tat: 15.151515151515152%
Random choice: 31.818181818181817%
1
2
3
4
5
Round: 99
Always cooperate: 28.169014084507044%
Always defect: 22.535211267605636%
Tit for tat: 29.577464788732392%
Random choice: 19.718309859154928%

应该让玩家随机出生在一个位置,然后围成一个圆,只会跟附近的人进行囚徒博弈,所以只在一开始 shuffle 一下,后面就相邻的人玩。

1
2
3
4
5
6
7
random.shuffle(players)

for round_num in range(num_rounds):
print("Round: " + str(round_num))

for i in range(0, len(players)):
play_round(players[i], (players[(i + 1) % len(players)]))

优化4

看了一眼结果,看起来好麻烦,所以再加一下图形化。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Round: 99
Always cooperate: 7
Always defect: 2
Tit for tat: 7
Random choice: 5
<function random_choice at 0x1029860c0>
<function always_defect at 0x102987100>
<function always_cooperate at 0x102986f20>
<function tit_for_tat at 0x102986e80>
<function always_cooperate at 0x102986f20>
<function tit_for_tat at 0x102986e80>
<function always_defect at 0x102987100>
<function random_choice at 0x1029860c0>
<function tit_for_tat at 0x102986e80>
<function always_cooperate at 0x102986f20>
<function always_cooperate at 0x102986f20>
<function always_cooperate at 0x102986f20>
<function always_cooperate at 0x102986f20>
<function random_choice at 0x1029860c0>
<function random_choice at 0x1029860c0>
<function tit_for_tat at 0x102986e80>
<function tit_for_tat at 0x102986e80>
<function tit_for_tat at 0x102986e80>
<function random_choice at 0x1029860c0>
<function tit_for_tat at 0x102986e80>
<function always_cooperate at 0x102986f20>

效果如下,看上去似乎还行,不过为了保险起见,先去掉随机者再观察一下。

改了一下,这次没问题了,看结果的话就是复读机和永远合作容易活,毕竟永远竞争会内部消耗,接下来就加个 AI 试试。

强化学习

仔细想了一下,好像没法让 AI 学策略,学出来也是跟随机者差不多估计,因为状态只有两个,学的也最多是一个不均等的选择概率。