前言

说实话,感觉也没什么好说的,都是大学概率统计学的内容,就挑重点讲一下吧,完整代码见 https://github.com/zong4/AILearning。

马尔可夫链

马尔可夫链直白点说就是状态转移,如下图所示。

通过这样一个转移概率图,我们可以求出转移矩阵,类似下图。

通过这个矩阵,我们可以做很多事情,如下图,不过具体的还是来个实战吧。

Page Rank

Page Rank 就是用来评估网页的价值的(被更多网站引用的网站一般价值会更高)。

那我们这边主要用两种方法来计算,一种是随机冲浪者模型,另一种是迭代算法。

随机冲浪者模型

这模型说白了就是在模拟大家平时浏览网站的操作。

假设大家随机从一个网站开始,当我们浏览完毕后,我们有一定概率(DAMPING)会去浏览它引用的其中一个网站,还有一定概率会重新再选一个网站浏览(1 - DAMPING)。

于是我们就可以写出如下代码。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
DAMPING = 0.85
SAMPLES = 10000

def transition_model(corpus, page, damping_factor):
"""
Return a probability distribution over which page to visit next,
given a current page.

With probability `damping_factor`, choose a link at random
linked to by `page`. With probability `1 - damping_factor`, choose
a link at random chosen from all pages in the corpus.
"""

# Create a dictionary to store the probability distribution
probability_distribution = {}

# If the page has no links, return a probability distribution that chooses randomly from all pages
if len(corpus[page]) == 0:
for page in corpus:
probability_distribution[page] = 1 / len(corpus)
return probability_distribution

# Calculate the probability of choosing a link at random linked to the page
for linked_page in corpus[page]:
probability_distribution[linked_page] = damping_factor / len(corpus[page])

# Calculate the probability of choosing a link at random from all pages
for page in corpus:
probability_distribution[page] = (1 - damping_factor) / len(corpus)

return probability_distribution

def sample_pagerank(corpus, damping_factor, n):
"""
Return PageRank values for each page by sampling `n` pages
according to transition model, starting with a page at random.

Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""

# Create a dictionary to store the number of times each page is visited
page_visits = {page: 0 for page in corpus}

# Choose a random page to start
current_page = random.choice(list(corpus.keys()))

# Visit n pages
for i in range(n):
page_visits[current_page] += 1

# Get the probability distribution for the current page
probability_distribution = transition_model(corpus, current_page, damping_factor)

# Choose the next page based on the probability distribution
current_page = random.choices(list(probability_distribution.keys()), weights=probability_distribution.values(), k=1)[0]

# Calculate the PageRank values
page_rank = {page: page_visits[page] / n for page in corpus}

return page_rank

其中 transition_model 就是我们的冲浪逻辑,而 sample_pagerank 中,通过记录 n 次冲浪时浏览的网站来给出相应的 Page Rank。

迭代算法

迭代算法是笨办法,就是对于网页的权重,一开始因为不知道所以都是0,但是我们知道每个网页的权重是能转移到它的网页的概率权重 + 随机浏览的概率权重,所以我们可以一直迭代算它们的权重,一直到所有网页的权重和为1结束,代码如下。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
def iterate_pagerank(corpus, damping_factor):
"""
Return PageRank values for each page by iteratively updating
PageRank values until convergence.

Return a dictionary where keys are page names, and values are
their estimated PageRank value (a value between 0 and 1). All
PageRank values should sum to 1.
"""

# Create a dictionary to store the PageRank values
page_rank = {page: 1 / len(corpus) for page in corpus}

# Create a dictionary to store the new PageRank values
new_page_rank = {page: 0 for page in corpus}

# Create a boolean variable to check if the PageRank values have converged
converged = False

# Iterate until the PageRank values converge
while not converged:

# Calculate the new PageRank values
for page in corpus:
new_page_rank[page] = (1 - damping_factor) / len(corpus)
for linking_page in corpus:
if page in corpus[linking_page]:
new_page_rank[page] += damping_factor * page_rank[linking_page] / len(corpus[linking_page])

# Check if the PageRank values have converged
if all(abs(new_page_rank[page] - page_rank[page]) < 0.001 for page in corpus):
converged = True

# Update the PageRank values
page_rank = new_page_rank.copy()

return page_rank

例子

最后来看一个例子,给出下图的状态转移图。

给大家看一下结果,可以发现差别还是很大的,随机冲浪者模型算出的的 Page Rank 的方差看上去比迭代算法的高了太多。

1
2
3
4
5
6
7
8
9
10
PageRank Results from Sampling (damping factor = 0.85, samples = 10000)
1.html: 0.0393
2.html: 0.8808
3.html: 0.0396
4.html: 0.0403
PageRank Results from Iteration
1.html: 0.2198
2.html: 0.4294
3.html: 0.2198
4.html: 0.1311

那是什么原因导致的呢?大家可以想一想。

其实是因为迭代算法前几次由于没有相关网页的权重,所以导致他迭代算了很多次随机游走的概率,才使得结果会更向平均值靠拢。

后记

完整代码 https://github.com/zong4/AILearning 中还有几个例子,大家都可以跑一跑试试,也可以改改这两个超参数(严格来说迭代步数其实不能算超参数),来看看结果会有什么变化。