前言

说实话，感觉也没什么好说的，都是大学概率统计学的内容，就挑重点讲一下吧，完整代码见 https://github.com/zong4/AILearning。

马尔可夫链

马尔可夫链直白点说就是状态转移，如下图所示。

通过这样一个转移概率图，我们可以求出转移矩阵，类似下图。

通过这个矩阵，我们可以做很多事情，如下图，不过具体的还是来个实战吧。

Page Rank

Page Rank 就是用来评估网页的价值的（被更多网站引用的网站一般价值会更高）。

那我们这边主要用两种方法来计算，一种是随机冲浪者模型，另一种是迭代算法。

随机冲浪者模型

这模型说白了就是在模拟大家平时浏览网站的操作。

假设大家随机从一个网站开始，当我们浏览完毕后，我们有一定概率（DAMPING）会去浏览它引用的其中一个网站，还有一定概率会重新再选一个网站浏览（1 - DAMPING）。

于是我们就可以写出如下代码。

DAMPING = 0.85
SAMPLES = 10000

def transition_model(corpus, page, damping_factor):
    """
    Return a probability distribution over which page to visit next,
    given a current page.

    With probability `damping_factor`, choose a link at random
    linked to by `page`. With probability `1 - damping_factor`, choose
    a link at random chosen from all pages in the corpus.
    """
    
    # Create a dictionary to store the probability distribution
    probability_distribution = {}
    
    # If the page has no links, return a probability distribution that chooses randomly from all pages
    if len(corpus[page]) == 0:
        for page in corpus:
            probability_distribution[page] = 1 / len(corpus)
        return probability_distribution
    
    # Calculate the probability of choosing a link at random linked to the page
    for linked_page in corpus[page]:
        probability_distribution[linked_page] = damping_factor / len(corpus[page])
    
    # Calculate the probability of choosing a link at random from all pages
    for page in corpus:
        probability_distribution[page] = (1 - damping_factor) / len(corpus)
    
    return probability_distribution

def sample_pagerank(corpus, damping_factor, n):
    """
    Return PageRank values for each page by sampling `n` pages
    according to transition model, starting with a page at random.

    Return a dictionary where keys are page names, and values are
    their estimated PageRank value (a value between 0 and 1). All
    PageRank values should sum to 1.
    """

    # Create a dictionary to store the number of times each page is visited
    page_visits = {page: 0 for page in corpus}
    
    # Choose a random page to start
    current_page = random.choice(list(corpus.keys()))
    
    # Visit n pages
    for i in range(n):
        page_visits[current_page] += 1
        
        # Get the probability distribution for the current page
        probability_distribution = transition_model(corpus, current_page, damping_factor)
        
        # Choose the next page based on the probability distribution
        current_page = random.choices(list(probability_distribution.keys()), weights=probability_distribution.values(), k=1)[0]
    
    # Calculate the PageRank values
    page_rank = {page: page_visits[page] / n for page in corpus}
    
    return page_rank

其中 transition_model 就是我们的冲浪逻辑，而 sample_pagerank 中，通过记录 n 次冲浪时浏览的网站来给出相应的 Page Rank。

迭代算法

迭代算法是笨办法，就是对于网页的权重，一开始因为不知道所以都是0，但是我们知道每个网页的权重是能转移到它的网页的概率权重 + 随机浏览的概率权重，所以我们可以一直迭代算它们的权重，一直到所有网页的权重和为1结束，代码如下。

def iterate_pagerank(corpus, damping_factor):
    """
    Return PageRank values for each page by iteratively updating
    PageRank values until convergence.

    Return a dictionary where keys are page names, and values are
    their estimated PageRank value (a value between 0 and 1). All
    PageRank values should sum to 1.
    """
    
    # Create a dictionary to store the PageRank values
    page_rank = {page: 1 / len(corpus) for page in corpus}
    
    # Create a dictionary to store the new PageRank values
    new_page_rank = {page: 0 for page in corpus}
    
    # Create a boolean variable to check if the PageRank values have converged
    converged = False
    
    # Iterate until the PageRank values converge
    while not converged:
        
        # Calculate the new PageRank values
        for page in corpus:
            new_page_rank[page] = (1 - damping_factor) / len(corpus)
            for linking_page in corpus:
                if page in corpus[linking_page]:
                    new_page_rank[page] += damping_factor * page_rank[linking_page] / len(corpus[linking_page])
        
        # Check if the PageRank values have converged
        if all(abs(new_page_rank[page] - page_rank[page]) < 0.001 for page in corpus):
            converged = True
        
        # Update the PageRank values
        page_rank = new_page_rank.copy()
    
    return page_rank

例子

最后来看一个例子，给出下图的状态转移图。

给大家看一下结果，可以发现差别还是很大的，随机冲浪者模型算出的的 Page Rank 的方差看上去比迭代算法的高了太多。

PageRank Results from Sampling (damping factor = 0.85, samples = 10000)
  1.html: 0.0393
  2.html: 0.8808
  3.html: 0.0396
  4.html: 0.0403
PageRank Results from Iteration
  1.html: 0.2198
  2.html: 0.4294
  3.html: 0.2198
  4.html: 0.1311

那是什么原因导致的呢？大家可以想一想。

其实是因为迭代算法前几次由于没有相关网页的权重，所以导致他迭代算了很多次随机游走的概率，才使得结果会更向平均值靠拢。

后记

完整代码 https://github.com/zong4/AILearning 中还有几个例子，大家都可以跑一跑试试，也可以改改这两个超参数（严格来说迭代步数其实不能算超参数），来看看结果会有什么变化。

Uncertainty篇（一），用马尔可夫链求PageRank

前言

马尔可夫链

Page Rank

随机冲浪者模型

迭代算法

例子

后记