前言

今天来写一个图书推荐引擎，完整代码见 https://github.com/zong4/AILearning。

数据处理

主要是有三个表，给大家看一下。

先把数据提出来，画图就不画了，数据量有点大，给我电脑干卡了。

books_filename = './book_recommendation/book-crossings/BX-Books.csv'
ratings_filename = './book_recommendation/book-crossings/BX-Book-Ratings.csv'
users_filename = './book_recommendation/book-crossings/BX-Users.csv'

# import csv data into dataframes
df_books = pd.read_csv(
    books_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['isbn', 'title', 'author'],
    usecols=['isbn', 'title', 'author'],
    dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
    ratings_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'isbn', 'rating'],
    usecols=['user', 'isbn', 'rating'],
    dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

df_users = pd.read_csv(
    users_filename,
    encoding = "ISO-8859-1",
    sep=";",
    header=0,
    names=['user', 'location', 'age'],
    usecols=['user', 'location', 'age'],
    dtype={'user': 'int32', 'location': 'str', 'age': 'float32'})

模型拟合

生成数据透视表，然后用 KNN 模型来推荐书籍。

# function to return recommended books - this will be tested
def get_recommends(book = ""):
    # create a new dataframe with the books and their ratings
    df = df_books.set_index('isbn').join(df_ratings.set_index('isbn'))
    # print(df.iloc[:4])

    # create a pivot table
    df_pivot = df.pivot_table(index='title', columns='user', values='rating').fillna(0)
    print(df_pivot.iloc[:4])

    # create a nearest neighbors model
    model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
    model_knn.fit(df_pivot)

    # get the index of the book
    query_index = df_pivot.index.get_loc(book)
    distances, indices = model_knn.kneighbors(df_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors=6)

    return [book, list(df_pivot.index[indices.flatten()]), list(distances.flatten())]

生成的表就差不多长下面这样，我们就可以根据用户的打分来计算书籍之间的相似度了。

优化1：特征提取

线程都直接被 kill 了。

1	zsh: killed python book_recommendation/test.py

其实不同用户的打分根本没用，应该输入的数据是书名，而不是用户信息，所以我们不需要计算用户的相似度，所以我们只需要书籍的信息和打分就可以了。

所以我们就提取那么些特征来训练 KNN 模型。

# average rating and number of ratings for each book
average_rating = df_ratings.groupby('isbn')['rating'].mean()
rating_count = df_ratings.groupby('isbn')['rating'].count()
df_ratings_avg = pd.DataFrame({'isbn': average_rating.index, 'avg_rating': average_rating.values, 'rating_count': rating_count.values})

# create a new dataframe with the books and their ratings
df = df_books.set_index('isbn').join(df_ratings_avg.set_index('isbn'))
# print(df.iloc[:4])

# One Hot Encoding
author_encoder = LabelEncoder()
df['author_encoded'] = author_encoder.fit_transform(df['author'])

publisher_encoder = LabelEncoder()
df['publisher_encoded'] = publisher_encoder.fit_transform(df['publisher'])

features = df[['author_encoded', 'year', 'publisher_encoded', 'avg_rating', 'rating_count']]

# create a nearest neighbors model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(features)

结果报错了。

1	ValueError: could not convert string to float: 'John Peterman'

优化2：数据清洗

看了一眼，就是书名有分号，所以识别出了问题。

加了一行这个没什么用，毕竟 mac 自带的 csv 读取器也错了，只能手改了。

1	quoting=csv.QUOTE_ALL

算了太多了，直接把数据丢了吧。

# create a new dataframe with the books and their ratings
df = df_books.set_index('isbn').join(df_ratings_avg.set_index('isbn'))
df = df[df['year'].str.isnumeric()]
df = df.dropna()
df['year'] = df['year'].astype(int)

优化3：数据编码

给字符串数据进行编码。

# Encoding
author_encoder = LabelEncoder()
publisher_encoder = LabelEncoder()
df['author_encoded'] = author_encoder.fit_transform(df['author'])
df['publisher_encoded'] = publisher_encoder.fit_transform(df['publisher'])

优化4：数据标准化

输入是 “Where the Heart Is (Oprah’s Book Club (Paperback))”，结果还可以，找的都是同一个作者的书。

{
    "Where the Heart Is (Oprah's Book Club (Paperback))": {
        "The Honk and Holler Opening Soon": 0.0004220216524539744,
        "Where the Heart Is: A Novel": 0.0004329866196861598,
        "Where the Heart Is": 0.00045208676755403854,
        "Shoot the Moon": 0.00047166440227597306
    }
}

不过感觉评分应该也不能加入相似性计算里，应该是找到了之后用评分排序比较好。

结果也是一样的，看来评分就没用，不过这相似度也太小了。

{
    "Where the Heart Is (Oprah's Book Club (Paperback))": {
        "The Honk and Holler Opening Soon": 1.4160210781710703e-09,
        "Where the Heart Is": 5.6642790458028e-09,
        "Where the Heart Is: A Novel": 1.2744773680850585e-08,
        "Shoot the Moon": 5.097383926067067e-08
    }
}

试一下标准化。

# Standardization
scaler = StandardScaler()
features = df[['author_encoded', 'year', 'publisher_encoded', 'avg_rating', 'rating_count']]
features = scaler.fit_transform(features)

结果也还是大差不差。

{
    "Where the Heart Is (Oprah's Book Club (Paperback))": {
        "Christmas Words: See-And-Say Fun for the Very Young": 5.922528489854528e-08,
        "The Scold's Bridle": 7.66054157885776e-08,
        "The Void Captain's Tale": 3.051218466776362e-07,
        "An Album of Voyager": 8.988488962025087e-07,
        "This Old House : The Best of Ask Norm": 1.0828802619045064e-06
    }
}

优化5：特征拼接

感觉之前优化了个寂寞，现在仔细一想用户-书籍的评分矩阵，可以让AI学会在用户眼中，哪几本书会一起看，所以应该一起考虑，类似这样。

数据量真的很大啊，我电脑上跑不下，被迫把数据筛掉很多。

INFO:root:count    340556.000000
mean          3.376185
std          12.436252
min           1.000000
25%           1.000000
50%           1.000000
75%           2.000000
max        2502.000000
Name: rating, dtype: float64

主要是后来还用了独热编码。

1 2	df_books = pd.get_dummies(df_books, columns=['author']) df_books = pd.get_dummies(df_books, columns=['publisher'])

调整了挺多的，大家自己看吧。

# draw a bar chart of the number of books rated by each user
ratings_per_user = df_ratings.groupby('user')['rating'].count()
ratings_per_book = df_ratings.groupby('isbn')['rating'].count()
logging.info("The distribution of ratings per book: ")
logging.info(ratings_per_book.describe())

# average rating and number of ratings for each book
average_rating = df_ratings.groupby('isbn')['rating'].mean()
rating_count = df_ratings.groupby('isbn')['rating'].count()
df_ratings_avg = pd.DataFrame({'isbn': average_rating.index, 'avg_rating': average_rating.values, 'rating_count': rating_count.values})

# process df_books
df_books = df_books[df_books['year'].str.isnumeric()]
df_books['year'] = df_books['year'].astype(int)
df_books = df_books.dropna()

# One-hot encoding
# df_books = pd.get_dummies(df_books, columns=['author'])
# df_books = pd.get_dummies(df_books, columns=['publisher'])

# extract extra features
extra_features = df_books.drop(columns=['title', 'publisher', 'author'])
logging.info("Extra features: ")
logging.info(extra_features)

# join the dataframes
df = df_books.set_index('isbn').join(df_ratings.set_index('isbn'))
df = df.groupby('isbn').filter(lambda x: len(x) >= 100)
user_book_matrix = df.pivot_table(index='isbn', columns='user', values='rating').fillna(0)
logging.info("User book matrix: ")
logging.info(user_book_matrix)

# concat by isbn
extra_features.set_index('isbn', inplace=True)
combined_data = pd.concat([user_book_matrix, extra_features], axis=1)
combined_data = combined_data.dropna()
combined_data = combined_data[combined_data['year'] != 0]
logging.info("Combined data: ")
logging.info(combined_data)

# train the knn model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(combined_data.values)

# function to return recommended books - this will be tested
def get_recommends(book = ""):
    # get the isbn of the book
    isbn_index = df_books[df_books['title'] == book]['isbn'].values[0]

    # get the index of the book in the combined data
    logging.info("Book: " + book)
    logging.info("ISBN index: ")
    if isbn_index not in combined_data.index:
        logging.info("Book not found.")
        return []
    book_index = combined_data.index.get_loc(isbn_index)

    distances, indices = model_knn.kneighbors(combined_data.iloc[book_index].values.reshape(1, -1), n_neighbors=5)

    # the format of the recommended books
    # {
    #     book:
    #     {
    #         [similar_book_1_isbn, similar_book_1_title]: distance_1,
    #         [similar_book_2_isbn, similar_book_2_title]: distance_2,
    #         [similar_book_3_isbn, similar_book_3_title]: distance_3,
    #         [similar_book_4_isbn, similar_book_4_title]: distance_4,
    #         [similar_book_5_isbn, similar_book_5_title]: distance_5
    #     }
    # }
    recommended_books = {}
    for i in range(1, len(indices[0])):
        recommended_book_isbn = combined_data.iloc[indices[0][i]].name
        recommended_book_title = df_books.loc[df_books['isbn'] == recommended_book_isbn, 'title'].values[0]
        recommended_books[recommended_book_isbn + " "+ recommended_book_title] = distances[0][i]

    return recommended_books

结果

还行，比之前好多了，就这样吧。

{
    "042513699X Turtle Moon": 0.0026335611586064678,
    "0373825013 Whirlwind (Tyler, Book 1)": 0.00265084184906883,
    "0446365505 Pleading Guilty": 0.0026517030537067665,
    "0425150143 Tom Clancy's Op-Center: Mirror Image (Tom Clancy's Op Center (Paperback))": 0.0026543396062681524
}

后记

怎么说呢，受益匪浅，但是确实处理数据太麻烦了，而且量太大了，我的电脑毕竟也只有16g内存。

其实还可以添加用户相似度的判断，这样就可以直接给新用户推荐书籍，大家可以自行尝试。