前言

今天来写一个图书推荐引擎,完整代码见 https://github.com/zong4/AILearning。

数据处理

主要是有三个表,给大家看一下。

先把数据提出来,画图就不画了,数据量有点大,给我电脑干卡了。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
books_filename = './book_recommendation/book-crossings/BX-Books.csv'
ratings_filename = './book_recommendation/book-crossings/BX-Book-Ratings.csv'
users_filename = './book_recommendation/book-crossings/BX-Users.csv'

# import csv data into dataframes
df_books = pd.read_csv(
books_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['isbn', 'title', 'author'],
usecols=['isbn', 'title', 'author'],
dtype={'isbn': 'str', 'title': 'str', 'author': 'str'})

df_ratings = pd.read_csv(
ratings_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'isbn', 'rating'],
usecols=['user', 'isbn', 'rating'],
dtype={'user': 'int32', 'isbn': 'str', 'rating': 'float32'})

df_users = pd.read_csv(
users_filename,
encoding = "ISO-8859-1",
sep=";",
header=0,
names=['user', 'location', 'age'],
usecols=['user', 'location', 'age'],
dtype={'user': 'int32', 'location': 'str', 'age': 'float32'})

模型拟合

生成数据透视表,然后用 KNN 模型来推荐书籍。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# function to return recommended books - this will be tested
def get_recommends(book = ""):
# create a new dataframe with the books and their ratings
df = df_books.set_index('isbn').join(df_ratings.set_index('isbn'))
# print(df.iloc[:4])

# create a pivot table
df_pivot = df.pivot_table(index='title', columns='user', values='rating').fillna(0)
print(df_pivot.iloc[:4])

# create a nearest neighbors model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(df_pivot)

# get the index of the book
query_index = df_pivot.index.get_loc(book)
distances, indices = model_knn.kneighbors(df_pivot.iloc[query_index, :].values.reshape(1, -1), n_neighbors=6)

return [book, list(df_pivot.index[indices.flatten()]), list(distances.flatten())]

生成的表就差不多长下面这样,我们就可以根据用户的打分来计算书籍之间的相似度了。

优化1:特征提取

线程都直接被 kill 了。

1
zsh: killed     python book_recommendation/test.py

其实不同用户的打分根本没用,应该输入的数据是书名,而不是用户信息,所以我们不需要计算用户的相似度,所以我们只需要书籍的信息和打分就可以了。

所以我们就提取那么些特征来训练 KNN 模型。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# average rating and number of ratings for each book
average_rating = df_ratings.groupby('isbn')['rating'].mean()
rating_count = df_ratings.groupby('isbn')['rating'].count()
df_ratings_avg = pd.DataFrame({'isbn': average_rating.index, 'avg_rating': average_rating.values, 'rating_count': rating_count.values})

# create a new dataframe with the books and their ratings
df = df_books.set_index('isbn').join(df_ratings_avg.set_index('isbn'))
# print(df.iloc[:4])

# One Hot Encoding
author_encoder = LabelEncoder()
df['author_encoded'] = author_encoder.fit_transform(df['author'])

publisher_encoder = LabelEncoder()
df['publisher_encoded'] = publisher_encoder.fit_transform(df['publisher'])

features = df[['author_encoded', 'year', 'publisher_encoded', 'avg_rating', 'rating_count']]

# create a nearest neighbors model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(features)

结果报错了。

1
ValueError: could not convert string to float: 'John Peterman'

优化2:数据清洗

看了一眼,就是书名有分号,所以识别出了问题。

加了一行这个没什么用,毕竟 mac 自带的 csv 读取器也错了,只能手改了。

1
quoting=csv.QUOTE_ALL

算了太多了,直接把数据丢了吧。

1
2
3
4
5
# create a new dataframe with the books and their ratings
df = df_books.set_index('isbn').join(df_ratings_avg.set_index('isbn'))
df = df[df['year'].str.isnumeric()]
df = df.dropna()
df['year'] = df['year'].astype(int)

优化3:数据编码

给字符串数据进行编码。

1
2
3
4
5
# Encoding
author_encoder = LabelEncoder()
publisher_encoder = LabelEncoder()
df['author_encoded'] = author_encoder.fit_transform(df['author'])
df['publisher_encoded'] = publisher_encoder.fit_transform(df['publisher'])

优化4:数据标准化

输入是 “Where the Heart Is (Oprah’s Book Club (Paperback))”,结果还可以,找的都是同一个作者的书。

1
2
3
4
5
6
7
8
{
"Where the Heart Is (Oprah's Book Club (Paperback))": {
"The Honk and Holler Opening Soon": 0.0004220216524539744,
"Where the Heart Is: A Novel": 0.0004329866196861598,
"Where the Heart Is": 0.00045208676755403854,
"Shoot the Moon": 0.00047166440227597306
}
}

不过感觉评分应该也不能加入相似性计算里,应该是找到了之后用评分排序比较好。

结果也是一样的,看来评分就没用,不过这相似度也太小了。

1
2
3
4
5
6
7
8
{
"Where the Heart Is (Oprah's Book Club (Paperback))": {
"The Honk and Holler Opening Soon": 1.4160210781710703e-09,
"Where the Heart Is": 5.6642790458028e-09,
"Where the Heart Is: A Novel": 1.2744773680850585e-08,
"Shoot the Moon": 5.097383926067067e-08
}
}

试一下标准化。

1
2
3
4
# Standardization
scaler = StandardScaler()
features = df[['author_encoded', 'year', 'publisher_encoded', 'avg_rating', 'rating_count']]
features = scaler.fit_transform(features)

结果也还是大差不差。

1
2
3
4
5
6
7
8
9
{
"Where the Heart Is (Oprah's Book Club (Paperback))": {
"Christmas Words: See-And-Say Fun for the Very Young": 5.922528489854528e-08,
"The Scold's Bridle": 7.66054157885776e-08,
"The Void Captain's Tale": 3.051218466776362e-07,
"An Album of Voyager": 8.988488962025087e-07,
"This Old House : The Best of Ask Norm": 1.0828802619045064e-06
}
}

优化5:特征拼接

感觉之前优化了个寂寞,现在仔细一想用户-书籍的评分矩阵,可以让AI学会在用户眼中,哪几本书会一起看,所以应该一起考虑,类似这样。

数据量真的很大啊,我电脑上跑不下,被迫把数据筛掉很多。

1
2
3
4
5
6
7
8
9
INFO:root:count    340556.000000
mean 3.376185
std 12.436252
min 1.000000
25% 1.000000
50% 1.000000
75% 2.000000
max 2502.000000
Name: rating, dtype: float64

主要是后来还用了独热编码。

1
2
df_books = pd.get_dummies(df_books, columns=['author'])
df_books = pd.get_dummies(df_books, columns=['publisher'])

调整了挺多的,大家自己看吧。

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
# draw a bar chart of the number of books rated by each user
ratings_per_user = df_ratings.groupby('user')['rating'].count()
ratings_per_book = df_ratings.groupby('isbn')['rating'].count()
logging.info("The distribution of ratings per book: ")
logging.info(ratings_per_book.describe())

# average rating and number of ratings for each book
average_rating = df_ratings.groupby('isbn')['rating'].mean()
rating_count = df_ratings.groupby('isbn')['rating'].count()
df_ratings_avg = pd.DataFrame({'isbn': average_rating.index, 'avg_rating': average_rating.values, 'rating_count': rating_count.values})

# process df_books
df_books = df_books[df_books['year'].str.isnumeric()]
df_books['year'] = df_books['year'].astype(int)
df_books = df_books.dropna()

# One-hot encoding
# df_books = pd.get_dummies(df_books, columns=['author'])
# df_books = pd.get_dummies(df_books, columns=['publisher'])

# extract extra features
extra_features = df_books.drop(columns=['title', 'publisher', 'author'])
logging.info("Extra features: ")
logging.info(extra_features)

# join the dataframes
df = df_books.set_index('isbn').join(df_ratings.set_index('isbn'))
df = df.groupby('isbn').filter(lambda x: len(x) >= 100)
user_book_matrix = df.pivot_table(index='isbn', columns='user', values='rating').fillna(0)
logging.info("User book matrix: ")
logging.info(user_book_matrix)

# concat by isbn
extra_features.set_index('isbn', inplace=True)
combined_data = pd.concat([user_book_matrix, extra_features], axis=1)
combined_data = combined_data.dropna()
combined_data = combined_data[combined_data['year'] != 0]
logging.info("Combined data: ")
logging.info(combined_data)

# train the knn model
model_knn = NearestNeighbors(metric='cosine', algorithm='brute', n_neighbors=20)
model_knn.fit(combined_data.values)

# function to return recommended books - this will be tested
def get_recommends(book = ""):
# get the isbn of the book
isbn_index = df_books[df_books['title'] == book]['isbn'].values[0]

# get the index of the book in the combined data
logging.info("Book: " + book)
logging.info("ISBN index: ")
if isbn_index not in combined_data.index:
logging.info("Book not found.")
return []
book_index = combined_data.index.get_loc(isbn_index)

distances, indices = model_knn.kneighbors(combined_data.iloc[book_index].values.reshape(1, -1), n_neighbors=5)

# the format of the recommended books
# {
# book:
# {
# [similar_book_1_isbn, similar_book_1_title]: distance_1,
# [similar_book_2_isbn, similar_book_2_title]: distance_2,
# [similar_book_3_isbn, similar_book_3_title]: distance_3,
# [similar_book_4_isbn, similar_book_4_title]: distance_4,
# [similar_book_5_isbn, similar_book_5_title]: distance_5
# }
# }
recommended_books = {}
for i in range(1, len(indices[0])):
recommended_book_isbn = combined_data.iloc[indices[0][i]].name
recommended_book_title = df_books.loc[df_books['isbn'] == recommended_book_isbn, 'title'].values[0]
recommended_books[recommended_book_isbn + " "+ recommended_book_title] = distances[0][i]

return recommended_books

结果

还行,比之前好多了,就这样吧。

1
2
3
4
5
6
{
"042513699X Turtle Moon": 0.0026335611586064678,
"0373825013 Whirlwind (Tyler, Book 1)": 0.00265084184906883,
"0446365505 Pleading Guilty": 0.0026517030537067665,
"0425150143 Tom Clancy's Op-Center: Mirror Image (Tom Clancy's Op Center (Paperback))": 0.0026543396062681524
}

后记

怎么说呢,受益匪浅,但是确实处理数据太麻烦了,而且量太大了,我的电脑毕竟也只有16g内存。

其实还可以添加用户相似度的判断,这样就可以直接给新用户推荐书籍,大家可以自行尝试。