推荐系统实战(四)—— 电影协同过滤推荐

Movielens数据集可以在这里下载,本文练习均是基于ml-latest-small.zip进行,该数据集较小便于单机使用和运行。

学习目标为:

  • 根据用户电影评分数据分别实现User-Based和Item-Based并进行电影评分的预测,然后为用户实现电影推荐

基于用户相似度的协同过滤实现

首先导入需要的包:

1
2
3
4
5
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import os
from pprint import pprint

定义一个从缓存读取数据的函数,当数据量较大时这种方法会极大加速后面的读取速度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
CACHE_DIR = "datasets/cache/"
DATA_PATH = "datasets/ml-latest-small/ratings.csv"

def load_data(data_path):
cache_path = os.path.join(CACHE_DIR, "ratings_matrix.cache")
print("-----开始加载数据集-----")
if os.path.exists(cache_path):
print("-----加载缓存中-----")
ratings_matrix = pd.read_pickle(cache_path)
print("-----从缓存加载数据集完毕-----")
else:
print("-----加载新数据中-----")
# 设置加载的数据字段的类型
dtype = {"userId": np.int32, "movieId": np.int32, "rating": np.float32}
# 加载数据,暂时只用前三列数据
ratings = pd.read_csv(data_path, dtype=dtype, usecols=range(3))
# 转换为评分矩阵
ratings_matrix = rating.pivot_table(index=["userId"], columns=["movieId"], values="rating")
# 存入缓存文件
ratings_matrix.to_pickle(cache_path)
print("-----数据集加载完毕-----")
return ratings_matrix

然后就是计算用户相似度,这里同时也实现了物品相似度的计算:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compute_person_similarity(ratings_matrix, based="user"):
user_similarity_cache_path = os.path.join(CACHE_DIR, "user_similarity.cache")
item_similarity_cache_path = os.path.join(CACHE_DIR, "item_similarity.cache")
if based=="user":
if os.path.exists(user_similarity_cache_path):
print("-----正从缓存加载用户相似度矩阵-----")
similarity = pd.read_pickle(user_similarity_cache_path)
else:
print("-----开始计算用户相似度矩阵-----")
similarity = ratings_matrix.T.corr()
similarity.to_pickle(user_similarity_cache_path)
elif based == "item":
if os.path.exists(item_similarity_cache_path):
print("-----正从缓存加载用户相似度矩阵-----")
similarity = pd.read_pickle(item_similarity_cache_path)
else:
print("-----开始计算物品相似度矩阵-----")
similarity = ratings_matrix.corr()
similarity.to_pickle(item_similarity_cache_path)
else:
raise Exception("Unhandled 'based' value: %s" %based)
print("-----相似度矩阵计算/加载完毕-----")
return similarity

我们首先实现单个用户对单个电影的评分:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
def predict(uid, iid, ratings_matrix, user_similar):
print("-----开始预测用户<%d>对电影<%d>的评分-----"%(uid, iid))
# 找出uid用户的相似用户
similar_users = user_similar[uid].drop([uid]).dropna()
similar_users = similar_users.where(similar_users>0).dropna()
if similar_users.empty is True:
raise Exception("用户<%d>没有相似的用户"%uid)
# 从uid用户的近邻相似用户中选出对iid物品有评分记录的近邻用户
ids = set(ratings_matrix[iid].dropna().index)&set(similar_users.index)
finally_similar_users = similar_users.ix[list(ids)]
# 结合uid用户与其近邻用户的相似度预测uid用户对iid物品的评分
sum_up = 0 #评分预测公式分子部分的值
sum_down = 0 #评分预测公式分母部分的值
for sim_uid, similarity in finally_similar_users.iteritems():
# 近邻用户的评分数据
sim_user_rated_movies = ratings_matrix.ix[sim_uid].dropna()
# 近邻用户对iid物品的评分
sim_user_rating_for_item = sim_user_rated_movies[iid]
# 计算分子的值
sum_up += similarity * sim_user_rating_for_item
# 计算分母的值
sum_down += similarity

predict_rating = sum_up/sum_down
print("预测出用户<%d>对电影<%d>的评分: %0.2f" %((uid, iid, predict_rating)))
return round(predict_rating, 2)

定义一个简单的用户对item_ids中的电影预测,item_ids可以直接是电影全集

1
2
3
4
5
6
7
8
9
def _predict_all(uid, item_ids, rating_matrix, user_similar):
item_ids = rating_matrix.columns
for iid in item_ids:
try:
rating = predict(uid, iid, ratings_matrix, user_similar)
except Exception as e:
print(e)
else:
yield uid, iid, rating

我们还可以添加其他过滤规则:过滤 非热门电影或有过评分的电影

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
def predict_all(uid, ratings_matrix, user_similar, filter_rule=None):
if not filter_rule:
item_ids = ratings_matrix.columns
elif isinstance(filter_rule, str) and filter_rule == "unhot":
# 过滤非热门电影
count = ratings_matrix.count()
item_ids = count.where(count>10).dropna().index
# 过滤有过评分的电影
elif isinstance(filter_rule, str) and filter_rule == "rated":
user_ratings = ratings_matrix.ix[uid]
_ = user_ratings<6
item_ids = _.where(_==True).dropna().index
elif isinstance(filter_rule, list) and set(filter_rule) == set(["unhot", "rated"]):
count = ratings_matrix.count()
ids1 = count.where(count>10).dropna().index
user_ratings = ratings_matrix.ix[uid]
_ = user_ratings<6
ids2 = _.where(_==True).dropna().index
item_ids = set(ids1) & set(ids2)
else:
raise Exception("无效的过滤参数")
yield from _predict_all(uid, item_ids, ratings_matrix, user_similar)

最后我们定义一个函数实现top20的推荐:

1
2
3
4
5
def top_k_rs_result(uid, k):
ratings_matrix = load_data(DATA_PATH)
user_similar = compute_person_similarity(ratings_matrix, based="user")
results = predict_all(uid, ratings_matrix, user_similar, filter_rule=["unhot", "rated"])
return sorted(results, key=lambda x: x[2], reverse=True)[:k]
  • 版权声明: 本博客所有文章除特别声明外,著作权归作者所有。转载请注明出处!

扫一扫,分享到微信

微信分享二维码
  • Copyrights © 2020 chenk
  • 由 帅气的CK本尊 强力驱动
  • 访问人数: | 浏览次数:

请我喝杯咖啡吧~

支付宝
微信