本文将对CCF《图书推荐系统竞赛》官方baseline进行详细解读,并修正一些错误,修正后的jupyter notebook文件可以在公众号”南极Python”后台回复图书推荐 自行获取。
赛题地址:https://www.datafountain.cn/competitions/542
题目介绍 背景 随着新型互联网的发展,人类逐渐进入了信息爆炸时代。新型电商网络面临的问题也逐渐转为如何让用户从海量的商品中挑选到自己想要的目标。推荐系统正是在互联网快速发展之后的产物。为帮助电商系统识别用户需求,为用户提供其更加感兴趣的信息,从而为用户提供更好的服务,需要依据真实的图书阅读数据集,利用机器学习的相关技术,建立一个图书推荐系统。用于为用户推荐其可能进行阅读的数据,从而在产生商业价值的同时,提升用户的阅读体验,帮助创建全民读书的良好社会风气。
任务 依据真实世界中的用户-图书交互记录,利用机器学习相关技术,建立一个精确稳定的图书推荐系统,预测用户可能会进行阅读的10本书籍。
数据 包含训练集和测试集,以及提交示例文件。
训练集中存储了用户与图书之间的交互信息,比如第一行: (user_id=0, item_id=257) 表示id为0的用户曾经阅读过id为257的图书。
测试集中只有用户的id,在预测时,我们需要预测测试集中出现的每个用户可能会阅读的10本图书,将其推荐给用户。
评价指标 F1-Score.
baseline搭建 导包,基本配置 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 import numpy as npimport pandas as pdimport randomimport torchimport torch.nn.functional as Ffrom torch.utils.data import Dataset, DataLoaderimport osseed = 114514 np.random.seed(seed) random.seed(seed) BATCH_SIZE = 512 hidden_dim = 16 epochs = 10 device = torch.device('cuda:0' ) if torch.cuda.is_available() else torch.device('cpu' ) print (device)
数据准备 由于原始的训练集没有标签,所以不能直接拿来训练,而是需要先根据训练集来构建用于训练的数据集。
先把训练集读进来:
然后构建真正能够用于训练的数据集,代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 import tqdmclass Goodbooks (Dataset ): def __init__ (self, df, mode='training' , negs = 99 ): super ().__init__() self.df = df self.mode = mode self.book_nums = max (df['item_id' ])+1 self.user_nums = max (df['user_id' ])+1 self._init_dataset() def _init_dataset (self ): self.Xs = [] self.user_book_map = {} for i in range (self.user_nums): self.user_book_map[i] = [] for index, row in self.df.iterrows(): user_id, book_id = row self.user_book_map[user_id].append(book_id) if self.mode == 'training' : for user, items in tqdm.tqdm(self.user_book_map.items()): for item in items[:-1 ]: self.Xs.append((user, item, 1 )) for _ in range (3 ): while True : neg_sample = random.randint(0 , self.book_nums-1 ) if neg_sample not in self.user_book_map[user]: self.Xs.append((user, neg_sample, 0 )) break elif self.mode == 'validation' : for user, items in tqdm.tqdm(self.user_book_map.items()): if len (items) == 0 : continue self.Xs.append((user, items[-1 ])) def __getitem__ (self, index ): if self.mode == 'training' : user_id, book_id, label = self.Xs[index] return user_id, book_id, label elif self.mode == 'validation' : user_id, book_id = self.Xs[index] negs = list (random.sample( list (set (range (self.book_nums)) - set (self.user_book_map[user_id])), k=99 )) return user_id, book_id, torch.LongTensor(negs) def __len__ (self ): return len (self.Xs)
在_init_dataset
方法中,首先用两个for循环建立了每一位用户与该用户看过的书籍之间的映射关系,存入self.user_book_map
中,即 :
1 self.user_book_map={用户1:用户1看过的书籍列表, 用户2:用户2看过的书籍列表, ...}
接下来分别实现了制作训练集和制作验证集的代码。
对于每一个用户的交互数据,训练集使用除了最后一个item(书籍 )之外的所有item(书籍),而验证集只使用最后一个item(书籍)。
用户阅读过的书籍视为正样本,未阅读的过的书籍视为负样本。
先看训练集。
在真实场景中,用户阅读过的书籍数往往小于数据总数,因此在构建训练集时,设置了正负样本比例为1:3,用于模拟真实场景。
训练集中每个样本的结构为:
label表示是否阅读,是:1,否:0.
再看验证集。
验证集中每个样本的结构为:
每一个用户id对应一个已阅读书籍id,对应99个未阅读书籍id.
之所以这样设计,是因为在验证时需要一个评价指标来衡量模型的好坏。
总之,有了上面的Goodbooks类,就可以构建训练集和验证集了:
1 2 3 4 5 6 traindataset = Goodbooks(df, 'training' ) validdataset = Goodbooks(df, 'validation' ) trainloader = DataLoader(traindataset, batch_size=BATCH_SIZE, shuffle=True , drop_last=False , num_workers=0 ) validloader = DataLoader(validdataset, batch_size=BATCH_SIZE, shuffle=True , drop_last=False , num_workers=0 )
模型构建 这里构建NeuralCF 模型:
关于该模型的详细原理,我们之前已经介绍过,可以查看这篇文章进行回顾。
下面来搭建网络模型。
网络组件如下:
1 2 3 4 5 6 7 8 9 Embedding Layer: 嵌入层,将稀疏的one-hot用户/物品向量转化为稠密的低维向量。 GMF Layer: 通过传统的矩阵分解算法,将以用户和物品的嵌入向量做内积,有效地提取浅层特征。 MLP Layer: 通过n层全连接层,提取深层特征。 Concatenation Layer: 将GMF和MLP输出的结果做concat,结合其中的深层和浅层信息。 Output Layer: 输出层,输出用户-物品对的最终评分。
PyTorch代码如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 class NCFModel (torch.nn.Module ): def __init__ (self, hidden_dim, user_num, item_num, mlp_layer_num=4 , weight_decay = 1e-5 , dropout=0.5 ): super ().__init__() self.hidden_dim = hidden_dim self.user_num = user_num self.item_num = item_num self.mlp_layer_num = mlp_layer_num self.weight_decay = weight_decay self.dropout=dropout self.mlp_user_embedding = torch.nn.Embedding(user_num, hidden_dim * (2 ** (self.mlp_layer_num - 1 ))) self.mlp_item_embedding = torch.nn.Embedding(item_num, hidden_dim * (2 ** (self.mlp_layer_num - 1 ))) self.gmf_user_embedding = torch.nn.Embedding(user_num, hidden_dim) self.gmf_item_embedding = torch.nn.Embedding(item_num, hidden_dim) mlp_Layers = [] input_size = int (hidden_dim*(2 ** (self.mlp_layer_num))) for i in range (self.mlp_layer_num): mlp_Layers.append(torch.nn.Linear(int (input_size), int (input_size / 2 ))) mlp_Layers.append(torch.nn.Dropout(self.dropout)) mlp_Layers.append(torch.nn.ReLU()) input_size /= 2 self.mlp_layers = torch.nn.Sequential(*mlp_Layers) """ Sequential( (0): Linear(in_features=256, out_features=128, bias=True) (1): Dropout(p=0.5, inplace=False) (2): ReLU() (3): Linear(in_features=128, out_features=64, bias=True) (4): Dropout(p=0.5, inplace=False) (5): ReLU() (6): Linear(in_features=64, out_features=32, bias=True) (7): Dropout(p=0.5, inplace=False) (8): ReLU() (9): Linear(in_features=32, out_features=16, bias=True) (10): Dropout(p=0.5, inplace=False) (11): ReLU() ) """ self.output_layer = torch.nn.Linear(2 *self.hidden_dim, 1 ) def forward (self, user, item ): user_gmf_embedding = self.gmf_user_embedding(user) item_gmf_embedding = self.gmf_item_embedding(item) user_mlp_embedding = self.mlp_user_embedding(user) item_mlp_embedding = self.mlp_item_embedding(item) gmf_output = user_gmf_embedding * item_gmf_embedding mlp_input = torch.cat([user_mlp_embedding, item_mlp_embedding], dim=-1 ) mlp_output = self.mlp_layers(mlp_input) output = torch.sigmoid(self.output_layer(torch.cat([gmf_output, mlp_output], dim=-1 ))).squeeze(-1 ) return output def predict (self, user, item ): self.eval () with torch.no_grad(): user_gmf_embedding = self.gmf_user_embedding(user) item_gmf_embedding = self.gmf_item_embedding(item) user_mlp_embedding = self.mlp_user_embedding(user) item_mlp_embedding = self.mlp_item_embedding(item) gmf_output = user_gmf_embedding.unsqueeze(1 ) * item_gmf_embedding user_mlp_embedding = user_mlp_embedding.unsqueeze(1 ).expand(-1 , item_mlp_embedding.shape[1 ], -1 ) mlp_input = torch.cat([user_mlp_embedding, item_mlp_embedding], dim=-1 ) mlp_output = self.mlp_layers(mlp_input) output = torch.sigmoid(self.output_layer(torch.cat([gmf_output, mlp_output], dim=-1 ))).squeeze(-1 ) return output
网络的输入是user id和item id,输出user id阅读item id 的预测概率。
模型训练&评估 直接上代码:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 model = NCFModel(hidden_dim, traindataset.user_nums, traindataset.book_nums).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3 ) crit = torch.nn.BCELoss() loss_for_plot = [] hits_for_plot = [] for epoch in range (epochs): losses = [] for index, data in enumerate (tqdm.tqdm(trainloader)): user, item, label = data user, item, label = user.to(device), item.to(device), label.to(device).float () y_ = model(user, item).squeeze() loss = crit(y_, label) optimizer.zero_grad() loss.backward() optimizer.step() losses.append(loss.detach().cpu().item()) hits = [] for index, data in enumerate (validloader): user, pos, neg = data pos = pos.unsqueeze(1 ) all_data = torch.cat([pos, neg], dim=-1 ) output = model.predict(user.to(device), all_data.to(device)).detach().cpu() for batch in output: pred10=(batch).argsort(descending=True )[:10 ] if 0 not in pred10: hits.append(0 ) else : hits.append(1 ) print ('Epoch {} finished, average loss {}, hits@20 {}' .format (epoch, sum (losses)/len (losses), sum (hits)/len (hits))) loss_for_plot.append(sum (losses)/len (losses)) hits_for_plot.append(sum (hits)/len (hits))
训练的代码很简单,这里不再赘述。
验证的代码有点绕,我们来仔细分析下。
验证时,分别预测batchsize(512)个用户阅读100本图书的概率(请回顾上面所讲验证集样本的结构)。这100本图书中,第一本(下标为0)是正样本(阅读),其余99本全是负样本(不阅读)。
这里的验证代码是有问题的,因为pred10
代表的下标仅仅是在当前用户对应的100本图书中的下标,它的取值范围是0到99,真正的图书id并不是这些下标,而是这些下标在当前用户对应的all_data中的行的取值。
语言描述可能不是很清晰,那咱们把代码改改:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 model = NCFModel(hidden_dim, traindataset.user_nums, traindataset.book_nums).to(device) optimizer = torch.optim.Adam(model.parameters(), lr=1e-3 ) crit = torch.nn.BCELoss() loss_for_plot = [] hits_for_plot = [] for epoch in range (epochs): losses = [] for index, data in enumerate (tqdm.tqdm(trainloader)): user, item, label = data user, item, label = user.to(device), item.to(device), label.to(device).float () y_ = model(user, item).squeeze() loss = crit(y_, label) optimizer.zero_grad() loss.backward() optimizer.step() losses.append(loss.detach().cpu().item()) hits = [] for index, data in enumerate (validloader): user, pos, neg = data pos = pos.unsqueeze(1 ) all_data = torch.cat([pos, neg], dim=-1 ) print (all_data) output = model.predict(user.to(device), all_data.to(device)).detach().cpu() for batch ,batch_items in zip (output,all_data): pos_id=batch_items[0 ] pred10=(batch).argsort(descending=True )[:10 ] pred10=batch_items[pred10] print (pred10) if pos_id not in pred10: hits.append(0 ) else : hits.append(1 ) print ('Epoch {} finished, average loss {}, hits@20 {}' .format (epoch, sum (losses)/len (losses), sum (hits)/len (hits))) loss_for_plot.append(sum (losses)/len (losses)) hits_for_plot.append(sum (hits)/len (hits))
all_data
中存储了item id,因此,在得到预测的下标(取值范围0到99)后,将其映射回真实的item id;在计算准确率时,使用了真实的item id:
1 2 3 4 5 6 7 for batch ,batch_items in zip (output,all_data): pos_id=batch_items[0 ] pred10=(batch).argsort(descending=True )[:10 ] pred10=batch_items[pred10]
这样就修改完成了。
执行上述代码块,开始训练,训练完成后可以保存模型:
1 2 torch.save(model.state_dict(), './model.h5' )
模型预测 先把测试集读进来:
1 2 df = pd.read_csv('./test_dataset.csv' ) user_for_test = df['user_id' ].tolist()
然后预测每一个用户user可能会点击的图书item,此时考虑所有用户未阅读的图书:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 predict_item_id = [] def chunks (l, n ): for i in range (0 , len (l), n): yield l[i:i+n] f = open ('./submission.csv' , 'w' , encoding='utf-8' ) for user in tqdm.tqdm(user_for_test): user_visited_items = traindataset.user_book_map[user] items_for_predict = list (set (range (traindataset.book_nums)) - set (user_visited_items)) results = [] user = torch.Tensor([user]).to(device).long() for item_batch in chunks(items_for_predict, 512 ): item_batch = torch.Tensor(item_batch).unsqueeze(0 ).to(device).long() result = model.predict(user, item_batch).view(-1 ).detach().cpu() results.append(result) results = torch.cat(results, dim=-1 ) predict_item_id = results.argsort(descending=True )[:10 ] print ('ind:' ,predict_item_id) res=[] for i in predict_item_id: res.append(items_for_predict[i]) print ('res:' ,res) list (map (lambda x: f.write('{},{}\n' .format (user.cpu().item(), x)), predict_item_id)) f.flush() f.close()
对所有用户未阅读的图书进行预测,对预测值从大到小排序并取前10所在下标。和之前讲的验证部分一样,需要将这些下标映射到真正的item id,一个for循环即可解决:
1 2 3 res=[] for i in predict_item_id: res.append(items_for_predict[i])
以上。
参考: