基于内容的推荐—为酒店建立内容推荐
发布日期:2022-02-05 22:03:41 浏览次数:1 分类:技术文章

本文共 6617 字,大约阅读时间需要 22 分钟。

项目描述:

基于西雅图酒店数据集,基于用户选择的酒店,为其推荐相似度高的Top10个其他酒店。

数据集下载链接:

数据集包含三个字段:酒店姓名、地址、以及内容描述。

数据集展示:

方法步骤:

1.数据探索及导入相关包:

import pandas as pdimport numpy as npfrom nltk.corpus import stopwordsfrom sklearn.metrics.pairwise import linear_kernelfrom sklearn.feature_extraction.text import CountVectorizerfrom sklearn.feature_extraction.text import TfidfVectorizerfrom sklearn.decomposition import LatentDirichletAllocationimport reimport randompd.options.display.max_columns = 30import matplotlib.pyplot as plt%matplotlib inline# 支持中文plt.rcParams['font.sans-serif'] = ['SimHei']  # 用来正常显示中文标签df = pd.read_csv('Seattle_Hotels.csv', encoding="latin-1")# 数据探索print(df.head())print('数据集中的酒店个数:', len(df))

 

name  \0  Hilton Garden Seattle Downtown   1          Sheraton Grand Seattle   2   Crowne Plaza Seattle Downtown   3   Kimpton Hotel Monaco Seattle    4              The Westin Seattle                                              address  \0  1821 Boren Avenue, Seattle Washington 98101 USA   1   1400 6th Avenue, Seattle, Washington 98101 USA   2                  1113 6th Ave, Seattle, WA 98101   3                   1101 4th Ave, Seattle, WA98101   4   1900 5th Avenue, Seattle, Washington 98101 USA                                                   desc  0  Located on the southern tip of Lake Union, the...  1  Located in the city's vibrant core, the Sherat...  2  Located in the heart of downtown Seattle, the ...  3  What?s near our hotel downtown Seattle locatio...  4  Situated amid incredible shopping and iconic a...  数据集中的酒店个数: 152
def print_description(index):    example = df[df.index == index][['desc', 'name']].values[0]    if len(example) > 0:        print(example[0])        print('Name:', example[1])print('第10个酒店的描述:')print_description(10)
第10个酒店的描述:Soak up the vibrant scene in the Living Room Bar and get in the mix with our live music and DJ series before heading to a memorable dinner at TRACE. Offering inspired seasonal fare in an award-winning atmosphere, it's a not-to-be-missed culinary experience in downtown Seattle. Work it all off the next morning at FIT®, our state-of-the-art fitness center before wandering out to explore many of the area's nearby attractions, including Pike Place Market, Pioneer Square and the Seattle Art Museum. As always, we've got you covered during your time at W Seattle with our signature Whatever/Whenever® service - your wish is truly our command.Name: W Seattle

 通过 CounterVectorizer建立三元词袋模型,统计酒店描述中,出现top20多的词。

# 得到酒店描述中n-gram特征中的TopK个def get_top_n_words(corpus, n=1, k=None):    # 统计ngram词频矩阵    vec = CountVectorizer(ngram_range=(n, n), stop_words='english').fit(corpus)    bag_of_words = vec.transform(corpus)    """    print('feature names:')    print(vec.get_feature_names())    print('bag of words:')    print(bag_of_words.toarray())    """    print('feature names:')    print(vec.get_feature_names())#获得所有文本的关键字    print('bag of words:')    print(bag_of_words.toarray())        sum_words = bag_of_words.sum(axis=0)    words_freq = [(word, sum_words[0, idx]) for word, idx in vec.vocabulary_.items()]    # 按照词频从大到小排序    words_freq =sorted(words_freq, key = lambda x: x[1], reverse=True)    return words_freq[:k]common_words = get_top_n_words(df['desc'], 3, 20)print(common_words)df1 = pd.DataFrame(common_words, columns = ['desc' , 'count'])df1.groupby('desc').sum()['count'].sort_values().plot(kind='barh', title='去掉停用词后,酒店描述中的Top20单词')plt.show()

2.对文本进行预处理

# 文本预处理REPLACE_BY_SPACE_RE = re.compile('[/(){}\[\]\|@,;]')BAD_SYMBOLS_RE = re.compile('[^0-9a-z #+_]')STOPWORDS = set(stopwords.words('english'))# 对文本进行清洗def clean_text(text):    # 全部小写    text = text.lower()    # 用空格替代一些特殊符号,如标点    text = REPLACE_BY_SPACE_RE.sub(' ', text)    # 移除BAD_SYMBOLS_RE    text = BAD_SYMBOLS_RE.sub('', text)    # 从文本中去掉停用词    text = ' '.join(word for word in text.split() if word not in STOPWORDS)     return text# 对desc字段进行清理df['desc_clean'] = df['desc'].apply(clean_text)print(df['desc_clean'].head())

3.采用TF-IDF提取文本特征

# 建模df.set_index('name', inplace = True)# 使用TF-IDF提取文本特征tf = TfidfVectorizer(analyzer='word', ngram_range=(1, 3), min_df=0.01, stop_words='english')#对文本数据进行tfidf特征表示tfidf_matrix = tf.fit_transform(df['desc_clean'])print('TFIDF feature names:')print(tf.get_feature_names())print(len(tf.get_feature_names()))print('tfidf_matrix:')print(tfidf_matrix)print(tfidf_matrix.toarray())print(tfidf_matrix.shape)

4.计算酒店之间的余弦相似度

# 计算酒店之间的余弦相似度(线性核函数)cosine_similarities = linear_kernel(tfidf_matrix, tfidf_matrix)print(cosine_similarities)print(cosine_similarities.shape)indices = pd.Series(df.index) #df.index是酒店名称
[[1.         0.0391713  0.10519839 ... 0.04506191 0.01188579 0.02732358] [0.0391713  1.         0.06121291 ... 0.06131857 0.01508036 0.03706011] [0.10519839 0.06121291 1.         ... 0.09179925 0.04235642 0.05607314] ... [0.04506191 0.06131857 0.09179925 ... 1.         0.0579826  0.04145794] [0.01188579 0.01508036 0.04235642 ... 0.0579826  1.         0.0172546 ] [0.02732358 0.03706011 0.05607314 ... 0.04145794 0.0172546  1.        ]](152, 152)

5.基于相似度推荐top10的酒店

# 基于相似度矩阵和指定的酒店name,推荐TOP10酒店def recommendations(name, cosine_similarities = cosine_similarities):    recommended_hotels = []    # 找到想要查询酒店名称的idx    idx = indices[indices == name].index[0]    print('idx=', idx)    # 对于idx酒店的余弦相似度向量按照从大到小进行排序    score_series = pd.Series(cosine_similarities[idx]).sort_values(ascending = False)    # 取相似度最大的前10个(除了自己以外)    top_10_indexes = list(score_series.iloc[1:11].index)    # 放到推荐列表中    for i in top_10_indexes:        recommended_hotels.append(list(df.index)[i])    return recommended_hotelsprint(recommendations('Hilton Seattle Airport & Conference Center'))print(recommendations('The Bacon Mansion Bed and Breakfast'))

推荐结果如下: 

idx= 49['Embassy Suites by Hilton Seattle Tacoma International Airport', 'DoubleTree by Hilton Hotel Seattle Airport', 'Seattle Airport Marriott', 'Motel 6 Seattle Sea-Tac Airport South', 'Knights Inn Tukwila', 'Four Points by Sheraton Downtown Seattle Center', 'Radisson Hotel Seattle Airport', 'Hampton Inn Seattle/Southcenter', 'Home2 Suites by Hilton Seattle Airport', 'Red Lion Hotel Seattle Airport Sea-Tac']idx= 116['11th Avenue Inn Bed and Breakfast', 'Shafer Baillie Mansion Bed & Breakfast', 'Gaslight Inn', 'Bed and Breakfast Inn Seattle', 'Chittenden House Bed and Breakfast', 'Hyatt House Seattle', 'Mozart Guest House', 'Silver Cloud Hotel - Seattle Broadway', 'WorldMark Seattle - The Camlin', 'Pensione Nichols Bed and Breakfast']

总结:

基于酒店内容推荐的一般步骤:

Step1,对酒店描述(Desc)进行特征提取 N-Gram,提取N个连续字的集合,作为特征 TF-IDF,按照(min_df, max_df)提取关键词,并生成TFIDF矩阵

Step2,计算酒店之间的相似度矩阵 余弦相似度

Step3,对于指定的酒店,选择相似度最大的Top-K个酒店进行输出

转载地址:https://blog.csdn.net/lu_yunjie/article/details/108060364 如侵犯您的版权,请留言回复原文章的地址,我们会给您删除此文章,给您带来不便请您谅解!

上一篇:面试题:合并递增链表并保持递增
下一篇:特征筛选之—IV值

发表评论

最新留言

路过按个爪印,很不错,赞一个!
[***.219.124.196]2024年04月18日 06时43分27秒