The problem is from a recommendations project.
(问题来自建议项目。)
The data has ~300K users and ~200K items. (数据有?300K用户和?200K项。)
The user-item ratings matrix would be sparse and huge, much larger than that can be fit in a RAM. (用户项目评级矩阵将稀疏且庞大,远大于可容纳在RAM中的矩阵。)
I first want to get latent representations of the users with PCA, and then do similarity analyses of the users with the latent vectors using something like approximate nearest neighbors. (我首先想用PCA获得用户的潜在表示,然后使用近似最近邻等方法对用户与潜在向量进行相似性分析。)
How can I approach this problem? (我该如何解决这个问题?)
I have the options of using PySpark and/or sklearn. (我可以选择使用PySpark和/或sklearn。)
ask by candide translate from so 与恶龙缠斗过久,自身亦成为恶龙;凝视深渊过久,深渊将回以凝视…