Recommanding recall system learning note
Updated: Mar 14, 2019
The first concept I think is Information Overload. It means that if we have too many items or the system is too large for a human being to iterate manually, it is called Information Overload. The circumstance of recommending is that when Information Overload happens, we need to select the items(objects) for users automatically basing on a set of rules.
Basing on the action of user
Basing on the user profile
Basing on the vector
Latent factor model
The basic idea of this model is that we should have a matrix(r) that store whether the user visited an item(object).
Basing on the feature of the item, such as category and size, we should be able to build a vector(q) storing the weight of these feature. Similarly, we can build another(p) vector for the user's feature. If we build the vector in a right way, and we do the dot product between two vector(transpose one of them), we will get the matrix above.
Using this vectors, we can do many things:
Top like: if we have an item that have not been shown to a user, we can compute whether a user will like it. Moreover, we can even compute a list of items that a user have not seen before, but he/she has a higher chance interested in.
Top similar: we can compute the distance or compute cosine value between two items or two users. We can find a list storing the items/users similar to current item/user. The recommendation can be in that list
Topic: using the cluster algorithms such as k-means, we can find the topic of items or users. We can recommend items in the same topic. And we can recommend the same item to users in the same topic.
So right now weight this system, we can give a prediction of users' behavior. We can minimize the loss function to fit the system:
It is the subtraction of the predicted matrix and the real user-item matrix, with two regulations. Normally the minimization can be done using CGD.
The problem of this system:
One of the biggest problem is it is slow to the changes in the system. Whenever an user makes a new action, the user-item matrix is changed. We may have do the whole process again. The action can be just he/she clicked a item which is not in the matrix.
Another problem is that we may not have so many action information. The matrix normally is very very parse.
The idea of this model comes from the word2vec. In NLP domain, when we want to represent a word, Bag of Words(BoW) may be the first choice for us. But the corpus for any kind of language in the world, contains millions or billions words. We cannot store each one of them using BoW, it requires too much space for storage. We need to represent the words with a method requiring less space to store. Moreover, a proper transcription of the corpus can provides more information such as the similarity of a word. It improves the performance of a model. So here comes the Word to Vector(Word2vec). The goal of this model is to find a shorter vector to represent the words. The mapping relationship will be stored as a dict. We can train a matrix to find this dict. In this dict every vector can be seen as a coordinate. Similar words such as Dad and Father, will have a closer distance.
If we could represent items using vectors in the same way of word2vec, we can recommend those items which have a closer distance in the dictionary. So how could this vector represent the similarity of two words? The similarity of words is defined by the context. If two words are used in the same circumstance, they are similar. We will train the matrix using the context of every word. For items, the context can be simply defined as the other items that user clicked before and after clicking this items. For other situations, the context may be different.
Normally there are two ways of doing this:
They are just identical. CBOW is using the context items to predict the central item. Skip-gram is trying to predict the context items with the central item.
CBOW: The target w(t) is the transpose of the sum of context vectors multiplied by the matrix(M)
Skip-gram: The context vector w(j) is the transpose of the target vector multiplied by the matrix(M)
During the training we may do the negative sampling, which means that we choose some negative targets to feed the matrix. It will make the model more robust. We may also just pick the items that have shown for a certain times. This will make the training process quicker and improve the accuracy of the model.
After training, we will get a transform matrix(M) and the collection of vectors for each item. Basing on these vectors, we can use k-means or t-SNE or other clustering mathod to find the similar items for the recommendation.