Statistical Learning Approaches to Information Filtering ~ Fakultät für Mathematik, Informatik und Statistik - Digitale Hochschulschriften der LMU

Enabling computer systems to understand human thinking or behaviors
has ever been an exciting challenge to computer scientists. In
recent years one such a topic, information filtering, emerges to
help users find desired information items (e.g.~movies, books,
news) from large amount of available data, and has become crucial
in many applications, like product recommendation, image retrieval,
spam email filtering, news filtering, and web navigation etc.. An
information filtering system must be able to understand users'
information needs. Existing approaches either infer a user's
profile by exploring his/her connections to other users,
i.e.~collaborative filtering (CF), or analyzing the content
descriptions of liked or disliked examples annotated by the user,
~i.e.~content-based filtering (CBF). Those methods work well to
some extent, but are facing difficulties due to lack of insights
into the problem. This thesis intensively studies a wide scope of
information filtering technologies. Novel and principled machine
learning methods are proposed to model users' information needs.
The work demonstrates that the uncertainty of user profiles and the
connections between them can be effectively modelled by using
probability theory and Bayes rule. As one major contribution of
this thesis, the work clarifies the ``structure'' of information
filtering and gives rise to principled solutions. In summary, the
work of this thesis mainly covers the following three aspects:
Collaborative filtering: We develop a probabilistic model for
memory-based collaborative filtering (PMCF), which has clear links
with classical memory-based CF. Various heuristics to improve
memory-based CF have been proposed in the literature. In contrast,
extensions based on PMCF can be made in a principled probabilistic
way. With PMCF, we describe a CF paradigm that involves
interactions with users, instead of passively receiving data from
users in conventional CF, and actively chooses the most informative
patterns to learn, thereby greatly reduce user efforts and
computational costs. Content-based filtering: One major problem for
CBF is the deficiency and high dimensionality of
content-descriptive features. Information items (e.g.~images or
articles) are typically described by high-dimensional features with
mixed types of attributes, that seem to be developed independently
but intrinsically related. We derive a generalized principle
component analysis to merge high-dimensional and heterogenous
content features into a low-dimensional continuous latent space.
The derived features brings great conveniences to CBF, because most
existing algorithms easily cope with low-dimensional and continuous
data, and more importantly, the extracted data highlight the
intrinsic semantics of original content features. Hybrid filtering:
How to combine CF and CBF in an ``smart'' way remains one of the
most challenging problems in information filtering. Little
principled work exists so far. This thesis reveals that people's
information needs can be naturally modelled with a hierarchical
Bayesian thinking, where each individual's data are generated based
on his/her own profile model, which itself is a sample from a
common distribution of the population of user profiles. Users are
thus connected to each other via this common distribution. Due to
the complexity of such a distribution in real-world applications,
usually applied parametric models are too restrictive, and we thus
introduce a nonparametric hierarchical Bayesian model using
Dirichlet process. We derive effective and efficient algorithms to
learn the described model. In particular, the finally achieved
hybrid filtering methods are surprisingly simple and intuitively
understandable, offering clear insights to previous work on pure
CF, pure CBF, and hybrid filtering.

Statistical Learning Approaches to Information Filtering

Beschreibung

Weitere Episoden

Generalized Bayesian inference under prior-data conflict

Regularity for degenerate elliptic and parabolic systems

Reifegradmodelle für Werkzeuglandschaften zur Unterstützung von ITSM-Prozessen

Similarity search and mining in uncertain spatial and spatio-temporal databases

Tensor factorization for relational learning

Kommentare (0)

Abonnenten

Anmelden mit