Scheme: Newton International Fellowships
Organisation: University of Cambridge
Dates: Feb 2012-Feb 2014
Summary: Approximately 1.5 billion people use the Internet. People write news articles, blogs, and reviews; people upload videos, audios, and photos. People become web content creators. This directly translates to the availability of half a Zettabyte of data. Synergistically with rapid progress in machine learning models and algorithms, as well as rapid rises in computing power and storage, the challenge of the 21st century consists of finding way to transform this complex massive yet noisy and sparse Internet data coming from a variety of sources into insights in support of knowledge creation. My research aims to address data to knowledge transformation in the context of machine learning.
Machine learning techniques have become prevalent for drawing inference and making prediction from massive scale data. Given input-output data pairs, the goal of learning is to infer a latent function that maps inputs to outputs. This function will then be used to predict an output for a given unseen input. Consider as an illustrative example, a task of categorising web videos (user-generated videos from video sharing websites). Here the inputs are the web videos and the outputs are the categories such as entertainment, music, news and politics, science and technology, among others.
The complex nature of Internet data manifests itself along both the input (feature) and output (label) dimensions. On the input dimensions, we deal not only with potentially millions of features but also the features might come from multiple modalities or data sources. Web videos admit the conventional representation of audio-visual features, the associated text (the filenames, titles, or descriptions) and even the intricate social network representation (the relationship among videos through the users, links, or recommendations). On the label dimensions, the information is sparse. Adding to the sparsity challenge, Internet data tend to have multiple sets of different labels.