Multimedia is a data that consists of a combination of different content forms such as images, video, text, audio and interactive content. Multimedia collections are becoming a central information resource for a growing number of domains, which increases the need for fast and insightful multimedia analysis tools. Since today’s multimedia collections are very large and ever-growing, the tools also need to be applicable to large-scale data. For example, the data obtained from social media platforms is almost all multimedia, the largest publicly available multimedia collection compromises 100 million images from Yahoo Flickr, called YFCC100M. However, there are many much larger multimedia collections that are not publicly available, like Facebook’s over hundred billion images.
But what is the best way to extract knowledge and insight from multimedia collections? The dominant approach revolves around search. Search is suitable only for cases when the user has a clear information need and is able to formulate it as a precise query. However, often the analyst wants to explore the collection, looking for the question to ask, and structure or categorize the data herself. Thus, multimedia systems should support interactive, open-ended tasks where the objective is the analyst’s knowledge gain. Below is example of few domains where this kind of interactive multimedia learning is important:
- Forensics: e.g. screening for child pornography
- Astronomy: e.g. to find new phenomena and relationships from the massive volumes of data originating from different directions of the orbit
- Marketing: e.g. to obtain insights into consumer behavior
- Medicine research: e.g. to analyze X-ray images to detect new patterns
- Art and culture: e.g. for cultural heritage analyzer
Multimedia collections actually have the potentially to significantly change the world, if we manage to extract the knowledge out of it.
A popular interactive method for analyzing multimedia collections is relevance feedback, which is an active learning technique where the system attempts to learn what kind of images the user is looking for. The system starts by presenting example images to the user. The user can label images as close matches / interesting or bad matches, or simply ignore them. The system then attempts to learn a function to evaluate images (for example using SVM or KNN), based on the feedback from the user, and uses that function to select the next set of example images to present. This continues until the user is satisfied, or until the system cannot find any more suitable images to present.
Without any optimizations, this method is however not suitable for large-scale data. The reason is that for the system to be interactive it has to learn very fast. Each round needs to be completed in around 1 second to satisfy the user. If the data collection is huge, selecting the next set of relevant images can take very long, since it needs to evaluate each image in the collection according to the evaluation function. Thus, there is a need for effective, efficient multimedia analysis tool that works on large-scale data.
Blackthorn is an interactive learning system for large-scale multimedia that applies many optimizations to the conventional relevance feedback, including effective data compression. It manages to learn user preferences for the YFCC100M collection in only 1.1 seconds. Even though that is an excellent performance (around 77.5X faster than conventional relevance feedback), we would like to do even better because like mentioned before, there are much larger collections out there, and the interaction interface requires more rapid suggestions to satisfy users.
One possible way to speed up Blackthorn, is to apply approximate high-dimensional clustering on it, which would allow dynamically selecting a small subset of likely images to apply Blackthorn to. Thus, the evaluation function would not need to be applied to the
whole collection (e.g. 100 million images) in order to select next top images to present to the user, rather the evaluation function would only need to be applied to a small subset which is defined by the clustering method.
One such indexing method is the eCP algorithm, which corresponds roughly to the first step of k-means clustering. This algorithm has been shown to give results of good quality with very limited processing, and has also been shown to scale very well to massive collections.
@thorhildurthorleiks and I, are currently working on a research project, with the goal of gluing together eCP and Blackthorn, in hope for creating the world’s fastest system for recognizing image preference on web-scale data. Stay tuned for the results!
Gylfi Þór Guðmundsson, Laurent Amsaleg, Björn Þór Jónsson. Impact of Storage Technology on the Efficiency of Cluster-Based High-Dimensional Index Creation. Second International Workshop on Flash-based Database Systems (FlashDB), Busan, South Korea, April, 2012.
Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring. Interactive Multimodal Learning on 100 Million Images. ACM International Conference on Multimedia Retrieval (ICMR), New York, NY, USA, June, 2016.
Jan Zahálka, Stevan Rudinac, Björn Þór Jónsson, Dennis C. Koelma, Marcel Worring. Blackthorn: Large-Scale Interactive Multimodal Learning. Journal version of , approved for publication in IEEE Transactions on Multimedia, 2017.
B. Thomee, D.A. Shamma, G. Friedland, B. Elizalde, K. Ni, D. Poland, D. Borth, L. Li. YFCC100M: The New Data in Multimedia Research. Communications of the ACM, 59(2):64-73, 2016.
Simon Tong and Edward Chang. Support vector machine active learning for image retrieval. In Proceedings of the ninth ACM international conference on Multimedia. ACM, New York, NY, USA, 107-118, 2001.
Keim D.A., Mansmann F., Schneidewind J., Thomas J., Ziegler H. Visual Analytics: Scope and Challenges. In: Simoff S.J., Böhlen M.H., Mazeika A. (eds) Visual Data Mining. Lecture Notes in Computer Science, vol 4404. Springer, Berlin, Heidelberg, 2008.