IR Summer School (June 20-25, 2011) in India (Bangalore) co-organized by Yahoo! and IISC, Bangalore
Monday, June 20, 2011 at 8:30 AM - Saturday, June 25, 2011 at 6:00 PM (IST)
IR Summer School
Information retrieval (IR) is an area of computer science that is concerned with the organization of, and access to information items such as text documents, web pages, online catalogs, multimedia objects, and structured and semi-structured items. In terms of scope, IR has grown well beyond its early goals of indexing text and searching for relevant documents. Nowadays IR encompasses web, social and enterprise search, text classification and categorization, system architecture and platform, interface and visualization, user behavior and information seeking, ranking and learning models, evaluation and crowd-sourcing methodologies, personalization and context, to name a few.
The objective of this summer school is as follows:
- equip attendees with a variety of tools needed to conduct research in this area,
- cover the entire spectrum from theory to practice.
National and international leaders in the field will give introductory tutorials (first two days) as well as advanced seminars (following four days) and will assist in small-group discussions.
This school is suitable for all levels, both for people without previous knowledge in Information Retrieval, and those wishing to broaden their expertise in this area. Exchange of students, joint publications and joint projects will result because of this collaboration.
For research students, the summer school provides a unique, high-quality, and intensive period of study. It is ideally suited for students currently pursuing, or intending to pursue, research in IR or related fields. Limited scholarships are available for students to cover travel, accommodation and boarding costs.
IT professionals who use IR will find that the summer school provides relevant knowledge and exposure to contemporary techniques. In addition, they will benefit by direct interaction with top-notch researchers and knowledge workers.
The school thus provides an ideal forum for networking and discussions. Academics will also benefit from interaction with IT professionals which will lead to a deeper understanding of real life problems.
Sathish Dawan Auditorium, CSIC
IISc Campus, Bangalore
This Summer School is co-organized by Yahoo! and CSA Department, IISc.
Monday, 20 June, 2011
Morning: Intro to IR (Mounia Lalmas)
Afternoon: IR evaluation1 (Mark Sanderson)
Tuesday, 21 June 2011
Morning: IR models(Mounia Lalmas)
Afternoon: Indexing (Ronny Lempel)
Wednesday, 22 June 2011
Morning: IR evaluation2 (Mark Sanderson)
Afternoon: Searching (Ronny Lempel)
Thursday, 23 June 2011
Morning: Web retrieval (Ricardo Baeza-Yates)
Afternoon: Multimedia retrieval (Malcolm Slaney)
Friday, 24 June 2011
Morning: Web retrieval (Ricardo Baeza-Yates)
Afternoon: Multimedia Retrieval (Malcolm Slaney)
Saturday, 25 June 2011
Morning: Learning to rank (Soumen Chakrabarti)
INTRODUCTION (Mounia Lalmas) (2 hours)
This lecture will introduce the information retrieval problem, introduce the terminology related to IR, and provide a history of IR. In particular, the history of the web and its impact on IR will be discussed. Special attention and emphasis will be given to the concept of relevance in IR and the critical role it has played in the development of the subject. The lecture will end with a conceptual explanation of the IR process, and its relationships with other domains as well as current research developments (e.g. searching Web 2.0).
IR MODELS (Mounia Lalmas) (4 hours)
This lecture will present the models that have been used to rank documents according to their estimated relevance to user given queries, where the most relevant documents are shown ahead to those less relevant. Many of these models form the basis for many of the ranking algorithms used in many of past and today’s search applications. The lecture will describe models of IR such as Boolean retrieval, vector space, probabilistic retrieval, language models, and logical models. Relevance feedback, a technique that either implicitly or explicitly modifies user queries in light of their interaction with retrieval results, will also be discussed, as this is particularly relevant to web search and personalization. This lecture will end with taxonomy of IR models.
IR EVALUATION (Mark Sanderson) (6 hours)
Evaluating the effectiveness of information retrieval systems has its origins in work dating back to the early 1950s. Across the nearly 60 years since that work started, use of test collections has been a de facto standard of evaluation, however, things are changing.
This session will first survey the history and traditional methods and measures devised for evaluation of retrieval systems, including a detailed look at the use of statistical significance testing in retrieval experimentation. This part of the session will focus on test collections. At its core, the modern-day test collection is little different from the structures that the pioneering researchers in the 1950s and 1960s conceived of. This tutorial shows that despite its age, this long-standing evaluation method is still a highly valued tool for retrieval research.
However, the second part of the session will also review more recent examinations of the validity of the test collection approach and evaluation measures as well as outlining the growth in the last 10 years of research exploiting other sources of information to evaluate. Here, the use of query logs, crowd sourcing, and live labs in retrieval experimentation will all be described. The session will be taught using both lectures and demonstrations of experiments being conducted.
INDEXING /SEARCHING (Ronny Lempel) (6 hours)
This lecture will focus on the indexing process of text, which takes as input documents and outputs an inverted index. Inverted indices are the basic data structure that enables search engines to efficiently retrieve documents relevant to keyword queries. We will cover indexing on a single node as well as distributed indexing, and will touch on index compression techniques.
Next we will detail how, when faced with queries, search engines use the inverted index to retrieve and rank results. We will cover document-at-a-time and term-at-a-time retrieval schemes, discuss early termination strategies, and adapt retrieval strategies to distributed index formats.
WEB RETRIEVAL (Ricardo Baeza-Yates) (6 hours)
This tutorial provides an introduction to the main concepts, issues, and techniques of web-based information retrieval. Topics covered include the differences between conventional and web IR, the evolution of web search technology, the overall system architecture, crawling and corpus construction, link analysis, and Web spam. In addition, we will look at current research trends including the use of query logs to improve search engines as well as distributed retrieval architectures.
MULTI-MEDIA RETRIEVAL (Malcolm Slaney) (6 hours)
People consume much more multimedia than they do text. Video, images and audio are key components of most people's web experience. Yet how do we get information from these signals and how do we help user's find the media they care about? The semantic gap only makes this problem more challenging---there is a lot of inference needed to learn that these orange pixels correspond to a cat. This session will talk about the unique signals and the feature vectors that are used to characterize multimedia signals. How can we pull information from audio waveforms and pixels? What problems can we solve? How can users find the information they desire in the exploding world of web-based multimedia content. We'll talk about features, useful strategies, both metadata and content, and then applications. How can we deliver the right multimedia content to the user? That's our goal.
LEARNING TO RANK (Soumen Chakrabarti) (3 hours)
We will study machine learning and associated numerical techniques to automatically learn ranking functions for entities represented as feature vectors as well as nodes in a social network. In the feature-vector scenario, an entity, e.g., a document x, is mapped to a feature vector in a d-dimensional space, and we have to search for a weight vector. This case corresponds to Information Retrieval in the vector space model. Training data consists of a partial order of preference among entities. We will study recent Bayesian and maximum-margin approaches to solving this problem, including recent efficient linear-time approximate algorithms. We will consider various ranking performance measures and how some of them create complications for learning algorithms. In the graph node ranking scenario, we will briefly review Pagerank, generalize it to arbitrary Markov conductance matrices, and consider the problem of learning conductance parameters from partial orders between nodes. In another class of formulation, the graph does not establish Pagerank or prestige-flow relationships between nodes, but encourages a certain smoothness between the scores of neighboring nodes. Some of these techniques have been used by Web search companies with very large query logs. We will review some of the issues and experience with applying the theory to practical systems. If time permits, we will look at general connections between score stability as against rank stability, and the connection between the stability of a score/rank-learning algorithm and its power to generalize to unforeseen test data.
Mounia Lalmas (http://research.yahoo.com/Mounia_Lalmas) is a visiting principal researcher and joined Yahoo! Research in January 2011, where she works on models and measures of user engagement. Prior to this, she was a Microsoft Research/RAEng Research Professor at the University of Glasgow. She has a strong background in information retrieval and evaluation. She co-led the Evaluation Initiative for XML Retrieval (INEX), a large-scale international project, responsible for defining the nature of XML retrieval, and how it should be evaluated. While at Glasgow, she has been working on applying quantum theory to model information retrieval. She also works on result presentation and evaluation for aggregated search, and technologies for bridging the digital divide.
Mark Sanderson (http://www.seg.rmit.edu.au/mark/) is a Professor and researcher in information retrieval (IR) (e.g. web search engines). He is particularly interested in evaluation of search engines, but also works in geographic search, cross language IR (CLIR), summarization, image retrieval by captions, word sense ambiguity. He ran the introduction to the IR tutorial at ACM SIGIR 2000 and 2001. He is associate editor of ACM TWeb (Transactions on the Web), IP&M (Information Processing and Management), and on the editorial board of FnTIR. He is also the TREC advisory panel. With Paul Clough and Henning Muller, Mark started the successful ImageCLEF search evaluation track of CLEF, which has run for seven years and each year has tens of international research groups evaluating their image retrieval systems.
Ronny Lempel (http://research.yahoo.com/Ronny_Lempel) joined Yahoo! Research in October 2007 as the director of Yahoo! Israel Research Ltd., where he oversees R&D activities at the cutting edge of Web search. Prior to joining Yahoo! Research, Ronny spent 4.5 years at IBM's Haifa Research Lab with the Information Retrieval Group, where his duties included research and development in the area of enterprise search systems. Prior to joining IBM, Ronny received his BSc, MSc and PhD from the Faculty of Computer Science at Technion, Israel Institute of Technology in 1997, 1999 and 2003 respectively. Both his MSc and PhD focused on search engine technology. During his PhD studies, Ronny spent two summer internships at the AltaVista search engine.
Ronny has authored numerous papers on search engine technology in leading conferences and journals, and has received several awards from the prestigious WWW conference series. He also serves on the program committees of leading Information Retrieval and Search conferences such as WWW, SIGIR, CIKM and WSDM.
Ricardo Baeza-Yates (http://research.yahoo.com/Ricardo_Baeza-Yates) is VP of Research for Europe and Latin America, leading the Yahoo! Research labs at Barcelona, Spain and Santiago, Chile, and also supervising the lab in Haifa, Israel. Until 2005 he was the director of the Center for Web Research at the Department of Computer Science of the Engineering School of the University of Chile; and ICREA Professor and founder of the Web Research Group at the Dept. of Information and Communication Technologies of Univ. Pompeu Fabra in Barcelona, Spain. He maintains ties with both mentioned universities as a part-time professor for the Ph.D. program. His research interests includes algorithms and data structures, information retrieval, web mining, text and multimedia databases, software and database visualization, and UI.
Malcolm Slaney (http://research.yahoo.com/Malcolm_Slaney) is a principal scientist at Yahoo! Research and a (consulting) Professor at Stanford CCRMA where he has led the Hearing Seminar for the last 20 years. He is a Fellow of the IEEE and (former) Associate Editors of IEEE Transactions on Audio, Speech and Signal Processing and IEEE Multimedia Magazine. He has given successful tutorials at ICASSP 1996 and 2009 on “Applications of Psychoacoustics to Signal Processing”, on “Multimedia Information Retrieval” at SIGIR and ICASSP, and "Web-Scale Multimedia Data" at ACM Multimedia 2010. He is a coauthor, with A. C. Kak, of the IEEE book "Principles of Computerized Tomographic Imaging." This book was recently republished by SIAM in their "Classics in Applied Mathematics" Series. He is coeditor, with Steven Greenberg, of the book "Computational Models of Auditory Function." Before Yahoo!, Dr. Slaney has worked at Bell Laboratory, Schlumberger Palo Alto Research, Apple Computer, Interval Research and IBM's Almaden Research Center. For many years he has lead the auditory group at the Telluride Neuromorphic (Cognition) Workshop. Dr. Slaney's recent work is on multimedia analysis and music- and image-retrieval algorithms in databases with billions of items.
Soumen Chakrabarti (http://www.cse.iitb.ac.in/~soumen/) received his B.Tech in Computer Science from the Indian Institute of Technology Kharagpur in 1991 and his M.S. and Ph.D. in Computer Science from the University of California, Berkeley in 1992 and 1996. At Berkeley he worked on compilers and runtime systems for running scalable parallel scientific software on message passing multiprocessors. He was a Research Staff Member at IBM Almaden Research Center from 1996 to 1999, where he worked on the CLEVER Web search project and led the Focused Crawling project. In 1999 he joined the Department of Computer Science and Engg. at the Indian Institute of Technology, Bombay, where he has been an Associate professor since 2003. In Spring 2004 he was Visiting Associate professor at Carnegie-Mellon University.
He has published in the WWW, SIGIR, SIGKDD, SIGMOD, VLDB, ICDE, SODA, STOC, SPAA and other conferences as well as Scientific American, IEEE Computer, VLDB and other journals. He holds eight US patents on Web-related inventions. He has served as technical advisor to search-related companies and vice-chair or program committee member for WWW, SIGIR, SIGKDD, VLDB, ICDE, SODA and other conferences, and guest editor or editorial board member for DMKD and TKDE journals. He is also author of a book on Web Mining.
His current research interests include integrating, searching, and mining text and graph data models, exploiting types and relations in search, and Web graph and popularity analysis.
Get a glimpse of the subject matter here: