|
 |
Machine Learning for Information Retrieval
Machine Learning

|
Our Information Retrieval (IR) work was done in collaboration with the Juru team.
The Juru search engine retrieves documents from unstructured document collections, and is known to be highly accurate, as shown in the TREC competition.
Our work consists of several parts:
Client
Juru search engine.
Challenge
To develop a tool for estimating the quality of the results returned by a search engine.
Solution
The huge increase in the quantity of available information has made search engines an essential tool in extracting relevant data. The output of the search engine is usually a list of ranked documents that the search engine deems relevant to the user's query.
We developed two methods that give the user feedback about the confidence of the search engine in the returned results. This novel feature, which can be added to any search engine, makes a prediction of how difficult the query is for the given search engine.
Our methods work by first measuring both the contribution of each query term to the final document ranking and the frequency of the term. Then, this list is input to a novel machine learning algorithm, which then predicts the likelihood that the retrieved documents will answer the user's query.
Click to see full size
Benefits
This feature is useful not only as feedback for the user, but also as a means of tuning the search engine to the given query and as a way to merge the results of several search engines. For example, Query Expansion (QE) is known to be useful if the precision of the original query is high, but is detrimental if it is low. The predictor can be used to decide when to use QE, so that it will only be used when it will benefit the recall.
We have applied for a patent for the algorithms and methods developed for difficulty prediction.
Client
Juru search engine.
Challenge
It is well known that an expert user of a search engine can formulate queries much better than a novice, and thus the answers received will be superior to those received by a novice user. The challenge is to develop a tool for rephrasing a query so that it is similar to a query written by an expert.
Solution
Our group developed an algorithm for automatically parsing queries and rewriting them much as an expert user would have written them. This algorithm is based on natural language processing together with a machine learning component.
For example, consider the TREC query: "Nobel prize winners".
An expert would assign more weight to the fact that the interest is in the Nobel award, and thus a better way to write the query is "+Nobel prize winners". By parsing this query, the average precision improved by almost 20%.
Another example is the TREC query: "How many civilian non-combatants have been killed in the various civil wars in Africa?"
The parser rewrote this query as:
"How many civilian non -combatants have been killed in the various civil wars in +Africa?"
These changes were only slight but they improved the average precision by 140% and the precision at 10 by 200%!
Benefits
The parser improves the precision and the retrieval of the search engine significantly, as measured on standard benchmarks such as the TREC competition data.
Client
Juru search engine.
Challenge
Most current search engines rank documents using a model that assigns a high rank to documents where a query term appears frequently compared to its frequency in the index. This model is known to fail for about 30% of queries.
Solution
We proposed a new ranking model whereby each query term is submitted to the search engine, which returns a ranking of documents based on each term. Since the terms are different, each holds a unique "view" of the document space. Combining these "views" results in improved precision and retrieval.
In order to combine the results of the sub-query rankings (the rankings returned for each query term) a support-vector machine (SVM) is used. The SVM tries to find a non-linear mapping that maximizes the agreement between the sub-queries and the full query. Essentially, the support-vector machine finds a non-linear function that replaces the traditional scoring model.
Benefits
Our tests have shown that this new scoring model is superior to the traditional model, although it is computationally intensive.
| |
|
|