Information Retrieval & Search - Basic IR Models

2008-03-05

Our focus in the database world has primarily been on retrieving information from a structured source, through a structured query.

Relational DBs --> structured data, structured query

XML DBs --> semi-structured data, structured query

search --> unstructured data, unstructured query

Actually, search --> data with hidden structure, query with possibly hidden structure

Recent efforts on IR considers data with no explicit structure:

Indexing and and retrieval of information-containing objects (text documents, web-pages, image, audio, ..)
- Want only relevant objects
- Efficiently from a large set
- Lets focus mainly on text based documents in this post
Related areas
- DB Management
- Artificial intelligence
  - Knowledge representation and intelligent action
- NLP and linguistics
  - linguistic analysis of natural language text
- Machine learning
  - Systems that improve performance with experience

Traditional Preprocessing steps

Strip irrelevant markup, tags, punctuation
Tokenize
Stem tokens to root words
1. Porter stemmer: set or local rules to reduce word to its root.
2. Words with distinct meanings may reduce to the same token
3. Does not recognize all morphological derivations
4. (arm, army) --> arm, (police, policy) --> polic
Remove stop words: I, a, the, is..
Detect common phrases
Build Inverted index: keywords --> docs

Basic Retrieval Models

Set-theoretic
1. Boolean model
2. Extended Boolean
Algebraic
1. Vector Space
2. Extended Boolean
3. Generalized VS
4. Latent semantic, neural network, etc.
Probabilistic
1. Inference/Belief Network

Boolean Model

Document Representation: Set of terms

Queries: Exact matches on presence of terms using logical operators AND, NOT, OR,..

No partial matches or ranking

Reasonably efficient implementations - think set operations in relational DBs queries

Rigid: hard to control: # of docs retrieved, rankings, relevance feedback

Vector Space Model

Document Representation: Bag of terms (with frequency)

Queries: Set of desired terms with optional weights

t distinct terms remain after preprocessing step, forming a vector space:

Dimension = t = |vocabulary|

Weight assigned to terms in the document, query (i, j): w_i,j

d_j = (w_1j, w_2j, …, w_tj)

Graphically..

Term-document matrix:

Each real-valued weight corresponds to the degree of significance of that term in the document. 0=no significance I, we, the, ..) or that it does not exist.

Term frequency (tf): tf_ij = frequency of i in j divided by frequency of most common term in j

Document frequency (df): number of documents containing term

Inverse document frequency of i: log_2 ( total # documents / df_i) Log dampens the effect relative to tf

tf-idf weighing commonly used to determine term weight: w_ij = tf_ij idf_i = tf_ij log2 (N/ df_i)

A term common across all documents is given less importance, and rare terms occurring in a document are given higher weight

Query is also treated as a document vector. Now the problem is to find similar documents

Need a way to compute degree of similarity between two vectors: Dot product!

Problems: Favors long documents. Does not measure how many terms don't match.

Compute similarity based on Cosine of the angle between two vectors.

With V vocabulary and D documents, complexity is O( |V| * |D|) !!

Key points:

Query viewed as a document
Measure proximity to it (numerous methods with different optimizations have been proposed)
Measures score/ranking, not just logical matches

Overall problems with Vector Space:

Missing semantic, word-sense type information
Missing Syntactic information - phrase structure, word order, proximity
Does not support boolean queries as with the boolean model. Requiring presence of absence of terms
Scoring Phrases of words difficult
Wild-cards