Skip navigation

Last week me and Greg spoke with Gerald Penn, who is doing Natural Language Processing research at the University of Toronto, and his graduate student Xiaodan Zhu about summarizing IRC logs. Gerald has pointed me to several resources on the related research. I started reading a book called “Automatic Summarization” by Inderjeet Mani, as well as several research papers on the topic of text summarization / extraction. I’m taking notes, because it’s very easy to get lost in the midst of details. NLP combines many fields – Linguistics, Statistics, Math (mostly algebra / linear optimization), as well Computer Science (data structures / algorithms). I have to mention that I never worked with NLP before, so all these ideas are very much new to me. However, it’s not terribly difficult, because it’s sufficiently abstracted that anyone with some background in the above mentioned fields can understand these ideas. The difficulty arises in my case when we are talking about summarizing IRC logs. Due to the nature of these chats, it’s very much unlike summarizing text. Usually, scientific papers are sufficiently organized, so Edmundsonian paradigm can be applied (i.e. first and last sentences of the paragraph are probably more important than the sentences in the middle, so they should be assigned more weight based just on location). The same is not true for the IRC logs. So, I always have to keep that in mind when I read about different heuristics discussed in these papers.

When I see something new, I like to get a big picture first before I start looking at details. What follows below is a summary of what I’ve learned so far.

IRC summarization consists of the following parts:

  1. Message segmentation
  2. Clustering
  3. Adjacent response pairs identification
  4. Summary extraction

Message Segmentation

This is done using TextTiling algorithm. It partitions the entire IRC log into multi-paragraph segments. These segments most likely have little overlap in terms of their content.

Clustering

This is the most crucial part. Basically, using Ward’s agglomerative method, we can cluster segments based on their relevance. Each cluster represents L-dimensional vector of words. Each word is assigned a weight, which is (word frequency) / (document frequency). Word frequency is defined as “the number of times a word appears in the cluster” and document frequency is defined as in “the number of segments the word appears in”. So, the higher the weight, the more valuable is the word. For example, if the word appears a lot in the cluster and not in other clusters, then it means that the topic discussed in this cluster was centered around this word. The next step is to start merging clusters based on the principle of minimal variance. In-cluster variance in this case is a function of the number of elements in both clusters to be merged, as well as the Euclidian distance between both clusters (based on the vector mentioned earlier). So, if 2 clusters are very similar (i.e. they’re about the same topic), then the distance between both vectors is small (compared to other possibilities), and the marginal increase in variance is pretty small. The process of merging of clusters stops when there no combinations such that the in-cluster variance doesn’t increase below a specified % threshold.

Adjacent response pairs identification

This is a very fuzzy area to me, which I will be trying to understand more. Basically, the idea is to identify the conversation initiator and the responses to this initiation. Since the conversation could have several threads, the process of identifying these pairs is probabilistic.

Summary extraction

Now that we’ve grouped messages on the same topic and sub-topic level, extracting summaries is pretty straight-forward. Looks like Edmundosian paradigm, or something similar to it would work.

Now that I have a general idea of the steps involved in summarizing IRC logs, I will start looking at the details of each step.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: