Skip navigation

I’ve been looking at the ways of organizing chat logs. In a long-term project, there will be thousands of messages, and we need some way of organizing these logs, instead of having one long document. Here are some ideas ordered by simplicity:

Archives by day/month/year.
The simplest way of doing this, which is very similar to IRC/MSN logs.

Use a channel topic and tags to organize logs.
The quality of organization will depend on users. Here is why. The basic idea is to use the channel topic as a log name (and consequently, wiki page name). For example, imagine students working on an assignment (or a software team working on a product). The channel will need to have a topic (i.e. Assignment 1, or DrProject Summer 2008 ) set by a channel operator. So, all chats related to this topic will be stored in the same document (Assignment1 or DrProjectSummer2008). To allow a deeper level of organization, tags can be used to create sub-topics. For example, when discussing component A of Assignment 1, a user could let the bot attach a tag to subsequent messages until the subtopic is changed (i.e. DrProjectBot: tag ComponentA). This way, users can select conversations by tags (subtopics) in one (potentially) giant document. This will also allow cross-topic selection, since the tags can link portions of conversations in different topics (i.e. if Component A is used in Assignment 1 and 4). An interface that will display logs in an organized and easy-to-use way will allow users to select only portions of messages they want to see by using tags. I’m thinking it will be based on the existing wiki system in DrProject, but will be a separate component. Why wiki? Because it’s the most popular tool for editing text on the web by many users.

Natural language processing for topic detection and classification.
I’ve never done any machine learning or natural language computing stuff, and today spent some time reading papers. It seems that it’s a fairly experimental field, and hasn’t been widely used in the industry. IBM has something called WebFountain, which is:

“WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level “augmenters” and corpus-level “miners,” and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.”

There’s a few papers on doing something similar with IRC chat logs. From what I can tell, they break the log into overlapping documents (sliding window style), get the keywords, and project documents on keywords. The result is a Gauss distribution, and if enough documents contain keywords, then these keywords form a topic. This is useful for topic extraction, and I guess it would be possible to select bits of conversations from these topics.

It seems very interesting, but not feasible for DrProject because of its experimental nature. These algorithms have to be fine-tuned either through machine learning or by humans. However, CIA and Homeland Security seem very interested because, apparently, IRC is used by terrorists, and extracting topics from chat logs would make it easier to spot them.

So far, I think the most feasible solution is to use tagging. It seems the easiest way to organize chat logs, and DrProject already has tag cloud and searching by tags.

Any opinions are very welcome, especially improvements to my thoughts on organizing chat logs.



    • drprojectirc
    • Posted May 29, 2008 at 5:37 pm
    • Permalink

    Another cool idea was to use references changesets, wikis, email, tickets as tags. I think it’s better than making users tag conversations themselves, since those links provide a natural way of segmenting conversations…

    • drprojectirc
    • Posted May 29, 2008 at 5:47 pm
    • Permalink

    Yet another idea was to use keywords that don’t frequently appear in a dictionary as a way of segmenting logs. I have to do some research on how similar ideas were implemented elsewhere, but it seems feasible as well. Not sure how to handle conversations in other than English languages though.

    • Liz
    • Posted May 30, 2008 at 2:12 am
    • Permalink

    I like comment 1. I don’t think users are going to want to bother with tagging themselves, however, if you’re going to have the bot be able to spit out tickets or links to tickets, it can use that data plus any other mentions of tickets by number as one way to cluster data.

    I also think that using the topic is a neat idea for organizing data. The only thing is that it relies on two things which are not necessarily true of all people who would use this: 1) that they’re used to IRC enough that they’re going to want to be changing the topic and 2) that they’re going to be thinking “oh, I should change the topic so that it’s easier to find this in the logs later”, or be motivated of their own accord to maintain an up to date topic.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: