I’ve been looking at the ways of organizing chat logs. In a long-term project, there will be thousands of messages, and we need some way of organizing these logs, instead of having one long document. Here are some ideas ordered by simplicity:
Archives by day/month/year.
The simplest way of doing this, which is very similar to IRC/MSN logs.
Use a channel topic and tags to organize logs.
The quality of organization will depend on users. Here is why. The basic idea is to use the channel topic as a log name (and consequently, wiki page name). For example, imagine students working on an assignment (or a software team working on a product). The channel will need to have a topic (i.e. Assignment 1, or DrProject Summer 2008 ) set by a channel operator. So, all chats related to this topic will be stored in the same document (Assignment1 or DrProjectSummer2008). To allow a deeper level of organization, tags can be used to create sub-topics. For example, when discussing component A of Assignment 1, a user could let the bot attach a tag to subsequent messages until the subtopic is changed (i.e. DrProjectBot: tag ComponentA). This way, users can select conversations by tags (subtopics) in one (potentially) giant document. This will also allow cross-topic selection, since the tags can link portions of conversations in different topics (i.e. if Component A is used in Assignment 1 and 4). An interface that will display logs in an organized and easy-to-use way will allow users to select only portions of messages they want to see by using tags. I’m thinking it will be based on the existing wiki system in DrProject, but will be a separate component. Why wiki? Because it’s the most popular tool for editing text on the web by many users.
Natural language processing for topic detection and classification.
I’ve never done any machine learning or natural language computing stuff, and today spent some time reading papers. It seems that it’s a fairly experimental field, and hasn’t been widely used in the industry. IBM has something called WebFountain, which is:
“WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level “augmenters” and corpus-level “miners,” and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.”
There’s a few papers on doing something similar with IRC chat logs. From what I can tell, they break the log into overlapping documents (sliding window style), get the keywords, and project documents on keywords. The result is a Gauss distribution, and if enough documents contain keywords, then these keywords form a topic. This is useful for topic extraction, and I guess it would be possible to select bits of conversations from these topics.
It seems very interesting, but not feasible for DrProject because of its experimental nature. These algorithms have to be fine-tuned either through machine learning or by humans. However, CIA and Homeland Security seem very interested because, apparently, IRC is used by terrorists, and extracting topics from chat logs would make it easier to spot them.
So far, I think the most feasible solution is to use tagging. It seems the easiest way to organize chat logs, and DrProject already has tag cloud and searching by tags.
Any opinions are very welcome, especially improvements to my thoughts on organizing chat logs.