Monthly Archives: May 2008

I’ve created a very basic screen mock up today. I’ve had a lot of feedback, which I’m going to integrate into the next version of the screen mockup. You can see it here: https://www.drproject.org/DrProject/wiki/DrProjectChatInterface

The next step is to design the database schema for storing messages, and start adding a new component to DrProject. I haven’t pegged the idea of using event log as a natural way of segmenting chat logs, but I think this is the easiest approach to take. The other would be some sort of NLP system, and it seems like a project within itself. However, I’m open to any other ideas.

DrProject consists of components, or modules. For example, drproject/ticket contains a template and the model and controller code to handle ticket requests. When a request for a resource is received through the web, web/main.py searches for the right controller by looking at the URL pattern. The question I have is why do it like this? In ASP.NET, you simply tell the webpage what “code-behind” (or “code-beside”) classes will handle the request.

Anyway, to create a new component in DrProject is fairly simple. It consists of the following files/directories:

/templates: This stores the html Kid templates. It’s pretty straight-forward: you can in-line Python code with HTML (which is very ugly and one of the reasons why MVC for web is very good), or you can do all the logic in the model/controller and use these templates as forms for getting input, or a page for displaying stuff.

/__init__.py: This is where imports go that will be used by other classes in the component.

/api.py: This is where “helper” classes and methods go, which will be used by other classes in the component.

/model.py: The guts of the component. All logic related to backend is here.

/web_ui.py: The brain of the component. Controller to handle user actions with the web page should be here.

Of course, in real world, there is no perfect MVC. Some controllers seem to contain logic for modifying the backend, which I think should be in the model somewhere.

I’ve been looking at the ways of organizing chat logs. In a long-term project, there will be thousands of messages, and we need some way of organizing these logs, instead of having one long document. Here are some ideas ordered by simplicity:

Archives by day/month/year.
The simplest way of doing this, which is very similar to IRC/MSN logs.

Use a channel topic and tags to organize logs.
The quality of organization will depend on users. Here is why. The basic idea is to use the channel topic as a log name (and consequently, wiki page name). For example, imagine students working on an assignment (or a software team working on a product). The channel will need to have a topic (i.e. Assignment 1, or DrProject Summer 2008 ) set by a channel operator. So, all chats related to this topic will be stored in the same document (Assignment1 or DrProjectSummer2008). To allow a deeper level of organization, tags can be used to create sub-topics. For example, when discussing component A of Assignment 1, a user could let the bot attach a tag to subsequent messages until the subtopic is changed (i.e. DrProjectBot: tag ComponentA). This way, users can select conversations by tags (subtopics) in one (potentially) giant document. This will also allow cross-topic selection, since the tags can link portions of conversations in different topics (i.e. if Component A is used in Assignment 1 and 4). An interface that will display logs in an organized and easy-to-use way will allow users to select only portions of messages they want to see by using tags. I’m thinking it will be based on the existing wiki system in DrProject, but will be a separate component. Why wiki? Because it’s the most popular tool for editing text on the web by many users.

Natural language processing for topic detection and classification.
I’ve never done any machine learning or natural language computing stuff, and today spent some time reading papers. It seems that it’s a fairly experimental field, and hasn’t been widely used in the industry. IBM has something called WebFountain, which is:

“WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level “augmenters” and corpus-level “miners,” and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.”

There’s a few papers on doing something similar with IRC chat logs. From what I can tell, they break the log into overlapping documents (sliding window style), get the keywords, and project documents on keywords. The result is a Gauss distribution, and if enough documents contain keywords, then these keywords form a topic. This is useful for topic extraction, and I guess it would be possible to select bits of conversations from these topics.

It seems very interesting, but not feasible for DrProject because of its experimental nature. These algorithms have to be fine-tuned either through machine learning or by humans. However, CIA and Homeland Security seem very interested because, apparently, IRC is used by terrorists, and extracting topics from chat logs would make it easier to spot them.

So far, I think the most feasible solution is to use tagging. It seems the easiest way to organize chat logs, and DrProject already has tag cloud and searching by tags.

Any opinions are very welcome, especially improvements to my thoughts on organizing chat logs.

Currently I’m working on integrating IRC logs with DrProject. There are two ways I can think of doing this, and both use wikis:

1. Save logs to the existing wiki system in DrProject. This seems the simplest solution, but there are some concerns with it. The biggest concern is that official wikis are meant to be formal documents, while logs are informal. The second concern is that logs will show up in the wiki index. So it seems that the better approach is to create a separate component, described in the next section.

2. Create a separate component for logs based on the current wiki system in DrProject. This will entail creating separate tables to store these wikis from the regular wikis on DrProject. It will be very similar to the existing wiki system in DrProject, but it gives more flexibility on what these wiki pages can parse and show, because we could curtail it specifically for logs (syntax highlighting, etc.).

In the first case, the bot could use DrProject RPC to edit wiki logs. In the second case, we could do it directly by inserting records into a table with messages, or use RPC as well.

Currently, I think the second option is the best.

Supybot comes with ChannelLogger plugin. Basically, it implements event handlers for things like users joining the channel, users changing nicknames, messages, etc. It saves them in a text file #<channel_name>.log under <irc_network_name>/<channel_name>. Since I finished writing RPC for editing wikis, I have now started on writing another plugin for Supybot to save logs as wikis on DrProject. Here is how I’m planning on doing this. Have a plugin run as sort of a background job, and periodically upload the logs created by ChannelLogger to DrProject wiki. I’ll post more details tomorrow, but that’s a basic idea.

This week I finished writing RPC code to handle ticket views and added a command to DrProject plugin for Supybot to handle ticket view requests. Ticketing system in DrProject will soon be revamped, so I expect the code to break. However, in my case, like Jeff said, it’s better to have something working now that could be fixed in the future, rather than preparing for the future. I don’t expect that it will be a big effort to fix incoming errors.

See here.

Currently, I’m working on implementing a similar feature for change sets. I think a simple description of a change set, along with a link should be sufficient. I’m also working on the design for integrating IRC and DrProject Wiki. We’ll want to save chat logs as wiki pages, and there’s many things to consider. How to organize these wiki pages? Use the channel topic as the name for the wiki page? Or, just time stamp it? I also want to create a doc explaining how the bot is talking to DrProject, how to set it up setup, configure, run etc. In short, useful documentation for everyone.

Now seems like a good time to write a plan for the next few weeks on what I’m planning on doing with my project.

DrProject has RPC functions for things like getting project’s roadmap, milestones, etc. So, wouldn’t it be cool if you could get information on your project straight from the IRC chat? For this, I’m planning on writing a plugin for Supybot. While searching around, I came across TracBot, which is a plugin written for Supybot. Users in channel can ask the bot to give them a link to a ticket, change set, wiki, or do a search. However, this is not good enough. It would be really cool if you could see all of this information in the IRC channel. Imagine a scenario where students are having a conversation, and somebody mentions a ticket or a wiki. Instead of posting a link, they could just ask the bot to spit this information into the channel, so everyone can see it at the same time. Since every project will have its own channel, this is not spamming. The people in the channel are all working on the same thing, and are very likely discussing the same thing.

So, back to implementation. It seems that some things, such as tickets, don’t have RPC implemented. What I’ll be doing next is finishing these features in DrProject. Then, I’ll develop a plugin for Supybot, so it will relay user requests in the IRC channel to DrProject via RPC’s.

We’re also hoping to get IRC daemon installed on our server, so everyone working on projects this summer, including profs, will be able to start using this as soon as possible. It’s also a great way of collecting data, since the next step would be to “wikify” conversations on DrProject, so that users can see the archive. Feedback will be invaluable also.

I’m hoping to talk to my supervisor about the actual architecture when he gets back next week. What I really want to discuss is security. Right now, I’m guessing each project will have its own channel, and each channel will have its own bot, which has an account on DrProject with view rights.  There’s tons of ways in which this could be compromised. IRC daemon alone could have a lot of security holes. What happens if somebody breaks into it and can view all channels? Then they can get information on projects from bots. Should the users be authenticated with the bot before they can issue commands? Would that become too tedious to be used (particularly for student projects)?

So the hard drive on my machine blew up and I had to change to an adjacent workstation which runs Windows XP. Thanks to Nick Jamil and his guide the transition was relatively smoothless. The daemon and bot were installed OK too, so something good came from this – I inadvertently confirmed that both can run on Windows. There was one small problem with Eclipse though. Not sure if it’s because of the new SP3, but Eclipse randomly closed with EXCEPTION_ACCESS_VIOLATION whenever I tried to check something out. I tried the workaround proposed here and that solved the problem. If you experience this  with subclipse, just change SVN interface to SVNKit from JavaHL.

Last day I spent mostly looking around for a cross-platform and a community supported IRC daemon that will be used for this project. After some research, I decided to go with InspIRCd. It seems to have decent documentation, and should fit our needs for IRC daemon.

As for the bot, we decided to go with Supybot. It’s written in Python and seems to be easy to extend by writing custom plugins (in Python of course). It already comes with several useful plugins, in particular the RSS feed and chat logger.