This week I had a demo at DemoCamp, and had to get a snapshot of DrProject’s database. I had to borrow somebody’s laptop for the demo, and didn’t want to install PostgreSQL on it. So, I was challenged with a task of converting PostgreSQL database dump into SQLite database. After some time of surfing the web, I came to a conclusion that nobody has ever documented/blogged this (and I guess that makes sense - if anything, people should be moving from SQLite to PostgreSQL).

To get a dump from PostgreSQL is simple:

pg_dump -a -d db_name > dump.sql

This will create a text file with SQL statements for table definitions, and INSERTs of the data in the tables. This is what I had to begin with. The next step is to sanitize data. SQLite has no predefined data type for booleans, it uses integers with 0 or 1 for false and true, respectively. Therefore, we should replace ‘false’ with 0 and ‘true’ with 1 using ‘awk’ command on Linux. The next step is to connect to the SQLite database:

sqlite3 db_name.db

You will get a prompt like this:
sqlite>

Now, use the .read command to load the dump:
sqlite> .read dump.sql

That’s it!

So, I just spent about 3 hours at work and at home debugging a small problem. I added functionality for a service bot to identify with the DrProject bot. This is done by the service bot sending IDENTIFY command to the DrProject bot, and waiting for a reply  “The operation succeeded”. Very simple concept, right?

INFO 2008-07-17T21:06:26 Received message from DrProjectBot: The operation succeeded.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
INFO 2008-07-17T21:06:29 Ignoring *!DrProjectBot@127.0.0.1 for 600 seconds due to
an apparent invalid command flood.
WARNING 2008-07-17T21:06:29 Apparent error loop with another Supybot
observed.  Consider ignoring this bot permanently.

Basically, I didn’t RTFM. Every message received by a bot is interpreted as a command. So, when the service bot received “The operation succeeded.”, it was very angry at the DrProjectBot, because “The operation succeeded.” is not a valid command. It proceeded to yell at DrProjectBot with a message “error: “The” is not a valid command.”. Naturally, DrProjectBot also became angry, because “error: “The” is not a valid command.” is not a valid command, and the cycle started.

Thanks to good people on #supybot channel, I found a way for the bot to ignore non-command messages. Setting supybot.reply.whenNotCommand variable to False does the trick.

So, after some deliberation, I came up with two solutions for making DrProject admin’s life easier and both have their pros and cons. I would like to get some opinions on which is better, por favor.

DrProject signal capturing method

This seems like the best way to automate the process of registering channels / users in IRC. DrProject provides signals whenever a new project is created or a user is assigned to a project. The idea is to create a handler for these signals. Next step becomes a little hazy. As I mentioned before, DrProject and Supybot are two separate processes. So, the idea is to have a separate bot for services. Whenever a new project is created, DrProject will modify a config file for this bot, start it up to do all the mundane IRC services tasks (creating a channel, adding a user to a channel, etc.), and the bot will automatically quit the network and die an honorable death after it’s done. The biggest problem with this approach is how does DrProject know if the bot has completed its tasks successfully? Does it have to wait for the bot to finish its work and then parse the log file to see if everything was done?

Supybot plugin to simplify services management

This is the method which I personally prefer as a developer because it’s clean. It follows Supybot plugin architecture (plugin is just an extension that provides commands which Supybot executes). However, it will put more burden on the administrator because not everything will be automated. The idea here is to provide the admin with commands which would automate the process of registering channels, adding users, etc. It could be as simple as !doMagic, which would then create and protect a channel for each project, create and register every user in every project, and assign each user to the corresponding channel’s access list. Or, !assignUserToChannel <user> <channel>, !createChannel <channel>, !registerUser <user> <password>, etc. Another reason why I like this approach is that it provides more fine-tuning to the administrator.

Even though I agree with Greg that there should be as much automation as possible (otherwise, users might not adopt a feature), there must be a balance between making someone’s life easier versus making software too complicated. In this case, the first approach is complicated because there’s many soft points where something could go wrong. Also, as I mentioned before, the second solution provides more power to the admin. It’s possible to use both methods as well. I would like to hear some ideas, especially from my supervisor :)

I have now started looking in more detail at user/channel management in IRC and how to integrate this with DrProject. What we want to do is to be able to automate things that IRC operators usually perform (registering a channel, setting access levels, etc.). For example, when a project is created, it would be nice if some entity (a bot?) could create, register and restrict that channel. When a new user is added to a project, the same entity should register a user on IRC network, and add the user to the access list for the channel that corresponds to that project.

I should mention that for IRC services we are using Anope, because it supports Linux and Windows. I will not mention the pain of configuring InspIRCd and Anope to talk to each other (apparently, localhost is not recognized as a loopback address, even though it’s the default value in config…).

Registering and restricting a channel manually in IRC can be done in several ways.

Using access list (short, preferred way)

First, we need to restrict access to the channel:

/msg ChanServ set #channel restricted on

Now, we need to add a user to the access list:

/msg chanserv vop #channel add user

What the above 2 commands do is restrict access to the channel, so only the users on the access list can join the channel, as well as add a user to the access list with voice privileges. Anybody else not on the access list who tries to join the channel will be automatically kicked and banned from this channel.

I will not mention commands for the other methods, because they are more cumbersome. Basically, one uses invite-only feature. This means that the channel operator sets the channel to invite (+i). Then, users are invited to the channel by the op. Later, when these users wish to rejoin the channel, they ask ChanServ to invite them. There’s two problems with this. First, it’s not natural to ask a ChanServ to invite you to a channel - most of the time, people simply /join channels. Second, if the user gets disconnected due to network problems, they will not be reconnected automatically. The last method is similar to what I’ve decided to use, but it’s more involved and produces essentially the same results. The idea is to give users who will have access to the channel access level greater than 0, and set NOJOIN feature to 0. Thus, users not on the access list will not be able to join a channel, because their access levels are 0 or less.

So, IRC services part is pretty clear. Now, how do we integrate this into DrProject? I think using a bot for this is ideal, but not sure how to make the bot and DrProject speak to each other. Supybot is a process that’s running separately from DrProject. If we use push method, we need to somehow send messages from DrProject to Supybot, it seems via IRC protocol. If we use a pull method, it’s not going to be real time. Supybot could periodically poll projects and users table and add these entities to IRC. Or, we could write a plugin and simplify some of these steps for a user. For example, allow a user to load a spreadsheet with project-user mapping and create/restrict respective channels. Ideas?

Last week me and Greg spoke with Gerald Penn, who is doing Natural Language Processing research at the University of Toronto, and his graduate student Xiaodan Zhu about summarizing IRC logs. Gerald has pointed me to several resources on the related research. I started reading a book called “Automatic Summarization” by Inderjeet Mani, as well as several research papers on the topic of text summarization / extraction. I’m taking notes, because it’s very easy to get lost in the midst of details. NLP combines many fields - Linguistics, Statistics, Math (mostly algebra / linear optimization), as well Computer Science (data structures / algorithms). I have to mention that I never worked with NLP before, so all these ideas are very much new to me. However, it’s not terribly difficult, because it’s sufficiently abstracted that anyone with some background in the above mentioned fields can understand these ideas. The difficulty arises in my case when we are talking about summarizing IRC logs. Due to the nature of these chats, it’s very much unlike summarizing text. Usually, scientific papers are sufficiently organized, so Edmundsonian paradigm can be applied (i.e. first and last sentences of the paragraph are probably more important than the sentences in the middle, so they should be assigned more weight based just on location). The same is not true for the IRC logs. So, I always have to keep that in mind when I read about different heuristics discussed in these papers.

When I see something new, I like to get a big picture first before I start looking at details. What follows below is a summary of what I’ve learned so far.

IRC summarization consists of the following parts:

  1. Message segmentation
  2. Clustering
  3. Adjacent response pairs identification
  4. Summary extraction

Message Segmentation

This is done using TextTiling algorithm. It partitions the entire IRC log into multi-paragraph segments. These segments most likely have little overlap in terms of their content.

Clustering

This is the most crucial part. Basically, using Ward’s agglomerative method, we can cluster segments based on their relevance. Each cluster represents L-dimensional vector of words. Each word is assigned a weight, which is (word frequency) / (document frequency). Word frequency is defined as “the number of times a word appears in the cluster” and document frequency is defined as in “the number of segments the word appears in”. So, the higher the weight, the more valuable is the word. For example, if the word appears a lot in the cluster and not in other clusters, then it means that the topic discussed in this cluster was centered around this word. The next step is to start merging clusters based on the principle of minimal variance. In-cluster variance in this case is a function of the number of elements in both clusters to be merged, as well as the Euclidian distance between both clusters (based on the vector mentioned earlier). So, if 2 clusters are very similar (i.e. they’re about the same topic), then the distance between both vectors is small (compared to other possibilities), and the marginal increase in variance is pretty small. The process of merging of clusters stops when there no combinations such that the in-cluster variance doesn’t increase below a specified % threshold.

Adjacent response pairs identification

This is a very fuzzy area to me, which I will be trying to understand more. Basically, the idea is to identify the conversation initiator and the responses to this initiation. Since the conversation could have several threads, the process of identifying these pairs is probabilistic.

Summary extraction

Now that we’ve grouped messages on the same topic and sub-topic level, extracting summaries is pretty straight-forward. Looks like Edmundosian paradigm, or something similar to it would work.

Now that I have a general idea of the steps involved in summarizing IRC logs, I will start looking at the details of each step.

Most of the last week I spent on adding event log into IRC log web page, as well as adding the ability to select message blurbs. Currently, the value is hardcoded to 30 minutes window within the event. So, when I click on the item in the event log, the messages within 30 minutes of the event are highlighted

.


Here is a summary of what’s been done so far and what needs to be done next in the next few weeks:

TO DATE:

  1. Created backend Elixir code to save IRC messages and events to the database.
  2. Created Supybot plugin to log messages to the database
  3. Created a new DrProject component for viewing IRC logs

So, now the functionality to view IRC logs by date is done.

TO DO:

  1. Add event log to the component, so the users can browse segments of IRC logs by events. First, I’ll create a screen mockup for this.
  2. Add search functionality for the logs.
  3. Add tags to conversations. The idea is to use words that don’t appear in an English dictionary (i.e. ComponentA) as tags. For example, users could then view conversation blobs around ComponentA by clicking on the tag.
  4. This Wednesday me and Greg are meeting with Gerald Penn, who is doing research in Natural Language Processing. Hopefully he’ll point us in the right direction on any algorithms for dissecting IRC messages by subject or keywords.

This is the first screencast made by yours truly about the progress for this project. The idea behind the screencast is that we want to show you the current functionality without having you to check out the code and configure everything to see what’s happening. I’ve uploaded it on youtube: Again, any comments are very welcome!

EDIT: here are screenshots with an updated interface:

P.S. That’s not real Greg :P

Jeremy Handcock, a CS Master’s student here at UofT has sent me a paper that explains how full-text search indexing source code repositories, bug descriptions, documents, etc. could help developers find a rationale behind a portion of a code. The way it works is pretty simple. The entire system is represented by a directed graph, with nodes representing artifacts such as URL, e-mail, bug number, e-mail address, etc., and arcs specifying where each artifact is mentioned. For example, a bug can be mentioned in a blog, bug tracker, e-mail, or anything else, so it will link to those artifacts.

The crawler scans data sources (repos, bug trackers, etc.) for artifacts. It then updates a directed graph with links between nodes (i.e. a new e-mail mentioned existing bug), so it adds that e-mail to the graph and creates a link between a bug and the e-mail.

The concept seems very simple, but the details are complicated. For example, the scanner for bug artifacts would have to capture messages like “bug#1234″, “the SQL query has been fixed in ticket 1234″, “see 1234 for the updated code”, etc. That’s kind of what I’m thinking about right now. For my project, I need to do just that. I need to find a way to look at the IRC log and say “messages 123 to 456 were about X and messages 789 to 1000 were about Y”. I already looked at several other papers that talk about this. It gets complicated really fast. For example, messages 123 to 456 could be about subjects X and Y, and Y could be repeated in messages 789 to 1000. I’m going to have a chat with another grad student who has been doing natural language processing to find out how feasible topic extraction is for IM logs.

I’m also working on the UI for viewing logs. I created a DB schema for IRC messages and events, a plugin for Supybot to snoop on channels and update the DB with messages and events, and also a basic interface for viewing chat logs by dates. The next step is to be able to search and navigate IRC messages using DrProject search and event log, respectively, as I mentioned in the earlier post. I’m going to post screenshots with explanation of the current UI this week, and any comments would be great. I’m planning to make browsing by event log as easy as possible to the user, and the search as powerful as possible, so maybe we won’t have to segment IRC messages by topic. Eventlog by itself is a pretty good way of segmenting messages. If A happened at time T (checking, bug report, etc.), and B was discussed at the same time on IRC, then most likely A and B are related.

P.S. I rediscovered in practice that words “Windows” and “open source” mentioned in the same sentence render that sentence meaningless. A piece of software as complicated as DrProject with its many open source plugins just can’t run 100% correct on Windows. There’s many reasons, but they’re all explained by the fact that it was coded with Unix/Linux in mind (although it does have checks for Windows OS).

P.P.S. I never used VIM for coding before. I tried earlier this week, and found that learning how to use it effectively is as useful as learning how to drive in manual when most of the time you’re using a car to get from A to B. Eclipse might not be perfect, but I don’t buy the argument that using Linux commands with VIM will make you a more effective programmer.

… proved to be much more tedious than I expected, but not too tedious *if* you know what you’re doing (i.e. there exists documentation).

First things first. DrProject needs to know which modules it has to load when it starts up. These are specified in entry_points dictionary in setup.py:

‘irc_model = drproject.irclog.model’
‘irc_webui = drproject.irclog.web_ui’

I have to include model and web_ui because in model i have Elixir code to create a table for IRC messages, and in web_ui I have IRCLogController class for handling user requests. So, DrProject needs to know both. When it starts, it loads these classes. In model.py, I have IRCMessage which inherits Elixir’s Entity class, and creates a mapping for the table irc_message in DB.

web.chrome provides a few useful interfaces. For me, I had to implement ITemplateProvider in my IRCLogController class. This template has only one method, which is get_templates_dirs. The dispatcher then calls this method to add a directory in which the templates are stored to its list. Another way is to add a template to drproject/templates, but I think it’s better to have a template in my component.

As I become more familiar with drproject and its components, I’ll add more details and create a wiki page on DrProject listing steps for adding new components. So far, I’ve created a schema for storing messages and the bare bone template to irclog component in DrProject. Next, I’ll be modifying the Supybot plugin to save messages to this table and adding tests for backend. After this, I’m going to start working on the IRC logs template based on the screen mockup I posted earlier.