Skip navigation

This is the first screencast made by yours truly about the progress for this project. The idea behind the screencast is that we want to show you the current functionality without having you to check out the code and configure everything to see what’s happening. I’ve uploaded it on youtube: Again, any comments are very welcome!

EDIT: here are screenshots with an updated interface:

P.S. That’s not real Greg 😛

Advertisements

Jeremy Handcock, a CS Master’s student here at UofT has sent me a paper that explains how full-text search indexing source code repositories, bug descriptions, documents, etc. could help developers find a rationale behind a portion of a code. The way it works is pretty simple. The entire system is represented by a directed graph, with nodes representing artifacts such as URL, e-mail, bug number, e-mail address, etc., and arcs specifying where each artifact is mentioned. For example, a bug can be mentioned in a blog, bug tracker, e-mail, or anything else, so it will link to those artifacts.

The crawler scans data sources (repos, bug trackers, etc.) for artifacts. It then updates a directed graph with links between nodes (i.e. a new e-mail mentioned existing bug), so it adds that e-mail to the graph and creates a link between a bug and the e-mail.

The concept seems very simple, but the details are complicated. For example, the scanner for bug artifacts would have to capture messages like “bug#1234”, “the SQL query has been fixed in ticket 1234”, “see 1234 for the updated code”, etc. That’s kind of what I’m thinking about right now. For my project, I need to do just that. I need to find a way to look at the IRC log and say “messages 123 to 456 were about X and messages 789 to 1000 were about Y”. I already looked at several other papers that talk about this. It gets complicated really fast. For example, messages 123 to 456 could be about subjects X and Y, and Y could be repeated in messages 789 to 1000. I’m going to have a chat with another grad student who has been doing natural language processing to find out how feasible topic extraction is for IM logs.

I’m also working on the UI for viewing logs. I created a DB schema for IRC messages and events, a plugin for Supybot to snoop on channels and update the DB with messages and events, and also a basic interface for viewing chat logs by dates. The next step is to be able to search and navigate IRC messages using DrProject search and event log, respectively, as I mentioned in the earlier post. I’m going to post screenshots with explanation of the current UI this week, and any comments would be great. I’m planning to make browsing by event log as easy as possible to the user, and the search as powerful as possible, so maybe we won’t have to segment IRC messages by topic. Eventlog by itself is a pretty good way of segmenting messages. If A happened at time T (checking, bug report, etc.), and B was discussed at the same time on IRC, then most likely A and B are related.

P.S. I rediscovered in practice that words “Windows” and “open source” mentioned in the same sentence render that sentence meaningless. A piece of software as complicated as DrProject with its many open source plugins just can’t run 100% correct on Windows. There’s many reasons, but they’re all explained by the fact that it was coded with Unix/Linux in mind (although it does have checks for Windows OS).

P.P.S. I never used VIM for coding before. I tried earlier this week, and found that learning how to use it effectively is as useful as learning how to drive in manual when most of the time you’re using a car to get from A to B. Eclipse might not be perfect, but I don’t buy the argument that using Linux commands with VIM will make you a more effective programmer.

… proved to be much more tedious than I expected, but not too tedious *if* you know what you’re doing (i.e. there exists documentation).

First things first. DrProject needs to know which modules it has to load when it starts up. These are specified in entry_points dictionary in setup.py:

‘irc_model = drproject.irclog.model’
‘irc_webui = drproject.irclog.web_ui’

I have to include model and web_ui because in model i have Elixir code to create a table for IRC messages, and in web_ui I have IRCLogController class for handling user requests. So, DrProject needs to know both. When it starts, it loads these classes. In model.py, I have IRCMessage which inherits Elixir’s Entity class, and creates a mapping for the table irc_message in DB.

web.chrome provides a few useful interfaces. For me, I had to implement ITemplateProvider in my IRCLogController class. This template has only one method, which is get_templates_dirs. The dispatcher then calls this method to add a directory in which the templates are stored to its list. Another way is to add a template to drproject/templates, but I think it’s better to have a template in my component.

As I become more familiar with drproject and its components, I’ll add more details and create a wiki page on DrProject listing steps for adding new components. So far, I’ve created a schema for storing messages and the bare bone template to irclog component in DrProject. Next, I’ll be modifying the Supybot plugin to save messages to this table and adding tests for backend. After this, I’m going to start working on the IRC logs template based on the screen mockup I posted earlier.

I’ve created a very basic screen mock up today. I’ve had a lot of feedback, which I’m going to integrate into the next version of the screen mockup. You can see it here: https://www.drproject.org/DrProject/wiki/DrProjectChatInterface

The next step is to design the database schema for storing messages, and start adding a new component to DrProject. I haven’t pegged the idea of using event log as a natural way of segmenting chat logs, but I think this is the easiest approach to take. The other would be some sort of NLP system, and it seems like a project within itself. However, I’m open to any other ideas.

DrProject consists of components, or modules. For example, drproject/ticket contains a template and the model and controller code to handle ticket requests. When a request for a resource is received through the web, web/main.py searches for the right controller by looking at the URL pattern. The question I have is why do it like this? In ASP.NET, you simply tell the webpage what “code-behind” (or “code-beside”) classes will handle the request.

Anyway, to create a new component in DrProject is fairly simple. It consists of the following files/directories:

/templates: This stores the html Kid templates. It’s pretty straight-forward: you can in-line Python code with HTML (which is very ugly and one of the reasons why MVC for web is very good), or you can do all the logic in the model/controller and use these templates as forms for getting input, or a page for displaying stuff.

/__init__.py: This is where imports go that will be used by other classes in the component.

/api.py: This is where “helper” classes and methods go, which will be used by other classes in the component.

/model.py: The guts of the component. All logic related to backend is here.

/web_ui.py: The brain of the component. Controller to handle user actions with the web page should be here.

Of course, in real world, there is no perfect MVC. Some controllers seem to contain logic for modifying the backend, which I think should be in the model somewhere.

I’ve been looking at the ways of organizing chat logs. In a long-term project, there will be thousands of messages, and we need some way of organizing these logs, instead of having one long document. Here are some ideas ordered by simplicity:

Archives by day/month/year.
The simplest way of doing this, which is very similar to IRC/MSN logs.

Use a channel topic and tags to organize logs.
The quality of organization will depend on users. Here is why. The basic idea is to use the channel topic as a log name (and consequently, wiki page name). For example, imagine students working on an assignment (or a software team working on a product). The channel will need to have a topic (i.e. Assignment 1, or DrProject Summer 2008 ) set by a channel operator. So, all chats related to this topic will be stored in the same document (Assignment1 or DrProjectSummer2008). To allow a deeper level of organization, tags can be used to create sub-topics. For example, when discussing component A of Assignment 1, a user could let the bot attach a tag to subsequent messages until the subtopic is changed (i.e. DrProjectBot: tag ComponentA). This way, users can select conversations by tags (subtopics) in one (potentially) giant document. This will also allow cross-topic selection, since the tags can link portions of conversations in different topics (i.e. if Component A is used in Assignment 1 and 4). An interface that will display logs in an organized and easy-to-use way will allow users to select only portions of messages they want to see by using tags. I’m thinking it will be based on the existing wiki system in DrProject, but will be a separate component. Why wiki? Because it’s the most popular tool for editing text on the web by many users.

Natural language processing for topic detection and classification.
I’ve never done any machine learning or natural language computing stuff, and today spent some time reading papers. It seems that it’s a fairly experimental field, and hasn’t been widely used in the industry. IBM has something called WebFountain, which is:

“WebFountain is a platform for very large-scale text analytics applications. The platform allows uniform access to a wide variety of sources, scalable system-managed deployment of a variety of document-level “augmenters” and corpus-level “miners,” and finally creation of an extensible set of hosted Web services containing information that drives end-user applications. Analytical components can be authored remotely by partners using a collection of Web service APIs (application programming interfaces). The system is operational and supports live customers. This paper surveys the high-level decisions made in creating such a system.”

There’s a few papers on doing something similar with IRC chat logs. From what I can tell, they break the log into overlapping documents (sliding window style), get the keywords, and project documents on keywords. The result is a Gauss distribution, and if enough documents contain keywords, then these keywords form a topic. This is useful for topic extraction, and I guess it would be possible to select bits of conversations from these topics.

It seems very interesting, but not feasible for DrProject because of its experimental nature. These algorithms have to be fine-tuned either through machine learning or by humans. However, CIA and Homeland Security seem very interested because, apparently, IRC is used by terrorists, and extracting topics from chat logs would make it easier to spot them.

So far, I think the most feasible solution is to use tagging. It seems the easiest way to organize chat logs, and DrProject already has tag cloud and searching by tags.

Any opinions are very welcome, especially improvements to my thoughts on organizing chat logs.

Currently I’m working on integrating IRC logs with DrProject. There are two ways I can think of doing this, and both use wikis:

1. Save logs to the existing wiki system in DrProject. This seems the simplest solution, but there are some concerns with it. The biggest concern is that official wikis are meant to be formal documents, while logs are informal. The second concern is that logs will show up in the wiki index. So it seems that the better approach is to create a separate component, described in the next section.

2. Create a separate component for logs based on the current wiki system in DrProject. This will entail creating separate tables to store these wikis from the regular wikis on DrProject. It will be very similar to the existing wiki system in DrProject, but it gives more flexibility on what these wiki pages can parse and show, because we could curtail it specifically for logs (syntax highlighting, etc.).

In the first case, the bot could use DrProject RPC to edit wiki logs. In the second case, we could do it directly by inserting records into a table with messages, or use RPC as well.

Currently, I think the second option is the best.

Supybot comes with ChannelLogger plugin. Basically, it implements event handlers for things like users joining the channel, users changing nicknames, messages, etc. It saves them in a text file #<channel_name>.log under <irc_network_name>/<channel_name>. Since I finished writing RPC for editing wikis, I have now started on writing another plugin for Supybot to save logs as wikis on DrProject. Here is how I’m planning on doing this. Have a plugin run as sort of a background job, and periodically upload the logs created by ChannelLogger to DrProject wiki. I’ll post more details tomorrow, but that’s a basic idea.

This week I finished writing RPC code to handle ticket views and added a command to DrProject plugin for Supybot to handle ticket view requests. Ticketing system in DrProject will soon be revamped, so I expect the code to break. However, in my case, like Jeff said, it’s better to have something working now that could be fixed in the future, rather than preparing for the future. I don’t expect that it will be a big effort to fix incoming errors.

See here.

Currently, I’m working on implementing a similar feature for change sets. I think a simple description of a change set, along with a link should be sufficient. I’m also working on the design for integrating IRC and DrProject Wiki. We’ll want to save chat logs as wiki pages, and there’s many things to consider. How to organize these wiki pages? Use the channel topic as the name for the wiki page? Or, just time stamp it? I also want to create a doc explaining how the bot is talking to DrProject, how to set it up setup, configure, run etc. In short, useful documentation for everyone.

Now seems like a good time to write a plan for the next few weeks on what I’m planning on doing with my project.

DrProject has RPC functions for things like getting project’s roadmap, milestones, etc. So, wouldn’t it be cool if you could get information on your project straight from the IRC chat? For this, I’m planning on writing a plugin for Supybot. While searching around, I came across TracBot, which is a plugin written for Supybot. Users in channel can ask the bot to give them a link to a ticket, change set, wiki, or do a search. However, this is not good enough. It would be really cool if you could see all of this information in the IRC channel. Imagine a scenario where students are having a conversation, and somebody mentions a ticket or a wiki. Instead of posting a link, they could just ask the bot to spit this information into the channel, so everyone can see it at the same time. Since every project will have its own channel, this is not spamming. The people in the channel are all working on the same thing, and are very likely discussing the same thing.

So, back to implementation. It seems that some things, such as tickets, don’t have RPC implemented. What I’ll be doing next is finishing these features in DrProject. Then, I’ll develop a plugin for Supybot, so it will relay user requests in the IRC channel to DrProject via RPC’s.

We’re also hoping to get IRC daemon installed on our server, so everyone working on projects this summer, including profs, will be able to start using this as soon as possible. It’s also a great way of collecting data, since the next step would be to “wikify” conversations on DrProject, so that users can see the archive. Feedback will be invaluable also.

I’m hoping to talk to my supervisor about the actual architecture when he gets back next week. What I really want to discuss is security. Right now, I’m guessing each project will have its own channel, and each channel will have its own bot, which has an account on DrProject with view rights.  There’s tons of ways in which this could be compromised. IRC daemon alone could have a lot of security holes. What happens if somebody breaks into it and can view all channels? Then they can get information on projects from bots. Should the users be authenticated with the bot before they can issue commands? Would that become too tedious to be used (particularly for student projects)?