DrProject and IRC Integration

Django auth

Is also very very nice. Basically, it consists of 2 layers – the permissions and groups. A permission is the most basic unit. A group is a collection of permissions (just like a role in DrProject right now). A user can have permissions or groups in their profile. This is all good, but there is one problem.

The problem is that each permission is tied to an application. In Pyrtl, we’ll want to have a more fine-grained access control. We want users to have possibly different permissions in different projects. One idea that I have is to use ${permission_name}_${project_name} to handle that. Basically, we’ll have a base set of groups and permissions. When a new project is created, Pyrtl will create the corresponding groups and permissions with the above schema. Then, it’s easy to add a role to the user for that project — we’ll simply assign a group ${group_name}_${project_name}, which will contain permissions from the base with the above schema.

There is two ways that I know of to create custom permissions in Django. The first way is to add a new model inside models.py as explained here. But we’ll want to do this dynamically. Here is a code snippet:

from django.contrib.auth.models import User, Permission, Group
from <project_name>.<app_name> import models as app
from django.contrib.contenttypes.models import ContentType

# get the application name from the module
appname = app.__name__.lower().split('.')[-2]

# get a ContentType of the application
ct, created = ContentType.objects.get_or_create(model='', app_label=appname, defaults={'name': appname})

# create a new permission for this application with codename 'foo' and human readable name 'foo permission'
p, created = Permission.objects.get_or_create(codename='foo', content_type__pk=ct.id, defaults={'name': 'foo permission', 'content_type': ct})
p.save()

# create a new group 
g = Group(name='bar')
g.save()

# add a permission to the group
g.permissions.add(p)
g.save()

# create a user and add a group to the user
user = User.objects.create_user('Kosta', 'kosta@foo.com', '123')
user.groups.add(g)
user.save()

Let’s go through it step by step.

from django.contrib.auth.models import User, Permission, Group
from <project_name>.<app_name> import models as app
from django.contrib.contenttypes.models import ContentType

We import the model module and give it alias ‘app’. This will be necessary to get a reference to the ContentType object that stores this module. When we create a new permission dynamically, we must pass it this object so it knows which application it’s tied with.

# get a ContentType of the application
ct, created = ContentType.objects.get_or_create(model='', app_label=appname, defaults={'name': appname})

Here, we get a reference to the ContentType object that references our application.

# create a new permission for this application with codename 'foo' and human readable name 'foo permission'
p, created = Permission.objects.get_or_create(codename='foo', content_type__pk=ct.id, defaults={'name': 'foo permission', 'content_type': ct})
p.save()

Now, we create a new permission object and save it to the database. We must give it a name, which I called ‘foo’ and a human readable name, which I gave ‘foo permission’. We must also pass it the ContentType object that references the application.

# create a new group 
g = Group(name='bar')
g.save()

# add a permission to the group
g.permissions.add(p)
g.save()

Here, we create a new group, save it so we have a public key, and add a permission to the group.

# create a user and add a group to the user
user = User.objects.create_user('kosta', 'kosta@foobar.com', '123')
user.groups.add(g)
user.save()

Now, we create a new user (we could also query a user based on the session, but for demo purposes I’ll just create a new one), and add a group to my profile.

This is going to work, but it’s just an idea that I have. To check if the user has a permission, we just do:

user.has_perm('<app_name>.foo')

And this returns True. So, in the controller we could use reflection to check if the user has a permission in the project that’s based on the base permission.

Django MVC

In the previous entry I explained how to create a new project, app and how to add an app to the project. In this entry, I’ll show how to implement a MVC in Django.

Recall that our project directory tree looks as follows:

<project_name>/

<app_name>/
__init__.py
models.py
views.py

__init__.py
manage.py
settings.py
urls.py

The first thing we need to do is to tell Django which methods will handle requests. This is done in the urls.py by adding a tuple of tuples with a name urlpatterns. The tuples are of the form:

(regular expression, Python callback function [, optional dictionary])

The first item is a regular expression that matches URL we want to handle. The second item is the full path to the function to handle the request. The dictionary is keyword arguments. The methods to handle the request must be inside the views.py file.

Inside the handler methods we can do anything that Python allows us to do. One condition is that the method must return an object of type HttpResponse. The next step is to create templates to handle the presentation. First thing we must do is decide where to put templates. The path to templates is added to TEMPLATE_DIRS tuple in settings.py. After we’ve added this template, we must tie it with the method that handles the request. This is done by creating a django.template.Context object, which is just a mapping of Python objects inside the method and their respective names inside the template. Then, we must load the actual template by calling django.template.loader.get_template(<template_name>). Finally, we render the template by calling render on the loader and passing to it the Context object, and return the resulting HttpResponse object.

Django

Is really, really good. It’s a MVC web framework written in Python with content-management in mind. The most general unit in Django is a project, which is a collection of applications. Applications are just plugins, and can be used in one or more projects. Remember my previous post on adding a new component to DrProject? Well, with Django it’s a lot simpler, as I’ll show soon. Django supports PostgreSQL, MySQL, Sqlite 3, and Oracle. It also comes with database API, which is very similar to SQLAlchemy.

To create a new project, we can use django-admin.py to create directory structure so we don’t have to do it manually. When we invoke django-admin.py startproject <project_name>, Django will create the following directory tree:

<project_name>/
__init__.py
manage.py
settings.py
urls.py

For example, the settings.py file just contains setting fields (such as database-related stuff). The backend follows the DRY (Don’t Repeat Yourself) principle. This means that we define anything that we want to store in the database as a model. A model is essentially a class inheriting django.db.models.Model with fields, a concept similar to Elixir/SQLAlchemy. A model belongs to an application that uses that model. To create a new app in Django, we go to the directory in which the project was created and invoke python manage.py startapp <app_name>. This creates the following directory tree:

<app_name>/
__init__.py
models.py
views.py

This looks like MVC, right? In models.py we specify the models that we’ll want to store in the database. When we’ve added models to this file, we call python manage.py syncdb from the project root directory, which calls execute_manager from django.core.management with our settings, and applications. One little caviat – we need to manually add our application to INSTALLED_APPS tuple in the settings.py using its full name (<project_name>.<application_name>). When this command is called, Django will create any tables related to our models.

This completes the intro to creating projects/apps and defining models. Next, I’ll show how to add controllers to handle requests and templates to show shiny stuff.

PostgreSQL to SQLite

This week I had a demo at DemoCamp, and had to get a snapshot of DrProject‘s database. I had to borrow somebody’s laptop for the demo, and didn’t want to install PostgreSQL on it. So, I was challenged with a task of converting PostgreSQL database dump into SQLite database. After some time of surfing the web, I came to a conclusion that nobody has ever documented/blogged this (and I guess that makes sense – if anything, people should be moving from SQLite to PostgreSQL).

To get a dump from PostgreSQL is simple:

pg_dump -a -d db_name > dump.sql

This will create a text file with SQL statements for table definitions, and INSERTs of the data in the tables. This is what I had to begin with. The next step is to sanitize data. SQLite has no predefined data type for booleans, it uses integers with 0 or 1 for false and true, respectively. Therefore, we should replace ‘false’ with 0 and ‘true’ with 1 using ‘awk’ command on Linux. The next step is to connect to the SQLite database:

sqlite3 db_name.db

You will get a prompt like this:
sqlite>

Now, use the .read command to load the dump:
sqlite> .read dump.sql

That’s it!

Wonders of RTFM’ing

So, I just spent about 3 hours at work and at home debugging a small problem. I added functionality for a service bot to identify with the DrProject bot. This is done by the service bot sending IDENTIFY command to the DrProject bot, and waiting for a reply “The operation succeeded”. Very simple concept, right?

INFO 2008-07-17T21:06:26 Received message from DrProjectBot: The operation succeeded.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
error: “error:” is not a valid command.
INFO 2008-07-17T21:06:29 Ignoring *!DrProjectBot@127.0.0.1 for 600 seconds due to
an apparent invalid command flood.
WARNING 2008-07-17T21:06:29 Apparent error loop with another Supybot
observed. Consider ignoring this bot permanently.

Basically, I didn’t RTFM. Every message received by a bot is interpreted as a command. So, when the service bot received “The operation succeeded.”, it was very angry at the DrProjectBot, because “The operation succeeded.” is not a valid command. It proceeded to yell at DrProjectBot with a message “error: “The” is not a valid command.”. Naturally, DrProjectBot also became angry, because “error: “The” is not a valid command.” is not a valid command, and the cycle started.

Thanks to good people on #supybot channel, I found a way for the bot to ignore non-command messages. Setting supybot.reply.whenNotCommand variable to False does the trick.

IRC Services (cont.)

So, after some deliberation, I came up with two solutions for making DrProject admin’s life easier and both have their pros and cons. I would like to get some opinions on which is better, por favor.

DrProject signal capturing method

This seems like the best way to automate the process of registering channels / users in IRC. DrProject provides signals whenever a new project is created or a user is assigned to a project. The idea is to create a handler for these signals. Next step becomes a little hazy. As I mentioned before, DrProject and Supybot are two separate processes. So, the idea is to have a separate bot for services. Whenever a new project is created, DrProject will modify a config file for this bot, start it up to do all the mundane IRC services tasks (creating a channel, adding a user to a channel, etc.), and the bot will automatically quit the network and die an honorable death after it’s done. The biggest problem with this approach is how does DrProject know if the bot has completed its tasks successfully? Does it have to wait for the bot to finish its work and then parse the log file to see if everything was done?

Supybot plugin to simplify services management

This is the method which I personally prefer as a developer because it’s clean. It follows Supybot plugin architecture (plugin is just an extension that provides commands which Supybot executes). However, it will put more burden on the administrator because not everything will be automated. The idea here is to provide the admin with commands which would automate the process of registering channels, adding users, etc. It could be as simple as !doMagic, which would then create and protect a channel for each project, create and register every user in every project, and assign each user to the corresponding channel’s access list. Or, !assignUserToChannel <user> <channel>, !createChannel <channel>, !registerUser <user> <password>, etc. Another reason why I like this approach is that it provides more fine-tuning to the administrator.

Even though I agree with Greg that there should be as much automation as possible (otherwise, users might not adopt a feature), there must be a balance between making someone’s life easier versus making software too complicated. In this case, the first approach is complicated because there’s many soft points where something could go wrong. Also, as I mentioned before, the second solution provides more power to the admin. It’s possible to use both methods as well. I would like to hear some ideas, especially from my supervisor 🙂

IRC Services + Supybot + DrProject = ???

I have now started looking in more detail at user/channel management in IRC and how to integrate this with DrProject. What we want to do is to be able to automate things that IRC operators usually perform (registering a channel, setting access levels, etc.). For example, when a project is created, it would be nice if some entity (a bot?) could create, register and restrict that channel. When a new user is added to a project, the same entity should register a user on IRC network, and add the user to the access list for the channel that corresponds to that project.

I should mention that for IRC services we are using Anope, because it supports Linux and Windows. I will not mention the pain of configuring InspIRCd and Anope to talk to each other (apparently, localhost is not recognized as a loopback address, even though it’s the default value in config…).

Registering and restricting a channel manually in IRC can be done in several ways.

Using access list (short, preferred way)

First, we need to restrict access to the channel:

/msg ChanServ set #channel restricted on

Now, we need to add a user to the access list:

/msg chanserv vop #channel add user

What the above 2 commands do is restrict access to the channel, so only the users on the access list can join the channel, as well as add a user to the access list with voice privileges. Anybody else not on the access list who tries to join the channel will be automatically kicked and banned from this channel.

I will not mention commands for the other methods, because they are more cumbersome. Basically, one uses invite-only feature. This means that the channel operator sets the channel to invite (+i). Then, users are invited to the channel by the op. Later, when these users wish to rejoin the channel, they ask ChanServ to invite them. There’s two problems with this. First, it’s not natural to ask a ChanServ to invite you to a channel – most of the time, people simply /join channels. Second, if the user gets disconnected due to network problems, they will not be reconnected automatically. The last method is similar to what I’ve decided to use, but it’s more involved and produces essentially the same results. The idea is to give users who will have access to the channel access level greater than 0, and set NOJOIN feature to 0. Thus, users not on the access list will not be able to join a channel, because their access levels are 0 or less.

So, IRC services part is pretty clear. Now, how do we integrate this into DrProject? I think using a bot for this is ideal, but not sure how to make the bot and DrProject speak to each other. Supybot is a process that’s running separately from DrProject. If we use push method, we need to somehow send messages from DrProject to Supybot, it seems via IRC protocol. If we use a pull method, it’s not going to be real time. Supybot could periodically poll projects and users table and add these entities to IRC. Or, we could write a plugin and simplify some of these steps for a user. For example, allow a user to load a spreadsheet with project-user mapping and create/restrict respective channels. Ideas?

Reading…

Last week me and Greg spoke with Gerald Penn, who is doing Natural Language Processing research at the University of Toronto, and his graduate student Xiaodan Zhu about summarizing IRC logs. Gerald has pointed me to several resources on the related research. I started reading a book called “Automatic Summarization” by Inderjeet Mani, as well as several research papers on the topic of text summarization / extraction. I’m taking notes, because it’s very easy to get lost in the midst of details. NLP combines many fields – Linguistics, Statistics, Math (mostly algebra / linear optimization), as well Computer Science (data structures / algorithms). I have to mention that I never worked with NLP before, so all these ideas are very much new to me. However, it’s not terribly difficult, because it’s sufficiently abstracted that anyone with some background in the above mentioned fields can understand these ideas. The difficulty arises in my case when we are talking about summarizing IRC logs. Due to the nature of these chats, it’s very much unlike summarizing text. Usually, scientific papers are sufficiently organized, so Edmundsonian paradigm can be applied (i.e. first and last sentences of the paragraph are probably more important than the sentences in the middle, so they should be assigned more weight based just on location). The same is not true for the IRC logs. So, I always have to keep that in mind when I read about different heuristics discussed in these papers.

When I see something new, I like to get a big picture first before I start looking at details. What follows below is a summary of what I’ve learned so far.

IRC summarization consists of the following parts:

Message segmentation
Clustering
Adjacent response pairs identification
Summary extraction

Message Segmentation

This is done using TextTiling algorithm. It partitions the entire IRC log into multi-paragraph segments. These segments most likely have little overlap in terms of their content.

Clustering

This is the most crucial part. Basically, using Ward’s agglomerative method, we can cluster segments based on their relevance. Each cluster represents L-dimensional vector of words. Each word is assigned a weight, which is (word frequency) / (document frequency). Word frequency is defined as “the number of times a word appears in the cluster” and document frequency is defined as in “the number of segments the word appears in”. So, the higher the weight, the more valuable is the word. For example, if the word appears a lot in the cluster and not in other clusters, then it means that the topic discussed in this cluster was centered around this word. The next step is to start merging clusters based on the principle of minimal variance. In-cluster variance in this case is a function of the number of elements in both clusters to be merged, as well as the Euclidian distance between both clusters (based on the vector mentioned earlier). So, if 2 clusters are very similar (i.e. they’re about the same topic), then the distance between both vectors is small (compared to other possibilities), and the marginal increase in variance is pretty small. The process of merging of clusters stops when there no combinations such that the in-cluster variance doesn’t increase below a specified % threshold.

Adjacent response pairs identification

This is a very fuzzy area to me, which I will be trying to understand more. Basically, the idea is to identify the conversation initiator and the responses to this initiation. Since the conversation could have several threads, the process of identifying these pairs is probabilistic.

Summary extraction

Now that we’ve grouped messages on the same topic and sub-topic level, extracting summaries is pretty straight-forward. Looks like Edmundosian paradigm, or something similar to it would work.

Now that I have a general idea of the steps involved in summarizing IRC logs, I will start looking at the details of each step.

Segmenting IRC logs by DrProject events

Most of the last week I spent on adding event log into IRC log web page, as well as adding the ability to select message blurbs. Currently, the value is hardcoded to 30 minutes window within the event. So, when I click on the item in the event log, the messages within 30 minutes of the event are highlighted

So Far

Here is a summary of what’s been done so far and what needs to be done next in the next few weeks:

TO DATE:

Created backend Elixir code to save IRC messages and events to the database.
Created Supybot plugin to log messages to the database
Created a new DrProject component for viewing IRC logs

So, now the functionality to view IRC logs by date is done.

TO DO:

Add event log to the component, so the users can browse segments of IRC logs by events. First, I’ll create a screen mockup for this.
Add search functionality for the logs.
Add tags to conversations. The idea is to use words that don’t appear in an English dictionary (i.e. ComponentA) as tags. For example, users could then view conversation blobs around ComponentA by clicking on the tag.
This Wednesday me and Greg are meeting with Gerald Penn, who is doing research in Natural Language Processing. Hopefully he’ll point us in the right direction on any algorithms for dissecting IRC messages by subject or keywords.

M	T	W	T	F	S	S
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30	31