Tag Archives: research

So What’s New?

It’s been a few weeks since I last updated this with a real post, so here it is.

I finished my research for the summer a few weeks back. I’m probably going to continue it again once the school year starts, but I decided I wanted at least some time to work on different things during the summer as well as catch a break. At the time that I ended, most of the major tasks assigned to me had been completed. That is, all but one program module had be written and polished and most of it had been used to generate at least first-run data. I can’t go very much into the details about what the modules did exactly, but the broad concepts are the same: write software that parses the millions of lines of log data efficiently and offers a way to gain greater insight into it

Despite the fact that some of this code will only be run once, I decided to nevertheless invest the time into making it nice and modular, etc. This is a time when I truly saw the importance of good code design. Code that I had written at the beginning of my research period, about two months back, really does look alien if you’ve lost touch with it. Among the people I was working with, I started first, so there was no code to start off from. I ended up making up my own standards for how to run certain modules – a certain combination of command-line flags and stdin/stdout redirects and all. In this process, of course a lot changed and later parameter paradigms greatly differed from the early ones. At some point, I realized that it was a major pain to remember how to run each program so rewrote lots of my previous code to accommodate for that. For example, there was a certain little script that I used to run the same program on many log files over a period of time. This little script started off first as an inline bit of Bash fu that I could start in the background and let it run through the day and into the night if necessary. It then evolved into a script of its own, became more standardized to accept the executable and as my knowledge of Bash improved, was written more efficiently. In fact, it even went on to become so commonly used that I included it in nearly module I wrote. The same goes for implementations of commonly used data structures – these were soon put in a common lib/ folder and linked to from various modules. Even commonly used functions that mapped to all elements of those data structures were placed in a specials utils.c file. Another useful little macro directive that I used to help with (possibly careless, irresponsible, and brash) memory management was a wrapper on GNU free that was double-free safe. So, every single free() call was replaced with nxfree() with the following implementation:

#include <stdlib.h>

 * Convenient wrapper for nfree().
#define nxfree(s) nfree((void**)&s)

 * Wrapper for GNU free. Checks for NULL and sets pointer to NULL
 * after freeing. Avoids accidental double frees.
#define nfree(p) {if(*(p)) free(*(p)); *(p) = NULL;}

So, pretty simple, but useful. I ended up using many such little snippets of code to make my life easier. In addition, I also rewrote my original Makefiles to be more modular. I ended up creating a standard template for all Makefiles, with only trivial changes to customize it for the module in question. The template is generic enough for use pretty much anywhere gcc works and is below:

CC = gcc
CFLAGS = -Wall
CDBGFLAGS = -Wall -g

LIBOBJS = ../../lib/*.o   #Shared Library Code Path

TARGET = <Release Target Name>
DBGTARGET = <Debug Target Name>

OBJS = <All Source Files Names, with .o Extensions>
DBGOBJS = <All Source Files Names, with Dbg.o Extensions>

        $(CC) -o $@ $^ $(CFLAGS) $(LIBS)

        $(CC) -o $@ $^ $(CDBGFLAGS) $(LIBS)

%.o : %.c
        $(CC) -c -o $@ $< $(CFLAGS)

%Dbg.o : %.c
        $(CC) -c -o $@ $< $(CDBGFLAGS)

all: $(TARGET)

debug : $(DBGTARGET)

        rm -f $(TARGET) $(DBGTARGET) $(OBJS) $(DBGOBJS)

        make clean && rm -f *~

Now after research was done, I got a bit more time to work on the Mozilla Crowdsourcing project I’ve mentioned before. We completed Phase 1 a few weeks back (check the Mozilla Labs’ blog for all sub-teams findings and here for just my sub-team’s). After that, we got started on Phase 2, which consists of designing a Design Challenge using our findings that took into account the various things we picked up along the way. The final result of the Crowdsourcing team’s designs will eventually replace the current model of the Lab’s Design Challenges. Last Sunday was the deadline for that and our sub-team had a pretty concrete set of guidelines. These will soon be written up and fleshed out into a blog post. Once this stage is complete, we will take our three sub-teams’ designs and combine them into one, which will be our official proposal. Then come the actual implementation of the proposal, which may include some actual development and/or extension of existing software. I’ll speak more of this as it comes up in the next few weeks.

Apart from that, I’m also just doing my usual things, such as testings and bug reporting for Minefield (Firefox Panorama officially ROCKS) and Chrome. I also decided on one Saturday to re-root my Droid from scratch since a couple things needed fixing. I was unable to install updates for certain applications such as Google Maps, doubleTwist, WordPress, and Facebook for Android. And Chrome-to-Phone was not working correctly either. In addition to the fact that my phone was still feeling rather sluggish. So, after many hours of figuring out what the hell was going on, I decided to format everything, install ClockworkMod Recovery, and flash a custom rooted Froyo ROM. This was actually the FRG01B build of Froyo, which replaced the earlier update that I had previously installed on my Droid. I also got SetCPU (I’m an xda-developers member, so I got the APK from there – if I continue using it, I’m going to consider donating to the developer). So far, my experience with that has been pretty much flawless. I set up a few failsafe profiles just to prevent spontaneous combustion of my device and the homescreen widget works great. I used a 1GHz kernel again and am currently running at a max of 1000MHz and a min of 500MHz on “performance” mode. My CPU temperature usually hovers around 33 to 35 degrees C, which is pretty normal and the phone is stable. Obviously, it feels a lot more snappy with the overclocked kernel and runs pretty smoothly on most things I’ve thrown at it. I’m pretty happy with that. There actually is another minor upgrade that was released after FRG01B that basically just enabled the download of the official Adobe Flash 10.1. Except that that build hasn’t been rooted yet and I side-loaded official 10.1 APK anyway, so I chose not to upgrade for now at least. From all accounts, the newer version is the last of the Froyo upgrade process for Droid and does NOT add any new features.

Well that’s what’s keeping me busy right now. I’ve got about two weeks before college starts again, so I’m prepping myself as necessary for that. I expect this year to be pretty challenging again, but at least I know I will enjoy the classes even more now, having had more flexibility in choosing them. In the meantime, I think I’ll finally take a look at Google Code University and see what they offer – been wanting to do that for quite a while.

Leave a comment

Posted by on September 1, 2010 in Uncategorized


Tags: , , , , , , , , , , , , ,

A Week Off

So I’m going to be taking next week off just so I have some more time to relax and refocus. I’ve felt that I never really got a chance to properly recharge between the end of finals and the start of my summer work. As I’ve said before, I just want some time free to think. More than anything, I enjoy having the time to do nothing in particular and not feel guilty about it either. My objectives for the week? I’d rather not specify exactly, but mostly I’m going to do lots of reading, some coding, some testing of Minefield, and clearing through all those links I’ve saved for reading at a later time.

I think now is also a good time to give an update on my research as I’ve completed about half of it or so I guess. So basically, the goal is to discuss the appliability of and to design a better resource allocation algorithm in a distributed computing system. Clearly the main problem is to figure out not only how to evenly distribute available resources in order to fully utilize the capabilities of the system, but also, on a local level, run each program as fast as possible. To accomplish this, it is not good enough to just give everyone an equal share – rather, you have to give each program the combination of resources that best fit its needs. Along a related thread of thought, we see as a natural consequence of this non-trivial allocation scheme that there exists some parts of the system – some nodes – are more heavily utilized, or more “popular.” But what is it that makes it so? These are the questions we are seeking to answer. Our approach is to look at the past for clues. So, we’re digging through many years’ worth of log files of usage in order to see how the popularity of nodes has evolved in the past. We are testing out several complex models of popularity metrics and seeing how well they match the data.

There are three of us working on this project and each of us are focusing on different aspects. For my part, I’ve been focusing specifically on how to extract pertinent data from the terabytes of textual log files in acceptable speed and how to best automate the task of actually finding relationships. I’ve decided to break up my work into many small programs that do different things and are tied together by scripts that automate as much as possible. Apart from modularity being good style, it also preserves sanity since this stuff can get pretty damn hairy.

There was one specific aspect that caught my interest in particular. The problem was tactfully posed by my professor as follows: let’s say we want to see how the temperature of water in a swimming pool and the height of the dive is related to how much pain the diver feels. Holding the temperature constant, we expect higher dives to hurt more but if the height isn’t too much, the relation may be weak and depend more on how you dive, etc. But conversely, how does temperature make a difference? At first glance, there’s no appreciable trend for a small range of temperature. But cross 32 degrees F and it’s a different ballgame. We strongly expect the height of the dive to affect pain felt and for it to hurt a lot more!

In this same, consider values of two seemingly independent data fields, one being the synthetic field of popularity and the other being a more tangible entity that deals with actual hardware. We seek to automatically find the ranges of the data that yield strong correlations.

In mathematical language, we seek to automatically perform a segmented regression on the data. Our final regression may be piecewise-linear, some crazy high order interpolated polynomial, or even non-parametric; but for our needs, this is at present not our concern. Thus, I chose to only look at the correlation coefficients. But there’s more than just one type: we are all taught Pearson’s coefficient, but other popular ones include Kendall’s Tau and Spearman’s Rho. Currently I’ve just implemented Pearson’s but this will almost certainly be replaced by Kendall’s. Why? Because linearity is not as importance as dependence in general. Having decided upon this, it was time to next decide how I was going to implement this in code. The main challenge was to automatically determine the relevant breakpoints that would yield the maximum correlations. There is a good deal of literature on this but in my preliminary search I only found papers that were either very marginally related to what I was doing or just too complicated, so I decided to implement a simple algorithm on my own.

I was writing this in C and so I grumbled about how I didn’t have my usual built-in data structures as in Java (grumble grumble) but I did have a decent ArrayList and HashMap implementation I could use (I miss my TreeMaps everyday…). Basically there are two parameters we wish to maximize: the correlation and the length of the segment. This need arises since a segment of one or two points will de facto have a correlation of unity and our data is certainly non-uniform. So, I thought of some kind of learning algorithm to accomplish the task and finally ended up with a simple algorithm that runs in ~2*N time (where ~ denotes tilde notation). Basically I start with some initial set of breakpoints evenly distributed along one axis. Then, I run a loop wherein I compare correlations of adjacent blocks. If their weighted sum of correlation and number of points in the block exceeds either block individually, I remove the middle breakpoint and coalesce the blocks. I recursively do this through all the blocks and then do the exact same thing in the reverse direction.

Now I have a set of unequal length blocks. I then copy all the blocks’ information into my ArrayList and sort by a custom comparator that compares the weight of the block, defined by: X*abs(correlation)+Y*(nelem_in_block), where X and Y are some constants.

In practice, this approach works reasonably well when the number of initial breakpoints is large enough. Besides, since we are only looking for heuristics, this does the job.

Now that that’s all done, I ran some tests on the data and the results are looking close to what we expected. When I get back, I’m going to tackle the task of tweaking and reworking the popularity metric to get even finer results.

But for right now, it is time to kick back and relax. Awesome.


Posted by on July 9, 2010 in Uncategorized


Tags: , , , , ,

Quick Update on Summer

I’ll write more later on but I just wanted to give a quick update on how my summer has been going. So I’ve started my research at Princeton about three and a half weeks ago and so far it’s going pretty well. Mostly it’s a lot of data analysis work and a fair bit of programming. That’s a very good thing. Honestly the more I think about it the more I realize that programming is what I love more than anything else. I’m effectively thinking about programming all the time anyway, always trying to look at the world through the lens of computer science (of what little I know of it) and so to have the opportunity to actually do it for the summer is quite cool. I’m working on servers running a variety of 64-bit Linux flavors, most of which have at least four cores and 8 gigs of RAM. The computational capacity of them is really amazing and the ability to run in parallel at acceptable speeds makes life easier. I’m extremely happy with the tools I’m using – straight manual Makefiles, gcc, gdb, and emacs/vim. As far as code is concerned, it’s mostly C, some Java, and very often Bash scripts with occasional AWK or Perl for quick things. For some reason I’ve found this setup more easy to work with than an actual high-powered IDE. Weird, perhaps, but I sure hope wherever I work in the future I will be afforded this same luxury.

The more I work with Linux, the more advantages I see. I was comfortable with Bash coming in this summer and I’ve learned even more after working exclusively with it. Linux feels like it was designed solely for the purpose of making a programmer’s easier and its done a great job at that. I’ve learned a lot more of Bash scripting now and I’m always amazed at how powerful it is. It’s an extremely useful language to know. For source control, I pretty much had the choice of deciding what I wanted to use and I chose SVN although I think I should’ve perhaps gone with Git. It’s been many months since I’ve used SVN but it was very easy to catch on again. It also forces me to be organized with my code and be more standardized so I’m not going to complain :).

For collaboration, I suggested we use Google Wave to keep a running to-do list and so far it’s working beautifully. It really is the best way to keep track of ideas (although I still maintain an ongoing to-do list on the whiteboard and on my Google Tasks because I’m weird like that). Checking off completed items is pretty damn thrilling – I love the feeling of making progress.

Apart from working on my current research, I’m also doing some light QA/Testing for Mozilla. I’m on the nightly builds (Minefield) on my Linux and Windows 7 partitions (though I use (Linux almost always). But I think that’s helpful here because there are far fewer Linux testers anyway. Mostly it’s just running Litmus tests, benchmarks, and giving feedback on UI/UX changes as they appear and/or break :). It’s a cool feeling to get daily updates and check out what new patches were made and to see the design process through the eyes of the developers. I’m also running on the dev branch of Google Chrome and there too it is interesting to witness the evolution of the interface and behavior. One day I imagine myself actually a developer at one of these companies, pushing out updates to millions of users… that’s something I like to think about everyday. I know it’s silly but I get a kick out of pretending that every commit I make has some tremendous influence. Maybe one day it will be so and all this is good preparation for that I guess :).

This summer my main goals are to of course continue my research but also to learn Python and master C++ properly. I have a somewhat adequate knowledge of C++ right now but I never learned it formally – I want to take this summer to actually learning the finer details of the language. Being that many, many applications are written in that language and that it’s widely the most popular language for programming contests too, it is certainly worth learning. As far as Python, I know that I won’t regret learning that. Seeing how much I’ve come to respect the power of quick and dirty Bash scripts, I know I’ll love the even greater power, flexibility, and ease of use of Python. A few of my friends are learning it too this summer, some for whom it is their first language and this further motivates me to learn it.

I’m going to be taking next week off from research for precisely these reasons among others. Princeton has really tired me out a great deal and I need some time to rest and regroup. I hope to take it a lot easier and to get started on learning those languages. I’ve also got to continue work on the Princeton Math Club blog. For that, I’m thinking of implementing the new Post Type feature that WordPress 3.0 brings. And for that, I need to polish up some PHP skills and get to work hacking away at the current theme. Looks rather straightforward so it shouldn’t take too long to do.

Okay so this post is freaking long and not quick at all. It also took quite a while to type out since I wrote it from the WordPress app on my Droid. More on that in an upcoming post. But for now, that’s about it.

1 Comment

Posted by on July 7, 2010 in Uncategorized


Tags: , , , , , , , , ,

Summer Research!

Big news! I finally got an offer to do research this summer in Computer Science with one of my professors! =] I also accepted the offer this morning, so my summer plans are pretty much finalized! It honestly feels like a huge weight has been lifted from my chest because I have spent the past several months looking for something to do this summer. My professor was kind enough to spend over a half hour explaining everything to me and showing me what I will be working on. I can’t say too much (I still have a lot to read up on!) but it’s related to the resource allocation problem in a distributed computing environment. So that means I’ll be spending most of my summer here at Princeton. It’s pretty much the highlight of my week and I can’t stop smiling about it :). I know it’s gonna be really challenging and all, but that’s perfectly fine by me. I am eager to begin this challenge and welcome the next stage of my life. I know that I’ll learn a whole lot; much more than what I could pick up just by reading a textbook. This is pretty much my first stab at doing formal research, so I’m really quite nervous. I hope all goes well!

Oh and I totally messed up on physics 😦 But it’s alright, the points I lost were almost exclusively due to minor errors in computation or slight misinterpretations; most of what I wrote was right (although I should have perhaps completed a greater proportion of the exam..). I’m not too worried about grades at this point so I know it’s gonna be okay.

As I get more information about my summer research, I’ll keep this blog updated. My hope is that it could be helpful down the line for someone in my same place (I was completely clueless on how to go about doing it, and am only marginally less clueless, but hell it could help).

Leave a comment

Posted by on April 1, 2010 in Uncategorized


Tags: , , , ,