Friday, December 24, 2010

LLVM IR Builder in Object Pascal

I'm about to graduate from my university (January 2011 if there's no more problems), and as a final assigment (though it's optional in my faculty, but it would be a great experience and honor to have one) I choose to implement a compiler. One of our labs, Formal Method in Software Engineering (or simple FMSE), is the lab where my supervisor gets involved. Therefore, the compiler I'm writing would probably based on a project they're (or have been) working on. Yep, I was given LinguSQL language to implement.

Previously, the language had a compiler, that generates Java code (bad choice IMO) which is then compiled by a Java compiler. The problem with this approach, as in normal Java application, is the HUGE runtime environment that must be distributed if someone wants to use the application. Furthermore, since Java uses interpreted bytecode (don't count GCJ, I even believe less than 10 persons in my campus know that thing exists), the performance is at maximum only 1/3 of native binaries (someone in OSDEV forum ever said). Last but not least, the execution isn't trivial. One must type "java xxx" in order to execute the program.

As a native application (deve)lov(|p)er (read: developer and lover), I decided to implement a compiler for this language that generates native binaries. However, I don't have enough experience in generating native binaries (or at least native assembly). So, remembering an option, I asked my supervisor what if the compiler generates LLVM assembly language (also called LLVM Intermediate Representation or IR)? Do you what he said? "What is LLVM?" (doh). OK, so after bla bla bla, he accepted my choice. The advantages of generating LLVM assembly instead of native assembly are:
  1. A LOT of optimizations for free
  2. It can be compiled to native assembly for MANY platforms
  3. Easy integration with existing libraries
Despite those advantages, there are also disadvantages:
  1. It uses SSA format
  2. Written in C++, more specifically, it officially only supports G++! (there are some hacks to use MSVC but... still it's not official)
  3. Most important one: it doesn't have Object Pascal frontend
The SSA format is not actually a disadvantage, but it's just a little harder to generate code for. But the last one is really a show stopper... or a challenge depending on how you look at it :)

So the work begins. I continue the previous research, the previous Java based compiler that generates Java code uses JavaCC, followed by JavaCUP, and finally UUAG. The first two are parser generators, with some differences, mainly JavaCC generates LL parsers, while JavaCUP generates LALR one. Both are BAD. I always find parser generators are bad since we have no idea whether it's correct or not, and the grammar can't be deduced from the code (except for recursive descent parser generators like Coco/R). The last one is a Haskell based product, which actually runs like recursive descent parser, only in functional languages they're called parser combinators. The last one is quite good, with one important problem: when a parsing error happens, the parser tries to find all possible corrections, therefore slows down the parsing and eats resources. This behavior can't be customized easily and that's what makes me writing the whole thing from scratch using classic approach: a true recursive descent parser. This is the best parser I've ever learned, since it's the most flexible one (there are tons of way to handle parsing error and that's totally up to you, with many methods possibly combined or used specifically for certain productions) and still shows the grammar in its code.

Come to the code generation part, the problem I stated above must be covered. I create my own LLVM IR Builder to generate LLVM assembly language. Due to the SSA structure, it's a bit difficult, but I managed to create it quite successful with beautiful modular architecture. It can now generate modules consisting of functions and global variables, where each functions can have local variables, labels (for branch and loop), arithmetic instructions, memory instructions, etc. It's not yet complete, but already capable of generating simple programs. I'll put it in my bitbucket account when I think it's quite production ready.

Wants some code? OK:

program llvmirbuildertest;

{$mode objfpc}{$H+}

uses
  llvmirbuilder;

var
  x,y,l,s,a,b: TLLVMSymbol;
  c: TLLVMConstant;
  cl: TLLVMCallInstruction;
begin
  x := TLLVMSymbol.Create('x',lltInteger,true);
  y := TLLVMSymbol.Create('y',lltInteger);
  c := TLLVMConstant.Create('255',lltInteger);
  l := TLLVMLoadInstruction.Create('tmp',lltInteger,x);
  s := TLLVMStoreInstruction.Create('tmp',lltInteger,y,x);
  cl:= TLLVMCallInstruction.Create('func',lltInteger);
  a := TLLVMAddInstruction.Create('a',lltInteger,x,c);
  b := TLLVMSubInstruction.Create('b',lltInteger,c,y);
  WriteLn(l.GenerateCode);
  WriteLn(s.GenerateCode);
  WriteLn(cl.GenerateCode);
  WriteLn(a.GenerateCode);
  WriteLn(b.GenerateCode);
  a.Free;
  b.Free;
  cl.Free;
  s.Free;
  l.Free;
  c.Free;
  y.Free;
  x.Free;
end.
and the generated LLVM IR:
%tmp = load i32 * @x
store i32 %y, i32 * @x
call i32 @func()
%a = add i32 @x, 255
%b = sub i32 255, %y
Note that it's a partial code, so compiling this with llvm-as would absolutely produce an error.

Saturday, December 18, 2010

Using Google Maps service in a Pascal based Web Application

My final team assignment of Information Retrieval (IR) class allows us to create any kind of IR system, and we decided to create a food ingredients and restaurant search system. It displays ingredients for a chosen food and then displays 10 top restaurants that provide the food. For the first one, it's simply a database approach. But the latter is a Geographic IR (GIR). Instead of creating it from scratch (which could take probably a whole life), we decided to use google maps service.

Due to the requirement that the system must be accessible from web, it's splitted into two parts: client and server. The client part is coded by my friend and the server part is mine. The client part is coded with ExtJS and we use JSON to communicate between the client and server. Since the server is a Service Oriented Application (SOA), it's no problem what language it's written in. And since I'm a Pascal geek (you can call me maniac if you want), I choose to write it in Pascal.

To create the server, I use fpWeb components available from standard Lazarus distribution. The server itself connects to Google Maps service to look for the coordinates. Hmm... how can the server do this? Luckily, there are two powerful networking components and libraries for Pascal, namely lNet and Synapse. I used to lNet due to its component based approach, however after struggling a bit, lNet seems to leak some features I need (esp. proxy support because my campus' internet access is protected by that). Then another Pascalian told me that it's very easy to do with Synapse. And so it goes:

with THTTPSend.Create do
  try
    ProxyHost := ConfigFile.ReadString('proxy','host','');
    ProxyPort := ConfigFile.ReadString('proxy','port','');
      URL := Format('http://maps.google.com/maps/geo?q=%s&output=json&oe=utf8&sensor=true' +
        '&key=%s',[QueryStr,GoogleAPIKey]);
      if not HTTPMethod('GET',URL) then begin
        AResponse.Content := Format('Error %d: %s',[ResultCode,ResultString]);
      end else begin
        AResponse.Contents.LoadFromStream(Document);
      end;
      AppendToLog(Headers.Text + AResponse.Content);
  finally
      Free;
    end;
  Handled := true;

The server uses TIniFile (as ConfigFile in above code) to store and retrieve proxy information so it can be adjusted without recompilation. QueryStr is the string that we want Google Maps to look for and GoogleAPIKey is the... err.. API key to access Google Maps. FYI, I'm using version 2 of the API which still requires a key, version 3 doesn't need it anymore.

It doesn't yet show anything useful, only the original JSON from Google Maps. Later, this JSON must be processed and merged with database result, and then sent to the client to be processed further. But after all... it's done in Pascal :)