Jikes RVM GSOC 2011: July 2011

I took the liberty this weekend to improve my thread pool before I move onto the next parallel compiler optimization strategy. (Multi-Method and Multi-Phase) Due to time constraints this will most likely be my last major revision to the thread pool before the summer ends. However, do not be surprised if you still see me working on this project after the summer ends since I have a hard time letting projects go unfinished or unpolished.

Here is the inheritance hierarchy of my thread pool

Thread Pool URL Easier to read http://pastebin.com/EfqSuDqF

SystemThread OptExecutor OptQueue
||| ||| |||
OptCompilerThread <-->OptCompilerThreadPool<--> OptCompilerBlocking

My Next Blog Post will showcase this ThreadPools New speedups with the existing multithreaded Optimizations

To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1] as seen below, and I started mutlithreading SortCommutativeRegisterUses as seen below and currently it is in a Version 1 state in terms of performance and it sees a 2x speedup for a certain method size.

Optimizations (in Simple.java)

// Compute defList, useList, useCount fields for each register.
DefUse.computeDU(ir);
// Recompute isSSA flags
DefUse.recomputeSSA(ir);
/ Simple copy propagation.
// This pass incrementally updates the register list.
copyPropagation(ir);
// Simple type propagation.
    // This pass uses the register list, but doesn't modify it.
   if (typeProp) {
     typePropagation(ir);
    }
// Perform simple bounds-check and arraylength elimination.
    // This pass incrementally updates the register list
    if (foldChecks) {
      arrayPropagation(ir);
    }
// Simple dead code elimination.
// This pass incrementally updates the register list
eliminateDeadInstructions(ir);
// constant folding
    // This pass usually doesn't modify the DU, but
    // if it does it will recompute it.
    foldConstants(ir);
    // Simple local expression folding respecting DU
    if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
      // constant folding again
      foldConstants(ir);
    }
// Try to remove conditional branches with constant operands
    // If it actually constant folds a branch,
    // this pass will recompute the DU
    if (foldBranches) {
      simplifyConstantBranches(ir);
    }
// Should we sort commutative use operand
if (parallel) {
    if (sortRegisters) {
      parallelsortCommutativeRegisterUses(ir);
    }
} else {
sortCommutativeRegisterUses(ir);
}

Algorithm 1:

for all threads do // i = 0, i =1, i=3, i=4, etc....

N = NumberofInstructions; //number of Instructions for Current Method

for (j = ((N/4)*(i) + 1); j < ((N/4)*(i+1) + 1); j++) {

ExecuteOptimizationCode();

}

end

What Does this Mean??????

Here is an explanation

In our parallel sortCommutativeRegisterUses we need chop our work up into N sub problems where N = NumofThreads

For example lets say the user specified 4 threads in a command Line argument

We need to take each methodSize/4 for each thread.

EG. Method “getValue” has 100 Instructions

So we have:

Thread 0 : Iterate on Instruction number (100/4 * (0) + 1) increment by 1 where Instruction number is less than (100./4 * (1) + 1)

which is saying for(i = 1; i < 26; i++) in plain English

Thread 1 : Iterate on Instruction number (100/4 * (1) + 1) increment by 1 where Instruction number is less than (100./4 * (2) + 1)

which is saying for(i = 26; i < 51; i++) in plain English

Thread 2 : Iterate on Instruction number (100/4 * (2) + 1) increment by 1 where Instruction number is less than (100./4 * (3) + 1)

which is saying for(i = 51; i < 76; i++) in plain English

Thread 3 : Iterate on Instruction number (100/4 * (3) + 1) increment by 1 where Instruction number is less than (100./4 * (4) + 1)

which is saying for(i = 76; i < 101; i++) in plain English

IN Code:

public static Runnable parallelsortCommutativeRegisterUsesrunPass(final int threadId, final IR ir) {

return new Runnable() {

// Pass over instructions

public void run() {

int nInstructions = ir.numberInstructions(); //Total number Instructions in current Method

int i = (int)(((double)nInstructions/4)*((double)threadId) + 1); //Calculate starting position in each Thread

//starts with floating point then casts back down to Integer

int end = (int)(((double)nInstructions/4)*((double)threadId + 1) + 1);//Calculate ending position in each Thread

//starts with floating point then casts back down to Integer

for (Enumeration<Instruction> e = ir.forwardInstrEnumerator(i); i < end; i++) {

Instruction s = e.nextElement();

// Sort most frequently defined operands onto lhs

if (Binary.conforms(s) && s.operator.isCommutative() &&

Binary.getVal1(s).isRegister() && Binary.getVal2(s).isRegister()) {

RegisterOperand rop1 = Binary.getVal1(s).asRegister();

RegisterOperand rop2 = Binary.getVal2(s).asRegister();

// Simple SSA based test

if (rop1.register.isSSA()) {

if (rop2.register.isSSA()) {

// ordering is arbitrary, ignore

} else {

// swap

Binary.setVal1(s, rop2);

Binary.setVal2(s, rop1);

}

} else if (rop2.register.isSSA()) {

// already have prefered ordering

} else {

// neither registers are SSA so place registers used more on the RHS

// (we don't have easy access to a count of the number of definitions)

if (rop1.register.useCount > rop2.register.useCount) {

// swap

Binary.setVal1(s, rop2);

Binary.setVal2(s, rop1);

}

};

}

/**

* Parallel Sort commutative use operands so that those defined most are on the lhs

* @param ir the IR to work on

private static void parallelsortCommutativeRegisterUses(IR ir) {

OptCompilerThread thread = new OptCompilerThread(4);

for (int i = 0; i < 4; i++) {

try {

thread.execute(parallelsortCommutativeRegisterUsesrunPass(i,ir));

} catch (InterruptedException e) {

e.printStackTrace();

}

thread.shutdown();

}

Jikes RVM GSOC 2011

Saturday, July 30, 2011

New Design MutliMethod Hot Method Recompilation

Thursday, July 21, 2011

New and Improved Thread Pool

Sunday, July 3, 2011

Version 1 SortCommutative RegisterUses Multithreading 2x speed up achieved