Sunday, July 3, 2011

Version 1 SortCommutative RegisterUses Multithreading 2x speed up achieved


To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1] as seen below, and I started mutlithreading SortCommutativeRegisterUses as seen below and currently it is in a Version 1 state in terms of performance and it sees a 2x speedup for a certain method size.

Optimizations (in Simple.java)
  1. // Compute defList, useList, useCount fields for each register.
        DefUse.computeDU(ir);
  2.    // Recompute isSSA flags
        DefUse.recomputeSSA(ir);
  3. / Simple copy propagation.
        // This pass incrementally updates the register list.
        copyPropagation(ir);
  4.  // Simple type propagation.
        // This pass uses the register list, but doesn't modify it.
       if (typeProp) {
         typePropagation(ir);
        }
  5. // Perform simple bounds-check and arraylength elimination.
        // This pass incrementally updates the register list
        if (foldChecks) {
          arrayPropagation(ir);
        }
  6. // Simple dead code elimination.
        // This pass incrementally updates the register list
        eliminateDeadInstructions(ir);
  7. // constant folding
        // This pass usually doesn't modify the DU, but
        // if it does it will recompute it.
        foldConstants(ir);
        // Simple local expression folding respecting DU
        if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
          // constant folding again
          foldConstants(ir);
        }
  8. // Try to remove conditional branches with constant operands
        // If it actually constant folds a branch,
        // this pass will recompute the DU
        if (foldBranches) {
          simplifyConstantBranches(ir);
        }
  9. // Should we sort commutative use operand
    if (parallel) {
        if (sortRegisters) {
          parallelsortCommutativeRegisterUses(ir);
        }
    } else {
    sortCommutativeRegisterUses(ir);
    }


Algorithm 1:

for all threads do // i = 0, i =1, i=3, i=4, etc....

N = NumberofInstructions; //number of Instructions for Current Method
for (j = ((N/4)*(i) + 1); j < ((N/4)*(i+1) + 1); j++) {
ExecuteOptimizationCode();
}

end


What Does this Mean??????

Here is an explanation


In our parallel sortCommutativeRegisterUses we need chop our work up into N sub problems where N = NumofThreads


For example lets say the user specified 4 threads in a command Line argument
We need to take each methodSize/4 for each thread.


EG. Method “getValue” has 100 Instructions

So we have:

Thread 0 : Iterate on Instruction number (100/4 * (0) + 1) increment by 1 where Instruction number is less than (100./4 * (1) + 1)

which is saying for(i = 1; i < 26; i++) in plain English

Thread 1 : Iterate on Instruction number (100/4 * (1) + 1) increment by 1 where Instruction number is less than (100./4 * (2) + 1)

which is saying for(i = 26; i < 51; i++) in plain English

Thread 2 : Iterate on Instruction number (100/4 * (2) + 1) increment by 1 where Instruction number is less than (100./4 * (3) + 1)

which is saying for(i = 51; i < 76; i++) in plain English

Thread 3 : Iterate on Instruction number (100/4 * (3) + 1) increment by 1 where Instruction number is less than (100./4 * (4) + 1)

which is saying for(i = 76; i < 101; i++) in plain English


IN Code:


public static Runnable parallelsortCommutativeRegisterUsesrunPass(final int threadId, final IR ir) {
return new Runnable() {
// Pass over instructions
public void run() {
int nInstructions = ir.numberInstructions(); //Total number Instructions in current Method
int i = (int)(((double)nInstructions/4)*((double)threadId) + 1); //Calculate starting position in each Thread
//starts with floating point then casts back down to Integer
int end = (int)(((double)nInstructions/4)*((double)threadId + 1) + 1);//Calculate ending position in each Thread
//starts with floating point then casts back down to Integer
for (Enumeration<Instruction> e = ir.forwardInstrEnumerator(i); i < end; i++) {
Instruction s = e.nextElement();
// Sort most frequently defined operands onto lhs
if (Binary.conforms(s) && s.operator.isCommutative() &&
Binary.getVal1(s).isRegister() && Binary.getVal2(s).isRegister()) {
RegisterOperand rop1 = Binary.getVal1(s).asRegister();
RegisterOperand rop2 = Binary.getVal2(s).asRegister();
// Simple SSA based test
if (rop1.register.isSSA()) {
if (rop2.register.isSSA()) {
// ordering is arbitrary, ignore
} else {
// swap
Binary.setVal1(s, rop2);
Binary.setVal2(s, rop1);
}
} else if (rop2.register.isSSA()) {
// already have prefered ordering
} else {
// neither registers are SSA so place registers used more on the RHS
// (we don't have easy access to a count of the number of definitions)
if (rop1.register.useCount > rop2.register.useCount) {
// swap
Binary.setVal1(s, rop2);
Binary.setVal2(s, rop1);
}
}
}
}
}
};
}




/**
* Parallel Sort commutative use operands so that those defined most are on the lhs
*
* @param ir the IR to work on
*/
private static void parallelsortCommutativeRegisterUses(IR ir) {
OptCompilerThread thread = new OptCompilerThread(4);
for (int i = 0; i < 4; i++) {
try {
thread.execute(parallelsortCommutativeRegisterUsesrunPass(i,ir));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
thread.shutdown();
}




 

No comments:

Post a Comment