- Boot[1] – The main advantage to multi-threading the boot is because in the Jikes rvm the Boot code compiles around methods 100 every time. The beauty in this multi-threaded optimization is the speedup achieved by this optimization is independent of the benchmark or program the user is running. Unfortunately this optimization is still in a pre version 0 phase however I hope to polish it up in the near future as continued post GSOC work. Hopefully we can get speed up from the multithreading with no overhead from the threading.
Speed up graphs will be posted in the future
Jikes RVM GSOC 2011
Saturday, August 20, 2011
Parallel Boot
Parallel Precompile
Pre Compile[1] – Pre Compile's core method that controls compilation is embarrassingly parallel because it uses a For loop to iterate over each Method. Currently due to time constraints of Google Summer the multi-threaded PreCompile method is in an non optimized version 0 state and is currently seeing overhead but the overhead is minimal because it is causing neither a speed up nor a slow down in other words the running time is the same. However, that can be easily fixed by sending groups of data onto a thread at once preferably 1/2 methods for thread one and 1/2 methods for thread two on a two thread system which is an optimization that was used to help improve multi-threaded Single-Method[2,3,4,5,6].
Command to run and generate advice file: ./rvm -X:aos:enable_advice_generation=true -cp dacapo-2006-10-MR2.jar Harness xalan
Command to run and precompile advice file with N threads: ./rvm -X:aos:enable_precompile=true -X:aos:pre_compile_threads=2 -X:aos:cafi=aosadvice.ca -cp dacapo-2006-10-MR2.jar Harness xalan
One task I am going to work on post GSOC is optimizing this code to achieve speed up and polishing up this code to make it easy for the jikes community to use this code if they chose to.
One task I am going to work on post GSOC is optimizing this code to achieve speed up and polishing up this code to make it easy for the jikes community to use this code if they chose to.
Saturday, July 30, 2011
New Design MutliMethod Hot Method Recompilation
As part of my GSOC I proposed 3 parallel compilation strategies single-method, multi-method, multi-phase.
Single-method is done however it is only relevant for large methods and would easily be overshadowed by multi-phase in a match up. However, multi-method is relevant for all cases and even our worst case benchmarks show speedup.
Here are the results numerically
2 Threads
Best Case LU: 816.5870
AVG Case LU: 795.67
Original
Best Case LU : 805.98
AVG Case LU: 793.05
2 Threads
for I in {1..90}; do ./rvm -X:aos:num_threads=2 -cp scimark2lib.jar jnt.scimark2.commandline; done
Single-method is done however it is only relevant for large methods and would easily be overshadowed by multi-phase in a match up. However, multi-method is relevant for all cases and even our worst case benchmarks show speedup.
Here are the results numerically
2 Threads
Best Case LU: 816.5870
AVG Case LU: 795.67
Original
Best Case LU : 805.98
AVG Case LU: 793.05
2 Threads
for I in {1..90}; do ./rvm -X:aos:num_threads=2 -cp scimark2lib.jar jnt.scimark2.commandline; done
Thursday, July 21, 2011
New and Improved Thread Pool
I took the liberty this weekend to improve my thread pool before I move onto the next parallel compiler optimization strategy. (Multi-Method and Multi-Phase) Due to time constraints this will most likely be my last major revision to the thread pool before the summer ends. However, do not be surprised if you still see me working on this project after the summer ends since I have a hard time letting projects go unfinished or unpolished.
Here is the inheritance hierarchy of my thread pool
Thread Pool URL Easier to read http://pastebin.com/EfqSuDqF
SystemThread OptExecutor OptQueue
||| ||| |||
OptCompilerThread <-->OptCompilerThreadPool<--> OptCompilerBlocking
My Next Blog Post will showcase this ThreadPools New speedups with the existing multithreaded Optimizations
Here is the inheritance hierarchy of my thread pool
Thread Pool URL Easier to read http://pastebin.com/EfqSuDqF
SystemThread OptExecutor OptQueue
||| ||| |||
OptCompilerThread <-->OptCompilerThreadPool<--> OptCompilerBlocking
My Next Blog Post will showcase this ThreadPools New speedups with the existing multithreaded Optimizations
Sunday, July 3, 2011
Version 1 SortCommutative RegisterUses Multithreading 2x speed up achieved
To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1] as seen below, and I started mutlithreading SortCommutativeRegisterUses as seen below and currently it is in a Version 1 state in terms of performance and it sees a 2x speedup for a certain method size.
Optimizations (in Simple.java)
- // Compute defList, useList, useCount fields for each register.
DefUse.computeDU(ir); - // Recompute isSSA flags
DefUse.recomputeSSA(ir); - / Simple copy propagation.
// This pass incrementally updates the register list.
copyPropagation(ir); - // Simple type propagation.
// This pass uses the register list, but doesn't modify it.
if (typeProp) {
typePropagation(ir);
} - // Perform simple bounds-check and arraylength elimination.
// This pass incrementally updates the register list
if (foldChecks) {
arrayPropagation(ir);
} - // Simple dead code elimination.
// This pass incrementally updates the register list
eliminateDeadInstructions(ir); - // constant folding
// This pass usually doesn't modify the DU, but
// if it does it will recompute it.
foldConstants(ir);
// Simple local expression folding respecting DU
if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
// constant folding again
foldConstants(ir);
} - // Try to remove conditional branches with constant operands
// If it actually constant folds a branch,
// this pass will recompute the DU
if (foldBranches) {
simplifyConstantBranches(ir);
} - // Should we sort commutative use operand
if (parallel) {
if (sortRegisters) {
parallelsortCommutativeRegisterUses(ir);
}
} else {
sortCommutativeRegisterUses(ir);
}
Algorithm 1:
for all threads do // i = 0, i =1, i=3, i=4, etc....
N = NumberofInstructions; //number of Instructions for Current Method
for (j = ((N/4)*(i) + 1); j < ((N/4)*(i+1) + 1); j++) {
ExecuteOptimizationCode();
}
end
What Does this Mean??????
Here is an explanation
In our parallel sortCommutativeRegisterUses we need chop our work up into N sub problems where N = NumofThreads
For example lets say the user specified 4 threads in a command Line argument
We need to take each methodSize/4 for each thread.
EG. Method “getValue” has 100 Instructions
So we have:
Thread 0 : Iterate on Instruction number (100/4 * (0) + 1) increment by 1 where Instruction number is less than (100./4 * (1) + 1)
which is saying for(i = 1; i < 26; i++) in plain English
Thread 1 : Iterate on Instruction number (100/4 * (1) + 1) increment by 1 where Instruction number is less than (100./4 * (2) + 1)
which is saying for(i = 26; i < 51; i++) in plain English
Thread 2 : Iterate on Instruction number (100/4 * (2) + 1) increment by 1 where Instruction number is less than (100./4 * (3) + 1)
which is saying for(i = 51; i < 76; i++) in plain English
Thread 3 : Iterate on Instruction number (100/4 * (3) + 1) increment by 1 where Instruction number is less than (100./4 * (4) + 1)
which is saying for(i = 76; i < 101; i++) in plain English
IN Code:
public static Runnable parallelsortCommutativeRegisterUsesrunPass(final int threadId, final IR ir) {
return new Runnable() {
// Pass over instructions
public void run() {
int nInstructions = ir.numberInstructions(); //Total number Instructions in current Method
int i = (int)(((double)nInstructions/4)*((double)threadId) + 1); //Calculate starting position in each Thread
//starts with floating point then casts back down to Integer
int end = (int)(((double)nInstructions/4)*((double)threadId + 1) + 1);//Calculate ending position in each Thread
//starts with floating point then casts back down to Integer
for (Enumeration<Instruction> e = ir.forwardInstrEnumerator(i); i < end; i++) {
Instruction s = e.nextElement();
// Sort most frequently defined operands onto lhs
if (Binary.conforms(s) && s.operator.isCommutative() &&
Binary.getVal1(s).isRegister() && Binary.getVal2(s).isRegister()) {
RegisterOperand rop1 = Binary.getVal1(s).asRegister();
RegisterOperand rop2 = Binary.getVal2(s).asRegister();
// Simple SSA based test
if (rop1.register.isSSA()) {
if (rop2.register.isSSA()) {
// ordering is arbitrary, ignore
} else {
// swap
Binary.setVal1(s, rop2);
Binary.setVal2(s, rop1);
}
} else if (rop2.register.isSSA()) {
// already have prefered ordering
} else {
// neither registers are SSA so place registers used more on the RHS
// (we don't have easy access to a count of the number of definitions)
if (rop1.register.useCount > rop2.register.useCount) {
// swap
Binary.setVal1(s, rop2);
Binary.setVal2(s, rop1);
}
}
}
}
}
};
}
/**
* Parallel Sort commutative use operands so that those defined most are on the lhs
*
* @param ir the IR to work on
*/
private static void parallelsortCommutativeRegisterUses(IR ir) {
OptCompilerThread thread = new OptCompilerThread(4);
for (int i = 0; i < 4; i++) {
try {
thread.execute(parallelsortCommutativeRegisterUsesrunPass(i,ir));
} catch (InterruptedException e) {
e.printStackTrace();
}
}
thread.shutdown();
}
Thursday, June 16, 2011
Version 0 SortCommutativeRegisterUses Optimization Running Time / Speedups
To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1] as seen below,
and I started mutlithreading SortCommutativeRegisterUses as seen below.
Optimizations (in Simple.java)
In this code we multithread SortCommuativeRegisterUses optimization by forking a thread per Instruction clearly the granularity is to fine and we need to optimize the algorithm.
./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop
Running Time
and I started mutlithreading SortCommutativeRegisterUses as seen below.
Optimizations (in Simple.java)
- // Compute defList, useList, useCount fields for each register.
DefUse.computeDU(ir); - // Recompute isSSA flags
DefUse.recomputeSSA(ir); - / Simple copy propagation.
// This pass incrementally updates the register list.
copyPropagation(ir); - // Simple type propagation.
// This pass uses the register list, but doesn't modify it.
if (typeProp) {
typePropagation(ir);
} - // Perform simple bounds-check and arraylength elimination.
// This pass incrementally updates the register list
if (foldChecks) {
arrayPropagation(ir);
} - // Simple dead code elimination.
// This pass incrementally updates the register list
eliminateDeadInstructions(ir); - // constant folding
// This pass usually doesn't modify the DU, but
// if it does it will recompute it.
foldConstants(ir);
// Simple local expression folding respecting DU
if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
// constant folding again
foldConstants(ir);
} - // Try to remove conditional branches with constant operands
// If it actually constant folds a branch,
// this pass will recompute the DU
if (foldBranches) {
simplifyConstantBranches(ir);
} - // Should we sort commutative use operands
if (sortRegisters) {
sortCommutativeRegisterUses(ir);
}
In this code we multithread SortCommuativeRegisterUses optimization by forking a thread per Instruction clearly the granularity is to fine and we need to optimize the algorithm.
./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop
Running Time
SpeedUps
Tuesday, June 14, 2011
ComputeDU original vs Mutlithreaded
To help create a starting point for my work I started with profiling the execution times of each optimization in Simple.java[1]
I started multi threading DefUse.computeDU
Optimizations (in Simple.java)
Our ComputeDU Optimization Mutlithread's on a Register Level Unfortunately that Granularity is way to fine and we see overhead from runnable objects/ threads
Although we are faced with a few cases where we actually benefit (kinda) see 3/4 transition
Original Execution Time: 108715 nano Seconds Instructs: 821 Method: deleteTree
Execution Time: 556187 nano Seconds Instructs: 821 Method: deleteTree 2 Threads
Execution Time: 24728977 nano Seconds Instructs: 821 Method: deleteTree 3 Threads
Execution Time: 14231161 nano Seconds Instructs: 821 Method: deleteTree 4 Threads
for I in {1..3}; do ./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop; done
Graph Axes will be fixed and dimensions will be lined up
I started multi threading DefUse.computeDU
Optimizations (in Simple.java)
- // Compute defList, useList, useCount fields for each register.
DefUse.computeDU(ir); - // Recompute isSSA flags
DefUse.recomputeSSA(ir); - / Simple copy propagation.
// This pass incrementally updates the register list.
copyPropagation(ir); - // Simple type propagation.
// This pass uses the register list, but doesn't modify it.
if (typeProp) {
typePropagation(ir);
} - // Perform simple bounds-check and arraylength elimination.
// This pass incrementally updates the register list
if (foldChecks) {
arrayPropagation(ir);
} - // Simple dead code elimination.
// This pass incrementally updates the register list
eliminateDeadInstructions(ir); - // constant folding
// This pass usually doesn't modify the DU, but
// if it does it will recompute it.
foldConstants(ir);
// Simple local expression folding respecting DU
if (ir.options.LOCAL_EXPRESSION_FOLDING && ExpressionFolding.performLocal(ir)) {
// constant folding again
foldConstants(ir);
} - // Try to remove conditional branches with constant operands
// If it actually constant folds a branch,
// this pass will recompute the DU
if (foldBranches) {
simplifyConstantBranches(ir);
} - // Should we sort commutative use operands
if (sortRegisters) {
sortCommutativeRegisterUses(ir);
}
Our ComputeDU Optimization Mutlithread's on a Register Level Unfortunately that Granularity is way to fine and we see overhead from runnable objects/ threads
Although we are faced with a few cases where we actually benefit (kinda) see 3/4 transition
Original Execution Time: 108715 nano Seconds Instructs: 821 Method: deleteTree
Execution Time: 556187 nano Seconds Instructs: 821 Method: deleteTree 2 Threads
Execution Time: 24728977 nano Seconds Instructs: 821 Method: deleteTree 3 Threads
Execution Time: 14231161 nano Seconds Instructs: 821 Method: deleteTree 4 Threads
for I in {1..3}; do ./rvm -X:aos:initial_compiler=opt -classpath dacapo-2006-10-MR2.jar Harness fop; done
Graph Axes will be fixed and dimensions will be lined up
Subscribe to:
Posts (Atom)