| [807] | 1 |
|
|---|
| 2 | ParGeant4: Geant4/TOP-C, a parallelization of Geant4
|
|---|
| 3 | (event-level parallelism)
|
|---|
| 4 |
|
|---|
| 5 | Gene Cooperman
|
|---|
| 6 | Northeastern University
|
|---|
| 7 | gene@ccs.neu.edu,
|
|---|
| 8 |
|
|---|
| 9 | For the latest information on ParGeant4, see:
|
|---|
| 10 | http://www.ccs.neu.edu/home/gene/pargeant4.html
|
|---|
| 11 | Note that a version now exists that runs Geant4 over the Grid.
|
|---|
| 12 | Please write to gene@ccs.neu.edu for further information.
|
|---|
| 13 | To port other applications to a parallel version, read the
|
|---|
| 14 | files ../../info/PAR_INSTALL and ../../info/PAR_README.
|
|---|
| 15 |
|
|---|
| 16 | See the beginning of GNUmakefile for reasonable `make' targets to run it.
|
|---|
| 17 | To run it:
|
|---|
| 18 | 0. a. Follow the standard Geant4 installation procedure.
|
|---|
| 19 | b. Download and install TOP-C
|
|---|
| 20 | The TOP-C home page is at http://www.ccs.neu.edu/home/gene/topc.html
|
|---|
| 21 | cd <TOPC_INSTALL_DIR>
|
|---|
| 22 | gzip -dc topc.tar.gz | tar -xvf -
|
|---|
| 23 | cd topc
|
|---|
| 24 | ./configure
|
|---|
| 25 | make
|
|---|
| 26 | make check
|
|---|
| 27 | [ Copy bin/topc-config to your path ]
|
|---|
| 28 | c. Verify that the Geant4 example installs:
|
|---|
| 29 | cd $G4INSTALL/examples/extended/parallel/ParN04
|
|---|
| 30 | make
|
|---|
| 31 | $G4WORKDIR/bin/$G4SYSTEM/ParN04 ParN04.in
|
|---|
| 32 | 2. make run
|
|---|
| 33 | [ By default, the included `procgroup' file creates two slave processes
|
|---|
| 34 | on localhost. ]
|
|---|
| 35 | [ Note that in addition to output on master,
|
|---|
| 36 | $G4WORKDIR/bin/$G4SYSTEM/slave*.out contains slave output. ]
|
|---|
| 37 | [ To remove intermediate files and start over: make parclean ]
|
|---|
| 38 | 3. Try running it with slave processes on remote processes.
|
|---|
| 39 | First, test that your local environment is set up correctly.
|
|---|
| 40 | Try:
|
|---|
| 41 | ssh <REMOTE_HOSTNAME> $G4WORKDIR/bin/$G4SYSTEM/ParN04 `pwd`/ParN04.in
|
|---|
| 42 | The above command needs to work without asking for a password.
|
|---|
| 43 | [ If you use dynamic libraries (*.so), make sure the LD_LIBRARY_PATH
|
|---|
| 44 | in your shell startup file (e.g. ~.tcshrc) includes both:
|
|---|
| 45 | $G4INSTALL/lib/$G4SYSTEM and $CLHEP_BASE_DIR/lib
|
|---|
| 46 | If you use AFS, you may need to type 'klog' to renew your AFS token. ]
|
|---|
| 47 | In `procgroup' file, replace `localhost' by desired remote hosts;
|
|---|
| 48 | Add additional remote hosts (additional slaves) if you like.
|
|---|
| 49 | Then: make run
|
|---|
| 50 |
|
|---|
| 51 | ============================================================================
|
|---|
| 52 | If you read ParGNUmakefile, you'll find other things that you can
|
|---|
| 53 | modify. For example, all TOP-C additions are in conditionals:
|
|---|
| 54 | remove -DG4USE_TOPC from ParGNUmakefile and:
|
|---|
| 55 | make parclean; make run
|
|---|
| 56 | in order to re-compile and rerun without TOP-C.
|
|---|
| 57 | Define REMOTE_SHELL differently if you don't use `ssh' for a remote shell.
|
|---|
| 58 | (If undefined, ParGNUmakefile defines it to be `ssh')
|
|---|
| 59 | Define MACROFILE diferently to use a different set of input commands.
|
|---|
| 60 | Define MEM_MODEL=--seq
|
|---|
| 61 | to run with TOP-C, but using a single (sequential) process, suitable
|
|---|
| 62 | for easy debugging (via gdb, for example).
|
|---|
| 63 | Try: pushd $G4WORKDIR/bin/$G4SYSTEM/; ./ParN04 --TOPC-help
|
|---|
| 64 | to see TOP-C run-time options that can be invoked, such as
|
|---|
| 65 | pushd $G4WORKDIR/bin/$G4SYSTEM/; ./ParN04 --TOPC-num-slaves=5 ParN04.in
|
|---|
| 66 | Alternatively, modify TOPC_OPTIONS in ParGNUmakefile for the same effect.
|
|---|
| 67 |
|
|---|
| 68 | You can also try other targets: make run-debug
|
|---|
| 69 | This will run it under gdb, so you can single step to see what happens.
|
|---|
| 70 | make parclean - Start over with clean set of files.
|
|---|
| 71 |
|
|---|
| 72 | ============================================================================
|
|---|
| 73 | New or modified files:
|
|---|
| 74 | ParN04.cc - Adds one line: #include "ParN04.icc"
|
|---|
| 75 | ParExample.icc inserts: #include "topc.h"
|
|---|
| 76 | and causes main to calls TOPC_init, TOPC_finalize,
|
|---|
| 77 | and to use: `new ParRunManager' instead of `new G4RunManager'
|
|---|
| 78 | GNUmakefile - Adds one line at beginning: include ParGNUmakefile
|
|---|
| 79 | ParGNUmakefile defines EXTRALIBS and CPPFLAGS so as to
|
|---|
| 80 | modify behavior of config/binmake.gmk
|
|---|
| 81 | in order to use TOP-C libraries and includes
|
|---|
| 82 | procgroup - Specifies which slave hosts to use, and where to put output
|
|---|
| 83 | For example: localhost 1 - > slave1.out
|
|---|
| 84 | host=`localhost', executable=`same as master',
|
|---|
| 85 | params of slave=`> slave1.out' (redirect output)
|
|---|
| 86 | If output not redirected, it goes to stdout on master.
|
|---|
| 87 | src/ParRunManager.cc - ParRunManger derived from G4RunManager
|
|---|
| 88 | replaces Gr4RunManager::DoEventLoop w/ TOP-C parallel loop,
|
|---|
| 89 | Adds certain local vars of DoEventLoop as ParRunManager members
|
|---|
| 90 | include/MarshaledObj.h - run-time utilities for marshalling
|
|---|
| 91 | include/MarshaledEx*Hit.h - marshals N04 hits (calorimeter hits)
|
|---|
| 92 | include/MarshaledG4*.h - marshaling routines for Geant4 data structures
|
|---|
| 93 |
|
|---|
| 94 | ~/slave*.out - Contains outputs of slave1, slave2, etc.
|
|---|
| 95 | Generated each time parallel ParN04 is executed.
|
|---|
| 96 | These files are specified in the file procgroup.
|
|---|
| 97 |
|
|---|
| 98 | ====================================================================
|
|---|
| 99 | This version passes an event number to the slave and lets the
|
|---|
| 100 | slave generate the event. The slave passes back marshaled hits to
|
|---|
| 101 | the master.
|
|---|
| 102 |
|
|---|
| 103 | I will integrate the track level parallelism into this scenario at
|
|---|
| 104 | a later date. For the track level, I will generate several
|
|---|
| 105 | secondary tracks on the master, and then convert the secondary tracks
|
|---|
| 106 | to new events that can be passed to slaves. I will do this only if
|
|---|
| 107 | I detect that there are not enough initial events to fully occupy all
|
|---|
| 108 | the slaves. This scheme has the drawback that we are splitting an event
|
|---|
| 109 | into many events, which may make the summarization, histogram, and so
|
|---|
| 110 | on more difficult. However, track level parallelism will be triggered
|
|---|
| 111 | only when a very small number of events are generated.
|
|---|
| 112 |
|
|---|
| 113 | I also want to support postponing
|
|---|
| 114 | a track to the next event ( G4ClassificationOfNewTrack::fPostpone .
|
|---|
| 115 | To do this, each slave will wait to retire an event until it knows that
|
|---|
| 116 | the previous event has been retired.
|
|---|
| 117 |
|
|---|
| 118 | In addition, I plan to have only the master read commands and pass
|
|---|
| 119 | them to the slaves. Currently, the master and slaves each read
|
|---|
| 120 | identical commands.
|
|---|
| 121 |
|
|---|
| 122 | ====================================================================
|
|---|
| 123 | If you are curious about some of the layers, the following
|
|---|
| 124 | stack trace [somewhat out of date now] gives some idea.
|
|---|
| 125 | [ This stack trace is from a run based on ParN02.]
|
|---|
| 126 |
|
|---|
| 127 | G4RunManager::BeamOn calls ParRunManager::DoEventLoop
|
|---|
| 128 | (since G4RunManager::DoEventLoop is virtual)
|
|---|
| 129 | ParRunManager::DoEventLoop calls TOPC_master_slave
|
|---|
| 130 | TOPC_master_slave calls submit_task_input
|
|---|
| 131 | submit_task_input eventually calls COMM_send_msg which calls MPI_Send
|
|---|
| 132 | (COMM_send_msg is the communication layer of TOPC;
|
|---|
| 133 | ParN04.cc was linked with the TOP-C MPI communication layer.
|
|---|
| 134 | The same source could have been linked with a POSIX threads layer,
|
|---|
| 135 | a communication layer, or some other communication layer.
|
|---|
| 136 | )
|
|---|
| 137 | MPI_send calls send
|
|---|
| 138 | (where send is the socket system call of libc.so)
|
|---|
| 139 |
|
|---|
| 140 | (gdb) where
|
|---|
| 141 | #0 0x41946c62 in send () from /lib/libc.so.6
|
|---|
| 142 | #1 0x400839c1 in send () at wrapsyscall.c:186
|
|---|
| 143 | #2 0x805c547 in MPI_Send (buf=0x82690fc, count=4, datatype=3, dest=2, tag=1, comm=0) at sendrecv.c:236
|
|---|
| 144 | #3 0x805a0b5 in COMM_send_msg (msg=0x82690fc, msg_size=4, dst=2, tag=TASK_INPUT_TAG) at comm-mpi.c:224
|
|---|
| 145 | #4 0x805774e in send_task_input (slave=2, input={data = 0x82690fc, data_size = 4}, tag=TASK_INPUT_TAG) at topc.c:560
|
|---|
| 146 | #5 0x8057aa8 in submit_task_input (input={data = 0x82690fc, data_size = 4}) at topc.c:659
|
|---|
| 147 | #6 0x805813c in TOPC_master_slave (generate_task_input_=0x4003d2e4 <ParRunManager::GenerateEventInput(void)>,
|
|---|
| 148 | do_task_=0x4003d350 <ParRunManager::DoEvent(int *)>, check_task_result_=0x4003d420 <ParRunManager::CheckEventResult(int *, void *)>,
|
|---|
| 149 | update_shared_data_=0) at topc.c:922
|
|---|
| 150 | #7 0x4003d18c in ParRunManager::DoEventLoop (this=0x80c0bf0, n_event=1, macroFile=0x0, n_select=-1) at src/ParRunManager.cc:51
|
|---|
| 151 | #8 0x400b14d1 in G4RunManager::BeamOn () from /afs/cern.ch/user/c/cooperma/scratch-pcitapi07/geant4/lib/libG4run.so
|
|---|
| 152 | #9 0x400b870a in G4RunMessenger::SetNewValue () from /afs/cern.ch/user/c/cooperma/scratch-pcitapi07/geant4/lib/libG4run.so
|
|---|
| 153 | #10 0x4167157b in G4UIcommand::DoIt () from /afs/cern.ch/user/c/cooperma/scratch-pcitapi07/geant4/lib/libG4intercoms.so
|
|---|
| 154 | #11 0x416810a3 in G4UImanager::ApplyCommand () from /afs/cern.ch/user/c/cooperma/scratch-pcitapi07/geant4/lib/libG4intercoms.so
|
|---|
| 155 | #12 0x805db7e in G4UIterminal::ExecuteCommand () at /afs/cern.ch/sw/lhcxx/specific/redhat61/3.2.0/include/CLHEP/Random/Randomize.h:64
|
|---|
| 156 | #13 0x805d42d in G4UIterminal::SessionStart () at /afs/cern.ch/sw/lhcxx/specific/redhat61/3.2.0/include/CLHEP/Random/Randomize.h:64
|
|---|
| 157 | #14 0x8056a3d in main (argc=1, argv=0x80bfa00) at ParN02.cc:98
|
|---|