Chombo + EB + MF  3.2
CH_Attach.H
Go to the documentation of this file.
1 #ifdef CH_LANG_CC
2 /*
3  * _______ __
4  * / ___/ / ___ __ _ / / ___
5  * / /__/ _ \/ _ \/ V \/ _ \/ _ \
6  * \___/_//_/\___/_/_/_/_.__/\___/
7  * Please refer to Copyright.txt, in Chombo's root directory.
8  */
9 #endif
10 
11 #ifndef _CH_ATTACH_H_
12 #define _CH_ATTACH_H_
13 #include "BaseNamespaceHeader.H"
14 
15 /// call this function to register emacs-gdb to be invoked on abort.
16 /**
17  After watching most members of the ANAG team suffer from parallel
18  debugging efforts I resolved to offer something to help.
19 
20  There are two usual problems.
21 
22  1. unless the code chokes on process rank 0, you either resort to
23  printf debugging, or begin the adventure of hunting down which process
24  is causing the problem and trying to use gdb 'attach' to debug the
25  offending process. One is messy and painfully slow, the other is
26  finicky and difficult and you forget how to do it between times you
27  need it.
28 
29  2. If you are lucky enough to actually have your code abort on process
30  rank 0, you are still stuck with regular tty gdb to decipher your problem.
31 
32  All of this also depends on running your parallel process all on a
33  single machine, which defeats some of the point of parallel processing.
34 
35  To address these problems, you can insert a call to 'registerDebugger()'.
36  you can call it anywhere in your program. It registers the function
37  'AttachDebugger' with the ABORT signal handler. Now, when your a process
38  does something naughty and goes into abort (assert fail, MayDay, segfault,
39  etc) an emacs session is launch, gdb is invoked, your binary is found (for
40  symbols) and gdb attaches to your process before it has a chance to
41  completely die. The emacs window in named for the rank of the offending
42  MPI process.
43 
44  Interaction with regular debug session:
45 
46  It is still perfectly fine to debug code that has called 'registerDebugger'
47  in a regular gdb session, as gdb replaces signal handlers with it's own
48  when it starts up your program for you.
49 
50  X11 Forwarding:
51 
52  As stated, the offending process is going to open up an emacs terminal. In
53  order to do this I read the process' environment variable DISPLAY. MPICH
54  on our systems uses "ssh" to start other processes, and no amount of
55  playing with mpich configure has allowed me to insert the -X command to
56  enable X11 forwarding. In addition, ssh at ANAG defaults to NOT forward
57  X11. Hence, the DISPLAY environment variable for all the MPI processes
58  rank>0 don't have a valid DISPLAY. Fortunately there is an easy answer.
59  Create the file ~/.ssh/config (or ~/.ssh2/config) and place the following lines in it:
60  Host *
61  ForwardAgent yes
62  ForwardX11 yes
63 
64  This turns out to be pretty nice. If you log into your ANAG machine from
65  home using ssh -X, and then run a parallel job with mpirun you will get
66  your emacs debug session forwarded from the machine the process is actually
67  running on, to the machine you logged into, to your machine at home.
68 
69  If you see some message about gdb finding fd0 closed, then the failure is
70  with the DISPLAY environment. make sure you have those three lines in
71  your .ssh/config
72 
73  Cavaets:
74 
75  -naturally, this approach assumes you have emacs and gdb and X11 forwarding
76  on your system. If you don't then this signal handler gracefully passes the
77  abort signal handler to the next handler function and you are none the
78  wiser.
79 
80  -The current approach uses POSIX standard calls for it's operations,
81  except: In order to find the binary for the running file I snoop into the
82  /proc filesystem on linux. This naturally will only work on linux
83  operating systems. If we find the tool useful it shouldn't be too hard to
84  make it work with other OS's on a demand basis. For those NOT running
85  Linux, you are not completely without hope. When the debugger appears, you
86  will have to tell gdb where the binary is and use the 'load' command to
87  fetch it and get all your debug symbols.
88 
89  -I do not know if it will be possible to have X11 forwarding working with
90  batch submission systems. It means compute nodes would need to have
91  proper X client systems. Someboday might be able to give me some pointers
92  on this one.
93 
94  Another use for this feature would be babysitting very large runs. If you
95  think your big executable *might* crash then you can put this in. If the
96  code runs to completion properly, then there is not need for a debugger
97  and one never gets invoked. If your code does die, then you will find
98  debugger open on your desktop attached to your program at the point where
99  it failed. Since we never seem to use core files, this might be a
100  palatable option. In parallel runs core files are just not an option.
101 
102  Tue Jan 10 13:57:20 PST 2006
103 
104  new feature:
105 
106  previously, the debugger only can be attached when the code has been
107  signaled to abort, or when the mpi error handler has been called. You get
108  a debugger, attached where you want, on the MPI proc you want, but executing
109  has been ended. You cannot
110  'cont'. Additionally, if you explicitly compile the 'AttachDebugger' command
111  into your code it also meant the end of the line. no continuing.
112 
113  now, (through the POSIX, but still evil, pipe, popen, fork, etc) you can continue
114  from an explicit call to AttachDebugger. it is a two-step process. When the
115  debugger window pops up, you need to call:
116 
117  (gdb) p DebugCont()
118 
119  this should return a (void) result and put you back at the prompt. now you can
120  use regular gdb commands to set break points, cont, up, fin, etc.
121 
122  You can use this feature to set an actual parallel breakpoint, you just have
123  to have the patience to type your breakpoint into each window, and the fortitude
124  to accept that if you make a mistake, then you have to kill all your debuggers
125  and start the mpi job again.
126 
127  until I really figure out how gdbserver works, and how p4 REALLY works, this
128  will have to do for now.
129 
130 currently this does not work:
131  In the event that the AttachDebugger cannot find a valid DISPLAY, it will
132  still gdb attach to the process and read out a stacktrace to pout() so you
133  can have some idea where the code died. Unfortunately, due to a known bug
134  introduced in gdb as of version 5 or so, this fallback doesn't work. There
135  is a known patch for gdb, so hopefully a version of gdb will be
136  available soon that works properly. I'll still commit it as is.
137 
138  bvs
139 */
140 int registerDebugger(); // you use this function up in main, after MPI_Init
141 
142 /**
143  Yeah, registerDebugger() is great if one of the Chombo CH_asserts fails or
144  we call abort or exit in the code, or trip an exception, but what about when some part
145  of the code calls MPI_Abort ? MPI_Abort never routes to abort but just runs to the
146  exit() function after trying to take down the rest of the parallel job. In order to
147  attach to MPI_Abort calls we need to register a function with the MPI error handler
148  system. which is what this function does. Currently registerDebugger() calls this
149  function for you.
150 */
151 int setChomboMPIErrorHandler(); // you use this function up in main, after MPI_Init
152 
153 /** Not for general consumption. you can insert this function call into
154  your code and it will fire up a debugger wherever you put it. I don't think
155  the code can be continued from this point however. */
156 void AttachDebugger(int a_sig = 4);
157 #include "BaseNamespaceFooter.H"
158 #endif
int registerDebugger()
call this function to register emacs-gdb to be invoked on abort.
void AttachDebugger(int a_sig=4)
int setChomboMPIErrorHandler()