Chombo + EB
3.0
src
BaseTools
CH_Attach.H
Go to the documentation of this file.
1
#ifdef CH_LANG_CC
2
/*
3
* _______ __
4
* / ___/ / ___ __ _ / / ___
5
* / /__/ _ \/ _ \/ V \/ _ \/ _ \
6
* \___/_//_/\___/_/_/_/_.__/\___/
7
* Please refer to Copyright.txt, in Chombo's root directory.
8
*/
9
#endif
10
11
#ifndef _CH_ATTACH_H_
12
#define _CH_ATTACH_H_
13
#include "
BaseNamespaceHeader.H
"
14
15
/// call this function to register emacs-gdb to be invoked on abort.
16
/**
17
After watching most members of the ANAG team suffer from parallel
18
debugging efforts I resolved to offer something to help.
19
20
There are two usual problems.
21
22
1. unless the code chokes on process rank 0, you either resort to
23
printf debugging, or begin the adventure of hunting down which process
24
is causing the problem and trying to use gdb 'attach' to debug the
25
offending process. One is messy and painfully slow, the other is
26
finicky and difficult and you forget how to do it between times you
27
need it.
28
29
2. If you are lucky enough to actually have your code abort on process
30
rank 0, you are still stuck with regular tty gdb to decipher your problem.
31
32
All of this also depends on running your parallel process all on a
33
single machine, which defeats some of the point of parallel processing.
34
35
To address these problems, you can insert a call to 'registerDebugger()'.
36
you can call it anywhere in your program. It registers the function
37
'AttachDebugger' with the ABORT signal handler. Now, when your a process
38
does something naughty and goes into abort (assert fail, MayDay, segfault,
39
etc) an emacs session is launch, gdb is invoked, your binary is found (for
40
symbols) and gdb attaches to your process before it has a chance to
41
completely die. The emacs window in named for the rank of the offending
42
MPI process.
43
44
Interaction with regular debug session:
45
46
It is still perfectly fine to debug code that has called 'registerDebugger'
47
in a regular gdb session, as gdb replaces signal handlers with it's own
48
when it starts up your program for you.
49
50
X11 Forwarding:
51
52
As stated, the offending process is going to open up an emacs terminal. In
53
order to do this I read the process' environment variable DISPLAY. MPICH
54
on our systems uses "ssh" to start other processes, and no amount of
55
playing with mpich configure has allowed me to insert the -X command to
56
enable X11 forwarding. In addition, ssh at ANAG defaults to NOT forward
57
X11. Hence, the DISPLAY environment variable for all the MPI processes
58
rank>0 don't have a valid DISPLAY. Fortunately there is an easy answer.
59
Create the file ~/.ssh/config (or ~/.ssh2/config) and place the following lines in it:
60
Host *
61
ForwardAgent yes
62
ForwardX11 yes
63
64
This turns out to be pretty nice. If you log into your ANAG machine from
65
home using ssh -X, and then run a parallel job with mpirun you will get
66
your emacs debug session forwarded from the machine the process is actually
67
running on, to the machine you logged into, to your machine at home.
68
69
If you see some message about gdb finding fd0 closed, then the failure is
70
with the DISPLAY environment. make sure you have those three lines in
71
your .ssh/config
72
73
Cavaets:
74
75
-naturally, this approach assumes you have emacs and gdb and X11 forwarding
76
on your system. If you don't then this signal handler gracefully passes the
77
abort signal handler to the next handler function and you are none the
78
wiser.
79
80
-The current approach uses POSIX standard calls for it's operations,
81
except: In order to find the binary for the running file I snoop into the
82
/proc filesystem on linux. This naturally will only work on linux
83
operating systems. If we find the tool useful it shouldn't be too hard to
84
make it work with other OS's on a demand basis. For those NOT running
85
Linux, you are not completely without hope. When the debugger appears, you
86
will have to tell gdb where the binary is and use the 'load' command to
87
fetch it and get all your debug symbols.
88
89
-I do not know if it will be possible to have X11 forwarding working with
90
batch submission systems. It means compute nodes would need to have
91
proper X client systems. Someboday might be able to give me some pointers
92
on this one.
93
94
Another use for this feature would be babysitting very large runs. If you
95
think your big executable *might* crash then you can put this in. If the
96
code runs to completion properly, then there is not need for a debugger
97
and one never gets invoked. If your code does die, then you will find
98
debugger open on your desktop attached to your program at the point where
99
it failed. Since we never seem to use core files, this might be a
100
palatable option. In parallel runs core files are just not an option.
101
102
Tue Jan 10 13:57:20 PST 2006
103
104
new feature:
105
106
previously, the debugger only can be attached when the code has been
107
signaled to abort, or when the mpi error handler has been called. You get
108
a debugger, attached where you want, on the MPI proc you want, but executing
109
has been ended. You cannot
110
'cont'. Additionally, if you explicitly compile the 'AttachDebugger' command
111
into your code it also meant the end of the line. no continuing.
112
113
now, (through the POSIX, but still evil, pipe, popen, fork, etc) you can continue
114
from an explicit call to AttachDebugger. it is a two-step process. When the
115
debugger window pops up, you need to call:
116
117
(gdb) p DebugCont()
118
119
this should return a (void) result and put you back at the prompt. now you can
120
use regular gdb commands to set break points, cont, up, fin, etc.
121
122
You can use this feature to set an actual parallel breakpoint, you just have
123
to have the patience to type your breakpoint into each window, and the fortitude
124
to accept that if you make a mistake, then you have to kill all your debuggers
125
and start the mpi job again.
126
127
until I really figure out how gdbserver works, and how p4 REALLY works, this
128
will have to do for now.
129
130
131
currently this does not work:
132
In the event that the AttachDebugger cannot find a valid DISPLAY, it will
133
still gdb attach to the process and read out a stacktrace to pout() so you
134
can have some idea where the code died. Unfortunately, due to a known bug
135
introduced in gdb as of version 5 or so, this fallback doesn't work. There
136
is a known patch for gdb, so hopefully a version of gdb will be
137
available soon that works properly. I'll still commit it as is.
138
139
bvs
140
*/
141
int
registerDebugger
();
// you use this function up in main, after MPI_Init
142
143
144
145
/**
146
Yeah, registerDebugger() is great if one of the Chombo CH_asserts fails or
147
we call abort or exit in the code, or trip an exception, but what about when some part
148
of the code calls MPI_Abort ? MPI_Abort never routes to abort but just runs to the
149
exit() function after trying to take down the rest of the parallel job. In order to
150
attach to MPI_Abort calls we need to register a function with the MPI error handler
151
system. which is what this function does. Currently registerDebugger() calls this
152
function for you.
153
*/
154
int
setChomboMPIErrorHandler
();
// you use this function up in main, after MPI_Init
155
156
157
158
/** Not for general consumption. you can insert this function call into
159
your code and it will fire up a debugger wherever you put it. I don't think
160
the code can be continued from this point however. */
161
void
AttachDebugger
(
int
a_sig = 4);
162
#include "
BaseNamespaceFooter.H
"
163
#endif
registerDebugger
int registerDebugger()
call this function to register emacs-gdb to be invoked on abort.
AttachDebugger
void AttachDebugger(int a_sig=4)
BaseNamespaceHeader.H
setChomboMPIErrorHandler
int setChomboMPIErrorHandler()
BaseNamespaceFooter.H
Generated by
1.8.13