00001 #ifdef CH_LANG_CC 00002 /* 00003 * _______ __ 00004 * / ___/ / ___ __ _ / / ___ 00005 * / /__/ _ \/ _ \/ V \/ _ \/ _ \ 00006 * \___/_//_/\___/_/_/_/_.__/\___/ 00007 * Please refer to Copyright.txt, in Chombo's root directory. 00008 */ 00009 #endif 00010 00011 #ifndef _CH_ATTACH_H_ 00012 #define _CH_ATTACH_H_ 00013 #include "BaseNamespaceHeader.H" 00014 00015 /// call this function to register emacs-gdb to be invoked on abort. 00016 /** 00017 After watching most members of the ANAG team suffer from parallel 00018 debugging efforts I resolved to offer something to help. 00019 00020 There are two usual problems. 00021 00022 1. unless the code chokes on process rank 0, you either resort to 00023 printf debugging, or begin the adventure of hunting down which process 00024 is causing the problem and trying to use gdb 'attach' to debug the 00025 offending process. One is messy and painfully slow, the other is 00026 finicky and difficult and you forget how to do it between times you 00027 need it. 00028 00029 2. If you are lucky enough to actually have your code abort on process 00030 rank 0, you are still stuck with regular tty gdb to decipher your problem. 00031 00032 All of this also depends on running your parallel process all on a 00033 single machine, which defeats some of the point of parallel processing. 00034 00035 To address these problems, you can insert a call to 'registerDebugger()'. 00036 you can call it anywhere in your program. It registers the function 00037 'AttachDebugger' with the ABORT signal handler. Now, when your a process 00038 does something naughty and goes into abort (assert fail, MayDay, segfault, 00039 etc) an emacs session is launch, gdb is invoked, your binary is found (for 00040 symbols) and gdb attaches to your process before it has a chance to 00041 completely die. The emacs window in named for the rank of the offending 00042 MPI process. 00043 00044 Interaction with regular debug session: 00045 00046 It is still perfectly fine to debug code that has called 'registerDebugger' 00047 in a regular gdb session, as gdb replaces signal handlers with it's own 00048 when it starts up your program for you. 00049 00050 X11 Forwarding: 00051 00052 As stated, the offending process is going to open up an emacs terminal. In 00053 order to do this I read the process' environment variable DISPLAY. MPICH 00054 on our systems uses "ssh" to start other processes, and no amount of 00055 playing with mpich configure has allowed me to insert the -X command to 00056 enable X11 forwarding. In addition, ssh at ANAG defaults to NOT forward 00057 X11. Hence, the DISPLAY environment variable for all the MPI processes 00058 rank>0 don't have a valid DISPLAY. Fortunately there is an easy answer. 00059 Create the file ~/.ssh/config (or ~/.ssh2/config) and place the following lines in it: 00060 Host * 00061 ForwardAgent yes 00062 ForwardX11 yes 00063 00064 This turns out to be pretty nice. If you log into your ANAG machine from 00065 home using ssh -X, and then run a parallel job with mpirun you will get 00066 your emacs debug session forwarded from the machine the process is actually 00067 running on, to the machine you logged into, to your machine at home. 00068 00069 If you see some message about gdb finding fd0 closed, then the failure is 00070 with the DISPLAY environment. make sure you have those three lines in 00071 your .ssh/config 00072 00073 Cavaets: 00074 00075 -naturally, this approach assumes you have emacs and gdb and X11 forwarding 00076 on your system. If you don't then this signal handler gracefully passes the 00077 abort signal handler to the next handler function and you are none the 00078 wiser. 00079 00080 -The current approach uses POSIX standard calls for it's operations, 00081 except: In order to find the binary for the running file I snoop into the 00082 /proc filesystem on linux. This naturally will only work on linux 00083 operating systems. If we find the tool useful it shouldn't be too hard to 00084 make it work with other OS's on a demand basis. For those NOT running 00085 Linux, you are not completely without hope. When the debugger appears, you 00086 will have to tell gdb where the binary is and use the 'load' command to 00087 fetch it and get all your debug symbols. 00088 00089 -I do not know if it will be possible to have X11 forwarding working with 00090 batch submission systems. It means compute nodes would need to have 00091 proper X client systems. Someboday might be able to give me some pointers 00092 on this one. 00093 00094 Another use for this feature would be babysitting very large runs. If you 00095 think your big executable *might* crash then you can put this in. If the 00096 code runs to completion properly, then there is not need for a debugger 00097 and one never gets invoked. If your code does die, then you will find 00098 debugger open on your desktop attached to your program at the point where 00099 it failed. Since we never seem to use core files, this might be a 00100 palatable option. In parallel runs core files are just not an option. 00101 00102 Tue Jan 10 13:57:20 PST 2006 00103 00104 new feature: 00105 00106 previously, the debugger only can be attached when the code has been 00107 signaled to abort, or when the mpi error handler has been called. You get 00108 a debugger, attached where you want, on the MPI proc you want, but executing 00109 has been ended. You cannot 00110 'cont'. Additionally, if you explicitly compile the 'AttachDebugger' command 00111 into your code it also meant the end of the line. no continuing. 00112 00113 now, (through the POSIX, but still evil, pipe, popen, fork, etc) you can continue 00114 from an explicit call to AttachDebugger. it is a two-step process. When the 00115 debugger window pops up, you need to call: 00116 00117 (gdb) p DebugCont() 00118 00119 this should return a (void) result and put you back at the prompt. now you can 00120 use regular gdb commands to set break points, cont, up, fin, etc. 00121 00122 You can use this feature to set an actual parallel breakpoint, you just have 00123 to have the patience to type your breakpoint into each window, and the fortitude 00124 to accept that if you make a mistake, then you have to kill all your debuggers 00125 and start the mpi job again. 00126 00127 until I really figure out how gdbserver works, and how p4 REALLY works, this 00128 will have to do for now. 00129 00130 currently this does not work: 00131 In the event that the AttachDebugger cannot find a valid DISPLAY, it will 00132 still gdb attach to the process and read out a stacktrace to pout() so you 00133 can have some idea where the code died. Unfortunately, due to a known bug 00134 introduced in gdb as of version 5 or so, this fallback doesn't work. There 00135 is a known patch for gdb, so hopefully a version of gdb will be 00136 available soon that works properly. I'll still commit it as is. 00137 00138 bvs 00139 */ 00140 int registerDebugger(); // you use this function up in main, after MPI_Init 00141 00142 /** 00143 Yeah, registerDebugger() is great if one of the Chombo CH_asserts fails or 00144 we call abort or exit in the code, or trip an exception, but what about when some part 00145 of the code calls MPI_Abort ? MPI_Abort never routes to abort but just runs to the 00146 exit() function after trying to take down the rest of the parallel job. In order to 00147 attach to MPI_Abort calls we need to register a function with the MPI error handler 00148 system. which is what this function does. Currently registerDebugger() calls this 00149 function for you. 00150 */ 00151 int setChomboMPIErrorHandler(); // you use this function up in main, after MPI_Init 00152 00153 /** Not for general consumption. you can insert this function call into 00154 your code and it will fire up a debugger wherever you put it. I don't think 00155 the code can be continued from this point however. */ 00156 void AttachDebugger(int a_sig = 4); 00157 #include "BaseNamespaceFooter.H" 00158 #endif