Chombo + EB: CH_Attach.H Source File

00001 #ifdef CH_LANG_CC
00002 /*
00003  *      _______              __
00004  *     / ___/ /  ___  __ _  / /  ___
00005  *    / /__/ _ \/ _ \/  V \/ _ \/ _ \
00006  *    \___/_//_/\___/_/_/_/_.__/\___/
00007  *    Please refer to Copyright.txt, in Chombo's root directory.
00008  */
00009 #endif
00010 
00011 #ifndef _CH_ATTACH_H_
00012 #define _CH_ATTACH_H_
00013 #include "BaseNamespaceHeader.H"
00014 
00015 /// call this function to register emacs-gdb to be invoked on abort.
00016 /**
00017    After watching most members of the ANAG team suffer from parallel
00018    debugging efforts I resolved to offer something to help.
00019 
00020    There are two usual problems.
00021 
00022    1.  unless the code chokes on process rank 0, you either resort to
00023    printf debugging, or begin the adventure of hunting down which process
00024    is causing the problem and trying to use gdb 'attach' to debug the
00025    offending process.  One is messy and painfully slow, the other is
00026    finicky and difficult and you forget how to do it between times you
00027    need it.
00028 
00029    2. If you are lucky enough to actually have your code abort on process
00030    rank 0, you are still stuck with regular tty gdb to decipher your problem.
00031 
00032    All of this also depends on running your parallel process all on a
00033    single machine, which defeats some of the point of parallel processing.
00034 
00035    To address these problems, you can insert a call to 'registerDebugger()'.
00036    you can call it anywhere in your program.  It registers the function
00037    'AttachDebugger' with the ABORT signal handler.  Now, when your a process
00038    does something naughty and goes into abort (assert fail, MayDay, segfault,
00039    etc) an emacs session is launch, gdb is invoked, your binary is found (for
00040    symbols) and gdb attaches to your process before it has a chance to
00041    completely die.  The emacs window in named for the rank of the offending
00042    MPI process.
00043 
00044    Interaction with regular debug session:
00045 
00046    It is still perfectly fine to debug code that has called 'registerDebugger'
00047    in a regular gdb session, as gdb replaces signal handlers with it's own
00048    when it starts up your program for you.
00049 
00050    X11 Forwarding:
00051 
00052    As stated, the offending process is going to open up an emacs terminal. In
00053    order to do this I read the process' environment variable DISPLAY.  MPICH
00054    on our systems uses "ssh" to start other processes, and no amount of
00055    playing with mpich configure has allowed me to insert the -X command to
00056    enable X11 forwarding.  In addition, ssh at ANAG defaults to NOT forward
00057    X11. Hence, the DISPLAY environment variable for all the MPI processes
00058    rank>0 don't have a valid DISPLAY.  Fortunately there is an easy answer.
00059    Create the file ~/.ssh/config (or ~/.ssh2/config) and place the following lines in it:
00060       Host *
00061       ForwardAgent yes
00062       ForwardX11 yes
00063 
00064    This turns out to be pretty nice.  If you log into your ANAG machine from
00065    home using ssh -X, and then run a parallel job with mpirun you will get
00066    your emacs debug session forwarded from the machine the process is actually
00067    running on, to the machine you logged into, to your machine at home.
00068 
00069    If you see some message about gdb finding fd0 closed, then the failure is
00070    with the DISPLAY environment.  make sure you have those three lines in
00071    your .ssh/config
00072 
00073    Cavaets:
00074 
00075    -naturally, this approach assumes you have emacs and gdb and X11 forwarding
00076    on your system. If you don't then this signal handler gracefully passes the
00077    abort signal handler to the next handler function and you are none the
00078    wiser.
00079 
00080    -The current approach uses POSIX standard calls for it's operations,
00081    except: In order to find the binary for the running file I snoop into the
00082    /proc filesystem on linux.  This naturally will only work on linux
00083    operating systems.  If we find the tool useful it shouldn't be too hard to
00084    make it work with other OS's on a demand basis.  For those NOT running
00085    Linux, you are not completely without hope.  When the debugger appears, you
00086    will have to tell gdb where the binary is and use the 'load' command to
00087    fetch it and get all your debug symbols.
00088 
00089    -I do not know if it will be possible to have X11 forwarding working with
00090    batch submission systems.  It means compute nodes would need to have
00091    proper X client systems.  Someboday might be able to give me some pointers
00092    on this one.
00093 
00094    Another use for this feature would be babysitting very large runs.  If you
00095    think your big executable *might* crash then you can put this in.  If the
00096    code runs to completion properly, then there is not need for a debugger
00097    and one never gets invoked.  If your code does die, then you will find
00098    debugger open on your desktop attached to your program at the point where
00099    it failed.  Since we never seem to use core files, this might be a
00100    palatable option.  In parallel runs core files are just not an option.
00101 
00102    Tue Jan 10 13:57:20 PST 2006
00103 
00104    new feature:
00105 
00106    previously, the debugger only can be attached when the code has been
00107    signaled to abort, or when the mpi error handler has been called.  You get
00108    a debugger, attached where you want, on the MPI proc you want, but executing
00109    has been ended.  You cannot
00110    'cont'.  Additionally, if you explicitly compile the 'AttachDebugger' command
00111    into your code it also meant the end of the line.  no continuing.
00112 
00113    now, (through the POSIX, but still evil, pipe, popen, fork, etc) you can continue
00114    from an explicit call to AttachDebugger.  it is a two-step process.  When the
00115    debugger window pops up, you need to call:
00116 
00117        (gdb) p DebugCont()
00118 
00119    this should return a (void) result and put you back at the prompt. now you can
00120    use regular gdb commands to set break points, cont, up, fin, etc.
00121 
00122    You can use this feature to set an actual parallel breakpoint, you just have
00123    to have the patience to type your breakpoint into each window, and the fortitude
00124    to accept that if you make a mistake, then you have to kill all your debuggers
00125    and start the mpi job again.
00126 
00127    until I really figure out how gdbserver works, and how p4 REALLY works, this
00128    will have to do for now.
00129 
00130 currently this does not work:
00131    In the event that the AttachDebugger cannot find a valid DISPLAY, it will
00132    still gdb attach to the process and read out a stacktrace to pout() so you
00133    can have some idea where the code died.  Unfortunately, due to a known bug
00134    introduced in gdb as of version 5 or so, this fallback doesn't work. There
00135    is a known patch for gdb, so hopefully a version of gdb will be
00136    available soon that works properly.  I'll still commit it as is.
00137 
00138    bvs
00139 */
00140 int registerDebugger();  // you use this function up in main, after MPI_Init
00141 
00142 /**
00143    Yeah, registerDebugger() is great if one of the Chombo CH_asserts fails or
00144    we call abort or exit in the code, or trip an exception, but what about when some part
00145    of the code calls MPI_Abort ? MPI_Abort never routes to abort but just runs to the
00146    exit() function after trying to take down the rest of the parallel job.  In order to
00147    attach to MPI_Abort calls we need to register a function with the MPI error handler
00148    system.  which is what this function does.  Currently registerDebugger() calls this
00149    function for you.
00150 */
00151 int setChomboMPIErrorHandler();  // you use this function up in main, after MPI_Init
00152 
00153 /**  Not for general consumption. you can insert this function call into
00154      your code and it will fire up a debugger wherever you put it.  I don't think
00155      the code can be continued from this point however. */
00156 void AttachDebugger(int a_sig = 4);
00157 #include "BaseNamespaceFooter.H"
00158 #endif