Operation of DAQ and Slow Controls - TWIST
On this page:
Restarting Run Software
MHTTPD | ODB | DAQ | Slow Controls
If the DAQ stops taking data but everything seems to
be running (and the beam is on), the first thing to try is to stop the
run and start a new one. This will often get it going again.
Restarting MHTTPD
MHTTPD is the Midas HTTP Daemon, the program which provides the web
interface to the Midas controls. It dies sometimes, meaning your web
browser won't be able to connect to Midas, but the run
continues uninterrupted. No other parts of the DAQ system
care whether MHTTPD is running or not; it's just a front end.
If MHTTPD is not running (your browser has complained that it cannot
connect, or similar), you can type (as user twistonl
on
machine midtwist
):
start_mhttpd
which checks to see if MHTTPD is running, and if not it starts it.
If MHTTP is responding, but not doing what you expect (eg. it's stuck
in an infinite loop, or it's not showing all the messages when you
request "last 8 hours", etc), you can type (as user twistonl
on
machine midtwist
):
restart_mhttpd
which kills the presently running MHTTPD and starts a new one.
ODB
Restoring the ODB Terminal Window
Normally, on the "DAQ" desktop, there is a terminal window open to the
ODB. This window (with a prompt that usually looks something like
[local:twist:Running]/>
or similar) is where the
system messages, errors, etc appear (including what's read over the
speakers). This window is not the actual ODB, but rather an interface
to it. Losing this window should not affect data taking; it just
means you don't see the status or error messages.
To get the ODB window, open a terminal window, and as user
twistonl
on machine midtwist
type
odbedit
The prompt should change, and you're now in the ODB. (You might want
to make this window a little bigger.)
Recovering from an ODB Corruption
If the ODB is not responding, and you cannot reach Konstantin
or Renee, a rather complicated recovery procedure is available.
Restarting TWIST DAQ
This section includes restarting the Lazy Logger.
All commands are run as user twistonl
on machine
midtwist
unless otherwise noted.
General Problem
- Stop the run.
- Run
start_daq
from a command prompt. (This will
check for all the processes that should be running,
and restart any which aren't.
PPC (PowerPC) Problem
-
Stop the run
.
- Reboot the PowerPC and restart the associated frontend (see below).
- If that doesn't help, try to contact an expert
(Renee or Konstantin).
- If that doesn't help (ie. you can't reach them), do a
kill_daq
followed by a start_daq
(but
only as a last resort). (You'll need to
reboot the PPC machines at this point, too.)
To Reboot PPC's:
From the Midas Status page, push the "Programs" button. Then push the
"stop" button for fbc1fe or fbc2fe, whichever PPC has the problem.
They should go red on the status page. Then reboot the PPC machine:
- Press Enter, then type
reboot
at the ">"
prompt. If no go,
- press Ctrl-X. If no go,
- hard-reset the PPC by pressing the button on the PPC front
panel. The PPCs are in the right-most slots of the Fastbus
crates.
The last things you should see are some pedestal things, followed by a
few TDC things. Watch that the error "Cannot init equipment
record, probably other FE is using it
" does not occur. If the
error does show up, go to the programs page and stop the front end
again, then reboot the PPC again. The error shouldn't come up this
time.
If, when you press the "stop fbcxxx" button on the "Programs"
page, it complains that it cannot stop the frontend, continue with
rebooting the PPC.
Restarting TWIST Slow Controls
When the Launcher is working:
If the slow controls programs are not running, go to the main Midas
page, and click on "SlowLauncher" (just below the row of buttons, near
the top of the page). This screen lets you stop and start the various
Slow Controls programs. ("test" does not need to be running. All
others should be running.)
You can also access the SlowLauncher directly at this address:
http://midtwist.triumf.ca:8082
In either case, if you get "connection refused" errors instead of the
"Slow Launcher" page, maybe the launcher is not running. Start it by
running start_daq
from twistonl@midtwist.
If there are still some red "FE Nodes" on the Midas page after trying
to start everything, try stopping the appropriate frontends
and restarting them.
Which one needs restarting can be determined by looking for which
frontend on the SlowLauncher page does not appear on the Midas page.
(This snapshot of the Midas page may also
be helpful.) The following shows some of the "StatusBar" items
controlled by Slow Controls frontends:
- "PostAmp": fepa
- "HV": fecamac
- "u_Beam" muon gas degrader: fecamac, fe1hp
Make an Elog entry and restart the front ends when you have trouble
controlling them.
When the Launcher is hung:
Normally, the launcher is stopped by using the "shutdown" button on
it's web interface. If the launcher is not responding you will have to
kill it by hand:
- try to find me- I want to debug launcher hang-ups.
- if I cannot be found, kill the launcher by hand:
- killall -KILL Launcher.exe
- kill all the slow controls frontends: "ps -ef | grep fe",
then manually kill all front-ends.
- run "netstat -a | grep 8082" to check the the server socket is
free. The new launcher will not start unless the above command
returns nothing. The socket may still be in use by a running
front-end. You will have to hunt it down and kill it by hand.
Possibly Special Case: fe3hp
If unable to restart fe3hp, try the following instructions from Konstantin:
fe3hp was not running because the Digi terminal server was refusing
connections. Here is the recovery procedure:
1) go to http://142.90.130.83/
2) login as "root", password "dbps"
3) go to the "admin"->"who" page
4) find the "tty03 | 142.90.101.76 | direct_tty01 | 0 | Kill Session" entry
5) go to the "kill session" link
6) go to the MIDAS Status->Programs page. "Stop fe3hp"
7) fe3hp should restart, "ubeam stale data" should clear within 2 minutes.
K.O.
After an e614slow reboot or crash
In general, a crash of e614slow does not prevent running the DAQ and
taking data. However, a number of services will be disrupted and may
have to be restarted manually:
- ser2net: a terminal server emulation program running on
e614slow. It controls the fe5hp serial port and the PPC1 and
PPC2 consoles. Restart it manually by running "start_daq" as
twistonl@midtwist.
- PPC1 and PPC2 consoles: will die when ser2net dies. Restart
them by running "start_daq".
- fecamac: high-voltage control, gas degrader control, p_beam and
scalers readout. fecamac should restart automatically after
e614slow is booted.
- feepics: beam line readout and B1/B2 regulators. feepics should
restart automatically.
- felas: alignment system readout. Presently, due to USB
configuration problems, this frontend will not restart after an
e614slow reboot. Recovery requires an expert armed with the
e614slow root password. The recovery procedure is documented in elog. If expert cannot be contacted, continue running without felas.
- fe5hp: beam line magnet voltages and temperatures. Presently,
due to problems with USB-to-serial converters, this frontend
will not restart after an e614slow reboot. The recovery procedure is documented in elog. If you do not feel brave enough to follow these instructions and an expert cannot be contacted, continue running without fe5hp.
Handling Software Alarms and Error Messages
Remember to make an ELog entry whenever there's an
alarm.
Event Builder Alarm
If the Event Builder dies, you will see a large red
bar on the Status Page.
- In the Status page, press button Programs, press on "Start"
EBuilder.
- Press button Alarms, press button Reset at right of line for
EBuilder.
- Press button Status to return to the Status page.
Synch Error
Make sure the run has successfully stopped ("End run" lines in both
PPC windows). If it hasn't stop it manually from the Status screen.
Try to start a new run. If it dies, try again. If that dies too,
call an expert.
Once a new run has started, reset the alarm: from the Status page,
press button "Alarms", then press button "Reset" at right of line for
"fbc1fe" or "fbc2fe" (whichever tripped). Press button Status to
return to status page.
lazy_file_exists file runNNNNN.ybs doesn't exists
If the Messages screen (or the ODB window) is filling up with message
like the above, check to see if the mentioned run file does exist:
ls -l /twist/data_onl/current/runNNNNN.ybs
It probably exists, but has zero size. If that's the case, just
delete the file (with rm), and see that the lazy logger stops
complaining.
General DAQ/Midas Operation
Daq consists of 2 sets of components: DAQ proper and Slow Controls.
Slow controls is documented elsewhere
on this page.
DAQ proper components are running on the host machine MIDTWIST
and on the Power PCs located in each of the Fastbus crates.
DAQ Computers
The TWIST DAQ uses several computers:
- E614VW 1- A VxWorks Power PC located in the Fastbus crate 1 which
collects the FBC1 data bank but is not readily accessible to the
user.
- E614VW 2- A VxWorks Power PC located in the Fastbus crate 2 which
collects the FBC2 data bank but is not readily accessible to the
user.
- E614SLOW - a Linux Pentium located on the floor
of the experiment all. The alignement system USB video cameras
are connected to the e614slow USB port.
- Emulex terminal server - the server to which the HP DVMs are
connected.
- MIDTWIST - a Linux Pentium located in the counting room. This is
where the user interacts with the DAQ and analyses data. midtwist is
on the TWIST cluster.
- M13BEAM - holds the "split" data files (every Nth event is
written here as well as to the main data files).
- E614DB - holds the experiment databases (hardware etc.).
PPC tasks:
The tasks running on the PowerPCs are called "fbc1fe" and "fbc2fe"
respectively.
Outputs from these tasks are seen in the "PPC windows" PPC1 and PPC2.
These
windows are not essential to acquisition of data but they are the console
of the PPCs
and give useful information.
If power is lost on the Fastbus crates, PPC1 in crate 1 reboots automatically
with power on.
PPC2 usually looses the boot parameters
which have to be reloaded from PPC2 window.
If anything goes wrong in the PPCs, it is best to shutdown the programs
from netscape
and to issue a reboot command in the PPCx window.
If the PPCx window does not respond to a <CR>, you have to go to
the Fastbus crate
and press the "reset" button on the PPC. If you do not know where that
is, turn power off
on the PPC power supply control and turn power back on.
MIDTWIST tasks:
The essential tasks running on midtwist are
-
Logger - the task that writes the events to disk or tape and that
keeps the history information provided by slow control information.
-
EventB - the event builder task receives the events from the parrallel
PPCs in the Fastbus crates.
-
StatusBar - the TCL program that reports slow control status
The other tasks running on midtwist are
-
fedaq - a utility task that fills the runlog
-
Speaker - the task that will synthezise important message to the speakers
General Slow Controls Operation
The Slow Control frontend (SCfe) programs access slow control readout
devices or Epics to set and/or monitor variables.
The SCfe report information two ways:
- Data events are sent to the MIDAS event pool. If "logging" is
enabled, these events are written in the run data files mixed with the
Fastbus TDC events. Later the files are spooled to tape. The SC events
contain the RAW data and the CALIBRATED data. These SC events are
generated:
- When a BEGIN_RUN command is received. This kind of
trigger condition will build events from the data
already in the memory from the last read.
- On a periodic interval, the programs read sequentially
all the values they are monitoring. A data event will be
generated if the new value read differs from the
previous value by a certain threshold. This threshold
can be zero and it is zero for certain programs. We do
not have the value of this threshold in the above
mentioned database. Most of the time the threshold value
is hardcoded in the program. It is possible to make this
an ODB variable (easy to change).
- Data is written in the ODB (Online DabaBase) which is a
snapshot of the status of DAQ. When data is sent to the ODB, it
replaces previous values. There is much more information in the ODB
than just the values of variables. At the end of each run, the ODB is
written in an ASCII file called runxxxxxx.odb, kept on disks and
backed up regularly. It is intended to keep all these files forever on
the disk. Only the CALIBRATED values are sent to the ODB. Also,
there is a setting called "LOG_HISTORY" for each "equipment" type. If
this value is >0, it is interpreted as the frequency for logging the
variables from the given equipment to the HISTORY files. This is done
by the program logger all the time regardless of the DAQ state:
RUNNING or STOPPED. If the program logger is NOT RUNNING, there is no
history file created. We usually turn off this program during long
shutdown. At the moment, all our SC equipments have LOG_HISTORY set
to 60 seconds, except for LAS which readouts out whenever it feels
like it. (this should be fixed) The values logged in the history files
are taken directly from the ODB. There is no knowledge as to when the
value was read last. It is just a snapshot of the values every minute.
Shows, among other things, which Slow Controls frontends handle which
variables.
B1, B2 NMR regulators
The B1 and B2 NMR regulators are designed to compensate for slow drifts
of magnet currents and fields. The regulators adjust the B1 and B2 DACs
to force the NMR values to be as close as possible to user-set NMR
setpoint values.
The regulator code is in feepics and it uses the NMR measurements
from fenmr. The regulator activity can be seen in the "history" panels
"B1_B2_regulator" (NMR deviations from the setpoints) and
"B1/B2_regulator" (raw NMR and averaged NMR vs DAC values).
During normal running, the B1 and B2 regulators should be turned
on. If they are off, for example after feepics restart, the "B1/B2
regulator is OFF" alarms will trigger. These alarms can only be
cleared by restarting the regulators or by disabling the alarms.
The regulators are started by entering the NMR setpoints. First,
make sure the NMRs are locked. (If NMRs lose the
lock, the "NMR" status goes "yellow" and a midas alarm should trigger,
but it is not 100% reliable.) Also, B1 and B2 should be set to
within 5 Gauss of the desired values. Then go to the
MIDAS status page and click on the "Reg_B1" or "Reg_B2" aliases. This
will open the ODB "alias" page. Click on the "Reg_B1" or "Reg_B2" NMR
setpoints and enter the desired NMR values. (This activates the
regulators.) Go back to the midas status page. You should see "Start
the B1/B2 regulator with NMR setpoint: xxx" midas messages. Clear the
"B1/B2 regulator" alarms by going from the midas status page to the
"alarms" page and clicking the "reset alarm" buttons. If the alarms
refuse to clear, try clearing them again after 5 minutes. If they
still do not clear, call the expert.
Once running, the regulators can fail. If they do, the "B1/B2
regulator FAILED" alarms should trigger. There are several failure
reasons: changing DAC values from EPICS, NMRs losing their signal lock
for more than 10 minutes and NMRs drifting too far away too fast.
To Restart a Regulator
- Make sure that the NMR is locked.
- Adjust the DACs (from EPICS) to bring the NMR readback within 5
Gauss to the desired NMR setpoint.
- Restart the regulator by reentering the NMR setpoint.
- Clear the alarm from the "alarms" page. If the alarm does not
clear, try clearing them again after 5 minutes. If it still
does not clear, call the expert.
Using the StatusBar
The StatusBar is a button bar which is normally started by the
start_daq script. Some useful information about the StatusBar:
- Clicking a button will bring up a control or display, depending
on the button.
- Right-clicking on a button will bring up a window with a status
message, explaining why the button is the colour it is.
- Button colours mean the following:
- Green
- Everything is (apparently) OK.
- Dark Grey
- Some of the status information is stale/unavailable.
- Yellow
- A warning about something. (Right-click to find out
what.)
- Red
- Something is wrong. (Note that sometimes a red
condition can appear because of bad readback, especially
for buttons like PostAmp; if these buttons turn red for
a particular channel, keep an eye on it to see if it
clears itself. Otherwise, investigate or call an
expert.)
- White
- Status is non-standard, but this is apparently
deliberate. e.g. The DAQ button is white when the run
is stopped; HV button is white when one or more HV
channels are turned off.
Other DAQ-Related Software
Back to the Run page.
Last updated 1 December 2005 by Robert MacDonald.