You are on page 1of 25

HawkEye

A Monitoring and Management


Tool for Distributed Systems
Todd Tannenbaum
Department of Computer Sciences
University of Wisconsin-Madison
http://www.cs.wisc.edu/condor
condor-admin@cs.wisc.edu

www.cs.wisc.edu/condor 1
What does Condor have?
› …lots of core technology for building a
distributed system

www.cs.wisc.edu/condor 2
What does Condor have?
› …lots of core technology for building a
distributed system
› …lots of core technology for monitoring the
status of a machine

www.cs.wisc.edu/condor 3
What does Condor have?
› …lots of core technology for building a
distributed system
› …lots of core technology for monitoring the
status of a machine
› …lots of core technology for managing a
work load of tasks

www.cs.wisc.edu/condor 4
What does Condor have?
› …lots of core technology for building a
distributed system
› …lots of core technology for monitoring the
status of a machine
› …lots of core technology for managing a
work load of tasks
› …lots of really, truly, skilled and
experienced developers and researchers at
building distributed systems. Some of the
best. Standout state employees. Honest.
 Email for Wisconsin Gov Scott McCallum:
wisgov@gov.state.wi.us

www.cs.wisc.edu/condor 5
One day an
avid Condor
user asked:

www.cs.wisc.edu/condor 6
One day an Say, could Condor
avid Condor Technology be used
for distributed system
user asked: administration??

www.cs.wisc.edu/condor 7
Time to think…
› Gathered up our experiences with our
own management tasks, looked at the
mature Condor technology available to
us, and HawkEye effort was born.
› Completely separate from Condor
from end user prospective.
 Can install HawkEye, or Condor, or both

www.cs.wisc.edu/condor 8
First Component:
MONITORING
› Sysadmins first need information
about what is happening on the
machines they are responsible for.
 Both Current and Past
 Information must be consolidated and
easily accessible
 Information must be dynamic

www.cs.wisc.edu/condor 9
Condor ClassAds
› Technology for an entity to describe
itself
› Simple attribute value pairs
[
load_average = 1.3
free_Swap_space_mb = 140
number_of_processes = 92
keyboard_idle_secs = 6
ram = 128
total_swap = 512
total_memory = ram + total_swap
busy = load_average > 1.0
]

www.cs.wisc.edu/condor 10
Condor ClassAds, cont.
› No fixed schema
› Attributes can contain values or
expressions
› Serialize Ads in XML
› Open source libraries on C++ and Java to:
 Manipulate Ads and Ad attributes
 Store Ads
 Query collections of Ads

› Bindings for Perl and others on the way…

www.cs.wisc.edu/condor 11
HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye
Manager ClassAd
Updates
Via
Secure
UDP

www.cs.wisc.edu/condor 12
HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye
Manager HawkEye Monitoring Agent

HawkEye Monitoring Agent

HawkEye Monitoring Agent

www.cs.wisc.edu/condor 13
HawkEye Monitoring Agent
Hawkeye_Startup_Agent

HawkEye Hawkeye_Monitor
Manager ClassAd
Updates /proc, kstat…
Via
Secure
UDP
HawkEye Monitoring Agent

www.cs.wisc.edu/condor 14
Monitor Agent, cont.
› Updates are sent periodically
 Information does not get stale

› Updates also serve as a heartbeat monitor


 Know when a machine is down

› Out of the box, the update ClassAd has


many attributes about the machine of
interest for system administration
 Current Prototype = 184 attributes

www.cs.wisc.edu/condor 15
What if I want
to monitor
something you
didn’t think
about?

www.cs.wisc.edu/condor 16
Custom Attributes

Hawkeye_Startup_Agent

HawkEye Hawkeye_Monitor
Manager
/proc, kstat…

Create your own Data from


HawkEye plugins, hawkeye_update_attribute
or share plugins with command line tool
others
HawkEye Monitoring Agent

www.cs.wisc.edu/condor 17
Role of HawkEye HawkEye

Manager
Manager

› Store all incoming ClassAds in a indexed


resident data structure
 Fast response to client tool queries about
current state
 “Show me all machines with a load average > 10”

› Periodically store ClassAd attributes into a


Round Robin Database
 Store information over time
 “Show me a graph with the load average for this
machine over the past week”
› Speak to clients via CEDAR, HTTP
www.cs.wisc.edu/condor 18
Several different clients
› Command-line, GUI, Web-based
But sysadmins also
sometimes have to do
work…
› Task: copy a new library onto the
local disk of each machine.
 Just a script to copy via rcp/scp to
every machine… or is it?

www.cs.wisc.edu/condor 20
Running tasks on behalf of
the sysadmin
› Submit your sysadmin tasks to HawkEye
 Tasks are stored in a persistent queue by the
Manager
 Tasks can leave the queue upon completion, or
repeat after specified intervals
 Tasks can have complex interdependencies via
DAGMan
 Records are kept on which task ran where
› Sounds like Condor, eh?
 Yes, but simpler…

www.cs.wisc.edu/condor 21
Run Tasks in response to
monitoring information
› ClassAd “Requirements” Attribute
› Example: Send email if a machine is low on
disk space or low on swap space
 Submit an email task with an attribute:
Requirements = free_disk < 5 || free_swap < 5
› Example w/ task interdependency: If load
average is high and OS=Linux and console is
Idle, submit a task which runs “top”, if top
sees Netscape, submit a task to kill Netscape

www.cs.wisc.edu/condor 22
HawkEye Design Goals
› Monitoring
 Reliable presence
 Get Data off the node in an extensible, consistent
manner
› Run Tasks
 In response to probe information
 Repeat or once-only semantics
 Audit Log

› Independent and self-contained


› Cross-Platform

www.cs.wisc.edu/condor 23
Current Status
› Just Beginning this project
› Initial release early summer
› Prototypes already running –
Stop in and see initial HawkEye Work
Rm 3385 on Weds 9am – 12pm

www.cs.wisc.edu/condor 24
Thank you!

I was an
overworked
sysadmin. Now
I have more free
time thanks to
HawkEye!

www.cs.wisc.edu/condor 25

You might also like