This session walks the user through the process of developing a custom service dashboard for the Blackboard Learn platform, which was deployed in production December 2011. It displays metrics such as disk space, database tables, processor load, connections, the number of users, etc., automatically updating.
The author shares the measures used and how they were obtained, the APIs used to display the content and manage access, and the use of ajax and Google charts to provide live updates. Some time is spent explaining the design philosophy so that viewers aren’t dazzled by an array of blinking lights.
It concludes showing how we have incorporated other monitoring tools into the dashboard and our plans for the future.
Delivered at BbWorld 2012 in New Orleans
19. A lot of shell scripts
cd /local/bboard/blackboard/content
df -P . | grep -v '1024-blocks' | awk '{print "insert into dur_dashboard_data (when,
name, space, capacity, used, available) values
(sysdate,?duocontent?,?"$1"?,?"$2"?,?"$3"?,?"$4
"?);"}'|sed "s^?^'^g" >> /local/home/bbuser/sc/intotable.sql
Filesystem 1024-blocks Used Available Capacity Mounted on
/dev/sda1 16246428 3231304 12176772 21% /
tmpfs 32995944 2401084 30594860 8% /dev/shm
/dev/mapper/vg0-s01 51606140 3914360 45070340 8% /s01
/dev/mapper/vg0-s02 51606140 14266484 34718216 30% /s02
/dev/mapper/vg0-data01 258030980 181684776 63239004 75% /data01
/dev/mapper/vg0-data02 258030980 88587208 156336572 37% /data02
/dev/mapper/vg0-data03 309637120 284903840 9005280 97% /data03
ssh bbuser@duoapp1 'w'
10:23:05 up 3 days, 23:44, 0 users, load average: 0.02, 0.04, 0.00
USER TTY FROM LOGIN@ IDLE JCPU PCPU WHAT
20. Database Tablespaces
spool /local/home/bbuser/sc/tablespacereturn;
SELECT Total.name "Tablespace Name",
nvl(Free_space, 0) Free_space,
nvl(total_space-Free_space, 0) Used_space,
total_space
FROM
(select tablespace_name, sum(bytes/1024) Free_Space
from sys.dba_free_space dfs
group by tablespace_name
) Free,
(select b.name, sum(bytes/1024) TOTAL_SPACE
from sys.v_$datafile a, sys.v_$tablespace B
where a.ts# = b.ts#
group by b.name
) Total
WHERE Free.Tablespace_name(+) = Total.name
ORDER BY Total.name
/
spool off;
46. What next?
Better integration with the F5
More on NetApps disk usage
Java Memory Utilization
Number of downloads per user
Decide what to make public!
49. Summary
Keep it simple
Light touch on system being monitored
Allow dynamic reconfiguration
Manage access using tabs & roles
Learn from others
Slides available at: http://db.tt/rp2D88Nt
50. @malcolmmurray
malcolm.murray@durham.ac.uk
malcolm.murray@gmail.com
We value your feedback!
Please fill out a session evaluation.
50
Editor's Notes
Developing a Service Dashboard: keeping an eye on thingsThis session walks the user through the process of developing a custom service dashboard for the Blackboard Learn platform, which was deployed in production December 2011. It displays metrics such as disk space, database tables, processor load, connections, the number of users, etc., automatically updating.Malcolm Murray will share the measures used and how they were obtained, the APIs used to display the content and manage access, and the use of ajax and Google charts to provide live updates. Some time is spent explaining the design philosophy so that viewers aren’t dazzled by an array of blinking lights.It will conclude showing how we have incorporated other monitoring tools into the dashboard and our plans for the future.Audience: Sys Admins? Managers? Developers?Eyeball from: http://www.clker.com/cliparts/q/K/E/M/8/C/green-eye-md.png
The key issue here is that Blackboard is a complicated service to manage. There’s a lot to it, some parts are essentially black boxes (closed systems) but we need to keep it running smoothly.Out of the box, there aren’t many management tools (though the Admin Console is a good start)Image source: http://myhometheaterbuild.wordpress.com/2011/08/03/feeling-ocd-how-to-clean-your-car-engine/
The team managing a service don’t need a report from every switch or transaction – they don’t need to know everything, but do want to quickly check all seems well, or be alerted to any problem.What’s more this needs to be done in such a way that it doesn’t get in the way of their other activities. We can’t assume that all the team are liux/database/java/whatever gurus, so the tool needs to indicate where the concern is in plain english.Dashboard icon: http://www.veryicon.com/icons/system/smoothicons-5/dashboard-17.html
So what should the Blackboard dashboard look like? The borrowing of terms from automotive design is a appropriate – the “dashboard” needs to convey the data we need, without distracting us from the road ahead.The second photo is an example of a poorly designed dashboard – your focus is on the steering column, not the windscreen.Image source: http://tutsplus.com/tutorial/creating-a-car-dashboard-using-the-brush-tool/Photo source: http://static.stomp.com.sg/site/servlet/linkableblob/stomp/1148306/data/a98211_d13jpg1338954675095-data.jpg
So just why is it so complicated?
For most institutions, Blackboard is not a stand-alone service running on a single box.From our own experience, as demand increased, we brought in a load balancer, currently run on four virtual servers, have a dedicated collab server, dedicated database, store data on a filestore (a NetApps appliance), etc. Each server has different partitions, the database spans multiple volumes, all with their own quota. There’s a lot of interdependencies and it is essentially a meta-service composed of lots of discrete units.
When we started, each part had (if we were lucky) it’s own monitoring interface.Each had a different URLs, somerequired password, some only accessible on-site, etc.Some of the pages were definitely not designed for quick inspection key information such as server load may take careful scrutiny.Bringing these together introduces a further tension – the need to keep things simple, yet provide access to lots of interfaces in one placeImage icon source: http://www.caradvice.com.au/20229/cadillac-introduces-2010-srx-crossover/
At this point there is a real danger of scope-creep – what do we need the dashboard to do?The initial aim was some form of (semi) real time monitoring, which can provide the metrics needed for service measurement if we persist them somewhere.Some hoped it would provide alerts – e.g. triggering emails if a measure exceeded a critical threshold. Personally I think this is in the wrong place – don’t expect a failing server to email you that it is going down It is hard to escape the ever-growing demand for KPIs – could this provide some? Should it? Could KPIs help shape what we measure?The scoping stage needs a lot of discipline or you end up trying to spec the impossible. My advice is to start small, and be prepared to learn, borrow, share and sometimes even start again from scratch (but wiser)!Image: http://img.ehowcdn.com/article-new/ehow/images/a05/ms/99/wire-switch-panel-race-cars-800x800.jpg
A dashboard needs to provide some measurements
The obvious candidates were used and available disk space, ditto for database tables, some measure of processor load and an indicator of users to help us understand any sudden changes in the figures.We started with the easy things to measure and added more as we could.An important activity if verifying the results – do your numbers add up? If you say a disk is approaching 75% capacity, is there really 25% free?Image source: http://www.moates.net/innovate-auxbox-lma-3.html
Our database is split over several logical volumes. Although the tables are set to autogrow, sometimes in the past we had run very close to running out of space.Thus an early task was to get these figures onto the screen.Here the figures for disk data03 are shown in red – indicating they had exceeded a warning threshold. (Panic not, new data are no longer written to data03 now).As these figures only change relatively slowly (and there is a non-trivial overhead in calculating them) we decided they should be updated when the page is refreshed, drawing on data updated hourly.Image source: http://thinkinginrails.com/wp-content/uploads/2010/05/database-integration.jpg
If you want more detail about the database – e.g. to try and understand why one volume is filling up – click on the database tab.This lists the volumes, tables and indexes, complete with sparklines showing the measures for the last 48 hours. These trends help to stop panics if a temp folder suddenly starts filling up.Clicking on the sparkline opens a new window…
Here the disk usage is shown in two graphs, generated using the Google chart APIs.The left hand graph (red) shows relative usage – how near 100% of the disk space allocated to the system did we get?The right hand (blue) shown disk space in absolute terms – the horizontal grey line along the top shows that this disk has been sitting at 250GB throughout the monitoring period.The highlighted section along the bottom allows you to zoom in on a particular section of the graph – all thanks to Google’s code!
The performance of the app servers was another early candidate that made it to the dashboard.We wanted similar measures of disk usage and capacity, but also processor load and the number of connections to the load balancer.We also wanted some JVM stats, but that has to wait until version 2.01As the load and connections are volatile, we needed these data to be regularly refreshed.Image source: http://www.7l.com/images/large-SL2600-Multi-Servers-icon.png
For users, we can harvest the existing session data stored in the Blackboard database. Given that not everyone in the sessions table is likely to be an active user – we don’t all log out – we also plotted the number of logins. This tab provides a graph that the user can query, to see whether current usage is high, low or normal.
What happened at 4.30 yesterday? Did someone try and download the entire content collection? Did something go viral? Or was there a denial of service attack?On the front page we plot a summary of current users, updated using a Prototype query every minute. A more detailed version of these data is available from the Online Now tab…
The logic used to gather these data is based on Santo’s SENECA Who’s Online building block.Knowing who is actually online doing something can help diagnose strange events (load spikes) or provide a list of people to inform if things are about to go wrong!
I’ve sort of skipped ahead, showing you the end result without really explaining how.The next section gives you a flavour of the scripts we use to generate and record the data.Note they may not be the best way of doing them, or the most efficient.They are in-house solutions that work, that’s good enough for us just now!
Most of the data are gathered using a cron job that triggers a set of shell scripts running on one server –we chose the collab server as it is under-used.They follow a set pattern:Invoke a linux command to generate a set of measures (e.g. df –P or netstat) possibly redirecting the output to a file.Massage the file using commands such as grep and awk to get just the bits we need, in the format we want.Append these to a text file in the form of SQL insert statements.Once all the measures are done, run the SQL to persist the data.Credit here goes to my colleague Stephen Applegarth.
This example shows the query used to generate free, used and total disk space figures for the database tables
The query on the previous slide generates output like this
This project had zero budget and limited time.At the developer’s conference this year, NoriakiTatsumi from Blackboard gave a great presentation showing how to use the free (GPL2 license) tool zabbix for system monitoring, which links to a custom building block, allowing JMX calls, security checks and lots of other goodness.I am sure this is the way to go – we will be investigating this when I get home! His presentation was recorded and should be available with all the other devcon2012 materials when they are released.
OK – time to think more about the UI design decisions we made when developing the system dashboard.
Hard to argue with Einstein (and win)But what does this mean for our dashboard?Image source: http://www.wallchan.com/wallpaper/19591/
One of the design constraints/requirements was that the product needed to fit on this old 30” 75 cm monitor (running 1360 x 768) that we had lying around the office.I am a firm believer that the front page of a dashboard shouldn’t need to scroll.
This wasn’t something that was going to be right under my nose all the time.It has to work from a distance
We now look at a selection of dashboard designs culled from the internet – there are many more just google ‘system dashboard’http://www.dashboardinsight.com/dashboards/product-demos/altosoft-insight-dashboard-for-system-center.aspx
http://dashboardspy.wordpress.com/2010/12/08/excel-dashboard-tutorial/Interesting example created using Excel – lots of VBA so no good for Mac users
Two of the more classic analogue control panel designs – do you like these?http://www.designvsart.com/blog/2008/08/14/designing-information-dashboards/#.T_YAi3BzWw8
Red amber green lightshttp://dashboardspy.wordpress.com/2006/11/02/business-analysis-monitoring-dashboards-bam-rolling-up-application-kpis-for-a-system-status-dashboard/
Which is your favourite?No right answer!
This example shows the way this dashboard appears to the 5% of men who suffer from deuteranopia (most common form of colour blindness).Can you tell which services are now in a state of alert?http://dashboardspy.wordpress.com/2006/11/02/business-analysis-monitoring-dashboards-bam-rolling-up-application-kpis-for-a-system-status-dashboard/
My thinking has been informed by reading around the subject. I have found these two authors particularly informative and thought provoking.N.B. That is not the same as saying that I agree with everything they say!Stephen Few has over 20 years of experience as an innovator, consultant, and educator in the fields of business intelligence (a.k.a. data warehousing and decision support) and information design. Through his company, Perceptual Edge, he focuses on the effective analysis and presentation quantitative business information. Stephen is recognized as a world leader in the field of data visualization. He teaches regularly at conferences such as those presented by The Data Warehousing Institute (TDWI) and DCI, and also in the MBA program at the Haas School of Business at U. C. Berkeley. He is also the author of the book "Show Me the Numbers: Designing Tables and Graphs to Enlighten" (Analytics Press). Edward Tufte is an American statistician and professor emeritus of political science, statistics, and computer science at Yale University. He is noted for his writings on information design and as a pioneer in the field of data visualization.
When designing our dashboard we considered the phrase “Don’t bother me I am busy!”It is designed so that if after a quick glance all is grey, that means we can go back to our day job, all is well.If something is amiss it appears red and in bold font – drawing your attention to the issue.Consider how this would look if we had used red and green…
Sparklines are clever data-rich graphics that manage to impose a low cognitive load on the viewer.Tufte’s examples list start and end values in the sequence, plus low and high points.In our case we simply show the last 48 results, plotting the start and end values. We felt that was enough for this application.
They are rendered using a delightfully simple bit of javascript – simply enclosing the sequence of numbers in a custom span.This means that the browser builds the graph on the fly – delayed auto buildDoesn’t work in a grumpy browser like IE that won’t support the canvas element – not a problem for our implementation – Safari, Firefox or Chrome would do nicely.
Now lets turn our attention to the deployment process
2,996,403 Visits 1st Oct to 31 Dec 2011 (Google Analytics)The system is busy. We had to make sure that our monitoring wouldn’t tip it over the edge during busy periods.Image source: http://misterysnake.homepage24.de/bilder/indieluftgehen.jpg
AJAX query updates key data every minute. – number of live sessions, the app server sparklines and number of connections. Database and app server disk space figures are pulled from database when page first loaded. Cached for this user until a refresh is forced. The bottom performance graphic is read from a file – this is generated automatically by another building block
The rest of the graphing uses the Google chart API – this makes the browser do the work – all the data is stored on the page, the API selects which portion to graph – no JSP refresh neededInitial graph LHS relative, RHS actual – steps indicate physical growth in available disk spaceThen overlain with interactive graph where we can change the metric displayed
Standard building block interface – a degree of future proofingThresholds – match our current risk appetite – how full should a disk get before we colour it red?2. URLs – allows us to change the location of external tools at will3: Access control – Use institutional roles and tabs to manage who sees what – some of the pages that allow dynamic reconfiguration are visible to sys admins but not senior managers!
Standard building block interface – a degree of future proofingProvide information here displayed on individual server reports to help understand any differencesServer diagram generated on the fly using google chart APIs – helps to ensure categorisation is correctCan edit these data from here without a restart
This page allows you to provide a friendly name for the various tables and disks – so everyone knows who to contact and what to ask about if they see an alertThis information is echoed on graphs of individual disks/servers
Use tabs to easily switch views – incl. links to pages shown earlierKey information is on the front page, but useful links are collected in one place on the others.
Image source: http://www.tuvie.com/wp-content/uploads/solid-future-car-concept1.jpgThis is only the start…
This slide shows our first attempt at replacing some of the old apache load balancer pages with custom reports generated by querying the F5 appliance. Still very much a work in progress. This page updates automatically.
So what have we learned?
Keep it simple: one screen, grey is good, red is bad newsLightweight: use ajax calls to update screen, display content pulled from other sitesDynamic reconfiguration: important it stays up to date and has all the information you needUse tabs to organise content and control accessLearn from others: plenty of books, look at other dashboards around – which ones do you like/hate?It is doable – have a go yourself!