doc/en_US.ISO8859-1/captions/2009/dcbsdcon/davis-isolatingcluster.sbv

0:00:15.749,0:00:18.960
I do apologize for the (other)

0:00:18.960,0:00:22.130
for the EuroBSDCon slides.  I've redone the

0:00:22.130,0:00:23.890
title page and redone the

0:00:23.890,0:00:27.380
and made some changes to the slides
and they didn't make it through for approval

0:00:27.380,0:00:33.130
by this afternoon so

0:00:33.130,0:00:34.640
okay so

0:00:34.640,0:00:36.390
I'm gonna be  talking about

0:00:36.390,0:00:38.430
doing

0:00:38.430,0:00:42.889
about isolating jobs for performance and predictability
in clusters

0:00:42.889,0:00:43.970
before I get into that

0:00:43.970,0:00:46.010
I want to talk a little bit about

0:00:46.010,0:00:47.229
who we are and

0:00:47.229,0:00:49.520
what our problem space is like because that

0:00:49.520,0:00:54.760
dictates that… has an effect
on our solutions base

0:00:54.760,0:00:57.079
I work for the aerospace corporation.

0:00:57.079,0:00:58.609
We work;

0:00:58.609,0:01:02.480
we operate a federally-funded
research and development center

0:01:02.480,0:01:05.400
in the area national security space

0:01:05.400,0:01:09.310
and in particular we work with the air force
space and missile command

0:01:09.310,0:01:13.090
and with the national reconnaissance
office

0:01:13.090,0:01:16.670
and our engineers support a wide variety

0:01:16.670,0:01:20.550
of  activities within that area

0:01:20.550,0:01:21.830
so we have

0:01:21.830,0:01:23.740
a bit over fourteen hundred to correct

0:01:23.740,0:01:25.860
sorry twenty four hundred engineers

0:01:25.860,0:01:28.820
in virtually every discipline we have

0:01:28.820,0:01:33.520
as you would expect we have our rocket scientists,
we have people who build satellites

0:01:33.520,0:01:37.439
we have people who build sensors that go on
satellites, people who study these sort of things

0:01:37.439,0:01:38.130
that you

0:01:38.130,0:01:39.590
see when you

0:01:39.590,0:01:40.819
use those sensors

0:01:40.819,0:01:42.040
that sort of thing.

0:01:42.040,0:01:44.180
We also have civil engineers and

0:01:44.180,0:01:45.680
electronic engineers

0:01:45.680,0:01:46.649
and process,

0:01:46.649,0:01:49.170
computer process people

0:01:49.170,0:01:53.120
so we literally do everything related to space
and all sorts of things that you might not

0:01:53.120,0:01:55.270
expect to be related to space,

0:01:55.270,0:01:58.820
since we also for instance help build ground
systems ‘cause satellites aren’t very useful if

0:01:58.820,0:02:00.680
there isn't anything to talk to them;

0:02:02.540,0:02:04.090
and these engineers

0:02:04.090,0:02:07.420
since they're solving all these different problems we have

0:02:07.420,0:02:11.499
engineering applications in you know
virtually every size you can think of

0:02:11.499,0:02:15.539
ranging from you know little spreadsheet things that
you might not think of as an engineering

0:02:15.539,0:02:17.229
application but they are

0:02:17.229,0:02:22.249
to Matlab programs or a lot of C code

0:02:22.249,0:02:23.960
or one of traditional parallel for us

0:02:23.960,0:02:25.159
serial code

0:02:25.159,0:02:26.049
and then

0:02:26.049,0:02:30.949
large parallel applications either in house;
genetic algorithms and that sort

0:02:30.949,0:02:31.769
of thing,

0:02:31.769,0:02:32.900
or traditional

0:02:32.900,0:02:34.749
the classic parallel code

0:02:34.749,0:02:37.599
like you work around a crate or something material simulation
0:02:40.119,0:02:41.459
or that or food flow

0:02:41.459,0:02:43.869
or that sort of thing

0:02:43.869,0:02:44.240
so

0:02:44.240,0:02:46.349
so we have this big application space

0:02:46.349,0:02:49.029
just want to give a little introduction to that because
it

0:02:49.029,0:02:51.529
does come back and influence what we

0:02:51.529,0:02:55.999
the sort of solutions we look at

0:02:55.999,0:03:00.499
so the rest of the talk I’m gonna talk about rese…

0:03:00.499,0:03:05.259
we skipped a slide, there we are, that’s a little better.

0:03:05.259,0:03:08.940
Now, what I'm interested in is I do high
performance computing

0:03:08.940,0:03:10.109
at company

0:03:10.109,0:03:13.949
and I provide high performance computing resources
to our users

0:03:13.949,0:03:19.949
as part of my role in our technical
computing services organization

0:03:19.949,0:03:20.370
so

0:03:20.370,0:03:23.120
our primary resource at this point is

0:03:23.120,0:03:25.429
the fellowship cluster

0:03:25.429,0:03:26.540
it's a for the

0:03:26.540,0:03:29.569
named for the fellowship of the ring

0:03:29.569,0:03:30.449
so it's a…

0:03:30.449,0:03:32.520
… eleven axel nodes

0:03:32.520,0:03:33.930
wrap the core systems

0:03:33.930,0:03:35.909
over here there's a

0:03:35.909,0:03:39.659
Cisco a large Cisco switch. Actually today
there are around two sixty five oh nines if

0:03:39.659,0:03:40.899
you  assess them

0:03:40.899,0:03:46.149
and because we couldn’t get the port density we wanted otherwise

0:03:46.149,0:03:50.219
and primarily the Gigabit Ethernet system runs
FreeBSD currently 6.0 ‘cause we haven’t upgraded

0:03:50.219,0:03:51.089
it yet

0:03:51.089,0:03:55.639
planning to move probably to 7.1
or maybe slightly past 7.1

0:03:55.639,0:04:01.029
if we want to get the latest HWPMC changes in

0:04:01.029,0:04:05.900
we use the Sun Grid Engine scheduler was one of
the two main options for open source

0:04:05.900,0:04:08.949
resource managers on clusters the other one being
the…

0:04:09.959,0:04:11.499
… the TORQUE

0:04:11.499,0:04:15.939
and now recombination from cluster resources

0:04:15.939,0:04:17.389
so we also have

0:04:17.389,0:04:18.079
that's actually

0:04:18.079,0:04:22.090
40 TB that’s really the raw number on a sun thumper and
0:04:23.219,0:04:26.290
that’s thirty two usable once you start using RAID-Z2

0:04:26.290,0:04:30.939
since you might actually like to have your data
should a disk fail

0:04:30.939,0:04:32.969
and with today's discs RAID…

0:04:32.969,0:04:34.009
RAID five

0:04:34.009,0:04:35.249
doesn't really cut it,

0:04:37.379,0:04:40.220
And then we also have some other resources coming on but I’m going to be (concentrating on)

0:04:40.220,0:04:43.530
two smaller clusters unfortunately probably running Linux and

0:04:43.530,0:04:45.900
some SMPs but

0:04:45.900,0:04:49.990
I’m going to be concentrating here on the work we're
doing on our other

0:04:49.990,0:04:54.259
FreeBSD based cluster.

0:04:54.259,0:04:55.060
So, first of all

0:04:55.060,0:04:59.410
first of all I want to talk about why we want to
share resources. Should be fairly obvious

0:04:59.410,0:05:00.610
but I'll talk about it in a little bit

0:05:00.610,0:05:04.900
and then what goes wrong when you start sharing resources

0:05:04.900,0:05:08.449
after that I'll talk about some different solutions
to those problems

0:05:08.449,0:05:09.759
and

0:05:09.759,0:05:13.399
some fairly trivial experiments that we've done
so far in terms of enhancing the schedule or

0:05:13.399,0:05:15.860
using operating system features

0:05:15.860,0:05:17.730
so you mitigate those problems

0:05:19.349,0:05:20.110
and

0:05:20.110,0:05:25.110
then conclude with some feature work.

0:05:25.110,0:05:29.289
So, obviously if you have a resource the size…
the size of our cluster, fourteen hundred

0:05:29.289,0:05:30.970
cores roughly

0:05:30.970,0:05:32.819
you probably want to share it unless you

0:05:32.819,0:05:35.080
purpose built it for a single application

0:05:35.080,0:05:37.340
you're going to want to have your users

0:05:37.340,0:05:39.440
sharing it

0:05:39.440,0:05:42.909
and you don't want to just say you know, you get on Monday

0:05:42.909,0:05:45.330
probably not going to be a very effective
option

0:05:45.330,0:05:49.270
especially not when we have as many users as we
do

0:05:49.270,0:05:53.849
we also can't just afford to buy another one
every time a user shows up

0:05:53.849,0:05:54.959
so one of our

0:05:54.959,0:05:57.339
senior VPs said a while back

0:05:57.339,0:05:57.969
you know

0:05:57.969,0:06:02.349
we could probably afford to buy just about
anything we could need once

0:06:02.349,0:06:03.800
 we can't just

0:06:03.800,0:06:06.359
buy ten of them though

0:06:06.359,0:06:08.939
if we really, really needed it

0:06:08.939,0:06:09.680
dropping

0:06:09.680,0:06:11.460
small numbers of millions of dollars on

0:06:11.460,0:06:13.349
computing resources wouldn’t be

0:06:13.349,0:06:15.039
impossible

0:06:15.039,0:06:20.829
but we can't go to you know just have every engineer
who wants one just call up Dell and say ship me ten racks

0:06:20.829,0:06:24.030
it's not going to work

0:06:24.030,0:06:25.580
and the other thing is that we can’t

0:06:25.580,0:06:28.360
we need to also provide quick turnaround

0:06:28.360,0:06:29.390
for some users

0:06:29.390,0:06:33.229
so we can't have one user hogging the system and
hogging it until they are done

0:06:33.229,0:06:34.720
because we have some users

0:06:34.720,0:06:37.099
and then the next one can run

0:06:37.099,0:06:40.949
because we have some users who'll
come in and say well I need to run

0:06:40.949,0:06:43.159
for three months

0:06:43.159,0:06:43.690
and

0:06:43.690,0:06:46.810
we've had users come in and literally run

0:06:46.810,0:06:49.740
pretty much using the entire system for three months

0:06:49.740,0:06:53.839
well so we've had to provide some ability for other
users to still get their work done

0:06:53.839,0:06:58.300
so we can't just… so we do have to have some sharing

0:06:58.300,0:07:00.619
however when you start to share any resource

0:07:00.619,0:07:01.610
like this

0:07:01.610,0:07:03.509
you start getting contention

0:07:03.509,0:07:06.300
users need the same thing at the same time

0:07:06.300,0:07:09.700
and so they fight back and forth for it and they
can't get what they want

0:07:09.700,0:07:11.639
so you have to balance them a bit

0:07:12.999,0:07:14.529
you know also

0:07:14.529,0:07:17.869
some jobs lie when they

0:07:17.869,0:07:20.870
request resources and they actually need
more than they ask for

0:07:20.870,0:07:23.279
which can cause problems

0:07:23.279,0:07:27.229
so we schedule them. We say you're going to fit
here fine and they run off and use

0:07:27.229,0:07:28.580
more than they said

0:07:28.580,0:07:31.000
and if we don't have a mechanism to constrain
them

0:07:31.000,0:07:32.389
we have problems.

0:07:32.389,0:07:34.270
Likewise

0:07:34.270,0:07:37.109
once these users start to contend

0:07:37.109,0:07:39.029
that doesn't just result in

0:07:39.029,0:07:40.439
the jobs taking,

0:07:40.439,0:07:43.360
taking longer in terms of wall clock time

0:07:43.360,0:07:44.659
because they are extremely slow

0:07:44.659,0:07:48.430
but there's overhead related to that contention;
they get swapped out due to pressure on

0:07:49.219,0:07:51.509
various systems

0:07:51.509,0:07:52.550
if you really

0:07:52.550,0:07:57.039
for instance run out of memory then you go into
swap and you end up wasting all your cycles

0:07:57.039,0:07:58.710
pulling junk in and out of disc

0:07:58.710,0:08:00.830
wasting your bandwidth on that

0:08:00.830,0:08:03.530
so there are

0:08:03.530,0:08:04.219
resource

0:08:04.219,0:08:08.139
there are resource costs to the contention not merely

0:08:08.139,0:08:11.979
a delay in returning results.

0:08:11.979,0:08:16.590
So now I'm going to switch gears and start talk… so I'm
going to talk a little bit about different

0:08:16.590,0:08:18.270
solutions to these


0:08:18.270,0:08:20.610
to the

0:08:20.610,0:08:22.339
these contention issues

0:08:23.710,0:08:27.840
and look at different ways of solving the
problem. Most of these are things that have

0:08:27.840,0:08:29.440
already been done

0:08:29.440,0:08:30.620
but I just want to talk about

0:08:30.620,0:08:32.990
the different ways and then

0:08:32.990,0:08:35.710
evaluate them in our context.

0:08:35.710,0:08:38.119
So a classic solution to the problem is

0:08:38.119,0:08:39.280
Gang Scheduling

0:08:39.280,0:08:44.139
It's basically conventional Unix process
context switching

0:08:44.139,0:08:46.560
written really big

0:08:46.560,0:08:50.339
you what you do is you have your parallel
job that’s running

0:08:50.339,0:08:51.390
on a system

0:08:51.390,0:08:52.839
and it runs for a while

0:08:52.839,0:08:57.920
and then after a certain amount of time you basically
shove it all; you kick it off of all the nodes

0:08:57.920,0:08:59.940
and let the next one come in

0:08:59.940,0:09:04.030
and typically when  people do this they do it on
on the order of hours because the context switch

0:09:04.030,0:09:09.270
time is extremely large is extremely high

0:09:09.270,0:09:10.639
for example

0:09:10.639,0:09:14.530
because it's not just like swapping a process
internet. You suddenly have to co-ordinate

0:09:14.530,0:09:17.470
the this context which across to all your processes

0:09:17.470,0:09:19.280
if you're running say

0:09:19.280,0:09:21.190
MPI over TCP

0:09:21.190,0:09:25.910
you actually need to tear down the TCP sessions
because you can't just have TCP timers sitting

0:09:25.910,0:09:26.570
around

0:09:26.570,0:09:28.260
or that sort of thing so

0:09:28.260,0:09:29.950
there there's a there's a lot of overhead

0:09:29.950,0:09:34.340
associated with this. You take a long context switch

0:09:34.340,0:09:36.820
if all of your infrastructure supports this

0:09:36.820,0:09:39.420
 it's fairly effective

0:09:39.420,0:09:43.300
and it does allow jobs to avoid interfering
with each other which is nice

0:09:43.300,0:09:46.100
so you can't you don't have issues

0:09:46.100,0:09:47.689
because you're typically allocating

0:09:47.689,0:09:50.950
whole swaps of the system

0:09:50.950,0:09:53.390
and for properly written applications

0:09:55.000,0:09:59.690
partial results can be returned which for some of
our users is really important where you're doing a

0:09:59.690,0:10:00.490
refinement

0:10:00.490,0:10:04.350
users would want to look at the results and
say okay

0:10:04.350,0:10:06.130
you know is this just going off into the weeds

0:10:06.130,0:10:10.860
or does it look like it's actually converging on
some sort of useful solution

0:10:10.860,0:10:13.980
as they don't want to just wait till the end.

0:10:13.980,0:10:19.270
Down side of course is that this context
switches costs are very high

0:10:19.270,0:10:22.460
and most importantly there's really a lack
of useful implementations

0:10:22.460,0:10:25.340
a number of platforms have implemented this in the past

0:10:25.340,0:10:29.840
but in practice on modern clusters which are
built on commodity hardware

0:10:29.840,0:10:32.340
with you know

0:10:32.340,0:10:35.530
communication libraries written on standard protocols

0:10:35.530,0:10:37.050
the tools just aren’t there

0:10:37.050,0:10:39.100
and so

0:10:39.100,0:10:40.860
it's not very practical.

0:10:40.860,0:10:44.010
Also it doesn't really make a lot of sense with small jobs

0:10:44.010,0:10:47.789
and one of the things that we found is we have users who have

0:10:47.789,0:10:50.860
embarrassingly parallel problems for they need to look at

0:10:50.860,0:10:53.450
you know twenty thousand studies

0:10:53.450,0:10:57.400
and they could write something that looked more like a
conventional parallel application where they

0:10:57.400,0:11:01.930
you know wrote a Scheduler and set up an MPI a Message Passing Interface

0:11:01.930,0:11:05.400
and handed out tasks to pieces of their job and then you
could do this

0:11:05.400,0:11:09.280
but then they would be running a Scheduler and they would
probably do a bad job of it turns out it's actually

0:11:09.280,0:11:10.820
fairly difficult to do right

0:11:10.820,0:11:13.740
even a trivial case

0:11:13.740,0:11:16.189
and so what they do instead is they just select twenty

0:11:16.189,0:11:18.730
twenty thousand jobs to grid engine and say okay

0:11:18.730,0:11:21.330
whatever I'll deal with it

0:11:21.330,0:11:23.140
earlier versions that might have been a problem

0:11:23.140,0:11:24.730
current versions of the code

0:11:24.730,0:11:27.060
 handle easily a million jobs that

0:11:27.060,0:11:29.370
so not really a big deal

0:11:29.370,0:11:31.610
but those sort of users wouldn't fit well

0:11:31.610,0:11:34.190
into the gang scheduled environment

0:11:34.190,0:11:35.690
at least not in a

0:11:35.690,0:11:39.149
conventional gang scheduled environment where
you do gang scheduling on the granularity of

0:11:39.149,0:11:40.940
jobs

0:11:40.940,0:11:44.140
so from that perspective it wouldn’t work very well.

0:11:44.140,0:11:48.380
If you have all the pieces in place and you are
doing a big parallel applications it is in fact

0:11:48.380,0:11:53.770
an extremely effective approach.

0:11:53.770,0:11:56.290
Another option which is sort of related

0:11:56.290,0:11:57.420
it's in fact

0:11:57.420,0:12:00.079
take taking an even courser granularity

0:12:00.079,0:12:04.360
is single application or single project
clusters or sub-clusters.

0:12:04.360,0:12:07.590
For instance this is used some national labs

0:12:07.590,0:12:11.910
where you're given a cycle allocation for a
year based on your grant proposals

0:12:11.910,0:12:14.779
and what your cycle allocation actually comes to you as is

0:12:14.779,0:12:16.580
here's your cluster

0:12:16.580,0:12:17.489
here's a frontend

0:12:17.489,0:12:19.840
here's this chunk of notes, they're yours, go to it.

0:12:19.840,0:12:21.930
Install your own OS, whatever you want

0:12:21.930,0:12:25.580
it's yours

0:12:25.580,0:12:30.310
and then and at a sort of finer scale there's things such as

0:12:30.310,0:12:31.800
you could use Emulab

0:12:31.800,0:12:36.300
which is the network emulation system but also does a OS install and configuration

0:12:36.300,0:12:39.300
so you could do dynamic allocation that way

0:12:39.300,0:12:40.540
Sun's

0:12:40.540,0:12:44.040
Project Hedeby now actually I think it's
called service domain manager

0:12:44.040,0:12:46.500
is the productised version

0:12:46.500,0:12:50.010
or some Clusters on Demand

0:12:50.010,0:12:54.450
they were actually talking about web hosting clusters but

0:12:54.450,0:12:57.780
things that allow rapid deployment unless you
do that a little

0:12:57.780,0:12:59.510
little

0:12:59.510,0:13:02.810
a more granular level than the

0:13:02.810,0:13:05.580
the allocate them once a year approach

0:13:05.580,0:13:07.720
nonetheless

0:13:07.720,0:13:11.220
let’s you give people whole clusters to work with

0:13:11.220,0:13:12.920
nice one nice thing about it is

0:13:12.920,0:13:15.450
the isolation between the processes

0:13:15.450,0:13:16.890
is complete


0:13:16.890,0:13:20.800
so you don’t have to worry about users stomping on each other.
It’s their own system, they can trash it all they

0:13:20.800,0:13:22.230
want

0:13:22.230,0:13:24.709
if they flood the network or they

0:13:24.709,0:13:26.180
run the nodes into swap

0:13:26.180,0:13:28.480
well that's their problem

0:13:28.480,0:13:32.120
but it also has the advantage that you can tailor the images

0:13:32.120,0:13:36.980
on the nodes of the operative systems to
meet the exact needs of the application

0:13:36.980,0:13:40.560
down side of course is its coarse granularity, in our environment that doesn't work

0:13:40.560,0:13:41.500
very well

0:13:41.500,0:13:46.800
 since we do have all of these all these different types of jobs

0:13:46.800,0:13:51.710
context switches are also pretty expensive. Certainly on the order of minutes

0:13:51.710,0:13:54.690
Emulab typically claim something like ten minutes

0:13:54.690,0:13:57.970
there are some systems out there

0:13:57.970,0:14:03.320
for instance if you use I think it’s Open Boot that
they're calling it today. It used to be 1xBIOS

0:14:03.320,0:14:06.790
where you can actually deploy a system in

0:14:06.790,0:14:08.700
tens of seconds

0:14:08.700,0:14:11.520
mostly by getting rid of all that junk the BIOS writers wrote

0:14:11.520,0:14:12.890
and

0:14:12.890,0:14:17.770
the OS boots pretty fast if you don’t have all
that stuff to waylay you,

0:14:17.770,0:14:19.940
but in practice on sort of

0:14:19.940,0:14:21.660
off the shelf hardware

0:14:21.660,0:14:24.400
the context switches times’ are quite high

0:14:24.400,0:14:26.930
users of course can interfere with themselves

0:14:26.930,0:14:29.200
you can argue it's not a problem but

0:14:29.200,0:14:31.660
ideally you would like to prevent
that

0:14:31.660,0:14:35.350
one of the things that I have to deal with
is that my users are

0:14:35.350,0:14:37.830
almost universally

0:14:37.830,0:14:40.410
not trained as computer scientists or programmers

0:14:40.410,0:14:42.550
you know they’re trained in their domain area

0:14:42.550,0:14:44.780
they're really good in that area

0:14:44.780,0:14:48.389
but their concepts of the way hardware works in the
way software works

0:14:48.389,0:14:55.389
don’t match reality in many cases

0:15:01.269,0:15:02.830
(inaudible question)
It’s pretty rare in practice

0:15:02.830,0:15:06.700
well I've heard one lab that does it significantly

0:15:06.700,0:15:09.839
but it's like they do it on sort of a yearly
allocation basis

0:15:09.839,0:15:12.790
and throw the hardware away after two or three years

0:15:12.790,0:15:15.999
and you do typically have some sort of the deployment

0:15:15.999,0:15:18.340
system in place

0:15:18.340,0:15:20.680
or in those types of cases actually

0:15:20.680,0:15:22.359
usually your application comes with

0:15:22.359,0:15:26.500
and here's what we're going to spend on this many people

0:15:26.500,0:15:27.730
on this project so this is

0:15:27.730,0:15:34.730
big resource allocation

0:15:36.000,0:15:39.780
And yeah I guess one other issue with this is there's no real easy

0:15:39.780,0:15:43.320
way to capture underutilized resources
for example

0:15:43.320,0:15:44.389
if you have

0:15:44.389,0:15:49.190
an application which you know say single-threaded
and uses a ton of memory

0:15:49.190,0:15:51.210
and is running on a machine

0:15:51.210,0:15:55.040
the machines we're buying these days are eight core so

0:15:55.040,0:16:00.040
that’s wasting a lot of CPU cycles you're just
generating a lot of heat doing nothing

0:16:00.040,0:16:03.890
so ideally you would like a scheduler that
said okay so you're using

0:16:03.890,0:16:08.040
using eight or seven of the eight Gigabytes of
RAM but we've got these jobs

0:16:08.040,0:16:10.080
sitting here that

0:16:10.080,0:16:11.560
need next to know need

0:16:11.560,0:16:15.910
a hundred megabytes so we slap seven of
those in along with the big job

0:16:15.910,0:16:18.580
and backfill and in this

0:16:18.580,0:16:19.600
mechanism there's no

0:16:19.600,0:16:21.810
there's no good way to do that

0:16:21.810,0:16:26.820
obviously if the users have that application
next they can do it themselves

0:16:26.820,0:16:30.510
but it's not something where we can easily
bring in

0:16:30.510,0:16:35.090
bring in more jobs and have a mix to
take advantage of the different

0:16:35.090,0:16:37.300
resources.

0:16:37.300,0:16:39.940
A related approach is to

0:16:39.940,0:16:43.950
to install virtualization software on the
equipment and this is this is

0:16:44.980,0:16:46.379
this is the essence of

0:16:46.379,0:16:49.800
what Cloud Computing is at the moment

0:16:49.800,0:16:53.520
it's Amazon providing Zen

0:16:53.520,0:16:55.129
Zen hosting for

0:16:55.129,0:16:56.769
relatively arbitrary

0:16:56.769,0:16:59.710
OS images

0:16:59.710,0:17:02.720
it does have the advantage that it allows rapid deployment

0:17:02.720,0:17:06.510
in theory if your application is scalable provides for

0:17:06.510,0:17:08.259
extremely high scalability

0:17:08.259,0:17:10.110
particularly if you

0:17:10.110,0:17:14.470
aren’t us and therefore can possibly use somebody else's hardware

0:17:14.470,0:17:16.520
in our application's case that’s

0:17:16.520,0:17:18.790
not very practical so

0:17:18.790,0:17:20.360
we can't do that

0:17:20.360,0:17:20.870
and

0:17:20.870,0:17:23.790
it also has the advantage that you can run

0:17:23.790,0:17:26.470
you can have people with their own image in there

0:17:26.470,0:17:30.000
which is tightly resource constrained but you
can run more than one of them on a node. So for instance

0:17:30.000,0:17:31.170
you can give

0:17:31.170,0:17:32.730
one job

0:17:32.730,0:17:35.489
four cores and another job two cores another

0:17:35.489,0:17:37.500
you know and have a couple single core

0:17:37.500,0:17:38.860
jobs in theory

0:17:38.860,0:17:43.340
you can get fairly strong isolation there
obviously there are shared resources underneath

0:17:43.340,0:17:44.710
and you

0:17:44.710,0:17:45.570
probably can't

0:17:45.570,0:17:48.370
afford to completely isolate say network bandwidth

0:17:48.370,0:17:49.520
at the bottom layer

0:17:49.520,0:17:51.580
you can do some but

0:17:51.580,0:17:56.170
if you go overboard you can spend all your time on accounting

0:17:56.170,0:17:58.830
you also can again

0:17:58.830,0:18:01.410
tailor the images to the job

0:18:01.410,0:18:05.030
and in this environment actually you can
do that even more strongly than that

0:18:05.030,0:18:07.030
the sub-cluster approach

0:18:07.030,0:18:09.860
in that you can often do run

0:18:09.860,0:18:16.360
a five-year-old operating system or ten-year-old
operating system if you're using full virtualization

0:18:16.360,0:18:19.030
and that can allow

0:18:19.030,0:18:23.820
allow obsolete code with weird baselines to work which is
important in our space because

0:18:23.820,0:18:27.390
the average program runs ten years or more

0:18:27.390,0:18:30.860
our average project runs ten years or more

0:18:30.860,0:18:32.530
and as a result

0:18:32.530,0:18:36.010
you might have to go rerun this program that was written

0:18:36.010,0:18:37.320
way back on

0:18:37.320,0:18:40.550
some ancient version of windows or whatever

0:18:40.550,0:18:41.890
it also does provide

0:18:41.890,0:18:43.840
the ability to recover resources

0:18:43.840,0:18:45.290
as I was talking about before

0:18:45.290,0:18:49.530
but you can't do easily with sub-clusters because you can’t just slip

0:18:49.530,0:18:50.360
another image

0:18:50.360,0:18:52.910
on the on there and say are you can use anything and

0:18:52.910,0:18:56.730
you know give that image idle priority essentially

0:18:56.730,0:19:00.480
down side of course is that it is in complete
isolation and that there is a shared

0:19:00.480,0:19:02.340
hardware

0:19:02.340,0:19:06.490
you're not likely to find I don't think
any the virtualization systems out there

0:19:06.490,0:19:08.890
right now

0:19:08.890,0:19:09.890
virtualize

0:19:09.890,0:19:11.470
your segment of

0:19:11.470,0:19:13.540
memory bandwidth

0:19:13.540,0:19:15.159
or your segment

0:19:15.159,0:19:16.390
of cache

0:19:16.390,0:19:18.390
of cache space

0:19:18.390,0:19:24.809
so users can’t in fact interfere with themselves and each other in this
environment

0:19:24.809,0:19:25.589
it's also

0:19:25.589,0:19:30.479
not really efficient for small jobs; the cost of running an
entire OS for every

0:19:30.479,0:19:33.020
job is fairly high

0:19:33.020,0:19:34.020
even with

0:19:34.020,0:19:34.710
relatively light

0:19:34.710,0:19:38.250
Unix like OSes is you're still looking

0:19:38.250,0:19:40.900
couple hundred megabytes in practice

0:19:40.900,0:19:46.240
once you get everything up and running unless you run something
totally stripped down

0:19:47.230,0:19:49.460
there’s significant overhead

0:19:49.460,0:19:52.240
there’s CPU slowdown typically in the

0:19:52.240,0:19:55.360
you know typical estimates are in the twenty
percent range

0:19:55.360,0:20:00.450
numbers really range from fifty percent to
five percent depending on what exactly you're doing

0:20:00.450,0:20:02.100
possibly even lower

0:20:02.100,0:20:04.830
or higher

0:20:04.830,0:20:05.870
and just

0:20:05.870,0:20:09.920
you know the overhead because you have the whole OS there's a lot of a lot

0:20:09.920,0:20:11.420
of duplicate

0:20:11.420,0:20:13.320
stuff

0:20:13.320,0:20:15.010
the various vendors

0:20:15.010,0:20:17.090
have their answers they claim you know we can

0:20:17.090,0:20:21.430
we can merge that and say oh you're running the same kernel so we'll keep your memory

0:20:21.430,0:20:24.120
we use the same memory but

0:20:24.120,0:20:25.220
at some level

0:20:25.220,0:20:29.309
it's all going to get duplicated.

0:20:29.309,0:20:30.590
A related option

0:20:30.590,0:20:34.820
comes from sort of the internet havesting
industry which is to use virtual private

0:20:34.820,0:20:38.130
which is the technology from virtual private servers

0:20:38.130,0:20:42.110
the example that everyone here is probably familiar with is Jails where

0:20:42.110,0:20:44.130
you can provide

0:20:44.130,0:20:46.720
your own file system root

0:20:46.720,0:20:49.060
your own network interface

0:20:49.060,0:20:50.620
and what not

0:20:50.620,0:20:51.500
and

0:20:51.500,0:20:53.129
the nice thing about this is

0:20:53.129,0:20:56.210
that unlike full virtualization

0:20:56.210,0:20:58.680
the overhead is very small

0:20:58.680,0:21:01.030
basically it costs you


0:21:01.030,0:21:02.820
an entry in your process table

0:21:02.820,0:21:05.570
or an entry in few structures

0:21:05.570,0:21:08.760
there's some extra tests in their kernel but otherwise

0:21:10.220,0:21:14.900
there's not a huge overhead for virtualization you don't need
an extra kernel for every

0:21:14.900,0:21:15.460
image

0:21:15.460,0:21:18.390
so you get the difference here
between

0:21:18.390,0:21:21.620
be able to run maybe

0:21:21.620,0:21:25.250
you might be able to squeeze two hundred VMWare images onto a machine

0:21:25.250,0:21:29.620
VMWare people say no no don't do that but we have machines that are running

0:21:29.620,0:21:30.509
nearly that many.

0:21:34.790,0:21:38.289
On the other hand there are people out there who run thousands of

0:21:38.289,0:21:40.730
virtual hosts

0:21:40.730,0:21:43.170
using this technique on a single machine so

0:21:43.170,0:21:45.200
big difference in resource use

0:21:45.200,0:21:46.400
on especially with light

0:21:46.400,0:21:48.070
in the lightly loaded use

0:21:48.070,0:21:52.400
in our environment we're looking more running a very small number of them but still

0:21:52.400,0:21:55.880
that overhead is significant

0:21:55.880,0:21:59.440
you still do have some ability to tailor the

0:21:59.440,0:22:01.670
images to a job’s needs

0:22:01.670,0:22:03.309
you could have a

0:22:03.309,0:22:05.400
custom root that for instance you could be running

0:22:05.400,0:22:07.380
FreeBSD 6.0 in one

0:22:07.380,0:22:08.650
in one

0:22:08.650,0:22:11.040
virtual server and 7.0 in another

0:22:11.040,0:22:15.090
you have to be running of course 7.0 kernel or 8.0 kernel to make
that work

0:22:15.090,0:22:16.330
but it allows you to do that

0:22:16.330,0:22:18.500
we also in principle can do

0:22:18.500,0:22:23.080
evil things like our 64-bit kernel and then 32-bit
user spaces because

0:22:23.080,0:22:26.400
say you have applications that you can't find the source to anymore

0:22:26.400,0:22:31.830
or libraries you don't
have the source to any more

0:22:31.830,0:22:32.990
an answer

0:22:32.990,0:22:34.150
interesting things there

0:22:34.150,0:22:36.680
and the other nice thing is since you're

0:22:36.680,0:22:39.629
you're doing a very lightweight and incomplete
virtualization

0:22:39.629,0:22:43.269
you don't have to virtualize things you don't
care about so you don’t have the overhead  of

0:22:43.269,0:22:45.520
virtualizing everything.

0:22:45.520,0:22:48.070
Downsides of course are incomplete isolation

0:22:48.070,0:22:50.690
you are running processes that on the same kernel

0:22:50.690,0:22:52.770
and they can interfere with each other

0:22:52.770,0:22:55.320
and there's dubious flexibility obviously

0:22:55.320,0:22:57.900
I don't think anyone

0:22:57.900,0:23:01.850
should have the ability to run Windows in a jail.

0:23:01.850,0:23:02.860
There’s some

0:23:02.860,0:23:04.960
Net BSD support but

0:23:04.960,0:23:10.510
and I don’t think it's really gotten to that point.

0:23:10.510,0:23:12.420
One final area

0:23:12.420,0:23:14.350
that sort of diverges from this

0:23:14.350,0:23:16.159
is the classic

0:23:16.159,0:23:18.400
Unix solution to the problem

0:23:18.400,0:23:20.580
on this on single

0:23:20.580,0:23:22.070
in a single machine

0:23:22.070,0:23:22.800
which is

0:23:22.800,0:23:28.950
to use existing resource limits and resource partitioning techniques

0:23:28.950,0:23:33.430
you know for example all Unix like our Unix systems have to process
resource limits

0:23:33.430,0:23:36.240
a resource and typically

0:23:36.240,0:23:36.999
scheduler a


0:23:38.340,0:23:41.510
cluster schedulers support the common ones

0:23:41.510,0:23:43.150
so you can set a

0:23:43.150,0:23:47.230
memory limit on your process or a CPU time limit on your process

0:23:47.230,0:23:49.830
and the schedulers typically provide

0:23:49.830,0:23:51.350
at least

0:23:51.350,0:23:54.740
launch support for

0:23:54.740,0:23:56.850
the limits on

0:23:56.850,0:24:01.900
a given set of process, that’s part of the job

0:24:01.900,0:24:02.850
also the most

0:24:02.850,0:24:05.640
you know there are a number of forms of resource
partitioning that

0:24:05.640,0:24:07.170
are available

0:24:08.100,0:24:09.700
as a standard feature

0:24:09.700,0:24:12.000
on so memory discs are one of them so

0:24:12.000,0:24:16.800
if you want to create a file system space that’s
limited in size, create a memory disc

0:24:16.800,0:24:17.969
and back it

0:24:17.969,0:24:21.130
and back it with a NMAP file

0:24:21.130,0:24:22.520
or swap

0:24:22.520,0:24:24.570
of partitioning

0:24:24.570,0:24:26.330
disc use

0:24:26.330,0:24:30.330
and then there are techniques like CPU affinities that you can walk
processes to it

0:24:30.330,0:24:32.010
a single process

0:24:32.010,0:24:34.540
processor or a set of processors

0:24:34.540,0:24:39.310
and so they can't interfere with each other
with processes running on other processors


0:24:39.310,0:24:44.280
the nice thing about this first is that you're using existing
facilities so you don’t have to rewrite

0:24:44.280,0:24:46.170
lots of new features

0:24:46.170,0:24:49.590
for a niche application

0:24:49.590,0:24:52.790
and they tend to integrate well with existing schedulers
in many cases

0:24:52.790,0:24:55.940
parts of them are already implemented

0:24:55.940,0:24:59.650
and in fact the experiments that I'll talk about later are all using
this type of

0:24:59.650,0:25:02.160
technique.

0:25:02.160,0:25:02.830
Cons are of course

0:25:02.830,0:25:04.850
incomplete isolation again

0:25:04.850,0:25:08.270
and there’s typically no unified framework

0:25:08.270,0:25:12.310
for the concept of a job when a job is composed of the center processes

0:25:12.310,0:25:16.710
yeah there are a number of data structures within the kernel for
instance the session

0:25:16.710,0:25:18.120
which

0:25:18.120,0:25:19.499
sort of aggregate processes

0:25:19.499,0:25:20.990
but there isn’t one

0:25:22.230,0:25:24.800
in BSD or Linux at this point

0:25:24.800,0:25:29.020
which allows you to place resource limits on those in a way that you can a process

0:25:29.020,0:25:32.520
IREX did have support like that

0:25:32.520,0:25:34.160
where they have a job ID

0:25:34.160,0:25:36.210
and there could be a job limit

0:25:36.210,0:25:38.280
and selected projects

0:25:38.280,0:25:41.320
are sort of similar but not quite the same

0:25:41.320,0:25:43.149
processes or part of a project but

0:25:43.149,0:25:46.770
it's not quite the same inherited relationship

0:25:47.720,0:25:49.500
and typically

0:25:49.500,0:25:50.900
there aren’t

0:25:50.900,0:25:55.390
limits on things like bandwidth. There was

0:25:55.390,0:25:56.430
a sort of a

0:25:56.430,0:25:58.350
bandwidth limiting

0:25:58.350,0:26:00.630
nice type interface

0:26:00.630,0:26:01.950
on that I saw

0:26:01.950,0:26:03.720
posted as a research project

0:26:03.720,0:26:07.150
many years ago I think in the 2.x days

0:26:07.150,0:26:09.880
where you could say this process can have

0:26:09.880,0:26:11.580
you know five megabits

0:26:11.580,0:26:12.530
or whatever

0:26:12.530,0:26:14.380
but I haven't really seen anything take off

0:26:14.380,0:26:16.940
that would be a pretty neat thing to have

0:26:16.940,0:26:19.309
actually one other exception there

0:26:19.309,0:26:22.230
is on IREX again

0:26:22.230,0:26:28.210
the XFS file system supported guaranteed data rates on file handles
you could say

0:26:28.210,0:26:30.140
you could open a file and say I need

0:26:30.140,0:26:32.940
ten megabits read or ten megabits write

0:26:32.940,0:26:34.029
or whatever and it would say

0:26:34.029,0:26:35.529
okay or no

0:26:35.529,0:26:39.279
and then you could read and write and
it would do evil things at the file system layer

0:26:39.279,0:26:40.600
in some cases

0:26:40.600,0:26:43.940
all to ensure that you could get that streaming data rate

0:26:44.900,0:26:49.710
by keeping the file.


0:26:49.710,0:26:53.620
So now I’m going to talk about what we've done

0:26:53.620,0:26:59.510
what we needed was a solution to handle
a wide range of job types

0:26:59.510,0:27:01.570
So of the options we looked at for instance

0:27:01.570,0:27:04.990
single application clusters or
project clusters

0:27:04.990,0:27:11.990
I think that the isolation they
provide is essentially unparalleled

0:27:12.590,0:27:16.630
and in our environment we probably have to
virtualize in order to be

0:27:16.630,0:27:18.179
efficient in terms of

0:27:18.179,0:27:22.060
being able to handle our job mix and what not and handle
the fact that our users

0:27:22.060,0:27:23.740
tend to have

0:27:23.740,0:27:27.730
spikes in their use

0:27:27.730,0:27:32.799
on a large scale so for instance we get GPS we’ll show up and say
we need to run for a month

0:27:32.799,0:27:33.780
on and then

0:27:33.780,0:27:38.460
some indeterminate number of months later
they'll do it again

0:27:38.460,0:27:40.840
for that sort of quick

0:27:40.840,0:27:41.480
demands

0:27:42.240,0:27:44.850
we really need the virtuals something
virtualized

0:27:44.850,0:27:47.120
and then we have to pay the price of

0:27:47.120,0:27:48.380
of the overhead

0:27:48.380,0:27:51.590
and again it doesn't handle small jobs well and that is a

0:27:51.590,0:27:54.050
large portion of our job mix so

0:27:54.050,0:27:55.180
of the

0:27:55.180,0:27:58.070
quarter million or something jobs we’ve run

0:27:58.070,0:27:59.700
on our cluster

0:27:59.700,0:28:02.490
I would guess that

0:28:02.490,0:28:04.730
more than half of those were submitted

0:28:04.730,0:28:05.890
in

0:28:05.890,0:28:09.660
batches of more than ten thousand

0:28:09.660,0:28:11.400
so they'll just pop up

0:28:11.400,0:28:14.030
the other method to have looked at

0:28:14.800,0:28:16.750
are using resource limits

0:28:16.750,0:28:19.060
the nice thing of course is they're achievable
with

0:28:19.060,0:28:21.429
they achieve useful isolation

0:28:21.429,0:28:26.289
and they’re implementable with either existing functionality or small
extensions so that's what we’ve

0:28:26.289,0:28:27.230
concentrating on.

0:28:27.230,0:28:29.740
We’ve also been doing some thinking about

0:28:29.740,0:28:31.809
could we use the techniques there

0:28:31.809,0:28:33.940
and combine them with jails

0:28:33.940,0:28:36.170
or related features

0:28:36.170,0:28:40.019
it may be bulking up jails to be more like zones in Solaris

0:28:40.019,0:28:44.150
or containers I think they're calling them this
week

0:28:44.150,0:28:44.840
and

0:28:44.840,0:28:46.770
so we're looking at that as well

0:28:46.770,0:28:50.840
to be able to provide


0:28:50.840,0:28:54.250
to be able to provide pretty user operating environments

0:28:54.250,0:28:59.200
potentially isolating users from upgrades so for instance as we upgrade the kernel

0:28:59.200,0:29:03.469
and users can continue using it all the
images they don't have time to rebuild their

0:29:03.469,0:29:04.330
application in

0:29:04.330,0:29:09.970
and handle the updates in libraries and what not

0:29:09.970,0:29:13.840
they also have the potential to provide strong isolation for security
purposes

0:29:13.840,0:29:18.740
which could be useful in the future.

0:29:18.740,0:29:20.159
We do think that

0:29:20.159,0:29:24.040
of these mechanisms the nice thing is that
resource limit

0:29:24.040,0:29:26.150
the resource limits and partitioning scheme

0:29:26.150,0:29:29.860
as well as virtual private service are very
similar implementation requirements

0:29:29.860,0:29:33.090
set up a fair bit more expensive

0:29:33.090,0:29:34.620
in the VPS case

0:29:34.620,0:29:38.780
but nonetheless they're fairly similar.

0:29:38.780,0:29:42.610
So, what we've been doing is we've taken the Sun Grid Engine

0:29:42.610,0:29:46.880
and we were originally intended to actually
extend Sun Grid Engine and modify its daemons

0:29:46.880,0:29:48.480
to do the work

0:29:48.480,0:29:51.150
on what we ended up doing instead is realize
that well

0:29:51.150,0:29:54.910
we can actually specify an alternate program
to run instead of the shepherd

0:29:54.910,0:29:57.990
The shepherd is the process

0:29:57.990,0:30:00.580
that starts all

0:30:00.580,0:30:02.250
starts the script that

0:30:02.250,0:30:03.380
can for each job

0:30:03.380,0:30:04.920
on a given node

0:30:04.920,0:30:08.559
it collects usage and forwards signals to the
children

0:30:08.559,0:30:12.620
and also is responsible for starting remote
components

0:30:12.620,0:30:14.560
so a shepherd is started and then

0:30:14.560,0:30:17.640
traditionally in Sun grid engine it starts out

0:30:17.640,0:30:19.910
its own RShell Daemon

0:30:19.910,0:30:20.800
and

0:30:20.800,0:30:22.010
jobs connect over

0:30:22.010,0:30:23.670
these days that for their own

0:30:23.670,0:30:25.870
mechanism which is

0:30:25.870,0:30:26.950
secure

0:30:26.950,0:30:28.000
not using the

0:30:28.840,0:30:30.530
crafty old RShell code.

0:30:35.370,0:30:37.970
So what we've done is we've implemented a wrapper script

0:30:37.970,0:30:40.139
which allows a pre-command hook

0:30:40.139,0:30:42.559
to run before the shepherd starts

0:30:42.559,0:30:47.170
the command wrapper so before we start shepherd we can run like the N program

0:30:47.170,0:30:49.150
or we can run

0:30:49.150,0:30:50.430
TRUE to whatever

0:30:50.430,0:30:54.040
to set up the environment that it runs in or CPU

0:30:54.040,0:30:56.600
setters I’ll show later

0:30:56.600,0:30:58.750
and a post command hook for cleanup

0:30:58.750,0:31:03.940
it's implemented in Ruby because I felt like it.

0:31:03.940,0:31:07.830
The first thing we implemented was memory backed temporary directories. The motivation for

0:31:07.830,0:31:08.700
this

0:31:08.700,0:31:09.640
is that

0:31:09.640,0:31:12.180
we've had problems for users will you know

0:31:12.180,0:31:15.510
run slash temp out on the nodes

0:31:15.510,0:31:19.059
where we have the nodes configured is that they do have discs

0:31:19.059,0:31:22.960
and most of the disc is available as slash temp

0:31:22.960,0:31:25.049
we had some cases

0:31:25.049,0:31:27.840
particularly early on where users would fill up the discs and not delete it

0:31:27.840,0:31:32.300
their job would crash or they would forget to add clean up code or whatever

0:31:32.300,0:31:35.100
and then other jobs would fail strangely

0:31:35.100,0:31:39.029
you might expect that you just get a nice error message

0:31:39.029,0:31:42.040
programmers being programmers

0:31:42.040,0:31:42.909
people would not do their

0:31:42.909,0:31:44.630
error handling correctly.

0:31:44.630,0:31:47.380
A number of libraries do have issues like for instance

0:31:47.380,0:31:49.600
the PVM library

0:31:49.600,0:31:52.600
unexpectedly fails and reports a completely strange error

0:31:52.600,0:31:54.759
if it can't create a file in temp

0:31:54.759,0:32:01.669
because it needs to create a UNIX domain socket
so it can talk to itself.

0:32:01.669,0:32:03.360
So, what we’ve done here

0:32:03.360,0:32:08.059
is it turns out that Sun Grid Engine actually creates a temporary
directory often the

0:32:08.059,0:32:11.730
typically /TEMP but you can change
that

0:32:11.730,0:32:14.490
and points temp dir to that

0:32:14.490,0:32:15.370
location

0:32:15.370,0:32:17.499
we've educated most of all users now

0:32:17.499,0:32:21.360
to use that location correctly
so they’ll use that variable

0:32:21.360,0:32:23.279
they treat their files under temp dir

0:32:23.279,0:32:24.950
and then when the job exits

0:32:24.950,0:32:26.569
the Grid Engine deletes the directory

0:32:26.569,0:32:28.510
and that all gets cleaned up

0:32:28.510,0:32:32.720
the problem of course being that of multiple
are also running on the same node at the same time

0:32:32.720,0:32:35.290
one of them could still fill temp

0:32:35.290,0:32:38.759
so the solution was pretty simple
we created a

0:32:38.759,0:32:41.420
wrapper script at the beginning of the job

0:32:41.420,0:32:42.760
creates a

0:32:42.760,0:32:43.940
a

0:32:43.940,0:32:47.260
memory file to swap back to MD file system

0:32:47.260,0:32:50.790
of a user requestable size with the default

0:32:50.790,0:32:53.310
and

0:32:53.310,0:32:56.520
this has a number of advantages the biggest one of course is that

0:32:56.520,0:32:58.320
it's fixed size so we get

0:32:58.320,0:32:59.449
you know

0:32:59.449,0:33:01.000
the user gets

0:33:01.000,0:33:03.420
what they asked for

0:33:03.420,0:33:05.930
and once they run of space, they run out of space well

0:33:05.930,0:33:09.300
and too bad they ran out of space

0:33:09.300,0:33:12.760
they should have asked for more

0:33:12.760,0:33:16.350
the other

0:33:16.350,0:33:18.770
the other advantage is the side-effect that

0:33:18.770,0:33:21.619
now that we're running swap back memory files systems for temp

0:33:21.619,0:33:24.560
the users who only use a fairly small amount of temp

0:33:24.560,0:33:28.190
should see vastly improved performance
because they're running in memory

0:33:28.190,0:33:32.980
rather than writing to disc

0:33:32.980,0:33:34.690
quick example

0:33:34.690,0:33:38.270
we've a little job script here

0:33:38.270,0:33:39.830
prints temp dir and

0:33:39.830,0:33:41.950
prints the

0:33:41.950,0:33:43.080
amount of space

0:33:43.080,0:33:46.210
we submit our job request saying that we want

0:33:46.210,0:33:51.539
this is what we want hundred megabytes of
temp space

0:33:51.539,0:33:53.580
the same that's why if this

0:33:53.580,0:33:55.230
so the program doesn't

0:33:55.230,0:33:57.620
so the program ends at the end of it

0:33:57.620,0:33:58.709
for doing it

0:33:58.709,0:34:00.510
here's a live demo

0:34:00.510,0:34:01.840
all and then

0:34:01.840,0:34:03.389
you look at the output

0:34:03.389,0:34:04.280
you can see it

0:34:04.280,0:34:07.549
does in fact it creates a memory file system

0:34:07.549,0:34:10.449
I attempted to do great code

0:34:10.449,0:34:13.409
having a variable space

0:34:13.409,0:34:15.839
that is roughly what the user asked for

0:34:15.839,0:34:17.089
the version that I had

0:34:17.089,0:34:20.739
when I was attempting this was not entirely
accurate

0:34:20.739,0:34:24.710
trying to guess what all the
UFS overhead would be

0:34:24.710,0:34:25.889
as the result was

0:34:25.889,0:34:28.399
not quite consistent

0:34:30.790,0:34:33.899
I couldn't figure out easy function so

0:34:33.899,0:34:39.589
it does a better job than it did to start with, it’s not perfect

0:34:39.589,0:34:40.600
sometimes however

0:34:40.600,0:34:42.329
today that that's a good fix

0:34:42.329,0:34:43.550
we're coming to

0:34:43.550,0:34:45.359
Deploy it pretty soon

0:34:45.359,0:34:47.159
it works pretty easily

0:34:47.159,0:34:48.570
well sometimes it's not enough

0:34:48.570,0:34:51.390
the biggest issue is that they were badly designed programs all

0:34:51.390,0:34:52.720
all over the world

0:34:52.720,0:34:54.919
don't use temp dir like they're supposed to

0:34:54.919,0:34:59.319
in fact

0:35:10.099,0:35:12.759
(inaudible question)
so there are all these applications

0:35:12.759,0:35:17.979
there are all these applications still that need
temp say during start up

0:35:17.979,0:35:19.230
that sort of thing

0:35:19.230,0:35:20.809
so

0:35:20.809,0:35:22.599
all

0:35:22.599,0:35:25.829
so we have problems with these

0:35:25.829,0:35:26.290
realistically

0:35:26.290,0:35:27.799
we can’t change all of them

0:35:27.799,0:35:30.019
it's just not going to happen

0:35:30.019,0:35:31.950
so we still have problems with people

0:35:31.950,0:35:34.509
running out of resources

0:35:34.509,0:35:35.819
so we probably

0:35:35.819,0:35:37.489
feel that


0:35:37.489,0:35:41.240
the most general solution is to write a per job slash temp

0:35:41.240,0:35:44.880
and virtualize that portion of the files system
in memory space

0:35:44.880,0:35:47.119
and variate symlinks can do that

0:35:47.119,0:35:52.539
and so we said okay let's give it a shot

0:35:52.539,0:35:56.969
just to introduce the concept of variate symlinks for people who aren’t familiar with them

0:35:56.969,0:36:00.280
variate symlinks are basically symlinks that
contain variables

0:36:00.280,0:36:02.389
which are expanded at run time

0:36:02.389,0:36:05.549
it allows paths to be different for different
processes

0:36:05.549,0:36:06.969
for example

0:36:06.969,0:36:08.689
you create some files

0:36:08.689,0:36:10.069
you create

0:36:10.069,0:36:12.459
a symlink whose contents are

0:36:12.459,0:36:18.329
this variable which has the default shell value

0:36:18.329,0:36:18.990
and you

0:36:18.990,0:36:24.949
get different results with different
variable sets.

0:36:24.949,0:36:27.170
So, to talk about the implementation we’ve done,

0:36:27.170,0:36:32.389
it's derived from direct implementation, most of
the data structures are identical

0:36:32.389,0:36:33.869
however, I’ve made a number of changes

0:36:33.869,0:36:39.649
the biggest one is that we took the concept
of scopes and we turned them entirely around

0:36:40.409,0:36:45.329
in there is a system scope which
is over overridden by a user scope and by a

0:36:45.329,0:36:47.259
process scope

0:36:49.819,0:36:53.449
problem with that is if you

0:36:53.449,0:36:56.099
only think about say the systems scope

0:36:56.099,0:36:57.079
and

0:36:57.079,0:36:59.459
you decide you want to do something clever like have

0:36:59.459,0:37:02.219
a root file system which

0:37:02.219,0:37:06.109
where slash lib points to different things
for different

0:37:06.109,0:37:08.249
different architectures

0:37:08.249,0:37:11.849
well, works quite nicely until the users come along
and

0:37:11.849,0:37:14.189
set their arch variable

0:37:14.189,0:37:15.629
up for you

0:37:15.629,0:37:18.900
if you have say a Set UID program and you don't
defensively

0:37:18.900,0:37:22.319
and you don't implement correctly

0:37:22.319,0:37:24.900
the obvious bad things happen. Obviously you would

0:37:24.900,0:37:28.599
write your code to not do that I believe they
did, but

0:37:28.599,0:37:31.700
there's a whole class of problems where

0:37:31.700,0:37:33.449
it's easy to screw up

0:37:33.449,0:37:36.219
add and do something wrong there

0:37:36.219,0:37:37.270
so by

0:37:37.270,0:37:38.509
reversing the order

0:37:38.509,0:37:41.849
we can reduce the risks

0:37:41.849,0:37:43.329
at the moment we don't

0:37:43.329,0:37:44.309
have a user scope

0:37:44.309,0:37:47.530
I just don't like the idea of the users scope
to be honest

0:37:47.530,0:37:50.900
problem being that then you have to have
per user state in kernel

0:37:50.900,0:37:55.509
that just sort of sits around forever
you can never garbage collect it except the

0:37:55.509,0:37:57.059
Administrator way

0:37:57.059,0:37:59.489
just doesn't seem like a great idea to me

0:37:59.489,0:38:00.700
And jail scope

0:38:00.700,0:38:04.609
just hasn't been implemented

0:38:04.609,0:38:09.809
because it wasn't entirely clear what the semantics should be

0:38:11.010,0:38:14.719
I also added default variable support variable
also shell style

0:38:14.719,0:38:16.999
variable support

0:38:16.999,0:38:19.169
to some extent undoes the scope

0:38:19.169,0:38:20.870
the scope change

0:38:20.870,0:38:21.779
in that

0:38:21.779,0:38:24.749
the default variable becomes a system scope

0:38:24.749,0:38:26.540
which is overridden by everything

0:38:26.540,0:38:30.890
but there are cases where we need to do that
in particular who wants implement their

0:38:30.890,0:38:33.380
slashed temp which varies

0:38:33.380,0:38:36.240
we have to do something like this because temp needs to work

0:38:37.209,0:38:42.059
if we don't have the job values set

0:38:42.059,0:38:45.829
I also decided to use

0:38:45.829,0:38:49.839
percent instead of dollar sign to avoid
confusion with shell variables because these

0:38:49.839,0:38:50.379
are

0:38:50.379,0:38:52.620
a separate namespace in the kernel

0:38:52.620,0:38:56.669
we can't do it to main OS and do all the evaluation in the
user space

0:38:56.669,0:38:59.269
it's classic vulnerability

0:38:59.269,0:39:02.739
in the CVE database for instance

0:39:02.739,0:39:08.109
and we’re not using @ and avoid confusion
with AFS

0:39:08.109,0:39:09.819
or the Net BSD implementation

0:39:09.819,0:39:11.019
which does not allow

0:39:11.019,0:39:14.879
user or administratively settable values

0:39:14.879,0:39:17.019
that support

0:39:17.019,0:39:20.359
I don't have any automated variables such
as

0:39:20.359,0:39:25.789
the percent sys value which is universally
set in the Net BDS implementation

0:39:25.789,0:39:26.750
or

0:39:28.039,0:39:32.579
a UID variable which they also have
0:39:32.579,0:39:34.909
and currently and it allows

0:39:34.909,0:39:40.880
setting of values in other processes,
you can only set them in your own and inherit it

0:39:40.880,0:39:42.699
that may change but

0:39:42.699,0:39:47.339
one of my goals here is because they were
subtle ways to make dumb mistakes and

0:39:47.339,0:39:48.930
cause security vulnerabilities

0:39:48.930,0:39:52.479
I've attempted to slim the feature set
down to the point where you

0:39:52.479,0:39:54.909
have some reasonable chance of not

0:39:54.909,0:39:56.339
doing that

0:39:56.339,0:40:03.339
if you start building systems on them for deployment.

0:40:04.419,0:40:06.909
The final area that we've worked on

0:40:06.909,0:40:09.499
is moving away from the final system space

0:40:09.499,0:40:12.559
and into CPU sets

0:40:12.559,0:40:16.379
Jeff Roberts implemented a program

0:40:16.379,0:40:20.699
implemented a CPU set functionality which
allows you to

0:40:20.699,0:40:23.489
create… put a process into a CPU set

0:40:23.489,0:40:24.879
and then set the affinity of that

0:40:24.879,0:40:26.269
CPU set

0:40:26.269,0:40:29.189
by default every process has an anonymous

0:40:29.189,0:40:33.059
CPU set that was stuffed into
one that was created by this

0:40:33.059,0:40:37.269
in a parent

0:40:37.269,0:40:38.619
so for a little background here

0:40:38.619,0:40:40.740
in a typical SGE configuration

0:40:40.740,0:40:42.769
every node has one slot

0:40:42.769,0:40:44.429
per CPU

0:40:44.429,0:40:48.639
There are a number of other ways you
can configure it, basically a slot is something

0:40:48.639,0:40:50.019
a job can run in

0:40:50.019,0:40:56.719
and a parallel job crosses slots
and can be in more than one slot

0:40:56.719,0:41:01.359
for instance in many applications where
code tends to spend a fair bit of time

0:41:01.359,0:41:02.380
waiting for IO

0:41:02.380,0:41:06.209
you are looking at more than one slot per CPU so two slots per

0:41:06.209,0:41:08.089
core is not uncommon

0:41:08.089,0:41:10.869
but probably the most common configuration
and the one that

0:41:10.869,0:41:13.719
you get out of the box is you just install a Grid Engine

0:41:13.719,0:41:16.739
is one slot for each CPU

0:41:16.739,0:41:19.830
and that's how that's how we run because we
want users to have

0:41:19.830,0:41:23.699
that whole CPU for whatever they want to do with
it

0:41:23.699,0:41:26.130
so jobs are allocated one or more slots

0:41:26.130,0:41:27.599
if they're

0:41:27.599,0:41:33.189
depending on whether they're sequential or parallel jobs
and how many they ask for

0:41:33.189,0:41:37.239
but this is just a convention
there's no actual connection between slots

0:41:37.239,0:41:39.119
and CPUs

0:41:39.119,0:41:40.829
so it's quite possible to

0:41:40.829,0:41:42.819
submit a non-parallel job

0:41:42.819,0:41:45.019
that goes off and spawns a zillion threads

0:41:45.019,0:41:48.369
and sucks up all the CPUs on the whole system

0:41:48.369,0:41:50.800
in some early versions of grid engine

0:41:50.800,0:41:53.569
there actually was

0:41:53.569,0:41:55.729
support for tying slots

0:41:55.729,0:41:58.669
to CPUs if you set it up that
way

0:41:58.669,0:42:02.979
there is a sensible implementation for IREX
and then things got weirder and weirder is

0:42:02.979,0:42:06.010
people tried to implement it on other platforms
which had

0:42:06.010,0:42:07.030
vastly different

0:42:07.030,0:42:09.839
CPU binding semantics

0:42:09.839,0:42:12.359
and at this point it’s entirely broken

0:42:12.359,0:42:14.959
on every platform as far as I can tell

0:42:14.959,0:42:18.759
so we decided okay we've got this wrapper
let's see what we can do

0:42:18.759,0:42:21.009
in terms of making things work.

0:42:21.659,0:42:27.119
We now have the wrapper store allocations in the final system

0:42:27.119,0:42:31.239
we have a not yet recursive allocation algorithm

0:42:31.239,0:42:33.369
well we try to do is

0:42:33.369,0:42:34.690
find the best fit

0:42:34.690,0:42:35.779
fitting set of

0:42:35.779,0:42:39.539
adjacent cores

0:42:39.539,0:42:42.329
and then if that doesn't work we take the largest
to repeat

0:42:43.519,0:42:45.180
and until we fix

0:42:45.180,0:42:47.300
or until we've got enough slots

0:42:47.300,0:42:50.800
the goal is to minimize new fragments we haven't
done any analysis

0:42:50.800,0:42:52.269
to determine whether that's actually

0:42:52.269,0:42:55.179
an appropriate algorithm

0:42:55.179,0:42:56.289
but off hand it seems

0:42:56.289,0:43:00.519
fine given I’ve thought about it over lunch.

0:43:00.519,0:43:02.810
Should 40’s lay down their OSes

0:43:02.810,0:43:09.649
turns out that FreeBSD, CPU setting, API
and the Linux one

0:43:09.649,0:43:12.519
differ only in the very small details

0:43:12.519,0:43:13.599
They’re

0:43:13.599,0:43:15.479
essentially exactly

0:43:15.479,0:43:17.569
identical which is

0:43:17.569,0:43:20.489
convenient semantically,
so converting between then is pretty straight forward

0:43:20.489,0:43:24.869
so converting between then is pretty straight forward,
so I did a set of benchmarks

0:43:24.869,0:43:27.019
to demonstrate the

0:43:28.089,0:43:29.359
effectiveness of CPU set,
they also happen to demonstrate the wrapper

0:43:29.359,0:43:33.319
but don’t really have any relevance

0:43:33.319,0:43:35.229
used a little eight core Intel Xeon box

0:43:38.289,0:43:40.749
7.1 pre-release that had

0:43:40.749,0:43:43.239
John Bjorkman backported

0:43:43.239,0:43:46.640
CPU set

0:43:46.640,0:43:49.039
from 8.0 shortly before release

0:43:49.039,0:43:53.450
well not so shortly, it's supposed to be shortly
before

0:43:53.450,0:43:55.579
and the SG 6.2

0:43:55.579,0:43:59.739
we used the simple integer benchmarks

0:43:59.739,0:44:02.519
end Queens program were tested

0:44:02.519,0:44:03.349
for instance an 8 x 8 board

0:44:03.349,0:44:05.360
placed

0:44:05.360,0:44:08.069
the 8 queens so they can’t capture each other

0:44:08.069,0:44:09.289
on the board

0:44:11.039,0:44:13.680
so it's a simple load benchmark

0:44:13.680,0:44:18.800
that we ran a small version of the problem
as our measure command to generate

0:44:19.599,0:44:24.439
load we ran a larger version that we ran for much longer

0:44:24.439,0:44:28.149
some results

0:44:28.149,0:44:30.129
so for baseline,

0:44:30.129,0:44:33.170
the most interesting thing is to do
a baseline run

0:44:33.170,0:44:34.279
you see this

0:44:34.279,0:44:36.410
some variance it's not really very high

0:44:36.410,0:44:38.979
not surprising it doesn't really do anything

0:44:38.979,0:44:40.979
except suck CPU see here

0:44:40.979,0:44:41.729
Really not much

0:44:41.729,0:44:45.229
going on

0:44:45.229,0:44:50.029
in this case we’ve got seven
load processes and a single

0:44:50.029,0:44:52.789
a single test process running

0:44:52.789,0:44:55.160
we see things slow down slightly

0:44:55.160,0:44:55.890
and

0:44:55.890,0:44:58.389
the standard deviation goes up a bit

0:44:58.389,0:45:00.829
it’s a little bit of deviation from baseline

0:45:00.829,0:45:03.659
 the obvious explanation is clearly

0:45:03.659,0:45:07.339
we’re just content switching
a bit more

0:45:08.840,0:45:10.349
because we don't have

0:45:10.349,0:45:12.410
CPUs that are doing nothing at all

0:45:12.410,0:45:15.559
there some extra load from the system
as well

0:45:15.559,0:45:20.049
since the kernel has to run and
background tests have to run

0:45:20.049,0:45:23.150
you know in this case we have a badly behaved application

0:45:23.150,0:45:26.579
we now have 8 load processes which would suck up all the CPU

0:45:26.579,0:45:28.879
and then we try to run our measurement process

0:45:28.879,0:45:30.639
we see a you know

0:45:30.639,0:45:32.739
substantial performance decrease

0:45:32.739,0:45:35.570
you know about in the range we would expect

0:45:35.570,0:45:37.289
see if we had any

0:45:37.289,0:45:40.140
decrease

0:45:40.140,0:45:43.220
we fired up with CPU set

0:45:43.220,0:45:44.249
quite obviously

0:45:44.249,0:45:46.190
the interesting thing here is to see it

0:45:46.190,0:45:49.429
we’re getting no statistically significant difference

0:45:49.429,0:45:52.819
between the baseline case with

0:45:52.819,0:45:56.539
7 processors if we use CPU sets
we don't see this variance

0:45:56.539,0:45:58.520
which is nice to know that this shows

0:45:58.520,0:45:59.509
that's it

0:45:59.509,0:46:02.869
we actually see a slight performance
improvement

0:46:02.869,0:46:04.179
and

0:46:04.179,0:46:05.579
we

0:46:05.579,0:46:07.589
we see a reduction in variance

0:46:07.589,0:46:11.569
so CPU set is actually improving performance
even if we’re not overloaded

0:46:11.569,0:46:13.510
and we see in the overloaded case

0:46:13.510,0:46:15.589
it's the same

0:46:15.589,0:46:20.319
for the other processes
they’re stuck on other CPUs

0:46:20.319,0:46:22.820
one interesting side note actually is that

0:46:22.820,0:46:26.719
when I was doing some tests early on

0:46:26.719,0:46:27.869
we actually saw

0:46:27.869,0:46:32.359
I tried doing the base line and
the baseline with CPU set and if you just fired off with the original

0:46:32.359,0:46:33.869
algorithm

0:46:33.869,0:46:34.540
which

0:46:34.540,0:46:36.489
grabbed CPU0

0:46:36.489,0:46:39.339
you saw a significant performance decline

0:46:39.339,0:46:42.319
because there's a lot of stuff that ends up
running on CPU0

0:46:42.319,0:46:43.819
which

0:46:43.819,0:46:45.100
what led to the

0:46:45.100,0:46:49.890
quick observation you want to allocate
from the large numbers down

0:46:49.890,0:46:50.569
so that you use

0:46:50.569,0:46:55.069
the CPUs which are not running the random processes
that get stuck on zero

0:46:55.069,0:46:57.880
or get all the interrupts in some architectures

0:46:57.880,0:47:02.199
and avoid Core0 in particular.

0:47:02.199,0:47:04.029
so some conclusions

0:47:04.029,0:47:07.530
I think we have useful proof of concept
of going to be deploying

0:47:07.530,0:47:09.880
certainly the

0:47:09.880,0:47:11.000
memory stuff soon

0:47:11.000,0:47:13.329
once we upgrade to seven we’ll

0:47:13.329,0:47:15.959
definitely be deploying the CPU sets

0:47:15.959,0:47:16.849
so it's

0:47:16.849,0:47:18.509
both improves performance

0:47:18.509,0:47:22.009
in the contended case and in the and uncontended case

0:47:22.009,0:47:26.299
we would like in the future to do some more work
with virtual private server stuff

0:47:26.299,0:47:28.979
Particularly it would be really interesting

0:47:28.979,0:47:30.759
to be able to run different

0:47:30.759,0:47:32.540
different FreeBSD versions in jails

0:47:32.540,0:47:37.660
for to run up for instance CentOS images
in jail since we’re running CentOS

0:47:37.660,0:47:40.649
on our Linux based systems

0:47:40.649,0:47:43.240
there could actually be some really interesting
things there

0:47:43.240,0:47:45.759
in that for instance we can run

0:47:45.759,0:47:50.989
we could potentially detrace Linux applications
it's never going to happen on native Linux

0:47:50.989,0:47:53.069
there's also another example where

0:47:53.069,0:47:56.269
Paul Sub who’s doing some benchmarking recently

0:47:56.269,0:48:01.039
and relative to Linux on the same hardware

0:48:01.039,0:48:04.900
he was seeing a three and a half times improvement
0:48:04.900,0:48:07.230
in basic matrix multiplication

0:48:07.230,0:48:08.549
relative to current

0:48:08.549,0:48:11.849
because previously super-pegged functionality

0:48:11.849,0:48:14.499
where you vastly reduce the number of TLV entries

0:48:14.499,0:48:16.150
in the page table

0:48:16.150,0:48:17.229
and so

0:48:17.229,0:48:21.109
that sort of thing can apply even to apply
to our Linux using population

0:48:21.109,0:48:23.969
could give FreeBSD some real wins there

0:48:26.309,0:48:27.579
I’d like to look at

0:48:27.579,0:48:30.859
more on the point of isolating users from kernel upgrades

0:48:30.859,0:48:32.620
one of the issues we've had is that

0:48:32.620,0:48:34.019
when you do a new bump

0:48:34.019,0:48:38.399
we have users who depend on all sorts of libraries
immediate which

0:48:38.399,0:48:41.380
you know the vendors like to rev them to
do

0:48:41.380,0:48:44.640
stupid API breaking changes is fairly
regularly so

0:48:44.640,0:48:48.380
it’d be nice for users if we can get all the
benefits to kernel upgrades

0:48:48.380,0:48:51.699
and they could upgrade at their leisure

0:48:51.699,0:48:54.459
so we're hoping to do that in future as well

0:48:54.459,0:48:57.809
we’d would like to see more limits
on bandwidth type resources

0:48:59.219,0:49:01.199
for instance say limiting the amount of

0:49:02.910,0:49:05.649
it's fairly easy to know the amount
of sockets I own

0:49:05.649,0:49:10.279
but it’s hard to place a total limit on
network bandwidth

0:49:10.279,0:49:11.819
by a particular process

0:49:11.819,0:49:16.979
when almost all of our storage is on NFS
how do you classify that traffic

0:49:17.649,0:49:21.259
without a fair bit of change to the kernel
and somehow tagging that

0:49:21.259,0:49:23.799
it's an interesting challenge.

0:49:23.799,0:49:28.309
we'd also like to see it could be needed some
you implement something like

0:49:28.309,0:49:30.089
the IRIX job ID

0:49:30.089,0:49:34.099
to allow the scheduler to just
tag processes as part of a job

0:49:34.099,0:49:36.309
currently

0:49:36.309,0:49:38.939
I've grid engine uses a clever but evil hack

0:49:38.939,0:49:40.010
where they add

0:49:40.010,0:49:42.509
an extra group to the process

0:49:42.509,0:49:44.819
and they just have a range of groups

0:49:44.819,0:49:48.209
available so they get inherited in the users
can’t drop them so

0:49:48.209,0:49:51.889
that allows them to track the process
but it’s an ugly hack

0:49:51.889,0:49:57.499
and with the current limits on the number of groups
it can become a real problem

0:49:57.499,0:49:59.529
actually before I take questions

0:49:59.529,0:49:59.980
I do want to put in

0:49:59.980,0:50:01.119
one quick point

0:50:01.119,0:50:05.100
the think it's not interesting you live in
the area and if you're looking for

0:50:05.100,0:50:06.430
looking for a job

0:50:06.430,0:50:09.780
we are trying to hire a few people it's difficult
to hire good

0:50:09.780,0:50:13.069
we do have some openings and we're looking
for

0:50:13.069,0:50:17.409
BSD people in general system
Admin people

0:50:17.409,0:50:24.409
so questions?

0:50:38.419,0:50:40.989
Yes
(inaudible question)

0:50:40.989,0:50:45.719
I would expect that to happen
but it's not something I’ve attempted to test

0:50:45.719,0:50:50.570
what I would really like is to have a topology aware allocator

0:50:50.570,0:50:53.179
so that you can request that you know I want

0:50:53.179,0:50:56.229
I want to share cache or I don't want to share cache

0:50:56.229,0:51:00.170
I want to share memory band width or not share memory bandwidth

0:51:00.170,0:51:02.459
open MPI 1.3

0:51:02.459,0:51:08.469
on the Linux side have a topology where a wrapper for their CPU

0:51:08.469,0:51:10.159
functionality

0:51:10.159,0:51:12.249
makes it something called

0:51:12.249,0:51:14.139
the PLAP

0:51:14.139,0:51:15.259
portable Linux

0:51:16.519,0:51:19.599
CPU allocator. Is that what
it's actually been

0:51:19.599,0:51:21.959
what the acronym is

0:51:21.959,0:51:25.400
in essence they have to work around the fact
that there were three standard

0:51:25.400,0:51:27.809
there were three different

0:51:27.809,0:51:31.759
kernel APIs for the same syscall

0:51:31.759,0:51:38.759
for CPU allocation because all the vendors
did it themselves somehow

0:51:38.769,0:51:44.969
they're the same number but
they’re completely incompatible

0:51:44.969,0:51:48.749
when you first load the application it calls
the syscall and it tries to figure out which

0:51:48.749,0:51:50.579
one it is

0:51:50.579,0:51:52.719
by what errors it returns depending on what

0:51:52.719,0:51:56.139
are you missing and completely evil

0:51:56.139,0:52:00.859
I think people should port their API
and have their library work but

0:52:00.859,0:52:05.650
we don’t need to do that junk
because we did not make that mistake

0:52:05.650,0:52:12.650
so I would like to see the
topology aware stuff in particular

0:52:30.710,0:52:32.529
(inaudible question)

0:52:32.529,0:52:37.180
the trick is it’s easy to limit application bandwidth

0:52:39.500,0:52:42.269
fairly easy to limit application bandwidth

0:52:42.269,0:52:44.329
it becomes more difficult when you have to

0:52:44.329,0:52:45.430
if your

0:52:45.430,0:52:49.759
interfaces are shared between application traffic

0:52:49.759,0:52:50.880
and

0:52:50.880,0:52:53.049
say NFS

0:52:53.049,0:52:57.399
getting classifying that is going to be trickier
you have to tag you’d have to add a fair bit of code

0:52:57.399,0:53:04.399
to trace that down through the kernel
certainly doable

0:53:12.069,0:53:15.499
(inaudible question)

0:53:15.499,0:53:18.389
I have contemplated doing just that

0:53:18.389,0:53:22.059
or in fact the other thing we consider
doing

0:53:22.059,0:53:24.829
more as a research project than is a practical thing

0:53:24.829,0:53:26.719
would be actually how

0:53:26.719,0:53:28.619
would be

0:53:28.619,0:53:30.029
independent VLANs

0:53:30.029,0:53:31.839
because then we could do

0:53:31.839,0:53:32.459
things like

0:53:32.459,0:53:35.489
give each process a VLAN they couldn't even

0:53:35.489,0:53:37.979
share at the internet layer

0:53:37.979,0:53:41.259
once the images’ in place for instance we will
be able to do that

0:53:41.259,0:53:45.049
and that say you know you've got your interfaces
it’s yours whatever

0:53:45.049,0:53:46.479
but then we could limit it

0:53:46.479,0:53:49.959
we could rate limit that at the kernel
we can also have

0:53:49.959,0:53:54.729
we’d have a physically isolated
we’d have a logically isolated network as well

0:53:54.729,0:53:57.589
with some of the latest switches we could actually
rate limit

0:53:57.589,0:54:04.589
at the switch as well

0:54:19.939,0:54:22.369
(inaudible questions)
so to the first question

0:54:22.369,0:54:26.190
we don’t run multiple

0:54:26.190,0:54:27.639
sensitivity data on these clusters

0:54:27.639,0:54:28.709
unclassified cluster

0:54:28.709,0:54:30.460
we've avoided that problem by

0:54:30.460,0:54:32.299
not allowing it

0:54:32.299,0:54:34.929
But it is a real issue

0:54:34.929,0:54:36.939
it's just not one we've had to deal with

0:54:39.559,0:54:42.109
in practice with stuff that’s sensitive

0:54:43.059,0:54:47.579
has handling requirements that you can't touch
the same hardware without a scrub

0:54:47.579,0:54:49.859
you need a pretty

0:54:49.859,0:54:51.739
ridiculously aggressive

0:54:51.739,0:54:53.770
you need a very coarse granularity

0:54:53.770,0:54:57.240
a ridiculous remote imaging process that you
moved all of the data

0:54:57.240,0:55:00.959
so if I were to do that I would
probably get rid of the discs

0:55:00.959,0:55:01.389
just

0:55:01.389,0:55:02.400
go disc less

0:55:02.400,0:55:04.910
that would get rid of my number-one failure case
of

0:55:04.910,0:55:07.839
that would be pretty good but

0:55:07.839,0:55:09.419
but haven’t done it

0:55:10.609,0:55:13.819
NFS failures we've had occasional problems of NFS overloading


0:55:13.819,0:55:15.679
we haven't had real problem

0:55:15.679,0:55:19.279
we're all local network it’s fairly tightly
contained so we haven't had problems with

0:55:19.279,0:55:20.539
things

0:55:20.539,0:55:21.819
with

0:55:21.819,0:55:26.039
you know the server going down for extended
periods and causing everything to hang

0:55:26.039,0:55:27.819
it's been more an issue of

0:55:27.819,0:55:33.189
I mean there isn't there's a problem
that Panasas is described as in cast

0:55:33.189,0:55:36.109
you can take out any NFS server

0:55:36.109,0:55:40.809
I mean we have the bluearc guys come in and the
PGA based stuff with multiple ten-gig links I said

0:55:40.809,0:55:42.049
you know I've got

0:55:42.049,0:55:46.779
to do this and they said can we not try this with your whole cluster

0:55:46.779,0:55:47.950
because if you got

0:55:47.950,0:55:49.370
three hundred and fifty

0:55:49.370,0:55:52.599
gigabit ethernet interfaces going into
the system

0:55:52.599,0:55:56.589
Even ten gig you can saturate pretty trivially

0:55:56.589,0:55:57.120
so that level

0:55:57.120,0:55:58.930
there's an inherent problem

0:55:58.930,0:56:01.969
on we need to handle that kind of bandwidth
we've

0:56:01.969,0:56:04.459
got to get it a parallel file system

0:56:04.459,0:56:06.069
get a cluster

0:56:06.069,0:56:12.289
before doing streaming stuff we could go via SWAN or something

0:56:12.289,0:56:14.949
anyone else?

0:56:14.949,0:56:15.429
thank you, everyone
(applause and end)