doc/en_US.ISO8859-1/captions/2009/dcbsdcon/davis-isolatingcluster.sbv
Murray Stokely f985ae0fb8 Second pass at human editing to improve the captions for this video
through work for hire on Amazon Mechanical Turk.

Sponsored by:	 FreeBSD Foundation
2010-03-08 06:04:20 +00:00

3778 lines
73 KiB
Text
Raw Blame History

This file contains ambiguous Unicode characters

This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.

0:00:15.749,0:00:18.960
I do apologize for the (other)
0:00:18.960,0:00:22.130
for the EuroBSDCon slides. I've redone the
0:00:22.130,0:00:23.890
title page and redone the
0:00:23.890,0:00:27.380
and made some changes to the slides
and they didn't make it through for approval
0:00:27.380,0:00:33.130
by this afternoon so
0:00:33.130,0:00:34.640
okay so
0:00:34.640,0:00:36.390
I'm gonna be talking about
0:00:36.390,0:00:38.430
doing
0:00:38.430,0:00:42.889
about isolating jobs for performance and predictability
in clusters
0:00:42.889,0:00:43.970
before I get into that
0:00:43.970,0:00:46.010
I want to talk a little bit about
0:00:46.010,0:00:47.229
who we are and
0:00:47.229,0:00:49.520
what our problem space is like because that
0:00:49.520,0:00:54.760
dictates that… has an effect
on our solutions base
0:00:54.760,0:00:57.079
I work for the aerospace corporation.
0:00:57.079,0:00:58.609
We work;
0:00:58.609,0:01:02.480
we operate a federally-funded
research and development center
0:01:02.480,0:01:05.400
in the area national security space
0:01:05.400,0:01:09.310
and in particular we work with the air force
space and missile command
0:01:09.310,0:01:13.090
and with the national reconnaissance
office
0:01:13.090,0:01:16.670
and our engineers support a wide variety
0:01:16.670,0:01:20.550
of activities within that area
0:01:20.550,0:01:21.830
so we have
0:01:21.830,0:01:23.740
a bit over fourteen hundred to correct
0:01:23.740,0:01:25.860
sorry twenty four hundred engineers
0:01:25.860,0:01:28.820
in virtually every discipline we have
0:01:28.820,0:01:33.520
as you would expect we have our rocket scientists,
we have people who build satellites
0:01:33.520,0:01:37.439
we have people who build sensors that go on
satellites, people who study these sort of things
0:01:37.439,0:01:38.130
that you
0:01:38.130,0:01:39.590
see when you
0:01:39.590,0:01:40.819
use those sensors
0:01:40.819,0:01:42.040
that sort of thing.
0:01:42.040,0:01:44.180
We also have civil engineers and
0:01:44.180,0:01:45.680
electronic engineers
0:01:45.680,0:01:46.649
and process,
0:01:46.649,0:01:49.170
computer process people
0:01:49.170,0:01:53.120
so we literally do everything related to space
and all sorts of things that you might not
0:01:53.120,0:01:55.270
expect to be related to space,
0:01:55.270,0:01:58.820
since we also for instance help build ground
systems cause satellites arent very useful if
0:01:58.820,0:02:00.680
there isn't anything to talk to them;
0:02:02.540,0:02:04.090
and these engineers
0:02:04.090,0:02:07.420
since they're solving all these different problems we have
0:02:07.420,0:02:11.499
engineering applications in you know
virtually every size you can think of
0:02:11.499,0:02:15.539
ranging from you know little spreadsheet things that
you might not think of as an engineering
0:02:15.539,0:02:17.229
application but they are
0:02:17.229,0:02:22.249
to Matlab programs or a lot of C code
0:02:22.249,0:02:23.960
or one of traditional parallel for us
0:02:23.960,0:02:25.159
serial code
0:02:25.159,0:02:26.049
and then
0:02:26.049,0:02:30.949
large parallel applications either in house;
genetic algorithms and that sort
0:02:30.949,0:02:31.769
of thing,
0:02:31.769,0:02:32.900
or traditional
0:02:32.900,0:02:34.749
the classic parallel code
0:02:34.749,0:02:37.599
like you work around a crate or something material simulation
0:02:40.119,0:02:41.459
or that or food flow
0:02:41.459,0:02:43.869
or that sort of thing
0:02:43.869,0:02:44.240
so
0:02:44.240,0:02:46.349
so we have this big application space
0:02:46.349,0:02:49.029
just want to give a little introduction to that because
it
0:02:49.029,0:02:51.529
does come back and influence what we
0:02:51.529,0:02:55.999
the sort of solutions we look at
0:02:55.999,0:03:00.499
so the rest of the talk Im gonna talk about rese…
0:03:00.499,0:03:05.259
we skipped a slide, there we are, thats a little better.
0:03:05.259,0:03:08.940
Now, what I'm interested in is I do high
performance computing
0:03:08.940,0:03:10.109
at company
0:03:10.109,0:03:13.949
and I provide high performance computing resources
to our users
0:03:13.949,0:03:19.949
as part of my role in our technical
computing services organization
0:03:19.949,0:03:20.370
so
0:03:20.370,0:03:23.120
our primary resource at this point is
0:03:23.120,0:03:25.429
the fellowship cluster
0:03:25.429,0:03:26.540
it's a for the
0:03:26.540,0:03:29.569
named for the fellowship of the ring
0:03:29.569,0:03:30.449
so it's a…
0:03:30.449,0:03:32.520
… eleven axel nodes
0:03:32.520,0:03:33.930
wrap the core systems
0:03:33.930,0:03:35.909
over here there's a
0:03:35.909,0:03:39.659
Cisco a large Cisco switch. Actually today
there are around two sixty five oh nines if
0:03:39.659,0:03:40.899
you assess them
0:03:40.899,0:03:46.149
and because we couldnt get the port density we wanted otherwise
0:03:46.149,0:03:50.219
and primarily the Gigabit Ethernet system runs
FreeBSD currently 6.0 cause we havent upgraded
0:03:50.219,0:03:51.089
it yet
0:03:51.089,0:03:55.639
planning to move probably to 7.1
or maybe slightly past 7.1
0:03:55.639,0:04:01.029
if we want to get the latest HWPMC changes in
0:04:01.029,0:04:05.900
we use the Sun Grid Engine scheduler was one of
the two main options for open source
0:04:05.900,0:04:08.949
resource managers on clusters the other one being
the…
0:04:09.959,0:04:11.499
… the TORQUE
0:04:11.499,0:04:15.939
and now recombination from cluster resources
0:04:15.939,0:04:17.389
so we also have
0:04:17.389,0:04:18.079
that's actually
0:04:18.079,0:04:22.090
40 TB thats really the raw number on a sun thumper and
0:04:23.219,0:04:26.290
thats thirty two usable once you start using RAID-Z2
0:04:26.290,0:04:30.939
since you might actually like to have your data
should a disk fail
0:04:30.939,0:04:32.969
and with today's discs RAID…
0:04:32.969,0:04:34.009
RAID five
0:04:34.009,0:04:35.249
doesn't really cut it,
0:04:37.379,0:04:40.220
And then we also have some other resources coming on but Im going to be (concentrating on)
0:04:40.220,0:04:43.530
two smaller clusters unfortunately probably running Linux and
0:04:43.530,0:04:45.900
some SMPs but
0:04:45.900,0:04:49.990
Im going to be concentrating here on the work we're
doing on our other
0:04:49.990,0:04:54.259
FreeBSD based cluster.
0:04:54.259,0:04:55.060
So, first of all
0:04:55.060,0:04:59.410
first of all I want to talk about why we want to
share resources. Should be fairly obvious
0:04:59.410,0:05:00.610
but I'll talk about it in a little bit
0:05:00.610,0:05:04.900
and then what goes wrong when you start sharing resources
0:05:04.900,0:05:08.449
after that I'll talk about some different solutions
to those problems
0:05:08.449,0:05:09.759
and
0:05:09.759,0:05:13.399
some fairly trivial experiments that we've done
so far in terms of enhancing the schedule or
0:05:13.399,0:05:15.860
using operating system features
0:05:15.860,0:05:17.730
so you mitigate those problems
0:05:19.349,0:05:20.110
and
0:05:20.110,0:05:25.110
then conclude with some feature work.
0:05:25.110,0:05:29.289
So, obviously if you have a resource the size…
the size of our cluster, fourteen hundred
0:05:29.289,0:05:30.970
cores roughly
0:05:30.970,0:05:32.819
you probably want to share it unless you
0:05:32.819,0:05:35.080
purpose built it for a single application
0:05:35.080,0:05:37.340
you're going to want to have your users
0:05:37.340,0:05:39.440
sharing it
0:05:39.440,0:05:42.909
and you don't want to just say you know, you get on Monday
0:05:42.909,0:05:45.330
probably not going to be a very effective
option
0:05:45.330,0:05:49.270
especially not when we have as many users as we
do
0:05:49.270,0:05:53.849
we also can't just afford to buy another one
every time a user shows up
0:05:53.849,0:05:54.959
so one of our
0:05:54.959,0:05:57.339
senior VPs said a while back
0:05:57.339,0:05:57.969
you know
0:05:57.969,0:06:02.349
we could probably afford to buy just about
anything we could need once
0:06:02.349,0:06:03.800
we can't just
0:06:03.800,0:06:06.359
buy ten of them though
0:06:06.359,0:06:08.939
if we really, really needed it
0:06:08.939,0:06:09.680
dropping
0:06:09.680,0:06:11.460
small numbers of millions of dollars on
0:06:11.460,0:06:13.349
computing resources wouldnt be
0:06:13.349,0:06:15.039
impossible
0:06:15.039,0:06:20.829
but we can't go to you know just have every engineer
who wants one just call up Dell and say ship me ten racks
0:06:20.829,0:06:24.030
it's not going to work
0:06:24.030,0:06:25.580
and the other thing is that we cant
0:06:25.580,0:06:28.360
we need to also provide quick turnaround
0:06:28.360,0:06:29.390
for some users
0:06:29.390,0:06:33.229
so we can't have one user hogging the system and
hogging it until they are done
0:06:33.229,0:06:34.720
because we have some users
0:06:34.720,0:06:37.099
and then the next one can run
0:06:37.099,0:06:40.949
because we have some users who'll
come in and say well I need to run
0:06:40.949,0:06:43.159
for three months
0:06:43.159,0:06:43.690
and
0:06:43.690,0:06:46.810
we've had users come in and literally run
0:06:46.810,0:06:49.740
pretty much using the entire system for three months
0:06:49.740,0:06:53.839
well so we've had to provide some ability for other
users to still get their work done
0:06:53.839,0:06:58.300
so we can't just… so we do have to have some sharing
0:06:58.300,0:07:00.619
however when you start to share any resource
0:07:00.619,0:07:01.610
like this
0:07:01.610,0:07:03.509
you start getting contention
0:07:03.509,0:07:06.300
users need the same thing at the same time
0:07:06.300,0:07:09.700
and so they fight back and forth for it and they
can't get what they want
0:07:09.700,0:07:11.639
so you have to balance them a bit
0:07:12.999,0:07:14.529
you know also
0:07:14.529,0:07:17.869
some jobs lie when they
0:07:17.869,0:07:20.870
request resources and they actually need
more than they ask for
0:07:20.870,0:07:23.279
which can cause problems
0:07:23.279,0:07:27.229
so we schedule them. We say you're going to fit
here fine and they run off and use
0:07:27.229,0:07:28.580
more than they said
0:07:28.580,0:07:31.000
and if we don't have a mechanism to constrain
them
0:07:31.000,0:07:32.389
we have problems.
0:07:32.389,0:07:34.270
Likewise
0:07:34.270,0:07:37.109
once these users start to contend
0:07:37.109,0:07:39.029
that doesn't just result in
0:07:39.029,0:07:40.439
the jobs taking,
0:07:40.439,0:07:43.360
taking longer in terms of wall clock time
0:07:43.360,0:07:44.659
because they are extremely slow
0:07:44.659,0:07:48.430
but there's overhead related to that contention;
they get swapped out due to pressure on
0:07:49.219,0:07:51.509
various systems
0:07:51.509,0:07:52.550
if you really
0:07:52.550,0:07:57.039
for instance run out of memory then you go into
swap and you end up wasting all your cycles
0:07:57.039,0:07:58.710
pulling junk in and out of disc
0:07:58.710,0:08:00.830
wasting your bandwidth on that
0:08:00.830,0:08:03.530
so there are
0:08:03.530,0:08:04.219
resource
0:08:04.219,0:08:08.139
there are resource costs to the contention not merely
0:08:08.139,0:08:11.979
a delay in returning results.
0:08:11.979,0:08:16.590
So now I'm going to switch gears and start talk… so I'm
going to talk a little bit about different
0:08:16.590,0:08:18.270
solutions to these
0:08:18.270,0:08:20.610
to the
0:08:20.610,0:08:22.339
these contention issues
0:08:23.710,0:08:27.840
and look at different ways of solving the
problem. Most of these are things that have
0:08:27.840,0:08:29.440
already been done
0:08:29.440,0:08:30.620
but I just want to talk about
0:08:30.620,0:08:32.990
the different ways and then
0:08:32.990,0:08:35.710
evaluate them in our context.
0:08:35.710,0:08:38.119
So a classic solution to the problem is
0:08:38.119,0:08:39.280
Gang Scheduling
0:08:39.280,0:08:44.139
It's basically conventional Unix process
context switching
0:08:44.139,0:08:46.560
written really big
0:08:46.560,0:08:50.339
you what you do is you have your parallel
job thats running
0:08:50.339,0:08:51.390
on a system
0:08:51.390,0:08:52.839
and it runs for a while
0:08:52.839,0:08:57.920
and then after a certain amount of time you basically
shove it all; you kick it off of all the nodes
0:08:57.920,0:08:59.940
and let the next one come in
0:08:59.940,0:09:04.030
and typically when people do this they do it on
on the order of hours because the context switch
0:09:04.030,0:09:09.270
time is extremely large is extremely high
0:09:09.270,0:09:10.639
for example
0:09:10.639,0:09:14.530
because it's not just like swapping a process
internet. You suddenly have to co-ordinate
0:09:14.530,0:09:17.470
the this context which across to all your processes
0:09:17.470,0:09:19.280
if you're running say
0:09:19.280,0:09:21.190
MPI over TCP
0:09:21.190,0:09:25.910
you actually need to tear down the TCP sessions
because you can't just have TCP timers sitting
0:09:25.910,0:09:26.570
around
0:09:26.570,0:09:28.260
or that sort of thing so
0:09:28.260,0:09:29.950
there there's a there's a lot of overhead
0:09:29.950,0:09:34.340
associated with this. You take a long context switch
0:09:34.340,0:09:36.820
if all of your infrastructure supports this
0:09:36.820,0:09:39.420
it's fairly effective
0:09:39.420,0:09:43.300
and it does allow jobs to avoid interfering
with each other which is nice
0:09:43.300,0:09:46.100
so you can't you don't have issues
0:09:46.100,0:09:47.689
because you're typically allocating
0:09:47.689,0:09:50.950
whole swaps of the system
0:09:50.950,0:09:53.390
and for properly written applications
0:09:55.000,0:09:59.690
partial results can be returned which for some of
our users is really important where you're doing a
0:09:59.690,0:10:00.490
refinement
0:10:00.490,0:10:04.350
users would want to look at the results and
say okay
0:10:04.350,0:10:06.130
you know is this just going off into the weeds
0:10:06.130,0:10:10.860
or does it look like it's actually converging on
some sort of useful solution
0:10:10.860,0:10:13.980
as they don't want to just wait till the end.
0:10:13.980,0:10:19.270
Down side of course is that this context
switches costs are very high
0:10:19.270,0:10:22.460
and most importantly there's really a lack
of useful implementations
0:10:22.460,0:10:25.340
a number of platforms have implemented this in the past
0:10:25.340,0:10:29.840
but in practice on modern clusters which are
built on commodity hardware
0:10:29.840,0:10:32.340
with you know
0:10:32.340,0:10:35.530
communication libraries written on standard protocols
0:10:35.530,0:10:37.050
the tools just arent there
0:10:37.050,0:10:39.100
and so
0:10:39.100,0:10:40.860
it's not very practical.
0:10:40.860,0:10:44.010
Also it doesn't really make a lot of sense with small jobs
0:10:44.010,0:10:47.789
and one of the things that we found is we have users who have
0:10:47.789,0:10:50.860
embarrassingly parallel problems for they need to look at
0:10:50.860,0:10:53.450
you know twenty thousand studies
0:10:53.450,0:10:57.400
and they could write something that looked more like a
conventional parallel application where they
0:10:57.400,0:11:01.930
you know wrote a Scheduler and set up an MPI a Message Passing Interface
0:11:01.930,0:11:05.400
and handed out tasks to pieces of their job and then you
could do this
0:11:05.400,0:11:09.280
but then they would be running a Scheduler and they would
probably do a bad job of it turns out it's actually
0:11:09.280,0:11:10.820
fairly difficult to do right
0:11:10.820,0:11:13.740
even a trivial case
0:11:13.740,0:11:16.189
and so what they do instead is they just select twenty
0:11:16.189,0:11:18.730
twenty thousand jobs to grid engine and say okay
0:11:18.730,0:11:21.330
whatever I'll deal with it
0:11:21.330,0:11:23.140
earlier versions that might have been a problem
0:11:23.140,0:11:24.730
current versions of the code
0:11:24.730,0:11:27.060
handle easily a million jobs that
0:11:27.060,0:11:29.370
so not really a big deal
0:11:29.370,0:11:31.610
but those sort of users wouldn't fit well
0:11:31.610,0:11:34.190
into the gang scheduled environment
0:11:34.190,0:11:35.690
at least not in a
0:11:35.690,0:11:39.149
conventional gang scheduled environment where
you do gang scheduling on the granularity of
0:11:39.149,0:11:40.940
jobs
0:11:40.940,0:11:44.140
so from that perspective it wouldnt work very well.
0:11:44.140,0:11:48.380
If you have all the pieces in place and you are
doing a big parallel applications it is in fact
0:11:48.380,0:11:53.770
an extremely effective approach.
0:11:53.770,0:11:56.290
Another option which is sort of related
0:11:56.290,0:11:57.420
it's in fact
0:11:57.420,0:12:00.079
take taking an even courser granularity
0:12:00.079,0:12:04.360
is single application or single project
clusters or sub-clusters.
0:12:04.360,0:12:07.590
For instance this is used some national labs
0:12:07.590,0:12:11.910
where you're given a cycle allocation for a
year based on your grant proposals
0:12:11.910,0:12:14.779
and what your cycle allocation actually comes to you as is
0:12:14.779,0:12:16.580
here's your cluster
0:12:16.580,0:12:17.489
here's a frontend
0:12:17.489,0:12:19.840
here's this chunk of notes, they're yours, go to it.
0:12:19.840,0:12:21.930
Install your own OS, whatever you want
0:12:21.930,0:12:25.580
it's yours
0:12:25.580,0:12:30.310
and then and at a sort of finer scale there's things such as
0:12:30.310,0:12:31.800
you could use Emulab
0:12:31.800,0:12:36.300
which is the network emulation system but also does a OS install and configuration
0:12:36.300,0:12:39.300
so you could do dynamic allocation that way
0:12:39.300,0:12:40.540
Sun's
0:12:40.540,0:12:44.040
Project Hedeby now actually I think it's
called service domain manager
0:12:44.040,0:12:46.500
is the productised version
0:12:46.500,0:12:50.010
or some Clusters on Demand
0:12:50.010,0:12:54.450
they were actually talking about web hosting clusters but
0:12:54.450,0:12:57.780
things that allow rapid deployment unless you
do that a little
0:12:57.780,0:12:59.510
little
0:12:59.510,0:13:02.810
a more granular level than the
0:13:02.810,0:13:05.580
the allocate them once a year approach
0:13:05.580,0:13:07.720
nonetheless
0:13:07.720,0:13:11.220
lets you give people whole clusters to work with
0:13:11.220,0:13:12.920
nice one nice thing about it is
0:13:12.920,0:13:15.450
the isolation between the processes
0:13:15.450,0:13:16.890
is complete
0:13:16.890,0:13:20.800
so you dont have to worry about users stomping on each other.
Its their own system, they can trash it all they
0:13:20.800,0:13:22.230
want
0:13:22.230,0:13:24.709
if they flood the network or they
0:13:24.709,0:13:26.180
run the nodes into swap
0:13:26.180,0:13:28.480
well that's their problem
0:13:28.480,0:13:32.120
but it also has the advantage that you can tailor the images
0:13:32.120,0:13:36.980
on the nodes of the operative systems to
meet the exact needs of the application
0:13:36.980,0:13:40.560
down side of course is its coarse granularity, in our environment that doesn't work
0:13:40.560,0:13:41.500
very well
0:13:41.500,0:13:46.800
since we do have all of these all these different types of jobs
0:13:46.800,0:13:51.710
context switches are also pretty expensive. Certainly on the order of minutes
0:13:51.710,0:13:54.690
Emulab typically claim something like ten minutes
0:13:54.690,0:13:57.970
there are some systems out there
0:13:57.970,0:14:03.320
for instance if you use I think its Open Boot that
they're calling it today. It used to be 1xBIOS
0:14:03.320,0:14:06.790
where you can actually deploy a system in
0:14:06.790,0:14:08.700
tens of seconds
0:14:08.700,0:14:11.520
mostly by getting rid of all that junk the BIOS writers wrote
0:14:11.520,0:14:12.890
and
0:14:12.890,0:14:17.770
the OS boots pretty fast if you dont have all
that stuff to waylay you,
0:14:17.770,0:14:19.940
but in practice on sort of
0:14:19.940,0:14:21.660
off the shelf hardware
0:14:21.660,0:14:24.400
the context switches times are quite high
0:14:24.400,0:14:26.930
users of course can interfere with themselves
0:14:26.930,0:14:29.200
you can argue it's not a problem but
0:14:29.200,0:14:31.660
ideally you would like to prevent
that
0:14:31.660,0:14:35.350
one of the things that I have to deal with
is that my users are
0:14:35.350,0:14:37.830
almost universally
0:14:37.830,0:14:40.410
not trained as computer scientists or programmers
0:14:40.410,0:14:42.550
you know theyre trained in their domain area
0:14:42.550,0:14:44.780
they're really good in that area
0:14:44.780,0:14:48.389
but their concepts of the way hardware works in the
way software works
0:14:48.389,0:14:55.389
dont match reality in many cases
0:15:01.269,0:15:02.830
(inaudible question)
Its pretty rare in practice
0:15:02.830,0:15:06.700
well I've heard one lab that does it significantly
0:15:06.700,0:15:09.839
but it's like they do it on sort of a yearly
allocation basis
0:15:09.839,0:15:12.790
and throw the hardware away after two or three years
0:15:12.790,0:15:15.999
and you do typically have some sort of the deployment
0:15:15.999,0:15:18.340
system in place
0:15:18.340,0:15:20.680
or in those types of cases actually
0:15:20.680,0:15:22.359
usually your application comes with
0:15:22.359,0:15:26.500
and here's what we're going to spend on this many people
0:15:26.500,0:15:27.730
on this project so this is
0:15:27.730,0:15:34.730
big resource allocation
0:15:36.000,0:15:39.780
And yeah I guess one other issue with this is there's no real easy
0:15:39.780,0:15:43.320
way to capture underutilized resources
for example
0:15:43.320,0:15:44.389
if you have
0:15:44.389,0:15:49.190
an application which you know say single-threaded
and uses a ton of memory
0:15:49.190,0:15:51.210
and is running on a machine
0:15:51.210,0:15:55.040
the machines we're buying these days are eight core so
0:15:55.040,0:16:00.040
thats wasting a lot of CPU cycles you're just
generating a lot of heat doing nothing
0:16:00.040,0:16:03.890
so ideally you would like a scheduler that
said okay so you're using
0:16:03.890,0:16:08.040
using eight or seven of the eight Gigabytes of
RAM but we've got these jobs
0:16:08.040,0:16:10.080
sitting here that
0:16:10.080,0:16:11.560
need next to know need
0:16:11.560,0:16:15.910
a hundred megabytes so we slap seven of
those in along with the big job
0:16:15.910,0:16:18.580
and backfill and in this
0:16:18.580,0:16:19.600
mechanism there's no
0:16:19.600,0:16:21.810
there's no good way to do that
0:16:21.810,0:16:26.820
obviously if the users have that application
next they can do it themselves
0:16:26.820,0:16:30.510
but it's not something where we can easily
bring in
0:16:30.510,0:16:35.090
bring in more jobs and have a mix to
take advantage of the different
0:16:35.090,0:16:37.300
resources.
0:16:37.300,0:16:39.940
A related approach is to
0:16:39.940,0:16:43.950
to install virtualization software on the
equipment and this is this is
0:16:44.980,0:16:46.379
this is the essence of
0:16:46.379,0:16:49.800
what Cloud Computing is at the moment
0:16:49.800,0:16:53.520
it's Amazon providing Zen
0:16:53.520,0:16:55.129
Zen hosting for
0:16:55.129,0:16:56.769
relatively arbitrary
0:16:56.769,0:16:59.710
OS images
0:16:59.710,0:17:02.720
it does have the advantage that it allows rapid deployment
0:17:02.720,0:17:06.510
in theory if your application is scalable provides for
0:17:06.510,0:17:08.259
extremely high scalability
0:17:08.259,0:17:10.110
particularly if you
0:17:10.110,0:17:14.470
arent us and therefore can possibly use somebody else's hardware
0:17:14.470,0:17:16.520
in our application's case thats
0:17:16.520,0:17:18.790
not very practical so
0:17:18.790,0:17:20.360
we can't do that
0:17:20.360,0:17:20.870
and
0:17:20.870,0:17:23.790
it also has the advantage that you can run
0:17:23.790,0:17:26.470
you can have people with their own image in there
0:17:26.470,0:17:30.000
which is tightly resource constrained but you
can run more than one of them on a node. So for instance
0:17:30.000,0:17:31.170
you can give
0:17:31.170,0:17:32.730
one job
0:17:32.730,0:17:35.489
four cores and another job two cores another
0:17:35.489,0:17:37.500
you know and have a couple single core
0:17:37.500,0:17:38.860
jobs in theory
0:17:38.860,0:17:43.340
you can get fairly strong isolation there
obviously there are shared resources underneath
0:17:43.340,0:17:44.710
and you
0:17:44.710,0:17:45.570
probably can't
0:17:45.570,0:17:48.370
afford to completely isolate say network bandwidth
0:17:48.370,0:17:49.520
at the bottom layer
0:17:49.520,0:17:51.580
you can do some but
0:17:51.580,0:17:56.170
if you go overboard you can spend all your time on accounting
0:17:56.170,0:17:58.830
you also can again
0:17:58.830,0:18:01.410
tailor the images to the job
0:18:01.410,0:18:05.030
and in this environment actually you can
do that even more strongly than that
0:18:05.030,0:18:07.030
the sub-cluster approach
0:18:07.030,0:18:09.860
in that you can often do run
0:18:09.860,0:18:16.360
a five-year-old operating system or ten-year-old
operating system if you're using full virtualization
0:18:16.360,0:18:19.030
and that can allow
0:18:19.030,0:18:23.820
allow obsolete code with weird baselines to work which is
important in our space because
0:18:23.820,0:18:27.390
the average program runs ten years or more
0:18:27.390,0:18:30.860
our average project runs ten years or more
0:18:30.860,0:18:32.530
and as a result
0:18:32.530,0:18:36.010
you might have to go rerun this program that was written
0:18:36.010,0:18:37.320
way back on
0:18:37.320,0:18:40.550
some ancient version of windows or whatever
0:18:40.550,0:18:41.890
it also does provide
0:18:41.890,0:18:43.840
the ability to recover resources
0:18:43.840,0:18:45.290
as I was talking about before
0:18:45.290,0:18:49.530
but you can't do easily with sub-clusters because you cant just slip
0:18:49.530,0:18:50.360
another image
0:18:50.360,0:18:52.910
on the on there and say are you can use anything and
0:18:52.910,0:18:56.730
you know give that image idle priority essentially
0:18:56.730,0:19:00.480
down side of course is that it is in complete
isolation and that there is a shared
0:19:00.480,0:19:02.340
hardware
0:19:02.340,0:19:06.490
you're not likely to find I don't think
any the virtualization systems out there
0:19:06.490,0:19:08.890
right now
0:19:08.890,0:19:09.890
virtualize
0:19:09.890,0:19:11.470
your segment of
0:19:11.470,0:19:13.540
memory bandwidth
0:19:13.540,0:19:15.159
or your segment
0:19:15.159,0:19:16.390
of cache
0:19:16.390,0:19:18.390
of cache space
0:19:18.390,0:19:24.809
so users cant in fact interfere with themselves and each other in this
environment
0:19:24.809,0:19:25.589
it's also
0:19:25.589,0:19:30.479
not really efficient for small jobs; the cost of running an
entire OS for every
0:19:30.479,0:19:33.020
job is fairly high
0:19:33.020,0:19:34.020
even with
0:19:34.020,0:19:34.710
relatively light
0:19:34.710,0:19:38.250
Unix like OSes is you're still looking
0:19:38.250,0:19:40.900
couple hundred megabytes in practice
0:19:40.900,0:19:46.240
once you get everything up and running unless you run something
totally stripped down
0:19:47.230,0:19:49.460
theres significant overhead
0:19:49.460,0:19:52.240
theres CPU slowdown typically in the
0:19:52.240,0:19:55.360
you know typical estimates are in the twenty
percent range
0:19:55.360,0:20:00.450
numbers really range from fifty percent to
five percent depending on what exactly you're doing
0:20:00.450,0:20:02.100
possibly even lower
0:20:02.100,0:20:04.830
or higher
0:20:04.830,0:20:05.870
and just
0:20:05.870,0:20:09.920
you know the overhead because you have the whole OS there's a lot of a lot
0:20:09.920,0:20:11.420
of duplicate
0:20:11.420,0:20:13.320
stuff
0:20:13.320,0:20:15.010
the various vendors
0:20:15.010,0:20:17.090
have their answers they claim you know we can
0:20:17.090,0:20:21.430
we can merge that and say oh you're running the same kernel so we'll keep your memory
0:20:21.430,0:20:24.120
we use the same memory but
0:20:24.120,0:20:25.220
at some level
0:20:25.220,0:20:29.309
it's all going to get duplicated.
0:20:29.309,0:20:30.590
A related option
0:20:30.590,0:20:34.820
comes from sort of the internet havesting
industry which is to use virtual private
0:20:34.820,0:20:38.130
which is the technology from virtual private servers
0:20:38.130,0:20:42.110
the example that everyone here is probably familiar with is Jails where
0:20:42.110,0:20:44.130
you can provide
0:20:44.130,0:20:46.720
your own file system root
0:20:46.720,0:20:49.060
your own network interface
0:20:49.060,0:20:50.620
and what not
0:20:50.620,0:20:51.500
and
0:20:51.500,0:20:53.129
the nice thing about this is
0:20:53.129,0:20:56.210
that unlike full virtualization
0:20:56.210,0:20:58.680
the overhead is very small
0:20:58.680,0:21:01.030
basically it costs you
0:21:01.030,0:21:02.820
an entry in your process table
0:21:02.820,0:21:05.570
or an entry in few structures
0:21:05.570,0:21:08.760
there's some extra tests in their kernel but otherwise
0:21:10.220,0:21:14.900
there's not a huge overhead for virtualization you don't need
an extra kernel for every
0:21:14.900,0:21:15.460
image
0:21:15.460,0:21:18.390
so you get the difference here
between
0:21:18.390,0:21:21.620
be able to run maybe
0:21:21.620,0:21:25.250
you might be able to squeeze two hundred VMWare images onto a machine
0:21:25.250,0:21:29.620
VMWare people say no no don't do that but we have machines that are running
0:21:29.620,0:21:30.509
nearly that many.
0:21:34.790,0:21:38.289
On the other hand there are people out there who run thousands of
0:21:38.289,0:21:40.730
virtual hosts
0:21:40.730,0:21:43.170
using this technique on a single machine so
0:21:43.170,0:21:45.200
big difference in resource use
0:21:45.200,0:21:46.400
on especially with light
0:21:46.400,0:21:48.070
in the lightly loaded use
0:21:48.070,0:21:52.400
in our environment we're looking more running a very small number of them but still
0:21:52.400,0:21:55.880
that overhead is significant
0:21:55.880,0:21:59.440
you still do have some ability to tailor the
0:21:59.440,0:22:01.670
images to a jobs needs
0:22:01.670,0:22:03.309
you could have a
0:22:03.309,0:22:05.400
custom root that for instance you could be running
0:22:05.400,0:22:07.380
FreeBSD 6.0 in one
0:22:07.380,0:22:08.650
in one
0:22:08.650,0:22:11.040
virtual server and 7.0 in another
0:22:11.040,0:22:15.090
you have to be running of course 7.0 kernel or 8.0 kernel to make
that work
0:22:15.090,0:22:16.330
but it allows you to do that
0:22:16.330,0:22:18.500
we also in principle can do
0:22:18.500,0:22:23.080
evil things like our 64-bit kernel and then 32-bit
user spaces because
0:22:23.080,0:22:26.400
say you have applications that you can't find the source to anymore
0:22:26.400,0:22:31.830
or libraries you don't
have the source to any more
0:22:31.830,0:22:32.990
an answer
0:22:32.990,0:22:34.150
interesting things there
0:22:34.150,0:22:36.680
and the other nice thing is since you're
0:22:36.680,0:22:39.629
you're doing a very lightweight and incomplete
virtualization
0:22:39.629,0:22:43.269
you don't have to virtualize things you don't
care about so you dont have the overhead of
0:22:43.269,0:22:45.520
virtualizing everything.
0:22:45.520,0:22:48.070
Downsides of course are incomplete isolation
0:22:48.070,0:22:50.690
you are running processes that on the same kernel
0:22:50.690,0:22:52.770
and they can interfere with each other
0:22:52.770,0:22:55.320
and there's dubious flexibility obviously
0:22:55.320,0:22:57.900
I don't think anyone
0:22:57.900,0:23:01.850
should have the ability to run Windows in a jail.
0:23:01.850,0:23:02.860
Theres some
0:23:02.860,0:23:04.960
Net BSD support but
0:23:04.960,0:23:10.510
and I dont think it's really gotten to that point.
0:23:10.510,0:23:12.420
One final area
0:23:12.420,0:23:14.350
that sort of diverges from this
0:23:14.350,0:23:16.159
is the classic
0:23:16.159,0:23:18.400
Unix solution to the problem
0:23:18.400,0:23:20.580
on this on single
0:23:20.580,0:23:22.070
in a single machine
0:23:22.070,0:23:22.800
which is
0:23:22.800,0:23:28.950
to use existing resource limits and resource partitioning techniques
0:23:28.950,0:23:33.430
you know for example all Unix like our Unix systems have to process
resource limits
0:23:33.430,0:23:36.240
a resource and typically
0:23:36.240,0:23:36.999
scheduler a
0:23:38.340,0:23:41.510
cluster schedulers support the common ones
0:23:41.510,0:23:43.150
so you can set a
0:23:43.150,0:23:47.230
memory limit on your process or a CPU time limit on your process
0:23:47.230,0:23:49.830
and the schedulers typically provide
0:23:49.830,0:23:51.350
at least
0:23:51.350,0:23:54.740
launch support for
0:23:54.740,0:23:56.850
the limits on
0:23:56.850,0:24:01.900
a given set of process, thats part of the job
0:24:01.900,0:24:02.850
also the most
0:24:02.850,0:24:05.640
you know there are a number of forms of resource
partitioning that
0:24:05.640,0:24:07.170
are available
0:24:08.100,0:24:09.700
as a standard feature
0:24:09.700,0:24:12.000
on so memory discs are one of them so
0:24:12.000,0:24:16.800
if you want to create a file system space thats
limited in size, create a memory disc
0:24:16.800,0:24:17.969
and back it
0:24:17.969,0:24:21.130
and back it with a NMAP file
0:24:21.130,0:24:22.520
or swap
0:24:22.520,0:24:24.570
of partitioning
0:24:24.570,0:24:26.330
disc use
0:24:26.330,0:24:30.330
and then there are techniques like CPU affinities that you can walk
processes to it
0:24:30.330,0:24:32.010
a single process
0:24:32.010,0:24:34.540
processor or a set of processors
0:24:34.540,0:24:39.310
and so they can't interfere with each other
with processes running on other processors
0:24:39.310,0:24:44.280
the nice thing about this first is that you're using existing
facilities so you dont have to rewrite
0:24:44.280,0:24:46.170
lots of new features
0:24:46.170,0:24:49.590
for a niche application
0:24:49.590,0:24:52.790
and they tend to integrate well with existing schedulers
in many cases
0:24:52.790,0:24:55.940
parts of them are already implemented
0:24:55.940,0:24:59.650
and in fact the experiments that I'll talk about later are all using
this type of
0:24:59.650,0:25:02.160
technique.
0:25:02.160,0:25:02.830
Cons are of course
0:25:02.830,0:25:04.850
incomplete isolation again
0:25:04.850,0:25:08.270
and theres typically no unified framework
0:25:08.270,0:25:12.310
for the concept of a job when a job is composed of the center processes
0:25:12.310,0:25:16.710
yeah there are a number of data structures within the kernel for
instance the session
0:25:16.710,0:25:18.120
which
0:25:18.120,0:25:19.499
sort of aggregate processes
0:25:19.499,0:25:20.990
but there isnt one
0:25:22.230,0:25:24.800
in BSD or Linux at this point
0:25:24.800,0:25:29.020
which allows you to place resource limits on those in a way that you can a process
0:25:29.020,0:25:32.520
IREX did have support like that
0:25:32.520,0:25:34.160
where they have a job ID
0:25:34.160,0:25:36.210
and there could be a job limit
0:25:36.210,0:25:38.280
and selected projects
0:25:38.280,0:25:41.320
are sort of similar but not quite the same
0:25:41.320,0:25:43.149
processes or part of a project but
0:25:43.149,0:25:46.770
it's not quite the same inherited relationship
0:25:47.720,0:25:49.500
and typically
0:25:49.500,0:25:50.900
there arent
0:25:50.900,0:25:55.390
limits on things like bandwidth. There was
0:25:55.390,0:25:56.430
a sort of a
0:25:56.430,0:25:58.350
bandwidth limiting
0:25:58.350,0:26:00.630
nice type interface
0:26:00.630,0:26:01.950
on that I saw
0:26:01.950,0:26:03.720
posted as a research project
0:26:03.720,0:26:07.150
many years ago I think in the 2.x days
0:26:07.150,0:26:09.880
where you could say this process can have
0:26:09.880,0:26:11.580
you know five megabits
0:26:11.580,0:26:12.530
or whatever
0:26:12.530,0:26:14.380
but I haven't really seen anything take off
0:26:14.380,0:26:16.940
that would be a pretty neat thing to have
0:26:16.940,0:26:19.309
actually one other exception there
0:26:19.309,0:26:22.230
is on IREX again
0:26:22.230,0:26:28.210
the XFS file system supported guaranteed data rates on file handles
you could say
0:26:28.210,0:26:30.140
you could open a file and say I need
0:26:30.140,0:26:32.940
ten megabits read or ten megabits write
0:26:32.940,0:26:34.029
or whatever and it would say
0:26:34.029,0:26:35.529
okay or no
0:26:35.529,0:26:39.279
and then you could read and write and
it would do evil things at the file system layer
0:26:39.279,0:26:40.600
in some cases
0:26:40.600,0:26:43.940
all to ensure that you could get that streaming data rate
0:26:44.900,0:26:49.710
by keeping the file.
0:26:49.710,0:26:53.620
So now Im going to talk about what we've done
0:26:53.620,0:26:59.510
what we needed was a solution to handle
a wide range of job types
0:26:59.510,0:27:01.570
So of the options we looked at for instance
0:27:01.570,0:27:04.990
single application clusters or
project clusters
0:27:04.990,0:27:11.990
I think that the isolation they
provide is essentially unparalleled
0:27:12.590,0:27:16.630
and in our environment we probably have to
virtualize in order to be
0:27:16.630,0:27:18.179
efficient in terms of
0:27:18.179,0:27:22.060
being able to handle our job mix and what not and handle
the fact that our users
0:27:22.060,0:27:23.740
tend to have
0:27:23.740,0:27:27.730
spikes in their use
0:27:27.730,0:27:32.799
on a large scale so for instance we get GPS well show up and say
we need to run for a month
0:27:32.799,0:27:33.780
on and then
0:27:33.780,0:27:38.460
some indeterminate number of months later
they'll do it again
0:27:38.460,0:27:40.840
for that sort of quick
0:27:40.840,0:27:41.480
demands
0:27:42.240,0:27:44.850
we really need the virtuals something
virtualized
0:27:44.850,0:27:47.120
and then we have to pay the price of
0:27:47.120,0:27:48.380
of the overhead
0:27:48.380,0:27:51.590
and again it doesn't handle small jobs well and that is a
0:27:51.590,0:27:54.050
large portion of our job mix so
0:27:54.050,0:27:55.180
of the
0:27:55.180,0:27:58.070
quarter million or something jobs weve run
0:27:58.070,0:27:59.700
on our cluster
0:27:59.700,0:28:02.490
I would guess that
0:28:02.490,0:28:04.730
more than half of those were submitted
0:28:04.730,0:28:05.890
in
0:28:05.890,0:28:09.660
batches of more than ten thousand
0:28:09.660,0:28:11.400
so they'll just pop up
0:28:11.400,0:28:14.030
the other method to have looked at
0:28:14.800,0:28:16.750
are using resource limits
0:28:16.750,0:28:19.060
the nice thing of course is they're achievable
with
0:28:19.060,0:28:21.429
they achieve useful isolation
0:28:21.429,0:28:26.289
and theyre implementable with either existing functionality or small
extensions so that's what weve
0:28:26.289,0:28:27.230
concentrating on.
0:28:27.230,0:28:29.740
Weve also been doing some thinking about
0:28:29.740,0:28:31.809
could we use the techniques there
0:28:31.809,0:28:33.940
and combine them with jails
0:28:33.940,0:28:36.170
or related features
0:28:36.170,0:28:40.019
it may be bulking up jails to be more like zones in Solaris
0:28:40.019,0:28:44.150
or containers I think they're calling them this
week
0:28:44.150,0:28:44.840
and
0:28:44.840,0:28:46.770
so we're looking at that as well
0:28:46.770,0:28:50.840
to be able to provide
0:28:50.840,0:28:54.250
to be able to provide pretty user operating environments
0:28:54.250,0:28:59.200
potentially isolating users from upgrades so for instance as we upgrade the kernel
0:28:59.200,0:29:03.469
and users can continue using it all the
images they don't have time to rebuild their
0:29:03.469,0:29:04.330
application in
0:29:04.330,0:29:09.970
and handle the updates in libraries and what not
0:29:09.970,0:29:13.840
they also have the potential to provide strong isolation for security
purposes
0:29:13.840,0:29:18.740
which could be useful in the future.
0:29:18.740,0:29:20.159
We do think that
0:29:20.159,0:29:24.040
of these mechanisms the nice thing is that
resource limit
0:29:24.040,0:29:26.150
the resource limits and partitioning scheme
0:29:26.150,0:29:29.860
as well as virtual private service are very
similar implementation requirements
0:29:29.860,0:29:33.090
set up a fair bit more expensive
0:29:33.090,0:29:34.620
in the VPS case
0:29:34.620,0:29:38.780
but nonetheless they're fairly similar.
0:29:38.780,0:29:42.610
So, what we've been doing is we've taken the Sun Grid Engine
0:29:42.610,0:29:46.880
and we were originally intended to actually
extend Sun Grid Engine and modify its daemons
0:29:46.880,0:29:48.480
to do the work
0:29:48.480,0:29:51.150
on what we ended up doing instead is realize
that well
0:29:51.150,0:29:54.910
we can actually specify an alternate program
to run instead of the shepherd
0:29:54.910,0:29:57.990
The shepherd is the process
0:29:57.990,0:30:00.580
that starts all
0:30:00.580,0:30:02.250
starts the script that
0:30:02.250,0:30:03.380
can for each job
0:30:03.380,0:30:04.920
on a given node
0:30:04.920,0:30:08.559
it collects usage and forwards signals to the
children
0:30:08.559,0:30:12.620
and also is responsible for starting remote
components
0:30:12.620,0:30:14.560
so a shepherd is started and then
0:30:14.560,0:30:17.640
traditionally in Sun grid engine it starts out
0:30:17.640,0:30:19.910
its own RShell Daemon
0:30:19.910,0:30:20.800
and
0:30:20.800,0:30:22.010
jobs connect over
0:30:22.010,0:30:23.670
these days that for their own
0:30:23.670,0:30:25.870
mechanism which is
0:30:25.870,0:30:26.950
secure
0:30:26.950,0:30:28.000
not using the
0:30:28.840,0:30:30.530
crafty old RShell code.
0:30:35.370,0:30:37.970
So what we've done is we've implemented a wrapper script
0:30:37.970,0:30:40.139
which allows a pre-command hook
0:30:40.139,0:30:42.559
to run before the shepherd starts
0:30:42.559,0:30:47.170
the command wrapper so before we start shepherd we can run like the N program
0:30:47.170,0:30:49.150
or we can run
0:30:49.150,0:30:50.430
TRUE to whatever
0:30:50.430,0:30:54.040
to set up the environment that it runs in or CPU
0:30:54.040,0:30:56.600
setters Ill show later
0:30:56.600,0:30:58.750
and a post command hook for cleanup
0:30:58.750,0:31:03.940
it's implemented in Ruby because I felt like it.
0:31:03.940,0:31:07.830
The first thing we implemented was memory backed temporary directories. The motivation for
0:31:07.830,0:31:08.700
this
0:31:08.700,0:31:09.640
is that
0:31:09.640,0:31:12.180
we've had problems for users will you know
0:31:12.180,0:31:15.510
run slash temp out on the nodes
0:31:15.510,0:31:19.059
where we have the nodes configured is that they do have discs
0:31:19.059,0:31:22.960
and most of the disc is available as slash temp
0:31:22.960,0:31:25.049
we had some cases
0:31:25.049,0:31:27.840
particularly early on where users would fill up the discs and not delete it
0:31:27.840,0:31:32.300
their job would crash or they would forget to add clean up code or whatever
0:31:32.300,0:31:35.100
and then other jobs would fail strangely
0:31:35.100,0:31:39.029
you might expect that you just get a nice error message
0:31:39.029,0:31:42.040
programmers being programmers
0:31:42.040,0:31:42.909
people would not do their
0:31:42.909,0:31:44.630
error handling correctly.
0:31:44.630,0:31:47.380
A number of libraries do have issues like for instance
0:31:47.380,0:31:49.600
the PVM library
0:31:49.600,0:31:52.600
unexpectedly fails and reports a completely strange error
0:31:52.600,0:31:54.759
if it can't create a file in temp
0:31:54.759,0:32:01.669
because it needs to create a UNIX domain socket
so it can talk to itself.
0:32:01.669,0:32:03.360
So, what weve done here
0:32:03.360,0:32:08.059
is it turns out that Sun Grid Engine actually creates a temporary
directory often the
0:32:08.059,0:32:11.730
typically /TEMP but you can change
that
0:32:11.730,0:32:14.490
and points temp dir to that
0:32:14.490,0:32:15.370
location
0:32:15.370,0:32:17.499
we've educated most of all users now
0:32:17.499,0:32:21.360
to use that location correctly
so theyll use that variable
0:32:21.360,0:32:23.279
they treat their files under temp dir
0:32:23.279,0:32:24.950
and then when the job exits
0:32:24.950,0:32:26.569
the Grid Engine deletes the directory
0:32:26.569,0:32:28.510
and that all gets cleaned up
0:32:28.510,0:32:32.720
the problem of course being that of multiple
are also running on the same node at the same time
0:32:32.720,0:32:35.290
one of them could still fill temp
0:32:35.290,0:32:38.759
so the solution was pretty simple
we created a
0:32:38.759,0:32:41.420
wrapper script at the beginning of the job
0:32:41.420,0:32:42.760
creates a
0:32:42.760,0:32:43.940
a
0:32:43.940,0:32:47.260
memory file to swap back to MD file system
0:32:47.260,0:32:50.790
of a user requestable size with the default
0:32:50.790,0:32:53.310
and
0:32:53.310,0:32:56.520
this has a number of advantages the biggest one of course is that
0:32:56.520,0:32:58.320
it's fixed size so we get
0:32:58.320,0:32:59.449
you know
0:32:59.449,0:33:01.000
the user gets
0:33:01.000,0:33:03.420
what they asked for
0:33:03.420,0:33:05.930
and once they run of space, they run out of space well
0:33:05.930,0:33:09.300
and too bad they ran out of space
0:33:09.300,0:33:12.760
they should have asked for more
0:33:12.760,0:33:16.350
the other
0:33:16.350,0:33:18.770
the other advantage is the side-effect that
0:33:18.770,0:33:21.619
now that we're running swap back memory files systems for temp
0:33:21.619,0:33:24.560
the users who only use a fairly small amount of temp
0:33:24.560,0:33:28.190
should see vastly improved performance
because they're running in memory
0:33:28.190,0:33:32.980
rather than writing to disc
0:33:32.980,0:33:34.690
quick example
0:33:34.690,0:33:38.270
we've a little job script here
0:33:38.270,0:33:39.830
prints temp dir and
0:33:39.830,0:33:41.950
prints the
0:33:41.950,0:33:43.080
amount of space
0:33:43.080,0:33:46.210
we submit our job request saying that we want
0:33:46.210,0:33:51.539
this is what we want hundred megabytes of
temp space
0:33:51.539,0:33:53.580
the same that's why if this
0:33:53.580,0:33:55.230
so the program doesn't
0:33:55.230,0:33:57.620
so the program ends at the end of it
0:33:57.620,0:33:58.709
for doing it
0:33:58.709,0:34:00.510
here's a live demo
0:34:00.510,0:34:01.840
all and then
0:34:01.840,0:34:03.389
you look at the output
0:34:03.389,0:34:04.280
you can see it
0:34:04.280,0:34:07.549
does in fact it creates a memory file system
0:34:07.549,0:34:10.449
I attempted to do great code
0:34:10.449,0:34:13.409
having a variable space
0:34:13.409,0:34:15.839
that is roughly what the user asked for
0:34:15.839,0:34:17.089
the version that I had
0:34:17.089,0:34:20.739
when I was attempting this was not entirely
accurate
0:34:20.739,0:34:24.710
trying to guess what all the
UFS overhead would be
0:34:24.710,0:34:25.889
as the result was
0:34:25.889,0:34:28.399
not quite consistent
0:34:30.790,0:34:33.899
I couldn't figure out easy function so
0:34:33.899,0:34:39.589
it does a better job than it did to start with, its not perfect
0:34:39.589,0:34:40.600
sometimes however
0:34:40.600,0:34:42.329
today that that's a good fix
0:34:42.329,0:34:43.550
we're coming to
0:34:43.550,0:34:45.359
Deploy it pretty soon
0:34:45.359,0:34:47.159
it works pretty easily
0:34:47.159,0:34:48.570
well sometimes it's not enough
0:34:48.570,0:34:51.390
the biggest issue is that they were badly designed programs all
0:34:51.390,0:34:52.720
all over the world
0:34:52.720,0:34:54.919
don't use temp dir like they're supposed to
0:34:54.919,0:34:59.319
in fact
0:35:10.099,0:35:12.759
(inaudible question)
so there are all these applications
0:35:12.759,0:35:17.979
there are all these applications still that need
temp say during start up
0:35:17.979,0:35:19.230
that sort of thing
0:35:19.230,0:35:20.809
so
0:35:20.809,0:35:22.599
all
0:35:22.599,0:35:25.829
so we have problems with these
0:35:25.829,0:35:26.290
realistically
0:35:26.290,0:35:27.799
we cant change all of them
0:35:27.799,0:35:30.019
it's just not going to happen
0:35:30.019,0:35:31.950
so we still have problems with people
0:35:31.950,0:35:34.509
running out of resources
0:35:34.509,0:35:35.819
so we probably
0:35:35.819,0:35:37.489
feel that
0:35:37.489,0:35:41.240
the most general solution is to write a per job slash temp
0:35:41.240,0:35:44.880
and virtualize that portion of the files system
in memory space
0:35:44.880,0:35:47.119
and variate symlinks can do that
0:35:47.119,0:35:52.539
and so we said okay let's give it a shot
0:35:52.539,0:35:56.969
just to introduce the concept of variate symlinks for people who arent familiar with them
0:35:56.969,0:36:00.280
variate symlinks are basically symlinks that
contain variables
0:36:00.280,0:36:02.389
which are expanded at run time
0:36:02.389,0:36:05.549
it allows paths to be different for different
processes
0:36:05.549,0:36:06.969
for example
0:36:06.969,0:36:08.689
you create some files
0:36:08.689,0:36:10.069
you create
0:36:10.069,0:36:12.459
a symlink whose contents are
0:36:12.459,0:36:18.329
this variable which has the default shell value
0:36:18.329,0:36:18.990
and you
0:36:18.990,0:36:24.949
get different results with different
variable sets.
0:36:24.949,0:36:27.170
So, to talk about the implementation weve done,
0:36:27.170,0:36:32.389
it's derived from direct implementation, most of
the data structures are identical
0:36:32.389,0:36:33.869
however, Ive made a number of changes
0:36:33.869,0:36:39.649
the biggest one is that we took the concept
of scopes and we turned them entirely around
0:36:40.409,0:36:45.329
in there is a system scope which
is over overridden by a user scope and by a
0:36:45.329,0:36:47.259
process scope
0:36:49.819,0:36:53.449
problem with that is if you
0:36:53.449,0:36:56.099
only think about say the systems scope
0:36:56.099,0:36:57.079
and
0:36:57.079,0:36:59.459
you decide you want to do something clever like have
0:36:59.459,0:37:02.219
a root file system which
0:37:02.219,0:37:06.109
where slash lib points to different things
for different
0:37:06.109,0:37:08.249
different architectures
0:37:08.249,0:37:11.849
well, works quite nicely until the users come along
and
0:37:11.849,0:37:14.189
set their arch variable
0:37:14.189,0:37:15.629
up for you
0:37:15.629,0:37:18.900
if you have say a Set UID program and you don't
defensively
0:37:18.900,0:37:22.319
and you don't implement correctly
0:37:22.319,0:37:24.900
the obvious bad things happen. Obviously you would
0:37:24.900,0:37:28.599
write your code to not do that I believe they
did, but
0:37:28.599,0:37:31.700
there's a whole class of problems where
0:37:31.700,0:37:33.449
it's easy to screw up
0:37:33.449,0:37:36.219
add and do something wrong there
0:37:36.219,0:37:37.270
so by
0:37:37.270,0:37:38.509
reversing the order
0:37:38.509,0:37:41.849
we can reduce the risks
0:37:41.849,0:37:43.329
at the moment we don't
0:37:43.329,0:37:44.309
have a user scope
0:37:44.309,0:37:47.530
I just don't like the idea of the users scope
to be honest
0:37:47.530,0:37:50.900
problem being that then you have to have
per user state in kernel
0:37:50.900,0:37:55.509
that just sort of sits around forever
you can never garbage collect it except the
0:37:55.509,0:37:57.059
Administrator way
0:37:57.059,0:37:59.489
just doesn't seem like a great idea to me
0:37:59.489,0:38:00.700
And jail scope
0:38:00.700,0:38:04.609
just hasn't been implemented
0:38:04.609,0:38:09.809
because it wasn't entirely clear what the semantics should be
0:38:11.010,0:38:14.719
I also added default variable support variable
also shell style
0:38:14.719,0:38:16.999
variable support
0:38:16.999,0:38:19.169
to some extent undoes the scope
0:38:19.169,0:38:20.870
the scope change
0:38:20.870,0:38:21.779
in that
0:38:21.779,0:38:24.749
the default variable becomes a system scope
0:38:24.749,0:38:26.540
which is overridden by everything
0:38:26.540,0:38:30.890
but there are cases where we need to do that
in particular who wants implement their
0:38:30.890,0:38:33.380
slashed temp which varies
0:38:33.380,0:38:36.240
we have to do something like this because temp needs to work
0:38:37.209,0:38:42.059
if we don't have the job values set
0:38:42.059,0:38:45.829
I also decided to use
0:38:45.829,0:38:49.839
percent instead of dollar sign to avoid
confusion with shell variables because these
0:38:49.839,0:38:50.379
are
0:38:50.379,0:38:52.620
a separate namespace in the kernel
0:38:52.620,0:38:56.669
we can't do it to main OS and do all the evaluation in the
user space
0:38:56.669,0:38:59.269
it's classic vulnerability
0:38:59.269,0:39:02.739
in the CVE database for instance
0:39:02.739,0:39:08.109
and were not using @ and avoid confusion
with AFS
0:39:08.109,0:39:09.819
or the Net BSD implementation
0:39:09.819,0:39:11.019
which does not allow
0:39:11.019,0:39:14.879
user or administratively settable values
0:39:14.879,0:39:17.019
that support
0:39:17.019,0:39:20.359
I don't have any automated variables such
as
0:39:20.359,0:39:25.789
the percent sys value which is universally
set in the Net BDS implementation
0:39:25.789,0:39:26.750
or
0:39:28.039,0:39:32.579
a UID variable which they also have
0:39:32.579,0:39:34.909
and currently and it allows
0:39:34.909,0:39:40.880
setting of values in other processes,
you can only set them in your own and inherit it
0:39:40.880,0:39:42.699
that may change but
0:39:42.699,0:39:47.339
one of my goals here is because they were
subtle ways to make dumb mistakes and
0:39:47.339,0:39:48.930
cause security vulnerabilities
0:39:48.930,0:39:52.479
I've attempted to slim the feature set
down to the point where you
0:39:52.479,0:39:54.909
have some reasonable chance of not
0:39:54.909,0:39:56.339
doing that
0:39:56.339,0:40:03.339
if you start building systems on them for deployment.
0:40:04.419,0:40:06.909
The final area that we've worked on
0:40:06.909,0:40:09.499
is moving away from the final system space
0:40:09.499,0:40:12.559
and into CPU sets
0:40:12.559,0:40:16.379
Jeff Roberts implemented a program
0:40:16.379,0:40:20.699
implemented a CPU set functionality which
allows you to
0:40:20.699,0:40:23.489
create… put a process into a CPU set
0:40:23.489,0:40:24.879
and then set the affinity of that
0:40:24.879,0:40:26.269
CPU set
0:40:26.269,0:40:29.189
by default every process has an anonymous
0:40:29.189,0:40:33.059
CPU set that was stuffed into
one that was created by this
0:40:33.059,0:40:37.269
in a parent
0:40:37.269,0:40:38.619
so for a little background here
0:40:38.619,0:40:40.740
in a typical SGE configuration
0:40:40.740,0:40:42.769
every node has one slot
0:40:42.769,0:40:44.429
per CPU
0:40:44.429,0:40:48.639
There are a number of other ways you
can configure it, basically a slot is something
0:40:48.639,0:40:50.019
a job can run in
0:40:50.019,0:40:56.719
and a parallel job crosses slots
and can be in more than one slot
0:40:56.719,0:41:01.359
for instance in many applications where
code tends to spend a fair bit of time
0:41:01.359,0:41:02.380
waiting for IO
0:41:02.380,0:41:06.209
you are looking at more than one slot per CPU so two slots per
0:41:06.209,0:41:08.089
core is not uncommon
0:41:08.089,0:41:10.869
but probably the most common configuration
and the one that
0:41:10.869,0:41:13.719
you get out of the box is you just install a Grid Engine
0:41:13.719,0:41:16.739
is one slot for each CPU
0:41:16.739,0:41:19.830
and that's how that's how we run because we
want users to have
0:41:19.830,0:41:23.699
that whole CPU for whatever they want to do with
it
0:41:23.699,0:41:26.130
so jobs are allocated one or more slots
0:41:26.130,0:41:27.599
if they're
0:41:27.599,0:41:33.189
depending on whether they're sequential or parallel jobs
and how many they ask for
0:41:33.189,0:41:37.239
but this is just a convention
there's no actual connection between slots
0:41:37.239,0:41:39.119
and CPUs
0:41:39.119,0:41:40.829
so it's quite possible to
0:41:40.829,0:41:42.819
submit a non-parallel job
0:41:42.819,0:41:45.019
that goes off and spawns a zillion threads
0:41:45.019,0:41:48.369
and sucks up all the CPUs on the whole system
0:41:48.369,0:41:50.800
in some early versions of grid engine
0:41:50.800,0:41:53.569
there actually was
0:41:53.569,0:41:55.729
support for tying slots
0:41:55.729,0:41:58.669
to CPUs if you set it up that
way
0:41:58.669,0:42:02.979
there is a sensible implementation for IREX
and then things got weirder and weirder is
0:42:02.979,0:42:06.010
people tried to implement it on other platforms
which had
0:42:06.010,0:42:07.030
vastly different
0:42:07.030,0:42:09.839
CPU binding semantics
0:42:09.839,0:42:12.359
and at this point its entirely broken
0:42:12.359,0:42:14.959
on every platform as far as I can tell
0:42:14.959,0:42:18.759
so we decided okay we've got this wrapper
let's see what we can do
0:42:18.759,0:42:21.009
in terms of making things work.
0:42:21.659,0:42:27.119
We now have the wrapper store allocations in the final system
0:42:27.119,0:42:31.239
we have a not yet recursive allocation algorithm
0:42:31.239,0:42:33.369
well we try to do is
0:42:33.369,0:42:34.690
find the best fit
0:42:34.690,0:42:35.779
fitting set of
0:42:35.779,0:42:39.539
adjacent cores
0:42:39.539,0:42:42.329
and then if that doesn't work we take the largest
to repeat
0:42:43.519,0:42:45.180
and until we fix
0:42:45.180,0:42:47.300
or until we've got enough slots
0:42:47.300,0:42:50.800
the goal is to minimize new fragments we haven't
done any analysis
0:42:50.800,0:42:52.269
to determine whether that's actually
0:42:52.269,0:42:55.179
an appropriate algorithm
0:42:55.179,0:42:56.289
but off hand it seems
0:42:56.289,0:43:00.519
fine given Ive thought about it over lunch.
0:43:00.519,0:43:02.810
Should 40s lay down their OSes
0:43:02.810,0:43:09.649
turns out that FreeBSD, CPU setting, API
and the Linux one
0:43:09.649,0:43:12.519
differ only in the very small details
0:43:12.519,0:43:13.599
Theyre
0:43:13.599,0:43:15.479
essentially exactly
0:43:15.479,0:43:17.569
identical which is
0:43:17.569,0:43:20.489
convenient semantically,
so converting between then is pretty straight forward
0:43:20.489,0:43:24.869
so converting between then is pretty straight forward,
so I did a set of benchmarks
0:43:24.869,0:43:27.019
to demonstrate the
0:43:28.089,0:43:29.359
effectiveness of CPU set,
they also happen to demonstrate the wrapper
0:43:29.359,0:43:33.319
but dont really have any relevance
0:43:33.319,0:43:35.229
used a little eight core Intel Xeon box
0:43:38.289,0:43:40.749
7.1 pre-release that had
0:43:40.749,0:43:43.239
John Bjorkman backported
0:43:43.239,0:43:46.640
CPU set
0:43:46.640,0:43:49.039
from 8.0 shortly before release
0:43:49.039,0:43:53.450
well not so shortly, it's supposed to be shortly
before
0:43:53.450,0:43:55.579
and the SG 6.2
0:43:55.579,0:43:59.739
we used the simple integer benchmarks
0:43:59.739,0:44:02.519
end Queens program were tested
0:44:02.519,0:44:03.349
for instance an 8 x 8 board
0:44:03.349,0:44:05.360
placed
0:44:05.360,0:44:08.069
the 8 queens so they cant capture each other
0:44:08.069,0:44:09.289
on the board
0:44:11.039,0:44:13.680
so it's a simple load benchmark
0:44:13.680,0:44:18.800
that we ran a small version of the problem
as our measure command to generate
0:44:19.599,0:44:24.439
load we ran a larger version that we ran for much longer
0:44:24.439,0:44:28.149
some results
0:44:28.149,0:44:30.129
so for baseline,
0:44:30.129,0:44:33.170
the most interesting thing is to do
a baseline run
0:44:33.170,0:44:34.279
you see this
0:44:34.279,0:44:36.410
some variance it's not really very high
0:44:36.410,0:44:38.979
not surprising it doesn't really do anything
0:44:38.979,0:44:40.979
except suck CPU see here
0:44:40.979,0:44:41.729
Really not much
0:44:41.729,0:44:45.229
going on
0:44:45.229,0:44:50.029
in this case weve got seven
load processes and a single
0:44:50.029,0:44:52.789
a single test process running
0:44:52.789,0:44:55.160
we see things slow down slightly
0:44:55.160,0:44:55.890
and
0:44:55.890,0:44:58.389
the standard deviation goes up a bit
0:44:58.389,0:45:00.829
its a little bit of deviation from baseline
0:45:00.829,0:45:03.659
the obvious explanation is clearly
0:45:03.659,0:45:07.339
were just content switching
a bit more
0:45:08.840,0:45:10.349
because we don't have
0:45:10.349,0:45:12.410
CPUs that are doing nothing at all
0:45:12.410,0:45:15.559
there some extra load from the system
as well
0:45:15.559,0:45:20.049
since the kernel has to run and
background tests have to run
0:45:20.049,0:45:23.150
you know in this case we have a badly behaved application
0:45:23.150,0:45:26.579
we now have 8 load processes which would suck up all the CPU
0:45:26.579,0:45:28.879
and then we try to run our measurement process
0:45:28.879,0:45:30.639
we see a you know
0:45:30.639,0:45:32.739
substantial performance decrease
0:45:32.739,0:45:35.570
you know about in the range we would expect
0:45:35.570,0:45:37.289
see if we had any
0:45:37.289,0:45:40.140
decrease
0:45:40.140,0:45:43.220
we fired up with CPU set
0:45:43.220,0:45:44.249
quite obviously
0:45:44.249,0:45:46.190
the interesting thing here is to see it
0:45:46.190,0:45:49.429
were getting no statistically significant difference
0:45:49.429,0:45:52.819
between the baseline case with
0:45:52.819,0:45:56.539
7 processors if we use CPU sets
we don't see this variance
0:45:56.539,0:45:58.520
which is nice to know that this shows
0:45:58.520,0:45:59.509
that's it
0:45:59.509,0:46:02.869
we actually see a slight performance
improvement
0:46:02.869,0:46:04.179
and
0:46:04.179,0:46:05.579
we
0:46:05.579,0:46:07.589
we see a reduction in variance
0:46:07.589,0:46:11.569
so CPU set is actually improving performance
even if were not overloaded
0:46:11.569,0:46:13.510
and we see in the overloaded case
0:46:13.510,0:46:15.589
it's the same
0:46:15.589,0:46:20.319
for the other processes
theyre stuck on other CPUs
0:46:20.319,0:46:22.820
one interesting side note actually is that
0:46:22.820,0:46:26.719
when I was doing some tests early on
0:46:26.719,0:46:27.869
we actually saw
0:46:27.869,0:46:32.359
I tried doing the base line and
the baseline with CPU set and if you just fired off with the original
0:46:32.359,0:46:33.869
algorithm
0:46:33.869,0:46:34.540
which
0:46:34.540,0:46:36.489
grabbed CPU0
0:46:36.489,0:46:39.339
you saw a significant performance decline
0:46:39.339,0:46:42.319
because there's a lot of stuff that ends up
running on CPU0
0:46:42.319,0:46:43.819
which
0:46:43.819,0:46:45.100
what led to the
0:46:45.100,0:46:49.890
quick observation you want to allocate
from the large numbers down
0:46:49.890,0:46:50.569
so that you use
0:46:50.569,0:46:55.069
the CPUs which are not running the random processes
that get stuck on zero
0:46:55.069,0:46:57.880
or get all the interrupts in some architectures
0:46:57.880,0:47:02.199
and avoid Core0 in particular.
0:47:02.199,0:47:04.029
so some conclusions
0:47:04.029,0:47:07.530
I think we have useful proof of concept
of going to be deploying
0:47:07.530,0:47:09.880
certainly the
0:47:09.880,0:47:11.000
memory stuff soon
0:47:11.000,0:47:13.329
once we upgrade to seven well
0:47:13.329,0:47:15.959
definitely be deploying the CPU sets
0:47:15.959,0:47:16.849
so it's
0:47:16.849,0:47:18.509
both improves performance
0:47:18.509,0:47:22.009
in the contended case and in the and uncontended case
0:47:22.009,0:47:26.299
we would like in the future to do some more work
with virtual private server stuff
0:47:26.299,0:47:28.979
Particularly it would be really interesting
0:47:28.979,0:47:30.759
to be able to run different
0:47:30.759,0:47:32.540
different FreeBSD versions in jails
0:47:32.540,0:47:37.660
for to run up for instance CentOS images
in jail since were running CentOS
0:47:37.660,0:47:40.649
on our Linux based systems
0:47:40.649,0:47:43.240
there could actually be some really interesting
things there
0:47:43.240,0:47:45.759
in that for instance we can run
0:47:45.759,0:47:50.989
we could potentially detrace Linux applications
it's never going to happen on native Linux
0:47:50.989,0:47:53.069
there's also another example where
0:47:53.069,0:47:56.269
Paul Sub whos doing some benchmarking recently
0:47:56.269,0:48:01.039
and relative to Linux on the same hardware
0:48:01.039,0:48:04.900
he was seeing a three and a half times improvement
0:48:04.900,0:48:07.230
in basic matrix multiplication
0:48:07.230,0:48:08.549
relative to current
0:48:08.549,0:48:11.849
because previously super-pegged functionality
0:48:11.849,0:48:14.499
where you vastly reduce the number of TLV entries
0:48:14.499,0:48:16.150
in the page table
0:48:16.150,0:48:17.229
and so
0:48:17.229,0:48:21.109
that sort of thing can apply even to apply
to our Linux using population
0:48:21.109,0:48:23.969
could give FreeBSD some real wins there
0:48:26.309,0:48:27.579
Id like to look at
0:48:27.579,0:48:30.859
more on the point of isolating users from kernel upgrades
0:48:30.859,0:48:32.620
one of the issues we've had is that
0:48:32.620,0:48:34.019
when you do a new bump
0:48:34.019,0:48:38.399
we have users who depend on all sorts of libraries
immediate which
0:48:38.399,0:48:41.380
you know the vendors like to rev them to
do
0:48:41.380,0:48:44.640
stupid API breaking changes is fairly
regularly so
0:48:44.640,0:48:48.380
itd be nice for users if we can get all the
benefits to kernel upgrades
0:48:48.380,0:48:51.699
and they could upgrade at their leisure
0:48:51.699,0:48:54.459
so we're hoping to do that in future as well
0:48:54.459,0:48:57.809
wed would like to see more limits
on bandwidth type resources
0:48:59.219,0:49:01.199
for instance say limiting the amount of
0:49:02.910,0:49:05.649
it's fairly easy to know the amount
of sockets I own
0:49:05.649,0:49:10.279
but its hard to place a total limit on
network bandwidth
0:49:10.279,0:49:11.819
by a particular process
0:49:11.819,0:49:16.979
when almost all of our storage is on NFS
how do you classify that traffic
0:49:17.649,0:49:21.259
without a fair bit of change to the kernel
and somehow tagging that
0:49:21.259,0:49:23.799
it's an interesting challenge.
0:49:23.799,0:49:28.309
we'd also like to see it could be needed some
you implement something like
0:49:28.309,0:49:30.089
the IRIX job ID
0:49:30.089,0:49:34.099
to allow the scheduler to just
tag processes as part of a job
0:49:34.099,0:49:36.309
currently
0:49:36.309,0:49:38.939
I've grid engine uses a clever but evil hack
0:49:38.939,0:49:40.010
where they add
0:49:40.010,0:49:42.509
an extra group to the process
0:49:42.509,0:49:44.819
and they just have a range of groups
0:49:44.819,0:49:48.209
available so they get inherited in the users
cant drop them so
0:49:48.209,0:49:51.889
that allows them to track the process
but its an ugly hack
0:49:51.889,0:49:57.499
and with the current limits on the number of groups
it can become a real problem
0:49:57.499,0:49:59.529
actually before I take questions
0:49:59.529,0:49:59.980
I do want to put in
0:49:59.980,0:50:01.119
one quick point
0:50:01.119,0:50:05.100
the think it's not interesting you live in
the area and if you're looking for
0:50:05.100,0:50:06.430
looking for a job
0:50:06.430,0:50:09.780
we are trying to hire a few people it's difficult
to hire good
0:50:09.780,0:50:13.069
we do have some openings and we're looking
for
0:50:13.069,0:50:17.409
BSD people in general system
Admin people
0:50:17.409,0:50:24.409
so questions?
0:50:38.419,0:50:40.989
Yes
(inaudible question)
0:50:40.989,0:50:45.719
I would expect that to happen
but it's not something Ive attempted to test
0:50:45.719,0:50:50.570
what I would really like is to have a topology aware allocator
0:50:50.570,0:50:53.179
so that you can request that you know I want
0:50:53.179,0:50:56.229
I want to share cache or I don't want to share cache
0:50:56.229,0:51:00.170
I want to share memory band width or not share memory bandwidth
0:51:00.170,0:51:02.459
open MPI 1.3
0:51:02.459,0:51:08.469
on the Linux side have a topology where a wrapper for their CPU
0:51:08.469,0:51:10.159
functionality
0:51:10.159,0:51:12.249
makes it something called
0:51:12.249,0:51:14.139
the PLAP
0:51:14.139,0:51:15.259
portable Linux
0:51:16.519,0:51:19.599
CPU allocator. Is that what
it's actually been
0:51:19.599,0:51:21.959
what the acronym is
0:51:21.959,0:51:25.400
in essence they have to work around the fact
that there were three standard
0:51:25.400,0:51:27.809
there were three different
0:51:27.809,0:51:31.759
kernel APIs for the same syscall
0:51:31.759,0:51:38.759
for CPU allocation because all the vendors
did it themselves somehow
0:51:38.769,0:51:44.969
they're the same number but
theyre completely incompatible
0:51:44.969,0:51:48.749
when you first load the application it calls
the syscall and it tries to figure out which
0:51:48.749,0:51:50.579
one it is
0:51:50.579,0:51:52.719
by what errors it returns depending on what
0:51:52.719,0:51:56.139
are you missing and completely evil
0:51:56.139,0:52:00.859
I think people should port their API
and have their library work but
0:52:00.859,0:52:05.650
we dont need to do that junk
because we did not make that mistake
0:52:05.650,0:52:12.650
so I would like to see the
topology aware stuff in particular
0:52:30.710,0:52:32.529
(inaudible question)
0:52:32.529,0:52:37.180
the trick is its easy to limit application bandwidth
0:52:39.500,0:52:42.269
fairly easy to limit application bandwidth
0:52:42.269,0:52:44.329
it becomes more difficult when you have to
0:52:44.329,0:52:45.430
if your
0:52:45.430,0:52:49.759
interfaces are shared between application traffic
0:52:49.759,0:52:50.880
and
0:52:50.880,0:52:53.049
say NFS
0:52:53.049,0:52:57.399
getting classifying that is going to be trickier
you have to tag youd have to add a fair bit of code
0:52:57.399,0:53:04.399
to trace that down through the kernel
certainly doable
0:53:12.069,0:53:15.499
(inaudible question)
0:53:15.499,0:53:18.389
I have contemplated doing just that
0:53:18.389,0:53:22.059
or in fact the other thing we consider
doing
0:53:22.059,0:53:24.829
more as a research project than is a practical thing
0:53:24.829,0:53:26.719
would be actually how
0:53:26.719,0:53:28.619
would be
0:53:28.619,0:53:30.029
independent VLANs
0:53:30.029,0:53:31.839
because then we could do
0:53:31.839,0:53:32.459
things like
0:53:32.459,0:53:35.489
give each process a VLAN they couldn't even
0:53:35.489,0:53:37.979
share at the internet layer
0:53:37.979,0:53:41.259
once the images in place for instance we will
be able to do that
0:53:41.259,0:53:45.049
and that say you know you've got your interfaces
its yours whatever
0:53:45.049,0:53:46.479
but then we could limit it
0:53:46.479,0:53:49.959
we could rate limit that at the kernel
we can also have
0:53:49.959,0:53:54.729
wed have a physically isolated
wed have a logically isolated network as well
0:53:54.729,0:53:57.589
with some of the latest switches we could actually
rate limit
0:53:57.589,0:54:04.589
at the switch as well
0:54:19.939,0:54:22.369
(inaudible questions)
so to the first question
0:54:22.369,0:54:26.190
we dont run multiple
0:54:26.190,0:54:27.639
sensitivity data on these clusters
0:54:27.639,0:54:28.709
unclassified cluster
0:54:28.709,0:54:30.460
we've avoided that problem by
0:54:30.460,0:54:32.299
not allowing it
0:54:32.299,0:54:34.929
But it is a real issue
0:54:34.929,0:54:36.939
it's just not one we've had to deal with
0:54:39.559,0:54:42.109
in practice with stuff thats sensitive
0:54:43.059,0:54:47.579
has handling requirements that you can't touch
the same hardware without a scrub
0:54:47.579,0:54:49.859
you need a pretty
0:54:49.859,0:54:51.739
ridiculously aggressive
0:54:51.739,0:54:53.770
you need a very coarse granularity
0:54:53.770,0:54:57.240
a ridiculous remote imaging process that you
moved all of the data
0:54:57.240,0:55:00.959
so if I were to do that I would
probably get rid of the discs
0:55:00.959,0:55:01.389
just
0:55:01.389,0:55:02.400
go disc less
0:55:02.400,0:55:04.910
that would get rid of my number-one failure case
of
0:55:04.910,0:55:07.839
that would be pretty good but
0:55:07.839,0:55:09.419
but havent done it
0:55:10.609,0:55:13.819
NFS failures we've had occasional problems of NFS overloading
0:55:13.819,0:55:15.679
we haven't had real problem
0:55:15.679,0:55:19.279
we're all local network its fairly tightly
contained so we haven't had problems with
0:55:19.279,0:55:20.539
things
0:55:20.539,0:55:21.819
with
0:55:21.819,0:55:26.039
you know the server going down for extended
periods and causing everything to hang
0:55:26.039,0:55:27.819
it's been more an issue of
0:55:27.819,0:55:33.189
I mean there isn't there's a problem
that Panasas is described as in cast
0:55:33.189,0:55:36.109
you can take out any NFS server
0:55:36.109,0:55:40.809
I mean we have the bluearc guys come in and the
PGA based stuff with multiple ten-gig links I said
0:55:40.809,0:55:42.049
you know I've got
0:55:42.049,0:55:46.779
to do this and they said can we not try this with your whole cluster
0:55:46.779,0:55:47.950
because if you got
0:55:47.950,0:55:49.370
three hundred and fifty
0:55:49.370,0:55:52.599
gigabit ethernet interfaces going into
the system
0:55:52.599,0:55:56.589
Even ten gig you can saturate pretty trivially
0:55:56.589,0:55:57.120
so that level
0:55:57.120,0:55:58.930
there's an inherent problem
0:55:58.930,0:56:01.969
on we need to handle that kind of bandwidth
we've
0:56:01.969,0:56:04.459
got to get it a parallel file system
0:56:04.459,0:56:06.069
get a cluster
0:56:06.069,0:56:12.289
before doing streaming stuff we could go via SWAN or something
0:56:12.289,0:56:14.949
anyone else?
0:56:14.949,0:56:15.429
thank you, everyone
(applause and end)