3778 lines
73 KiB
Text
3778 lines
73 KiB
Text
0:00:15.749,0:00:18.960
|
||
I do apologize for the (other)
|
||
|
||
0:00:18.960,0:00:22.130
|
||
for the EuroBSDCon slides. I've redone the
|
||
|
||
0:00:22.130,0:00:23.890
|
||
title page and redone the
|
||
|
||
0:00:23.890,0:00:27.380
|
||
and made some changes to the slides
|
||
and they didn't make it through for approval
|
||
|
||
0:00:27.380,0:00:33.130
|
||
by this afternoon so
|
||
|
||
0:00:33.130,0:00:34.640
|
||
okay so
|
||
|
||
0:00:34.640,0:00:36.390
|
||
I'm gonna be talking about
|
||
|
||
0:00:36.390,0:00:38.430
|
||
doing
|
||
|
||
0:00:38.430,0:00:42.889
|
||
about isolating jobs for performance and predictability
|
||
in clusters
|
||
|
||
0:00:42.889,0:00:43.970
|
||
before I get into that
|
||
|
||
0:00:43.970,0:00:46.010
|
||
I want to talk a little bit about
|
||
|
||
0:00:46.010,0:00:47.229
|
||
who we are and
|
||
|
||
0:00:47.229,0:00:49.520
|
||
what our problem space is like because that
|
||
|
||
0:00:49.520,0:00:54.760
|
||
dictates that… has an effect
|
||
on our solutions base
|
||
|
||
0:00:54.760,0:00:57.079
|
||
I work for the aerospace corporation.
|
||
|
||
0:00:57.079,0:00:58.609
|
||
We work;
|
||
|
||
0:00:58.609,0:01:02.480
|
||
we operate a federally-funded
|
||
research and development center
|
||
|
||
0:01:02.480,0:01:05.400
|
||
in the area national security space
|
||
|
||
0:01:05.400,0:01:09.310
|
||
and in particular we work with the air force
|
||
space and missile command
|
||
|
||
0:01:09.310,0:01:13.090
|
||
and with the national reconnaissance
|
||
office
|
||
|
||
0:01:13.090,0:01:16.670
|
||
and our engineers support a wide variety
|
||
|
||
0:01:16.670,0:01:20.550
|
||
of activities within that area
|
||
|
||
0:01:20.550,0:01:21.830
|
||
so we have
|
||
|
||
0:01:21.830,0:01:23.740
|
||
a bit over fourteen hundred to correct
|
||
|
||
0:01:23.740,0:01:25.860
|
||
sorry twenty four hundred engineers
|
||
|
||
0:01:25.860,0:01:28.820
|
||
in virtually every discipline we have
|
||
|
||
0:01:28.820,0:01:33.520
|
||
as you would expect we have our rocket scientists,
|
||
we have people who build satellites
|
||
|
||
0:01:33.520,0:01:37.439
|
||
we have people who build sensors that go on
|
||
satellites, people who study these sort of things
|
||
|
||
0:01:37.439,0:01:38.130
|
||
that you
|
||
|
||
0:01:38.130,0:01:39.590
|
||
see when you
|
||
|
||
0:01:39.590,0:01:40.819
|
||
use those sensors
|
||
|
||
0:01:40.819,0:01:42.040
|
||
that sort of thing.
|
||
|
||
0:01:42.040,0:01:44.180
|
||
We also have civil engineers and
|
||
|
||
0:01:44.180,0:01:45.680
|
||
electronic engineers
|
||
|
||
0:01:45.680,0:01:46.649
|
||
and process,
|
||
|
||
0:01:46.649,0:01:49.170
|
||
computer process people
|
||
|
||
0:01:49.170,0:01:53.120
|
||
so we literally do everything related to space
|
||
and all sorts of things that you might not
|
||
|
||
0:01:53.120,0:01:55.270
|
||
expect to be related to space,
|
||
|
||
0:01:55.270,0:01:58.820
|
||
since we also for instance help build ground
|
||
systems ‘cause satellites aren’t very useful if
|
||
|
||
0:01:58.820,0:02:00.680
|
||
there isn't anything to talk to them;
|
||
|
||
0:02:02.540,0:02:04.090
|
||
and these engineers
|
||
|
||
0:02:04.090,0:02:07.420
|
||
since they're solving all these different problems we have
|
||
|
||
0:02:07.420,0:02:11.499
|
||
engineering applications in you know
|
||
virtually every size you can think of
|
||
|
||
0:02:11.499,0:02:15.539
|
||
ranging from you know little spreadsheet things that
|
||
you might not think of as an engineering
|
||
|
||
0:02:15.539,0:02:17.229
|
||
application but they are
|
||
|
||
0:02:17.229,0:02:22.249
|
||
to Matlab programs or a lot of C code
|
||
|
||
0:02:22.249,0:02:23.960
|
||
or one of traditional parallel for us
|
||
|
||
0:02:23.960,0:02:25.159
|
||
serial code
|
||
|
||
0:02:25.159,0:02:26.049
|
||
and then
|
||
|
||
0:02:26.049,0:02:30.949
|
||
large parallel applications either in house;
|
||
genetic algorithms and that sort
|
||
|
||
0:02:30.949,0:02:31.769
|
||
of thing,
|
||
|
||
0:02:31.769,0:02:32.900
|
||
or traditional
|
||
|
||
0:02:32.900,0:02:34.749
|
||
the classic parallel code
|
||
|
||
0:02:34.749,0:02:37.599
|
||
like you work around a crate or something material simulation
|
||
0:02:40.119,0:02:41.459
|
||
or that or food flow
|
||
|
||
0:02:41.459,0:02:43.869
|
||
or that sort of thing
|
||
|
||
0:02:43.869,0:02:44.240
|
||
so
|
||
|
||
0:02:44.240,0:02:46.349
|
||
so we have this big application space
|
||
|
||
0:02:46.349,0:02:49.029
|
||
just want to give a little introduction to that because
|
||
it
|
||
|
||
0:02:49.029,0:02:51.529
|
||
does come back and influence what we
|
||
|
||
0:02:51.529,0:02:55.999
|
||
the sort of solutions we look at
|
||
|
||
0:02:55.999,0:03:00.499
|
||
so the rest of the talk I’m gonna talk about rese…
|
||
|
||
0:03:00.499,0:03:05.259
|
||
we skipped a slide, there we are, that’s a little better.
|
||
|
||
0:03:05.259,0:03:08.940
|
||
Now, what I'm interested in is I do high
|
||
performance computing
|
||
|
||
0:03:08.940,0:03:10.109
|
||
at company
|
||
|
||
0:03:10.109,0:03:13.949
|
||
and I provide high performance computing resources
|
||
to our users
|
||
|
||
0:03:13.949,0:03:19.949
|
||
as part of my role in our technical
|
||
computing services organization
|
||
|
||
0:03:19.949,0:03:20.370
|
||
so
|
||
|
||
0:03:20.370,0:03:23.120
|
||
our primary resource at this point is
|
||
|
||
0:03:23.120,0:03:25.429
|
||
the fellowship cluster
|
||
|
||
0:03:25.429,0:03:26.540
|
||
it's a for the
|
||
|
||
0:03:26.540,0:03:29.569
|
||
named for the fellowship of the ring
|
||
|
||
0:03:29.569,0:03:30.449
|
||
so it's a…
|
||
|
||
0:03:30.449,0:03:32.520
|
||
… eleven axel nodes
|
||
|
||
0:03:32.520,0:03:33.930
|
||
wrap the core systems
|
||
|
||
0:03:33.930,0:03:35.909
|
||
over here there's a
|
||
|
||
0:03:35.909,0:03:39.659
|
||
Cisco a large Cisco switch. Actually today
|
||
there are around two sixty five oh nines if
|
||
|
||
0:03:39.659,0:03:40.899
|
||
you assess them
|
||
|
||
0:03:40.899,0:03:46.149
|
||
and because we couldn’t get the port density we wanted otherwise
|
||
|
||
0:03:46.149,0:03:50.219
|
||
and primarily the Gigabit Ethernet system runs
|
||
FreeBSD currently 6.0 ‘cause we haven’t upgraded
|
||
|
||
0:03:50.219,0:03:51.089
|
||
it yet
|
||
|
||
0:03:51.089,0:03:55.639
|
||
planning to move probably to 7.1
|
||
or maybe slightly past 7.1
|
||
|
||
0:03:55.639,0:04:01.029
|
||
if we want to get the latest HWPMC changes in
|
||
|
||
0:04:01.029,0:04:05.900
|
||
we use the Sun Grid Engine scheduler was one of
|
||
the two main options for open source
|
||
|
||
0:04:05.900,0:04:08.949
|
||
resource managers on clusters the other one being
|
||
the…
|
||
|
||
0:04:09.959,0:04:11.499
|
||
… the TORQUE
|
||
|
||
0:04:11.499,0:04:15.939
|
||
and now recombination from cluster resources
|
||
|
||
0:04:15.939,0:04:17.389
|
||
so we also have
|
||
|
||
0:04:17.389,0:04:18.079
|
||
that's actually
|
||
|
||
0:04:18.079,0:04:22.090
|
||
40 TB that’s really the raw number on a sun thumper and
|
||
0:04:23.219,0:04:26.290
|
||
that’s thirty two usable once you start using RAID-Z2
|
||
|
||
0:04:26.290,0:04:30.939
|
||
since you might actually like to have your data
|
||
should a disk fail
|
||
|
||
0:04:30.939,0:04:32.969
|
||
and with today's discs RAID…
|
||
|
||
0:04:32.969,0:04:34.009
|
||
RAID five
|
||
|
||
0:04:34.009,0:04:35.249
|
||
doesn't really cut it,
|
||
|
||
0:04:37.379,0:04:40.220
|
||
And then we also have some other resources coming on but I’m going to be (concentrating on)
|
||
|
||
0:04:40.220,0:04:43.530
|
||
two smaller clusters unfortunately probably running Linux and
|
||
|
||
0:04:43.530,0:04:45.900
|
||
some SMPs but
|
||
|
||
0:04:45.900,0:04:49.990
|
||
I’m going to be concentrating here on the work we're
|
||
doing on our other
|
||
|
||
0:04:49.990,0:04:54.259
|
||
FreeBSD based cluster.
|
||
|
||
0:04:54.259,0:04:55.060
|
||
So, first of all
|
||
|
||
0:04:55.060,0:04:59.410
|
||
first of all I want to talk about why we want to
|
||
share resources. Should be fairly obvious
|
||
|
||
0:04:59.410,0:05:00.610
|
||
but I'll talk about it in a little bit
|
||
|
||
0:05:00.610,0:05:04.900
|
||
and then what goes wrong when you start sharing resources
|
||
|
||
0:05:04.900,0:05:08.449
|
||
after that I'll talk about some different solutions
|
||
to those problems
|
||
|
||
0:05:08.449,0:05:09.759
|
||
and
|
||
|
||
0:05:09.759,0:05:13.399
|
||
some fairly trivial experiments that we've done
|
||
so far in terms of enhancing the schedule or
|
||
|
||
0:05:13.399,0:05:15.860
|
||
using operating system features
|
||
|
||
0:05:15.860,0:05:17.730
|
||
so you mitigate those problems
|
||
|
||
0:05:19.349,0:05:20.110
|
||
and
|
||
|
||
0:05:20.110,0:05:25.110
|
||
then conclude with some feature work.
|
||
|
||
0:05:25.110,0:05:29.289
|
||
So, obviously if you have a resource the size…
|
||
the size of our cluster, fourteen hundred
|
||
|
||
0:05:29.289,0:05:30.970
|
||
cores roughly
|
||
|
||
0:05:30.970,0:05:32.819
|
||
you probably want to share it unless you
|
||
|
||
0:05:32.819,0:05:35.080
|
||
purpose built it for a single application
|
||
|
||
0:05:35.080,0:05:37.340
|
||
you're going to want to have your users
|
||
|
||
0:05:37.340,0:05:39.440
|
||
sharing it
|
||
|
||
0:05:39.440,0:05:42.909
|
||
and you don't want to just say you know, you get on Monday
|
||
|
||
0:05:42.909,0:05:45.330
|
||
probably not going to be a very effective
|
||
option
|
||
|
||
0:05:45.330,0:05:49.270
|
||
especially not when we have as many users as we
|
||
do
|
||
|
||
0:05:49.270,0:05:53.849
|
||
we also can't just afford to buy another one
|
||
every time a user shows up
|
||
|
||
0:05:53.849,0:05:54.959
|
||
so one of our
|
||
|
||
0:05:54.959,0:05:57.339
|
||
senior VPs said a while back
|
||
|
||
0:05:57.339,0:05:57.969
|
||
you know
|
||
|
||
0:05:57.969,0:06:02.349
|
||
we could probably afford to buy just about
|
||
anything we could need once
|
||
|
||
0:06:02.349,0:06:03.800
|
||
we can't just
|
||
|
||
0:06:03.800,0:06:06.359
|
||
buy ten of them though
|
||
|
||
0:06:06.359,0:06:08.939
|
||
if we really, really needed it
|
||
|
||
0:06:08.939,0:06:09.680
|
||
dropping
|
||
|
||
0:06:09.680,0:06:11.460
|
||
small numbers of millions of dollars on
|
||
|
||
0:06:11.460,0:06:13.349
|
||
computing resources wouldn’t be
|
||
|
||
0:06:13.349,0:06:15.039
|
||
impossible
|
||
|
||
0:06:15.039,0:06:20.829
|
||
but we can't go to you know just have every engineer
|
||
who wants one just call up Dell and say ship me ten racks
|
||
|
||
0:06:20.829,0:06:24.030
|
||
it's not going to work
|
||
|
||
0:06:24.030,0:06:25.580
|
||
and the other thing is that we can’t
|
||
|
||
0:06:25.580,0:06:28.360
|
||
we need to also provide quick turnaround
|
||
|
||
0:06:28.360,0:06:29.390
|
||
for some users
|
||
|
||
0:06:29.390,0:06:33.229
|
||
so we can't have one user hogging the system and
|
||
hogging it until they are done
|
||
|
||
0:06:33.229,0:06:34.720
|
||
because we have some users
|
||
|
||
0:06:34.720,0:06:37.099
|
||
and then the next one can run
|
||
|
||
0:06:37.099,0:06:40.949
|
||
because we have some users who'll
|
||
come in and say well I need to run
|
||
|
||
0:06:40.949,0:06:43.159
|
||
for three months
|
||
|
||
0:06:43.159,0:06:43.690
|
||
and
|
||
|
||
0:06:43.690,0:06:46.810
|
||
we've had users come in and literally run
|
||
|
||
0:06:46.810,0:06:49.740
|
||
pretty much using the entire system for three months
|
||
|
||
0:06:49.740,0:06:53.839
|
||
well so we've had to provide some ability for other
|
||
users to still get their work done
|
||
|
||
0:06:53.839,0:06:58.300
|
||
so we can't just… so we do have to have some sharing
|
||
|
||
0:06:58.300,0:07:00.619
|
||
however when you start to share any resource
|
||
|
||
0:07:00.619,0:07:01.610
|
||
like this
|
||
|
||
0:07:01.610,0:07:03.509
|
||
you start getting contention
|
||
|
||
0:07:03.509,0:07:06.300
|
||
users need the same thing at the same time
|
||
|
||
0:07:06.300,0:07:09.700
|
||
and so they fight back and forth for it and they
|
||
can't get what they want
|
||
|
||
0:07:09.700,0:07:11.639
|
||
so you have to balance them a bit
|
||
|
||
0:07:12.999,0:07:14.529
|
||
you know also
|
||
|
||
0:07:14.529,0:07:17.869
|
||
some jobs lie when they
|
||
|
||
0:07:17.869,0:07:20.870
|
||
request resources and they actually need
|
||
more than they ask for
|
||
|
||
0:07:20.870,0:07:23.279
|
||
which can cause problems
|
||
|
||
0:07:23.279,0:07:27.229
|
||
so we schedule them. We say you're going to fit
|
||
here fine and they run off and use
|
||
|
||
0:07:27.229,0:07:28.580
|
||
more than they said
|
||
|
||
0:07:28.580,0:07:31.000
|
||
and if we don't have a mechanism to constrain
|
||
them
|
||
|
||
0:07:31.000,0:07:32.389
|
||
we have problems.
|
||
|
||
0:07:32.389,0:07:34.270
|
||
Likewise
|
||
|
||
0:07:34.270,0:07:37.109
|
||
once these users start to contend
|
||
|
||
0:07:37.109,0:07:39.029
|
||
that doesn't just result in
|
||
|
||
0:07:39.029,0:07:40.439
|
||
the jobs taking,
|
||
|
||
0:07:40.439,0:07:43.360
|
||
taking longer in terms of wall clock time
|
||
|
||
0:07:43.360,0:07:44.659
|
||
because they are extremely slow
|
||
|
||
0:07:44.659,0:07:48.430
|
||
but there's overhead related to that contention;
|
||
they get swapped out due to pressure on
|
||
|
||
0:07:49.219,0:07:51.509
|
||
various systems
|
||
|
||
0:07:51.509,0:07:52.550
|
||
if you really
|
||
|
||
0:07:52.550,0:07:57.039
|
||
for instance run out of memory then you go into
|
||
swap and you end up wasting all your cycles
|
||
|
||
0:07:57.039,0:07:58.710
|
||
pulling junk in and out of disc
|
||
|
||
0:07:58.710,0:08:00.830
|
||
wasting your bandwidth on that
|
||
|
||
0:08:00.830,0:08:03.530
|
||
so there are
|
||
|
||
0:08:03.530,0:08:04.219
|
||
resource
|
||
|
||
0:08:04.219,0:08:08.139
|
||
there are resource costs to the contention not merely
|
||
|
||
0:08:08.139,0:08:11.979
|
||
a delay in returning results.
|
||
|
||
0:08:11.979,0:08:16.590
|
||
So now I'm going to switch gears and start talk… so I'm
|
||
going to talk a little bit about different
|
||
|
||
0:08:16.590,0:08:18.270
|
||
solutions to these
|
||
|
||
|
||
0:08:18.270,0:08:20.610
|
||
to the
|
||
|
||
0:08:20.610,0:08:22.339
|
||
these contention issues
|
||
|
||
0:08:23.710,0:08:27.840
|
||
and look at different ways of solving the
|
||
problem. Most of these are things that have
|
||
|
||
0:08:27.840,0:08:29.440
|
||
already been done
|
||
|
||
0:08:29.440,0:08:30.620
|
||
but I just want to talk about
|
||
|
||
0:08:30.620,0:08:32.990
|
||
the different ways and then
|
||
|
||
0:08:32.990,0:08:35.710
|
||
evaluate them in our context.
|
||
|
||
0:08:35.710,0:08:38.119
|
||
So a classic solution to the problem is
|
||
|
||
0:08:38.119,0:08:39.280
|
||
Gang Scheduling
|
||
|
||
0:08:39.280,0:08:44.139
|
||
It's basically conventional Unix process
|
||
context switching
|
||
|
||
0:08:44.139,0:08:46.560
|
||
written really big
|
||
|
||
0:08:46.560,0:08:50.339
|
||
you what you do is you have your parallel
|
||
job that’s running
|
||
|
||
0:08:50.339,0:08:51.390
|
||
on a system
|
||
|
||
0:08:51.390,0:08:52.839
|
||
and it runs for a while
|
||
|
||
0:08:52.839,0:08:57.920
|
||
and then after a certain amount of time you basically
|
||
shove it all; you kick it off of all the nodes
|
||
|
||
0:08:57.920,0:08:59.940
|
||
and let the next one come in
|
||
|
||
0:08:59.940,0:09:04.030
|
||
and typically when people do this they do it on
|
||
on the order of hours because the context switch
|
||
|
||
0:09:04.030,0:09:09.270
|
||
time is extremely large is extremely high
|
||
|
||
0:09:09.270,0:09:10.639
|
||
for example
|
||
|
||
0:09:10.639,0:09:14.530
|
||
because it's not just like swapping a process
|
||
internet. You suddenly have to co-ordinate
|
||
|
||
0:09:14.530,0:09:17.470
|
||
the this context which across to all your processes
|
||
|
||
0:09:17.470,0:09:19.280
|
||
if you're running say
|
||
|
||
0:09:19.280,0:09:21.190
|
||
MPI over TCP
|
||
|
||
0:09:21.190,0:09:25.910
|
||
you actually need to tear down the TCP sessions
|
||
because you can't just have TCP timers sitting
|
||
|
||
0:09:25.910,0:09:26.570
|
||
around
|
||
|
||
0:09:26.570,0:09:28.260
|
||
or that sort of thing so
|
||
|
||
0:09:28.260,0:09:29.950
|
||
there there's a there's a lot of overhead
|
||
|
||
0:09:29.950,0:09:34.340
|
||
associated with this. You take a long context switch
|
||
|
||
0:09:34.340,0:09:36.820
|
||
if all of your infrastructure supports this
|
||
|
||
0:09:36.820,0:09:39.420
|
||
it's fairly effective
|
||
|
||
0:09:39.420,0:09:43.300
|
||
and it does allow jobs to avoid interfering
|
||
with each other which is nice
|
||
|
||
0:09:43.300,0:09:46.100
|
||
so you can't you don't have issues
|
||
|
||
0:09:46.100,0:09:47.689
|
||
because you're typically allocating
|
||
|
||
0:09:47.689,0:09:50.950
|
||
whole swaps of the system
|
||
|
||
0:09:50.950,0:09:53.390
|
||
and for properly written applications
|
||
|
||
0:09:55.000,0:09:59.690
|
||
partial results can be returned which for some of
|
||
our users is really important where you're doing a
|
||
|
||
0:09:59.690,0:10:00.490
|
||
refinement
|
||
|
||
0:10:00.490,0:10:04.350
|
||
users would want to look at the results and
|
||
say okay
|
||
|
||
0:10:04.350,0:10:06.130
|
||
you know is this just going off into the weeds
|
||
|
||
0:10:06.130,0:10:10.860
|
||
or does it look like it's actually converging on
|
||
some sort of useful solution
|
||
|
||
0:10:10.860,0:10:13.980
|
||
as they don't want to just wait till the end.
|
||
|
||
0:10:13.980,0:10:19.270
|
||
Down side of course is that this context
|
||
switches costs are very high
|
||
|
||
0:10:19.270,0:10:22.460
|
||
and most importantly there's really a lack
|
||
of useful implementations
|
||
|
||
0:10:22.460,0:10:25.340
|
||
a number of platforms have implemented this in the past
|
||
|
||
0:10:25.340,0:10:29.840
|
||
but in practice on modern clusters which are
|
||
built on commodity hardware
|
||
|
||
0:10:29.840,0:10:32.340
|
||
with you know
|
||
|
||
0:10:32.340,0:10:35.530
|
||
communication libraries written on standard protocols
|
||
|
||
0:10:35.530,0:10:37.050
|
||
the tools just aren’t there
|
||
|
||
0:10:37.050,0:10:39.100
|
||
and so
|
||
|
||
0:10:39.100,0:10:40.860
|
||
it's not very practical.
|
||
|
||
0:10:40.860,0:10:44.010
|
||
Also it doesn't really make a lot of sense with small jobs
|
||
|
||
0:10:44.010,0:10:47.789
|
||
and one of the things that we found is we have users who have
|
||
|
||
0:10:47.789,0:10:50.860
|
||
embarrassingly parallel problems for they need to look at
|
||
|
||
0:10:50.860,0:10:53.450
|
||
you know twenty thousand studies
|
||
|
||
0:10:53.450,0:10:57.400
|
||
and they could write something that looked more like a
|
||
conventional parallel application where they
|
||
|
||
0:10:57.400,0:11:01.930
|
||
you know wrote a Scheduler and set up an MPI a Message Passing Interface
|
||
|
||
0:11:01.930,0:11:05.400
|
||
and handed out tasks to pieces of their job and then you
|
||
could do this
|
||
|
||
0:11:05.400,0:11:09.280
|
||
but then they would be running a Scheduler and they would
|
||
probably do a bad job of it turns out it's actually
|
||
|
||
0:11:09.280,0:11:10.820
|
||
fairly difficult to do right
|
||
|
||
0:11:10.820,0:11:13.740
|
||
even a trivial case
|
||
|
||
0:11:13.740,0:11:16.189
|
||
and so what they do instead is they just select twenty
|
||
|
||
0:11:16.189,0:11:18.730
|
||
twenty thousand jobs to grid engine and say okay
|
||
|
||
0:11:18.730,0:11:21.330
|
||
whatever I'll deal with it
|
||
|
||
0:11:21.330,0:11:23.140
|
||
earlier versions that might have been a problem
|
||
|
||
0:11:23.140,0:11:24.730
|
||
current versions of the code
|
||
|
||
0:11:24.730,0:11:27.060
|
||
handle easily a million jobs that
|
||
|
||
0:11:27.060,0:11:29.370
|
||
so not really a big deal
|
||
|
||
0:11:29.370,0:11:31.610
|
||
but those sort of users wouldn't fit well
|
||
|
||
0:11:31.610,0:11:34.190
|
||
into the gang scheduled environment
|
||
|
||
0:11:34.190,0:11:35.690
|
||
at least not in a
|
||
|
||
0:11:35.690,0:11:39.149
|
||
conventional gang scheduled environment where
|
||
you do gang scheduling on the granularity of
|
||
|
||
0:11:39.149,0:11:40.940
|
||
jobs
|
||
|
||
0:11:40.940,0:11:44.140
|
||
so from that perspective it wouldn’t work very well.
|
||
|
||
0:11:44.140,0:11:48.380
|
||
If you have all the pieces in place and you are
|
||
doing a big parallel applications it is in fact
|
||
|
||
0:11:48.380,0:11:53.770
|
||
an extremely effective approach.
|
||
|
||
0:11:53.770,0:11:56.290
|
||
Another option which is sort of related
|
||
|
||
0:11:56.290,0:11:57.420
|
||
it's in fact
|
||
|
||
0:11:57.420,0:12:00.079
|
||
take taking an even courser granularity
|
||
|
||
0:12:00.079,0:12:04.360
|
||
is single application or single project
|
||
clusters or sub-clusters.
|
||
|
||
0:12:04.360,0:12:07.590
|
||
For instance this is used some national labs
|
||
|
||
0:12:07.590,0:12:11.910
|
||
where you're given a cycle allocation for a
|
||
year based on your grant proposals
|
||
|
||
0:12:11.910,0:12:14.779
|
||
and what your cycle allocation actually comes to you as is
|
||
|
||
0:12:14.779,0:12:16.580
|
||
here's your cluster
|
||
|
||
0:12:16.580,0:12:17.489
|
||
here's a frontend
|
||
|
||
0:12:17.489,0:12:19.840
|
||
here's this chunk of notes, they're yours, go to it.
|
||
|
||
0:12:19.840,0:12:21.930
|
||
Install your own OS, whatever you want
|
||
|
||
0:12:21.930,0:12:25.580
|
||
it's yours
|
||
|
||
0:12:25.580,0:12:30.310
|
||
and then and at a sort of finer scale there's things such as
|
||
|
||
0:12:30.310,0:12:31.800
|
||
you could use Emulab
|
||
|
||
0:12:31.800,0:12:36.300
|
||
which is the network emulation system but also does a OS install and configuration
|
||
|
||
0:12:36.300,0:12:39.300
|
||
so you could do dynamic allocation that way
|
||
|
||
0:12:39.300,0:12:40.540
|
||
Sun's
|
||
|
||
0:12:40.540,0:12:44.040
|
||
Project Hedeby now actually I think it's
|
||
called service domain manager
|
||
|
||
0:12:44.040,0:12:46.500
|
||
is the productised version
|
||
|
||
0:12:46.500,0:12:50.010
|
||
or some Clusters on Demand
|
||
|
||
0:12:50.010,0:12:54.450
|
||
they were actually talking about web hosting clusters but
|
||
|
||
0:12:54.450,0:12:57.780
|
||
things that allow rapid deployment unless you
|
||
do that a little
|
||
|
||
0:12:57.780,0:12:59.510
|
||
little
|
||
|
||
0:12:59.510,0:13:02.810
|
||
a more granular level than the
|
||
|
||
0:13:02.810,0:13:05.580
|
||
the allocate them once a year approach
|
||
|
||
0:13:05.580,0:13:07.720
|
||
nonetheless
|
||
|
||
0:13:07.720,0:13:11.220
|
||
let’s you give people whole clusters to work with
|
||
|
||
0:13:11.220,0:13:12.920
|
||
nice one nice thing about it is
|
||
|
||
0:13:12.920,0:13:15.450
|
||
the isolation between the processes
|
||
|
||
0:13:15.450,0:13:16.890
|
||
is complete
|
||
|
||
|
||
0:13:16.890,0:13:20.800
|
||
so you don’t have to worry about users stomping on each other.
|
||
It’s their own system, they can trash it all they
|
||
|
||
0:13:20.800,0:13:22.230
|
||
want
|
||
|
||
0:13:22.230,0:13:24.709
|
||
if they flood the network or they
|
||
|
||
0:13:24.709,0:13:26.180
|
||
run the nodes into swap
|
||
|
||
0:13:26.180,0:13:28.480
|
||
well that's their problem
|
||
|
||
0:13:28.480,0:13:32.120
|
||
but it also has the advantage that you can tailor the images
|
||
|
||
0:13:32.120,0:13:36.980
|
||
on the nodes of the operative systems to
|
||
meet the exact needs of the application
|
||
|
||
0:13:36.980,0:13:40.560
|
||
down side of course is its coarse granularity, in our environment that doesn't work
|
||
|
||
0:13:40.560,0:13:41.500
|
||
very well
|
||
|
||
0:13:41.500,0:13:46.800
|
||
since we do have all of these all these different types of jobs
|
||
|
||
0:13:46.800,0:13:51.710
|
||
context switches are also pretty expensive. Certainly on the order of minutes
|
||
|
||
0:13:51.710,0:13:54.690
|
||
Emulab typically claim something like ten minutes
|
||
|
||
0:13:54.690,0:13:57.970
|
||
there are some systems out there
|
||
|
||
0:13:57.970,0:14:03.320
|
||
for instance if you use I think it’s Open Boot that
|
||
they're calling it today. It used to be 1xBIOS
|
||
|
||
0:14:03.320,0:14:06.790
|
||
where you can actually deploy a system in
|
||
|
||
0:14:06.790,0:14:08.700
|
||
tens of seconds
|
||
|
||
0:14:08.700,0:14:11.520
|
||
mostly by getting rid of all that junk the BIOS writers wrote
|
||
|
||
0:14:11.520,0:14:12.890
|
||
and
|
||
|
||
0:14:12.890,0:14:17.770
|
||
the OS boots pretty fast if you don’t have all
|
||
that stuff to waylay you,
|
||
|
||
0:14:17.770,0:14:19.940
|
||
but in practice on sort of
|
||
|
||
0:14:19.940,0:14:21.660
|
||
off the shelf hardware
|
||
|
||
0:14:21.660,0:14:24.400
|
||
the context switches times’ are quite high
|
||
|
||
0:14:24.400,0:14:26.930
|
||
users of course can interfere with themselves
|
||
|
||
0:14:26.930,0:14:29.200
|
||
you can argue it's not a problem but
|
||
|
||
0:14:29.200,0:14:31.660
|
||
ideally you would like to prevent
|
||
that
|
||
|
||
0:14:31.660,0:14:35.350
|
||
one of the things that I have to deal with
|
||
is that my users are
|
||
|
||
0:14:35.350,0:14:37.830
|
||
almost universally
|
||
|
||
0:14:37.830,0:14:40.410
|
||
not trained as computer scientists or programmers
|
||
|
||
0:14:40.410,0:14:42.550
|
||
you know they’re trained in their domain area
|
||
|
||
0:14:42.550,0:14:44.780
|
||
they're really good in that area
|
||
|
||
0:14:44.780,0:14:48.389
|
||
but their concepts of the way hardware works in the
|
||
way software works
|
||
|
||
0:14:48.389,0:14:55.389
|
||
don’t match reality in many cases
|
||
|
||
0:15:01.269,0:15:02.830
|
||
(inaudible question)
|
||
It’s pretty rare in practice
|
||
|
||
0:15:02.830,0:15:06.700
|
||
well I've heard one lab that does it significantly
|
||
|
||
0:15:06.700,0:15:09.839
|
||
but it's like they do it on sort of a yearly
|
||
allocation basis
|
||
|
||
0:15:09.839,0:15:12.790
|
||
and throw the hardware away after two or three years
|
||
|
||
0:15:12.790,0:15:15.999
|
||
and you do typically have some sort of the deployment
|
||
|
||
0:15:15.999,0:15:18.340
|
||
system in place
|
||
|
||
0:15:18.340,0:15:20.680
|
||
or in those types of cases actually
|
||
|
||
0:15:20.680,0:15:22.359
|
||
usually your application comes with
|
||
|
||
0:15:22.359,0:15:26.500
|
||
and here's what we're going to spend on this many people
|
||
|
||
0:15:26.500,0:15:27.730
|
||
on this project so this is
|
||
|
||
0:15:27.730,0:15:34.730
|
||
big resource allocation
|
||
|
||
0:15:36.000,0:15:39.780
|
||
And yeah I guess one other issue with this is there's no real easy
|
||
|
||
0:15:39.780,0:15:43.320
|
||
way to capture underutilized resources
|
||
for example
|
||
|
||
0:15:43.320,0:15:44.389
|
||
if you have
|
||
|
||
0:15:44.389,0:15:49.190
|
||
an application which you know say single-threaded
|
||
and uses a ton of memory
|
||
|
||
0:15:49.190,0:15:51.210
|
||
and is running on a machine
|
||
|
||
0:15:51.210,0:15:55.040
|
||
the machines we're buying these days are eight core so
|
||
|
||
0:15:55.040,0:16:00.040
|
||
that’s wasting a lot of CPU cycles you're just
|
||
generating a lot of heat doing nothing
|
||
|
||
0:16:00.040,0:16:03.890
|
||
so ideally you would like a scheduler that
|
||
said okay so you're using
|
||
|
||
0:16:03.890,0:16:08.040
|
||
using eight or seven of the eight Gigabytes of
|
||
RAM but we've got these jobs
|
||
|
||
0:16:08.040,0:16:10.080
|
||
sitting here that
|
||
|
||
0:16:10.080,0:16:11.560
|
||
need next to know need
|
||
|
||
0:16:11.560,0:16:15.910
|
||
a hundred megabytes so we slap seven of
|
||
those in along with the big job
|
||
|
||
0:16:15.910,0:16:18.580
|
||
and backfill and in this
|
||
|
||
0:16:18.580,0:16:19.600
|
||
mechanism there's no
|
||
|
||
0:16:19.600,0:16:21.810
|
||
there's no good way to do that
|
||
|
||
0:16:21.810,0:16:26.820
|
||
obviously if the users have that application
|
||
next they can do it themselves
|
||
|
||
0:16:26.820,0:16:30.510
|
||
but it's not something where we can easily
|
||
bring in
|
||
|
||
0:16:30.510,0:16:35.090
|
||
bring in more jobs and have a mix to
|
||
take advantage of the different
|
||
|
||
0:16:35.090,0:16:37.300
|
||
resources.
|
||
|
||
0:16:37.300,0:16:39.940
|
||
A related approach is to
|
||
|
||
0:16:39.940,0:16:43.950
|
||
to install virtualization software on the
|
||
equipment and this is this is
|
||
|
||
0:16:44.980,0:16:46.379
|
||
this is the essence of
|
||
|
||
0:16:46.379,0:16:49.800
|
||
what Cloud Computing is at the moment
|
||
|
||
0:16:49.800,0:16:53.520
|
||
it's Amazon providing Zen
|
||
|
||
0:16:53.520,0:16:55.129
|
||
Zen hosting for
|
||
|
||
0:16:55.129,0:16:56.769
|
||
relatively arbitrary
|
||
|
||
0:16:56.769,0:16:59.710
|
||
OS images
|
||
|
||
0:16:59.710,0:17:02.720
|
||
it does have the advantage that it allows rapid deployment
|
||
|
||
0:17:02.720,0:17:06.510
|
||
in theory if your application is scalable provides for
|
||
|
||
0:17:06.510,0:17:08.259
|
||
extremely high scalability
|
||
|
||
0:17:08.259,0:17:10.110
|
||
particularly if you
|
||
|
||
0:17:10.110,0:17:14.470
|
||
aren’t us and therefore can possibly use somebody else's hardware
|
||
|
||
0:17:14.470,0:17:16.520
|
||
in our application's case that’s
|
||
|
||
0:17:16.520,0:17:18.790
|
||
not very practical so
|
||
|
||
0:17:18.790,0:17:20.360
|
||
we can't do that
|
||
|
||
0:17:20.360,0:17:20.870
|
||
and
|
||
|
||
0:17:20.870,0:17:23.790
|
||
it also has the advantage that you can run
|
||
|
||
0:17:23.790,0:17:26.470
|
||
you can have people with their own image in there
|
||
|
||
0:17:26.470,0:17:30.000
|
||
which is tightly resource constrained but you
|
||
can run more than one of them on a node. So for instance
|
||
|
||
0:17:30.000,0:17:31.170
|
||
you can give
|
||
|
||
0:17:31.170,0:17:32.730
|
||
one job
|
||
|
||
0:17:32.730,0:17:35.489
|
||
four cores and another job two cores another
|
||
|
||
0:17:35.489,0:17:37.500
|
||
you know and have a couple single core
|
||
|
||
0:17:37.500,0:17:38.860
|
||
jobs in theory
|
||
|
||
0:17:38.860,0:17:43.340
|
||
you can get fairly strong isolation there
|
||
obviously there are shared resources underneath
|
||
|
||
0:17:43.340,0:17:44.710
|
||
and you
|
||
|
||
0:17:44.710,0:17:45.570
|
||
probably can't
|
||
|
||
0:17:45.570,0:17:48.370
|
||
afford to completely isolate say network bandwidth
|
||
|
||
0:17:48.370,0:17:49.520
|
||
at the bottom layer
|
||
|
||
0:17:49.520,0:17:51.580
|
||
you can do some but
|
||
|
||
0:17:51.580,0:17:56.170
|
||
if you go overboard you can spend all your time on accounting
|
||
|
||
0:17:56.170,0:17:58.830
|
||
you also can again
|
||
|
||
0:17:58.830,0:18:01.410
|
||
tailor the images to the job
|
||
|
||
0:18:01.410,0:18:05.030
|
||
and in this environment actually you can
|
||
do that even more strongly than that
|
||
|
||
0:18:05.030,0:18:07.030
|
||
the sub-cluster approach
|
||
|
||
0:18:07.030,0:18:09.860
|
||
in that you can often do run
|
||
|
||
0:18:09.860,0:18:16.360
|
||
a five-year-old operating system or ten-year-old
|
||
operating system if you're using full virtualization
|
||
|
||
0:18:16.360,0:18:19.030
|
||
and that can allow
|
||
|
||
0:18:19.030,0:18:23.820
|
||
allow obsolete code with weird baselines to work which is
|
||
important in our space because
|
||
|
||
0:18:23.820,0:18:27.390
|
||
the average program runs ten years or more
|
||
|
||
0:18:27.390,0:18:30.860
|
||
our average project runs ten years or more
|
||
|
||
0:18:30.860,0:18:32.530
|
||
and as a result
|
||
|
||
0:18:32.530,0:18:36.010
|
||
you might have to go rerun this program that was written
|
||
|
||
0:18:36.010,0:18:37.320
|
||
way back on
|
||
|
||
0:18:37.320,0:18:40.550
|
||
some ancient version of windows or whatever
|
||
|
||
0:18:40.550,0:18:41.890
|
||
it also does provide
|
||
|
||
0:18:41.890,0:18:43.840
|
||
the ability to recover resources
|
||
|
||
0:18:43.840,0:18:45.290
|
||
as I was talking about before
|
||
|
||
0:18:45.290,0:18:49.530
|
||
but you can't do easily with sub-clusters because you can’t just slip
|
||
|
||
0:18:49.530,0:18:50.360
|
||
another image
|
||
|
||
0:18:50.360,0:18:52.910
|
||
on the on there and say are you can use anything and
|
||
|
||
0:18:52.910,0:18:56.730
|
||
you know give that image idle priority essentially
|
||
|
||
0:18:56.730,0:19:00.480
|
||
down side of course is that it is in complete
|
||
isolation and that there is a shared
|
||
|
||
0:19:00.480,0:19:02.340
|
||
hardware
|
||
|
||
0:19:02.340,0:19:06.490
|
||
you're not likely to find I don't think
|
||
any the virtualization systems out there
|
||
|
||
0:19:06.490,0:19:08.890
|
||
right now
|
||
|
||
0:19:08.890,0:19:09.890
|
||
virtualize
|
||
|
||
0:19:09.890,0:19:11.470
|
||
your segment of
|
||
|
||
0:19:11.470,0:19:13.540
|
||
memory bandwidth
|
||
|
||
0:19:13.540,0:19:15.159
|
||
or your segment
|
||
|
||
0:19:15.159,0:19:16.390
|
||
of cache
|
||
|
||
0:19:16.390,0:19:18.390
|
||
of cache space
|
||
|
||
0:19:18.390,0:19:24.809
|
||
so users can’t in fact interfere with themselves and each other in this
|
||
environment
|
||
|
||
0:19:24.809,0:19:25.589
|
||
it's also
|
||
|
||
0:19:25.589,0:19:30.479
|
||
not really efficient for small jobs; the cost of running an
|
||
entire OS for every
|
||
|
||
0:19:30.479,0:19:33.020
|
||
job is fairly high
|
||
|
||
0:19:33.020,0:19:34.020
|
||
even with
|
||
|
||
0:19:34.020,0:19:34.710
|
||
relatively light
|
||
|
||
0:19:34.710,0:19:38.250
|
||
Unix like OSes is you're still looking
|
||
|
||
0:19:38.250,0:19:40.900
|
||
couple hundred megabytes in practice
|
||
|
||
0:19:40.900,0:19:46.240
|
||
once you get everything up and running unless you run something
|
||
totally stripped down
|
||
|
||
0:19:47.230,0:19:49.460
|
||
there’s significant overhead
|
||
|
||
0:19:49.460,0:19:52.240
|
||
there’s CPU slowdown typically in the
|
||
|
||
0:19:52.240,0:19:55.360
|
||
you know typical estimates are in the twenty
|
||
percent range
|
||
|
||
0:19:55.360,0:20:00.450
|
||
numbers really range from fifty percent to
|
||
five percent depending on what exactly you're doing
|
||
|
||
0:20:00.450,0:20:02.100
|
||
possibly even lower
|
||
|
||
0:20:02.100,0:20:04.830
|
||
or higher
|
||
|
||
0:20:04.830,0:20:05.870
|
||
and just
|
||
|
||
0:20:05.870,0:20:09.920
|
||
you know the overhead because you have the whole OS there's a lot of a lot
|
||
|
||
0:20:09.920,0:20:11.420
|
||
of duplicate
|
||
|
||
0:20:11.420,0:20:13.320
|
||
stuff
|
||
|
||
0:20:13.320,0:20:15.010
|
||
the various vendors
|
||
|
||
0:20:15.010,0:20:17.090
|
||
have their answers they claim you know we can
|
||
|
||
0:20:17.090,0:20:21.430
|
||
we can merge that and say oh you're running the same kernel so we'll keep your memory
|
||
|
||
0:20:21.430,0:20:24.120
|
||
we use the same memory but
|
||
|
||
0:20:24.120,0:20:25.220
|
||
at some level
|
||
|
||
0:20:25.220,0:20:29.309
|
||
it's all going to get duplicated.
|
||
|
||
0:20:29.309,0:20:30.590
|
||
A related option
|
||
|
||
0:20:30.590,0:20:34.820
|
||
comes from sort of the internet havesting
|
||
industry which is to use virtual private
|
||
|
||
0:20:34.820,0:20:38.130
|
||
which is the technology from virtual private servers
|
||
|
||
0:20:38.130,0:20:42.110
|
||
the example that everyone here is probably familiar with is Jails where
|
||
|
||
0:20:42.110,0:20:44.130
|
||
you can provide
|
||
|
||
0:20:44.130,0:20:46.720
|
||
your own file system root
|
||
|
||
0:20:46.720,0:20:49.060
|
||
your own network interface
|
||
|
||
0:20:49.060,0:20:50.620
|
||
and what not
|
||
|
||
0:20:50.620,0:20:51.500
|
||
and
|
||
|
||
0:20:51.500,0:20:53.129
|
||
the nice thing about this is
|
||
|
||
0:20:53.129,0:20:56.210
|
||
that unlike full virtualization
|
||
|
||
0:20:56.210,0:20:58.680
|
||
the overhead is very small
|
||
|
||
0:20:58.680,0:21:01.030
|
||
basically it costs you
|
||
|
||
|
||
0:21:01.030,0:21:02.820
|
||
an entry in your process table
|
||
|
||
0:21:02.820,0:21:05.570
|
||
or an entry in few structures
|
||
|
||
0:21:05.570,0:21:08.760
|
||
there's some extra tests in their kernel but otherwise
|
||
|
||
0:21:10.220,0:21:14.900
|
||
there's not a huge overhead for virtualization you don't need
|
||
an extra kernel for every
|
||
|
||
0:21:14.900,0:21:15.460
|
||
image
|
||
|
||
0:21:15.460,0:21:18.390
|
||
so you get the difference here
|
||
between
|
||
|
||
0:21:18.390,0:21:21.620
|
||
be able to run maybe
|
||
|
||
0:21:21.620,0:21:25.250
|
||
you might be able to squeeze two hundred VMWare images onto a machine
|
||
|
||
0:21:25.250,0:21:29.620
|
||
VMWare people say no no don't do that but we have machines that are running
|
||
|
||
0:21:29.620,0:21:30.509
|
||
nearly that many.
|
||
|
||
0:21:34.790,0:21:38.289
|
||
On the other hand there are people out there who run thousands of
|
||
|
||
0:21:38.289,0:21:40.730
|
||
virtual hosts
|
||
|
||
0:21:40.730,0:21:43.170
|
||
using this technique on a single machine so
|
||
|
||
0:21:43.170,0:21:45.200
|
||
big difference in resource use
|
||
|
||
0:21:45.200,0:21:46.400
|
||
on especially with light
|
||
|
||
0:21:46.400,0:21:48.070
|
||
in the lightly loaded use
|
||
|
||
0:21:48.070,0:21:52.400
|
||
in our environment we're looking more running a very small number of them but still
|
||
|
||
0:21:52.400,0:21:55.880
|
||
that overhead is significant
|
||
|
||
0:21:55.880,0:21:59.440
|
||
you still do have some ability to tailor the
|
||
|
||
0:21:59.440,0:22:01.670
|
||
images to a job’s needs
|
||
|
||
0:22:01.670,0:22:03.309
|
||
you could have a
|
||
|
||
0:22:03.309,0:22:05.400
|
||
custom root that for instance you could be running
|
||
|
||
0:22:05.400,0:22:07.380
|
||
FreeBSD 6.0 in one
|
||
|
||
0:22:07.380,0:22:08.650
|
||
in one
|
||
|
||
0:22:08.650,0:22:11.040
|
||
virtual server and 7.0 in another
|
||
|
||
0:22:11.040,0:22:15.090
|
||
you have to be running of course 7.0 kernel or 8.0 kernel to make
|
||
that work
|
||
|
||
0:22:15.090,0:22:16.330
|
||
but it allows you to do that
|
||
|
||
0:22:16.330,0:22:18.500
|
||
we also in principle can do
|
||
|
||
0:22:18.500,0:22:23.080
|
||
evil things like our 64-bit kernel and then 32-bit
|
||
user spaces because
|
||
|
||
0:22:23.080,0:22:26.400
|
||
say you have applications that you can't find the source to anymore
|
||
|
||
0:22:26.400,0:22:31.830
|
||
or libraries you don't
|
||
have the source to any more
|
||
|
||
0:22:31.830,0:22:32.990
|
||
an answer
|
||
|
||
0:22:32.990,0:22:34.150
|
||
interesting things there
|
||
|
||
0:22:34.150,0:22:36.680
|
||
and the other nice thing is since you're
|
||
|
||
0:22:36.680,0:22:39.629
|
||
you're doing a very lightweight and incomplete
|
||
virtualization
|
||
|
||
0:22:39.629,0:22:43.269
|
||
you don't have to virtualize things you don't
|
||
care about so you don’t have the overhead of
|
||
|
||
0:22:43.269,0:22:45.520
|
||
virtualizing everything.
|
||
|
||
0:22:45.520,0:22:48.070
|
||
Downsides of course are incomplete isolation
|
||
|
||
0:22:48.070,0:22:50.690
|
||
you are running processes that on the same kernel
|
||
|
||
0:22:50.690,0:22:52.770
|
||
and they can interfere with each other
|
||
|
||
0:22:52.770,0:22:55.320
|
||
and there's dubious flexibility obviously
|
||
|
||
0:22:55.320,0:22:57.900
|
||
I don't think anyone
|
||
|
||
0:22:57.900,0:23:01.850
|
||
should have the ability to run Windows in a jail.
|
||
|
||
0:23:01.850,0:23:02.860
|
||
There’s some
|
||
|
||
0:23:02.860,0:23:04.960
|
||
Net BSD support but
|
||
|
||
0:23:04.960,0:23:10.510
|
||
and I don’t think it's really gotten to that point.
|
||
|
||
0:23:10.510,0:23:12.420
|
||
One final area
|
||
|
||
0:23:12.420,0:23:14.350
|
||
that sort of diverges from this
|
||
|
||
0:23:14.350,0:23:16.159
|
||
is the classic
|
||
|
||
0:23:16.159,0:23:18.400
|
||
Unix solution to the problem
|
||
|
||
0:23:18.400,0:23:20.580
|
||
on this on single
|
||
|
||
0:23:20.580,0:23:22.070
|
||
in a single machine
|
||
|
||
0:23:22.070,0:23:22.800
|
||
which is
|
||
|
||
0:23:22.800,0:23:28.950
|
||
to use existing resource limits and resource partitioning techniques
|
||
|
||
0:23:28.950,0:23:33.430
|
||
you know for example all Unix like our Unix systems have to process
|
||
resource limits
|
||
|
||
0:23:33.430,0:23:36.240
|
||
a resource and typically
|
||
|
||
0:23:36.240,0:23:36.999
|
||
scheduler a
|
||
|
||
|
||
0:23:38.340,0:23:41.510
|
||
cluster schedulers support the common ones
|
||
|
||
0:23:41.510,0:23:43.150
|
||
so you can set a
|
||
|
||
0:23:43.150,0:23:47.230
|
||
memory limit on your process or a CPU time limit on your process
|
||
|
||
0:23:47.230,0:23:49.830
|
||
and the schedulers typically provide
|
||
|
||
0:23:49.830,0:23:51.350
|
||
at least
|
||
|
||
0:23:51.350,0:23:54.740
|
||
launch support for
|
||
|
||
0:23:54.740,0:23:56.850
|
||
the limits on
|
||
|
||
0:23:56.850,0:24:01.900
|
||
a given set of process, that’s part of the job
|
||
|
||
0:24:01.900,0:24:02.850
|
||
also the most
|
||
|
||
0:24:02.850,0:24:05.640
|
||
you know there are a number of forms of resource
|
||
partitioning that
|
||
|
||
0:24:05.640,0:24:07.170
|
||
are available
|
||
|
||
0:24:08.100,0:24:09.700
|
||
as a standard feature
|
||
|
||
0:24:09.700,0:24:12.000
|
||
on so memory discs are one of them so
|
||
|
||
0:24:12.000,0:24:16.800
|
||
if you want to create a file system space that’s
|
||
limited in size, create a memory disc
|
||
|
||
0:24:16.800,0:24:17.969
|
||
and back it
|
||
|
||
0:24:17.969,0:24:21.130
|
||
and back it with a NMAP file
|
||
|
||
0:24:21.130,0:24:22.520
|
||
or swap
|
||
|
||
0:24:22.520,0:24:24.570
|
||
of partitioning
|
||
|
||
0:24:24.570,0:24:26.330
|
||
disc use
|
||
|
||
0:24:26.330,0:24:30.330
|
||
and then there are techniques like CPU affinities that you can walk
|
||
processes to it
|
||
|
||
0:24:30.330,0:24:32.010
|
||
a single process
|
||
|
||
0:24:32.010,0:24:34.540
|
||
processor or a set of processors
|
||
|
||
0:24:34.540,0:24:39.310
|
||
and so they can't interfere with each other
|
||
with processes running on other processors
|
||
|
||
|
||
0:24:39.310,0:24:44.280
|
||
the nice thing about this first is that you're using existing
|
||
facilities so you don’t have to rewrite
|
||
|
||
0:24:44.280,0:24:46.170
|
||
lots of new features
|
||
|
||
0:24:46.170,0:24:49.590
|
||
for a niche application
|
||
|
||
0:24:49.590,0:24:52.790
|
||
and they tend to integrate well with existing schedulers
|
||
in many cases
|
||
|
||
0:24:52.790,0:24:55.940
|
||
parts of them are already implemented
|
||
|
||
0:24:55.940,0:24:59.650
|
||
and in fact the experiments that I'll talk about later are all using
|
||
this type of
|
||
|
||
0:24:59.650,0:25:02.160
|
||
technique.
|
||
|
||
0:25:02.160,0:25:02.830
|
||
Cons are of course
|
||
|
||
0:25:02.830,0:25:04.850
|
||
incomplete isolation again
|
||
|
||
0:25:04.850,0:25:08.270
|
||
and there’s typically no unified framework
|
||
|
||
0:25:08.270,0:25:12.310
|
||
for the concept of a job when a job is composed of the center processes
|
||
|
||
0:25:12.310,0:25:16.710
|
||
yeah there are a number of data structures within the kernel for
|
||
instance the session
|
||
|
||
0:25:16.710,0:25:18.120
|
||
which
|
||
|
||
0:25:18.120,0:25:19.499
|
||
sort of aggregate processes
|
||
|
||
0:25:19.499,0:25:20.990
|
||
but there isn’t one
|
||
|
||
0:25:22.230,0:25:24.800
|
||
in BSD or Linux at this point
|
||
|
||
0:25:24.800,0:25:29.020
|
||
which allows you to place resource limits on those in a way that you can a process
|
||
|
||
0:25:29.020,0:25:32.520
|
||
IREX did have support like that
|
||
|
||
0:25:32.520,0:25:34.160
|
||
where they have a job ID
|
||
|
||
0:25:34.160,0:25:36.210
|
||
and there could be a job limit
|
||
|
||
0:25:36.210,0:25:38.280
|
||
and selected projects
|
||
|
||
0:25:38.280,0:25:41.320
|
||
are sort of similar but not quite the same
|
||
|
||
0:25:41.320,0:25:43.149
|
||
processes or part of a project but
|
||
|
||
0:25:43.149,0:25:46.770
|
||
it's not quite the same inherited relationship
|
||
|
||
0:25:47.720,0:25:49.500
|
||
and typically
|
||
|
||
0:25:49.500,0:25:50.900
|
||
there aren’t
|
||
|
||
0:25:50.900,0:25:55.390
|
||
limits on things like bandwidth. There was
|
||
|
||
0:25:55.390,0:25:56.430
|
||
a sort of a
|
||
|
||
0:25:56.430,0:25:58.350
|
||
bandwidth limiting
|
||
|
||
0:25:58.350,0:26:00.630
|
||
nice type interface
|
||
|
||
0:26:00.630,0:26:01.950
|
||
on that I saw
|
||
|
||
0:26:01.950,0:26:03.720
|
||
posted as a research project
|
||
|
||
0:26:03.720,0:26:07.150
|
||
many years ago I think in the 2.x days
|
||
|
||
0:26:07.150,0:26:09.880
|
||
where you could say this process can have
|
||
|
||
0:26:09.880,0:26:11.580
|
||
you know five megabits
|
||
|
||
0:26:11.580,0:26:12.530
|
||
or whatever
|
||
|
||
0:26:12.530,0:26:14.380
|
||
but I haven't really seen anything take off
|
||
|
||
0:26:14.380,0:26:16.940
|
||
that would be a pretty neat thing to have
|
||
|
||
0:26:16.940,0:26:19.309
|
||
actually one other exception there
|
||
|
||
0:26:19.309,0:26:22.230
|
||
is on IREX again
|
||
|
||
0:26:22.230,0:26:28.210
|
||
the XFS file system supported guaranteed data rates on file handles
|
||
you could say
|
||
|
||
0:26:28.210,0:26:30.140
|
||
you could open a file and say I need
|
||
|
||
0:26:30.140,0:26:32.940
|
||
ten megabits read or ten megabits write
|
||
|
||
0:26:32.940,0:26:34.029
|
||
or whatever and it would say
|
||
|
||
0:26:34.029,0:26:35.529
|
||
okay or no
|
||
|
||
0:26:35.529,0:26:39.279
|
||
and then you could read and write and
|
||
it would do evil things at the file system layer
|
||
|
||
0:26:39.279,0:26:40.600
|
||
in some cases
|
||
|
||
0:26:40.600,0:26:43.940
|
||
all to ensure that you could get that streaming data rate
|
||
|
||
0:26:44.900,0:26:49.710
|
||
by keeping the file.
|
||
|
||
|
||
0:26:49.710,0:26:53.620
|
||
So now I’m going to talk about what we've done
|
||
|
||
0:26:53.620,0:26:59.510
|
||
what we needed was a solution to handle
|
||
a wide range of job types
|
||
|
||
0:26:59.510,0:27:01.570
|
||
So of the options we looked at for instance
|
||
|
||
0:27:01.570,0:27:04.990
|
||
single application clusters or
|
||
project clusters
|
||
|
||
0:27:04.990,0:27:11.990
|
||
I think that the isolation they
|
||
provide is essentially unparalleled
|
||
|
||
0:27:12.590,0:27:16.630
|
||
and in our environment we probably have to
|
||
virtualize in order to be
|
||
|
||
0:27:16.630,0:27:18.179
|
||
efficient in terms of
|
||
|
||
0:27:18.179,0:27:22.060
|
||
being able to handle our job mix and what not and handle
|
||
the fact that our users
|
||
|
||
0:27:22.060,0:27:23.740
|
||
tend to have
|
||
|
||
0:27:23.740,0:27:27.730
|
||
spikes in their use
|
||
|
||
0:27:27.730,0:27:32.799
|
||
on a large scale so for instance we get GPS we’ll show up and say
|
||
we need to run for a month
|
||
|
||
0:27:32.799,0:27:33.780
|
||
on and then
|
||
|
||
0:27:33.780,0:27:38.460
|
||
some indeterminate number of months later
|
||
they'll do it again
|
||
|
||
0:27:38.460,0:27:40.840
|
||
for that sort of quick
|
||
|
||
0:27:40.840,0:27:41.480
|
||
demands
|
||
|
||
0:27:42.240,0:27:44.850
|
||
we really need the virtuals something
|
||
virtualized
|
||
|
||
0:27:44.850,0:27:47.120
|
||
and then we have to pay the price of
|
||
|
||
0:27:47.120,0:27:48.380
|
||
of the overhead
|
||
|
||
0:27:48.380,0:27:51.590
|
||
and again it doesn't handle small jobs well and that is a
|
||
|
||
0:27:51.590,0:27:54.050
|
||
large portion of our job mix so
|
||
|
||
0:27:54.050,0:27:55.180
|
||
of the
|
||
|
||
0:27:55.180,0:27:58.070
|
||
quarter million or something jobs we’ve run
|
||
|
||
0:27:58.070,0:27:59.700
|
||
on our cluster
|
||
|
||
0:27:59.700,0:28:02.490
|
||
I would guess that
|
||
|
||
0:28:02.490,0:28:04.730
|
||
more than half of those were submitted
|
||
|
||
0:28:04.730,0:28:05.890
|
||
in
|
||
|
||
0:28:05.890,0:28:09.660
|
||
batches of more than ten thousand
|
||
|
||
0:28:09.660,0:28:11.400
|
||
so they'll just pop up
|
||
|
||
0:28:11.400,0:28:14.030
|
||
the other method to have looked at
|
||
|
||
0:28:14.800,0:28:16.750
|
||
are using resource limits
|
||
|
||
0:28:16.750,0:28:19.060
|
||
the nice thing of course is they're achievable
|
||
with
|
||
|
||
0:28:19.060,0:28:21.429
|
||
they achieve useful isolation
|
||
|
||
0:28:21.429,0:28:26.289
|
||
and they’re implementable with either existing functionality or small
|
||
extensions so that's what we’ve
|
||
|
||
0:28:26.289,0:28:27.230
|
||
concentrating on.
|
||
|
||
0:28:27.230,0:28:29.740
|
||
We’ve also been doing some thinking about
|
||
|
||
0:28:29.740,0:28:31.809
|
||
could we use the techniques there
|
||
|
||
0:28:31.809,0:28:33.940
|
||
and combine them with jails
|
||
|
||
0:28:33.940,0:28:36.170
|
||
or related features
|
||
|
||
0:28:36.170,0:28:40.019
|
||
it may be bulking up jails to be more like zones in Solaris
|
||
|
||
0:28:40.019,0:28:44.150
|
||
or containers I think they're calling them this
|
||
week
|
||
|
||
0:28:44.150,0:28:44.840
|
||
and
|
||
|
||
0:28:44.840,0:28:46.770
|
||
so we're looking at that as well
|
||
|
||
0:28:46.770,0:28:50.840
|
||
to be able to provide
|
||
|
||
|
||
0:28:50.840,0:28:54.250
|
||
to be able to provide pretty user operating environments
|
||
|
||
0:28:54.250,0:28:59.200
|
||
potentially isolating users from upgrades so for instance as we upgrade the kernel
|
||
|
||
0:28:59.200,0:29:03.469
|
||
and users can continue using it all the
|
||
images they don't have time to rebuild their
|
||
|
||
0:29:03.469,0:29:04.330
|
||
application in
|
||
|
||
0:29:04.330,0:29:09.970
|
||
and handle the updates in libraries and what not
|
||
|
||
0:29:09.970,0:29:13.840
|
||
they also have the potential to provide strong isolation for security
|
||
purposes
|
||
|
||
0:29:13.840,0:29:18.740
|
||
which could be useful in the future.
|
||
|
||
0:29:18.740,0:29:20.159
|
||
We do think that
|
||
|
||
0:29:20.159,0:29:24.040
|
||
of these mechanisms the nice thing is that
|
||
resource limit
|
||
|
||
0:29:24.040,0:29:26.150
|
||
the resource limits and partitioning scheme
|
||
|
||
0:29:26.150,0:29:29.860
|
||
as well as virtual private service are very
|
||
similar implementation requirements
|
||
|
||
0:29:29.860,0:29:33.090
|
||
set up a fair bit more expensive
|
||
|
||
0:29:33.090,0:29:34.620
|
||
in the VPS case
|
||
|
||
0:29:34.620,0:29:38.780
|
||
but nonetheless they're fairly similar.
|
||
|
||
0:29:38.780,0:29:42.610
|
||
So, what we've been doing is we've taken the Sun Grid Engine
|
||
|
||
0:29:42.610,0:29:46.880
|
||
and we were originally intended to actually
|
||
extend Sun Grid Engine and modify its daemons
|
||
|
||
0:29:46.880,0:29:48.480
|
||
to do the work
|
||
|
||
0:29:48.480,0:29:51.150
|
||
on what we ended up doing instead is realize
|
||
that well
|
||
|
||
0:29:51.150,0:29:54.910
|
||
we can actually specify an alternate program
|
||
to run instead of the shepherd
|
||
|
||
0:29:54.910,0:29:57.990
|
||
The shepherd is the process
|
||
|
||
0:29:57.990,0:30:00.580
|
||
that starts all
|
||
|
||
0:30:00.580,0:30:02.250
|
||
starts the script that
|
||
|
||
0:30:02.250,0:30:03.380
|
||
can for each job
|
||
|
||
0:30:03.380,0:30:04.920
|
||
on a given node
|
||
|
||
0:30:04.920,0:30:08.559
|
||
it collects usage and forwards signals to the
|
||
children
|
||
|
||
0:30:08.559,0:30:12.620
|
||
and also is responsible for starting remote
|
||
components
|
||
|
||
0:30:12.620,0:30:14.560
|
||
so a shepherd is started and then
|
||
|
||
0:30:14.560,0:30:17.640
|
||
traditionally in Sun grid engine it starts out
|
||
|
||
0:30:17.640,0:30:19.910
|
||
its own RShell Daemon
|
||
|
||
0:30:19.910,0:30:20.800
|
||
and
|
||
|
||
0:30:20.800,0:30:22.010
|
||
jobs connect over
|
||
|
||
0:30:22.010,0:30:23.670
|
||
these days that for their own
|
||
|
||
0:30:23.670,0:30:25.870
|
||
mechanism which is
|
||
|
||
0:30:25.870,0:30:26.950
|
||
secure
|
||
|
||
0:30:26.950,0:30:28.000
|
||
not using the
|
||
|
||
0:30:28.840,0:30:30.530
|
||
crafty old RShell code.
|
||
|
||
0:30:35.370,0:30:37.970
|
||
So what we've done is we've implemented a wrapper script
|
||
|
||
0:30:37.970,0:30:40.139
|
||
which allows a pre-command hook
|
||
|
||
0:30:40.139,0:30:42.559
|
||
to run before the shepherd starts
|
||
|
||
0:30:42.559,0:30:47.170
|
||
the command wrapper so before we start shepherd we can run like the N program
|
||
|
||
0:30:47.170,0:30:49.150
|
||
or we can run
|
||
|
||
0:30:49.150,0:30:50.430
|
||
TRUE to whatever
|
||
|
||
0:30:50.430,0:30:54.040
|
||
to set up the environment that it runs in or CPU
|
||
|
||
0:30:54.040,0:30:56.600
|
||
setters I’ll show later
|
||
|
||
0:30:56.600,0:30:58.750
|
||
and a post command hook for cleanup
|
||
|
||
0:30:58.750,0:31:03.940
|
||
it's implemented in Ruby because I felt like it.
|
||
|
||
0:31:03.940,0:31:07.830
|
||
The first thing we implemented was memory backed temporary directories. The motivation for
|
||
|
||
0:31:07.830,0:31:08.700
|
||
this
|
||
|
||
0:31:08.700,0:31:09.640
|
||
is that
|
||
|
||
0:31:09.640,0:31:12.180
|
||
we've had problems for users will you know
|
||
|
||
0:31:12.180,0:31:15.510
|
||
run slash temp out on the nodes
|
||
|
||
0:31:15.510,0:31:19.059
|
||
where we have the nodes configured is that they do have discs
|
||
|
||
0:31:19.059,0:31:22.960
|
||
and most of the disc is available as slash temp
|
||
|
||
0:31:22.960,0:31:25.049
|
||
we had some cases
|
||
|
||
0:31:25.049,0:31:27.840
|
||
particularly early on where users would fill up the discs and not delete it
|
||
|
||
0:31:27.840,0:31:32.300
|
||
their job would crash or they would forget to add clean up code or whatever
|
||
|
||
0:31:32.300,0:31:35.100
|
||
and then other jobs would fail strangely
|
||
|
||
0:31:35.100,0:31:39.029
|
||
you might expect that you just get a nice error message
|
||
|
||
0:31:39.029,0:31:42.040
|
||
programmers being programmers
|
||
|
||
0:31:42.040,0:31:42.909
|
||
people would not do their
|
||
|
||
0:31:42.909,0:31:44.630
|
||
error handling correctly.
|
||
|
||
0:31:44.630,0:31:47.380
|
||
A number of libraries do have issues like for instance
|
||
|
||
0:31:47.380,0:31:49.600
|
||
the PVM library
|
||
|
||
0:31:49.600,0:31:52.600
|
||
unexpectedly fails and reports a completely strange error
|
||
|
||
0:31:52.600,0:31:54.759
|
||
if it can't create a file in temp
|
||
|
||
0:31:54.759,0:32:01.669
|
||
because it needs to create a UNIX domain socket
|
||
so it can talk to itself.
|
||
|
||
0:32:01.669,0:32:03.360
|
||
So, what we’ve done here
|
||
|
||
0:32:03.360,0:32:08.059
|
||
is it turns out that Sun Grid Engine actually creates a temporary
|
||
directory often the
|
||
|
||
0:32:08.059,0:32:11.730
|
||
typically /TEMP but you can change
|
||
that
|
||
|
||
0:32:11.730,0:32:14.490
|
||
and points temp dir to that
|
||
|
||
0:32:14.490,0:32:15.370
|
||
location
|
||
|
||
0:32:15.370,0:32:17.499
|
||
we've educated most of all users now
|
||
|
||
0:32:17.499,0:32:21.360
|
||
to use that location correctly
|
||
so they’ll use that variable
|
||
|
||
0:32:21.360,0:32:23.279
|
||
they treat their files under temp dir
|
||
|
||
0:32:23.279,0:32:24.950
|
||
and then when the job exits
|
||
|
||
0:32:24.950,0:32:26.569
|
||
the Grid Engine deletes the directory
|
||
|
||
0:32:26.569,0:32:28.510
|
||
and that all gets cleaned up
|
||
|
||
0:32:28.510,0:32:32.720
|
||
the problem of course being that of multiple
|
||
are also running on the same node at the same time
|
||
|
||
0:32:32.720,0:32:35.290
|
||
one of them could still fill temp
|
||
|
||
0:32:35.290,0:32:38.759
|
||
so the solution was pretty simple
|
||
we created a
|
||
|
||
0:32:38.759,0:32:41.420
|
||
wrapper script at the beginning of the job
|
||
|
||
0:32:41.420,0:32:42.760
|
||
creates a
|
||
|
||
0:32:42.760,0:32:43.940
|
||
a
|
||
|
||
0:32:43.940,0:32:47.260
|
||
memory file to swap back to MD file system
|
||
|
||
0:32:47.260,0:32:50.790
|
||
of a user requestable size with the default
|
||
|
||
0:32:50.790,0:32:53.310
|
||
and
|
||
|
||
0:32:53.310,0:32:56.520
|
||
this has a number of advantages the biggest one of course is that
|
||
|
||
0:32:56.520,0:32:58.320
|
||
it's fixed size so we get
|
||
|
||
0:32:58.320,0:32:59.449
|
||
you know
|
||
|
||
0:32:59.449,0:33:01.000
|
||
the user gets
|
||
|
||
0:33:01.000,0:33:03.420
|
||
what they asked for
|
||
|
||
0:33:03.420,0:33:05.930
|
||
and once they run of space, they run out of space well
|
||
|
||
0:33:05.930,0:33:09.300
|
||
and too bad they ran out of space
|
||
|
||
0:33:09.300,0:33:12.760
|
||
they should have asked for more
|
||
|
||
0:33:12.760,0:33:16.350
|
||
the other
|
||
|
||
0:33:16.350,0:33:18.770
|
||
the other advantage is the side-effect that
|
||
|
||
0:33:18.770,0:33:21.619
|
||
now that we're running swap back memory files systems for temp
|
||
|
||
0:33:21.619,0:33:24.560
|
||
the users who only use a fairly small amount of temp
|
||
|
||
0:33:24.560,0:33:28.190
|
||
should see vastly improved performance
|
||
because they're running in memory
|
||
|
||
0:33:28.190,0:33:32.980
|
||
rather than writing to disc
|
||
|
||
0:33:32.980,0:33:34.690
|
||
quick example
|
||
|
||
0:33:34.690,0:33:38.270
|
||
we've a little job script here
|
||
|
||
0:33:38.270,0:33:39.830
|
||
prints temp dir and
|
||
|
||
0:33:39.830,0:33:41.950
|
||
prints the
|
||
|
||
0:33:41.950,0:33:43.080
|
||
amount of space
|
||
|
||
0:33:43.080,0:33:46.210
|
||
we submit our job request saying that we want
|
||
|
||
0:33:46.210,0:33:51.539
|
||
this is what we want hundred megabytes of
|
||
temp space
|
||
|
||
0:33:51.539,0:33:53.580
|
||
the same that's why if this
|
||
|
||
0:33:53.580,0:33:55.230
|
||
so the program doesn't
|
||
|
||
0:33:55.230,0:33:57.620
|
||
so the program ends at the end of it
|
||
|
||
0:33:57.620,0:33:58.709
|
||
for doing it
|
||
|
||
0:33:58.709,0:34:00.510
|
||
here's a live demo
|
||
|
||
0:34:00.510,0:34:01.840
|
||
all and then
|
||
|
||
0:34:01.840,0:34:03.389
|
||
you look at the output
|
||
|
||
0:34:03.389,0:34:04.280
|
||
you can see it
|
||
|
||
0:34:04.280,0:34:07.549
|
||
does in fact it creates a memory file system
|
||
|
||
0:34:07.549,0:34:10.449
|
||
I attempted to do great code
|
||
|
||
0:34:10.449,0:34:13.409
|
||
having a variable space
|
||
|
||
0:34:13.409,0:34:15.839
|
||
that is roughly what the user asked for
|
||
|
||
0:34:15.839,0:34:17.089
|
||
the version that I had
|
||
|
||
0:34:17.089,0:34:20.739
|
||
when I was attempting this was not entirely
|
||
accurate
|
||
|
||
0:34:20.739,0:34:24.710
|
||
trying to guess what all the
|
||
UFS overhead would be
|
||
|
||
0:34:24.710,0:34:25.889
|
||
as the result was
|
||
|
||
0:34:25.889,0:34:28.399
|
||
not quite consistent
|
||
|
||
0:34:30.790,0:34:33.899
|
||
I couldn't figure out easy function so
|
||
|
||
0:34:33.899,0:34:39.589
|
||
it does a better job than it did to start with, it’s not perfect
|
||
|
||
0:34:39.589,0:34:40.600
|
||
sometimes however
|
||
|
||
0:34:40.600,0:34:42.329
|
||
today that that's a good fix
|
||
|
||
0:34:42.329,0:34:43.550
|
||
we're coming to
|
||
|
||
0:34:43.550,0:34:45.359
|
||
Deploy it pretty soon
|
||
|
||
0:34:45.359,0:34:47.159
|
||
it works pretty easily
|
||
|
||
0:34:47.159,0:34:48.570
|
||
well sometimes it's not enough
|
||
|
||
0:34:48.570,0:34:51.390
|
||
the biggest issue is that they were badly designed programs all
|
||
|
||
0:34:51.390,0:34:52.720
|
||
all over the world
|
||
|
||
0:34:52.720,0:34:54.919
|
||
don't use temp dir like they're supposed to
|
||
|
||
0:34:54.919,0:34:59.319
|
||
in fact
|
||
|
||
0:35:10.099,0:35:12.759
|
||
(inaudible question)
|
||
so there are all these applications
|
||
|
||
0:35:12.759,0:35:17.979
|
||
there are all these applications still that need
|
||
temp say during start up
|
||
|
||
0:35:17.979,0:35:19.230
|
||
that sort of thing
|
||
|
||
0:35:19.230,0:35:20.809
|
||
so
|
||
|
||
0:35:20.809,0:35:22.599
|
||
all
|
||
|
||
0:35:22.599,0:35:25.829
|
||
so we have problems with these
|
||
|
||
0:35:25.829,0:35:26.290
|
||
realistically
|
||
|
||
0:35:26.290,0:35:27.799
|
||
we can’t change all of them
|
||
|
||
0:35:27.799,0:35:30.019
|
||
it's just not going to happen
|
||
|
||
0:35:30.019,0:35:31.950
|
||
so we still have problems with people
|
||
|
||
0:35:31.950,0:35:34.509
|
||
running out of resources
|
||
|
||
0:35:34.509,0:35:35.819
|
||
so we probably
|
||
|
||
0:35:35.819,0:35:37.489
|
||
feel that
|
||
|
||
|
||
0:35:37.489,0:35:41.240
|
||
the most general solution is to write a per job slash temp
|
||
|
||
0:35:41.240,0:35:44.880
|
||
and virtualize that portion of the files system
|
||
in memory space
|
||
|
||
0:35:44.880,0:35:47.119
|
||
and variate symlinks can do that
|
||
|
||
0:35:47.119,0:35:52.539
|
||
and so we said okay let's give it a shot
|
||
|
||
0:35:52.539,0:35:56.969
|
||
just to introduce the concept of variate symlinks for people who aren’t familiar with them
|
||
|
||
0:35:56.969,0:36:00.280
|
||
variate symlinks are basically symlinks that
|
||
contain variables
|
||
|
||
0:36:00.280,0:36:02.389
|
||
which are expanded at run time
|
||
|
||
0:36:02.389,0:36:05.549
|
||
it allows paths to be different for different
|
||
processes
|
||
|
||
0:36:05.549,0:36:06.969
|
||
for example
|
||
|
||
0:36:06.969,0:36:08.689
|
||
you create some files
|
||
|
||
0:36:08.689,0:36:10.069
|
||
you create
|
||
|
||
0:36:10.069,0:36:12.459
|
||
a symlink whose contents are
|
||
|
||
0:36:12.459,0:36:18.329
|
||
this variable which has the default shell value
|
||
|
||
0:36:18.329,0:36:18.990
|
||
and you
|
||
|
||
0:36:18.990,0:36:24.949
|
||
get different results with different
|
||
variable sets.
|
||
|
||
0:36:24.949,0:36:27.170
|
||
So, to talk about the implementation we’ve done,
|
||
|
||
0:36:27.170,0:36:32.389
|
||
it's derived from direct implementation, most of
|
||
the data structures are identical
|
||
|
||
0:36:32.389,0:36:33.869
|
||
however, I’ve made a number of changes
|
||
|
||
0:36:33.869,0:36:39.649
|
||
the biggest one is that we took the concept
|
||
of scopes and we turned them entirely around
|
||
|
||
0:36:40.409,0:36:45.329
|
||
in there is a system scope which
|
||
is over overridden by a user scope and by a
|
||
|
||
0:36:45.329,0:36:47.259
|
||
process scope
|
||
|
||
0:36:49.819,0:36:53.449
|
||
problem with that is if you
|
||
|
||
0:36:53.449,0:36:56.099
|
||
only think about say the systems scope
|
||
|
||
0:36:56.099,0:36:57.079
|
||
and
|
||
|
||
0:36:57.079,0:36:59.459
|
||
you decide you want to do something clever like have
|
||
|
||
0:36:59.459,0:37:02.219
|
||
a root file system which
|
||
|
||
0:37:02.219,0:37:06.109
|
||
where slash lib points to different things
|
||
for different
|
||
|
||
0:37:06.109,0:37:08.249
|
||
different architectures
|
||
|
||
0:37:08.249,0:37:11.849
|
||
well, works quite nicely until the users come along
|
||
and
|
||
|
||
0:37:11.849,0:37:14.189
|
||
set their arch variable
|
||
|
||
0:37:14.189,0:37:15.629
|
||
up for you
|
||
|
||
0:37:15.629,0:37:18.900
|
||
if you have say a Set UID program and you don't
|
||
defensively
|
||
|
||
0:37:18.900,0:37:22.319
|
||
and you don't implement correctly
|
||
|
||
0:37:22.319,0:37:24.900
|
||
the obvious bad things happen. Obviously you would
|
||
|
||
0:37:24.900,0:37:28.599
|
||
write your code to not do that I believe they
|
||
did, but
|
||
|
||
0:37:28.599,0:37:31.700
|
||
there's a whole class of problems where
|
||
|
||
0:37:31.700,0:37:33.449
|
||
it's easy to screw up
|
||
|
||
0:37:33.449,0:37:36.219
|
||
add and do something wrong there
|
||
|
||
0:37:36.219,0:37:37.270
|
||
so by
|
||
|
||
0:37:37.270,0:37:38.509
|
||
reversing the order
|
||
|
||
0:37:38.509,0:37:41.849
|
||
we can reduce the risks
|
||
|
||
0:37:41.849,0:37:43.329
|
||
at the moment we don't
|
||
|
||
0:37:43.329,0:37:44.309
|
||
have a user scope
|
||
|
||
0:37:44.309,0:37:47.530
|
||
I just don't like the idea of the users scope
|
||
to be honest
|
||
|
||
0:37:47.530,0:37:50.900
|
||
problem being that then you have to have
|
||
per user state in kernel
|
||
|
||
0:37:50.900,0:37:55.509
|
||
that just sort of sits around forever
|
||
you can never garbage collect it except the
|
||
|
||
0:37:55.509,0:37:57.059
|
||
Administrator way
|
||
|
||
0:37:57.059,0:37:59.489
|
||
just doesn't seem like a great idea to me
|
||
|
||
0:37:59.489,0:38:00.700
|
||
And jail scope
|
||
|
||
0:38:00.700,0:38:04.609
|
||
just hasn't been implemented
|
||
|
||
0:38:04.609,0:38:09.809
|
||
because it wasn't entirely clear what the semantics should be
|
||
|
||
0:38:11.010,0:38:14.719
|
||
I also added default variable support variable
|
||
also shell style
|
||
|
||
0:38:14.719,0:38:16.999
|
||
variable support
|
||
|
||
0:38:16.999,0:38:19.169
|
||
to some extent undoes the scope
|
||
|
||
0:38:19.169,0:38:20.870
|
||
the scope change
|
||
|
||
0:38:20.870,0:38:21.779
|
||
in that
|
||
|
||
0:38:21.779,0:38:24.749
|
||
the default variable becomes a system scope
|
||
|
||
0:38:24.749,0:38:26.540
|
||
which is overridden by everything
|
||
|
||
0:38:26.540,0:38:30.890
|
||
but there are cases where we need to do that
|
||
in particular who wants implement their
|
||
|
||
0:38:30.890,0:38:33.380
|
||
slashed temp which varies
|
||
|
||
0:38:33.380,0:38:36.240
|
||
we have to do something like this because temp needs to work
|
||
|
||
0:38:37.209,0:38:42.059
|
||
if we don't have the job values set
|
||
|
||
0:38:42.059,0:38:45.829
|
||
I also decided to use
|
||
|
||
0:38:45.829,0:38:49.839
|
||
percent instead of dollar sign to avoid
|
||
confusion with shell variables because these
|
||
|
||
0:38:49.839,0:38:50.379
|
||
are
|
||
|
||
0:38:50.379,0:38:52.620
|
||
a separate namespace in the kernel
|
||
|
||
0:38:52.620,0:38:56.669
|
||
we can't do it to main OS and do all the evaluation in the
|
||
user space
|
||
|
||
0:38:56.669,0:38:59.269
|
||
it's classic vulnerability
|
||
|
||
0:38:59.269,0:39:02.739
|
||
in the CVE database for instance
|
||
|
||
0:39:02.739,0:39:08.109
|
||
and we’re not using @ and avoid confusion
|
||
with AFS
|
||
|
||
0:39:08.109,0:39:09.819
|
||
or the Net BSD implementation
|
||
|
||
0:39:09.819,0:39:11.019
|
||
which does not allow
|
||
|
||
0:39:11.019,0:39:14.879
|
||
user or administratively settable values
|
||
|
||
0:39:14.879,0:39:17.019
|
||
that support
|
||
|
||
0:39:17.019,0:39:20.359
|
||
I don't have any automated variables such
|
||
as
|
||
|
||
0:39:20.359,0:39:25.789
|
||
the percent sys value which is universally
|
||
set in the Net BDS implementation
|
||
|
||
0:39:25.789,0:39:26.750
|
||
or
|
||
|
||
0:39:28.039,0:39:32.579
|
||
a UID variable which they also have
|
||
0:39:32.579,0:39:34.909
|
||
and currently and it allows
|
||
|
||
0:39:34.909,0:39:40.880
|
||
setting of values in other processes,
|
||
you can only set them in your own and inherit it
|
||
|
||
0:39:40.880,0:39:42.699
|
||
that may change but
|
||
|
||
0:39:42.699,0:39:47.339
|
||
one of my goals here is because they were
|
||
subtle ways to make dumb mistakes and
|
||
|
||
0:39:47.339,0:39:48.930
|
||
cause security vulnerabilities
|
||
|
||
0:39:48.930,0:39:52.479
|
||
I've attempted to slim the feature set
|
||
down to the point where you
|
||
|
||
0:39:52.479,0:39:54.909
|
||
have some reasonable chance of not
|
||
|
||
0:39:54.909,0:39:56.339
|
||
doing that
|
||
|
||
0:39:56.339,0:40:03.339
|
||
if you start building systems on them for deployment.
|
||
|
||
0:40:04.419,0:40:06.909
|
||
The final area that we've worked on
|
||
|
||
0:40:06.909,0:40:09.499
|
||
is moving away from the final system space
|
||
|
||
0:40:09.499,0:40:12.559
|
||
and into CPU sets
|
||
|
||
0:40:12.559,0:40:16.379
|
||
Jeff Roberts implemented a program
|
||
|
||
0:40:16.379,0:40:20.699
|
||
implemented a CPU set functionality which
|
||
allows you to
|
||
|
||
0:40:20.699,0:40:23.489
|
||
create… put a process into a CPU set
|
||
|
||
0:40:23.489,0:40:24.879
|
||
and then set the affinity of that
|
||
|
||
0:40:24.879,0:40:26.269
|
||
CPU set
|
||
|
||
0:40:26.269,0:40:29.189
|
||
by default every process has an anonymous
|
||
|
||
0:40:29.189,0:40:33.059
|
||
CPU set that was stuffed into
|
||
one that was created by this
|
||
|
||
0:40:33.059,0:40:37.269
|
||
in a parent
|
||
|
||
0:40:37.269,0:40:38.619
|
||
so for a little background here
|
||
|
||
0:40:38.619,0:40:40.740
|
||
in a typical SGE configuration
|
||
|
||
0:40:40.740,0:40:42.769
|
||
every node has one slot
|
||
|
||
0:40:42.769,0:40:44.429
|
||
per CPU
|
||
|
||
0:40:44.429,0:40:48.639
|
||
There are a number of other ways you
|
||
can configure it, basically a slot is something
|
||
|
||
0:40:48.639,0:40:50.019
|
||
a job can run in
|
||
|
||
0:40:50.019,0:40:56.719
|
||
and a parallel job crosses slots
|
||
and can be in more than one slot
|
||
|
||
0:40:56.719,0:41:01.359
|
||
for instance in many applications where
|
||
code tends to spend a fair bit of time
|
||
|
||
0:41:01.359,0:41:02.380
|
||
waiting for IO
|
||
|
||
0:41:02.380,0:41:06.209
|
||
you are looking at more than one slot per CPU so two slots per
|
||
|
||
0:41:06.209,0:41:08.089
|
||
core is not uncommon
|
||
|
||
0:41:08.089,0:41:10.869
|
||
but probably the most common configuration
|
||
and the one that
|
||
|
||
0:41:10.869,0:41:13.719
|
||
you get out of the box is you just install a Grid Engine
|
||
|
||
0:41:13.719,0:41:16.739
|
||
is one slot for each CPU
|
||
|
||
0:41:16.739,0:41:19.830
|
||
and that's how that's how we run because we
|
||
want users to have
|
||
|
||
0:41:19.830,0:41:23.699
|
||
that whole CPU for whatever they want to do with
|
||
it
|
||
|
||
0:41:23.699,0:41:26.130
|
||
so jobs are allocated one or more slots
|
||
|
||
0:41:26.130,0:41:27.599
|
||
if they're
|
||
|
||
0:41:27.599,0:41:33.189
|
||
depending on whether they're sequential or parallel jobs
|
||
and how many they ask for
|
||
|
||
0:41:33.189,0:41:37.239
|
||
but this is just a convention
|
||
there's no actual connection between slots
|
||
|
||
0:41:37.239,0:41:39.119
|
||
and CPUs
|
||
|
||
0:41:39.119,0:41:40.829
|
||
so it's quite possible to
|
||
|
||
0:41:40.829,0:41:42.819
|
||
submit a non-parallel job
|
||
|
||
0:41:42.819,0:41:45.019
|
||
that goes off and spawns a zillion threads
|
||
|
||
0:41:45.019,0:41:48.369
|
||
and sucks up all the CPUs on the whole system
|
||
|
||
0:41:48.369,0:41:50.800
|
||
in some early versions of grid engine
|
||
|
||
0:41:50.800,0:41:53.569
|
||
there actually was
|
||
|
||
0:41:53.569,0:41:55.729
|
||
support for tying slots
|
||
|
||
0:41:55.729,0:41:58.669
|
||
to CPUs if you set it up that
|
||
way
|
||
|
||
0:41:58.669,0:42:02.979
|
||
there is a sensible implementation for IREX
|
||
and then things got weirder and weirder is
|
||
|
||
0:42:02.979,0:42:06.010
|
||
people tried to implement it on other platforms
|
||
which had
|
||
|
||
0:42:06.010,0:42:07.030
|
||
vastly different
|
||
|
||
0:42:07.030,0:42:09.839
|
||
CPU binding semantics
|
||
|
||
0:42:09.839,0:42:12.359
|
||
and at this point it’s entirely broken
|
||
|
||
0:42:12.359,0:42:14.959
|
||
on every platform as far as I can tell
|
||
|
||
0:42:14.959,0:42:18.759
|
||
so we decided okay we've got this wrapper
|
||
let's see what we can do
|
||
|
||
0:42:18.759,0:42:21.009
|
||
in terms of making things work.
|
||
|
||
0:42:21.659,0:42:27.119
|
||
We now have the wrapper store allocations in the final system
|
||
|
||
0:42:27.119,0:42:31.239
|
||
we have a not yet recursive allocation algorithm
|
||
|
||
0:42:31.239,0:42:33.369
|
||
well we try to do is
|
||
|
||
0:42:33.369,0:42:34.690
|
||
find the best fit
|
||
|
||
0:42:34.690,0:42:35.779
|
||
fitting set of
|
||
|
||
0:42:35.779,0:42:39.539
|
||
adjacent cores
|
||
|
||
0:42:39.539,0:42:42.329
|
||
and then if that doesn't work we take the largest
|
||
to repeat
|
||
|
||
0:42:43.519,0:42:45.180
|
||
and until we fix
|
||
|
||
0:42:45.180,0:42:47.300
|
||
or until we've got enough slots
|
||
|
||
0:42:47.300,0:42:50.800
|
||
the goal is to minimize new fragments we haven't
|
||
done any analysis
|
||
|
||
0:42:50.800,0:42:52.269
|
||
to determine whether that's actually
|
||
|
||
0:42:52.269,0:42:55.179
|
||
an appropriate algorithm
|
||
|
||
0:42:55.179,0:42:56.289
|
||
but off hand it seems
|
||
|
||
0:42:56.289,0:43:00.519
|
||
fine given I’ve thought about it over lunch.
|
||
|
||
0:43:00.519,0:43:02.810
|
||
Should 40’s lay down their OSes
|
||
|
||
0:43:02.810,0:43:09.649
|
||
turns out that FreeBSD, CPU setting, API
|
||
and the Linux one
|
||
|
||
0:43:09.649,0:43:12.519
|
||
differ only in the very small details
|
||
|
||
0:43:12.519,0:43:13.599
|
||
They’re
|
||
|
||
0:43:13.599,0:43:15.479
|
||
essentially exactly
|
||
|
||
0:43:15.479,0:43:17.569
|
||
identical which is
|
||
|
||
0:43:17.569,0:43:20.489
|
||
convenient semantically,
|
||
so converting between then is pretty straight forward
|
||
|
||
0:43:20.489,0:43:24.869
|
||
so converting between then is pretty straight forward,
|
||
so I did a set of benchmarks
|
||
|
||
0:43:24.869,0:43:27.019
|
||
to demonstrate the
|
||
|
||
0:43:28.089,0:43:29.359
|
||
effectiveness of CPU set,
|
||
they also happen to demonstrate the wrapper
|
||
|
||
0:43:29.359,0:43:33.319
|
||
but don’t really have any relevance
|
||
|
||
0:43:33.319,0:43:35.229
|
||
used a little eight core Intel Xeon box
|
||
|
||
0:43:38.289,0:43:40.749
|
||
7.1 pre-release that had
|
||
|
||
0:43:40.749,0:43:43.239
|
||
John Bjorkman backported
|
||
|
||
0:43:43.239,0:43:46.640
|
||
CPU set
|
||
|
||
0:43:46.640,0:43:49.039
|
||
from 8.0 shortly before release
|
||
|
||
0:43:49.039,0:43:53.450
|
||
well not so shortly, it's supposed to be shortly
|
||
before
|
||
|
||
0:43:53.450,0:43:55.579
|
||
and the SG 6.2
|
||
|
||
0:43:55.579,0:43:59.739
|
||
we used the simple integer benchmarks
|
||
|
||
0:43:59.739,0:44:02.519
|
||
end Queens program were tested
|
||
|
||
0:44:02.519,0:44:03.349
|
||
for instance an 8 x 8 board
|
||
|
||
0:44:03.349,0:44:05.360
|
||
placed
|
||
|
||
0:44:05.360,0:44:08.069
|
||
the 8 queens so they can’t capture each other
|
||
|
||
0:44:08.069,0:44:09.289
|
||
on the board
|
||
|
||
0:44:11.039,0:44:13.680
|
||
so it's a simple load benchmark
|
||
|
||
0:44:13.680,0:44:18.800
|
||
that we ran a small version of the problem
|
||
as our measure command to generate
|
||
|
||
0:44:19.599,0:44:24.439
|
||
load we ran a larger version that we ran for much longer
|
||
|
||
0:44:24.439,0:44:28.149
|
||
some results
|
||
|
||
0:44:28.149,0:44:30.129
|
||
so for baseline,
|
||
|
||
0:44:30.129,0:44:33.170
|
||
the most interesting thing is to do
|
||
a baseline run
|
||
|
||
0:44:33.170,0:44:34.279
|
||
you see this
|
||
|
||
0:44:34.279,0:44:36.410
|
||
some variance it's not really very high
|
||
|
||
0:44:36.410,0:44:38.979
|
||
not surprising it doesn't really do anything
|
||
|
||
0:44:38.979,0:44:40.979
|
||
except suck CPU see here
|
||
|
||
0:44:40.979,0:44:41.729
|
||
Really not much
|
||
|
||
0:44:41.729,0:44:45.229
|
||
going on
|
||
|
||
0:44:45.229,0:44:50.029
|
||
in this case we’ve got seven
|
||
load processes and a single
|
||
|
||
0:44:50.029,0:44:52.789
|
||
a single test process running
|
||
|
||
0:44:52.789,0:44:55.160
|
||
we see things slow down slightly
|
||
|
||
0:44:55.160,0:44:55.890
|
||
and
|
||
|
||
0:44:55.890,0:44:58.389
|
||
the standard deviation goes up a bit
|
||
|
||
0:44:58.389,0:45:00.829
|
||
it’s a little bit of deviation from baseline
|
||
|
||
0:45:00.829,0:45:03.659
|
||
the obvious explanation is clearly
|
||
|
||
0:45:03.659,0:45:07.339
|
||
we’re just content switching
|
||
a bit more
|
||
|
||
0:45:08.840,0:45:10.349
|
||
because we don't have
|
||
|
||
0:45:10.349,0:45:12.410
|
||
CPUs that are doing nothing at all
|
||
|
||
0:45:12.410,0:45:15.559
|
||
there some extra load from the system
|
||
as well
|
||
|
||
0:45:15.559,0:45:20.049
|
||
since the kernel has to run and
|
||
background tests have to run
|
||
|
||
0:45:20.049,0:45:23.150
|
||
you know in this case we have a badly behaved application
|
||
|
||
0:45:23.150,0:45:26.579
|
||
we now have 8 load processes which would suck up all the CPU
|
||
|
||
0:45:26.579,0:45:28.879
|
||
and then we try to run our measurement process
|
||
|
||
0:45:28.879,0:45:30.639
|
||
we see a you know
|
||
|
||
0:45:30.639,0:45:32.739
|
||
substantial performance decrease
|
||
|
||
0:45:32.739,0:45:35.570
|
||
you know about in the range we would expect
|
||
|
||
0:45:35.570,0:45:37.289
|
||
see if we had any
|
||
|
||
0:45:37.289,0:45:40.140
|
||
decrease
|
||
|
||
0:45:40.140,0:45:43.220
|
||
we fired up with CPU set
|
||
|
||
0:45:43.220,0:45:44.249
|
||
quite obviously
|
||
|
||
0:45:44.249,0:45:46.190
|
||
the interesting thing here is to see it
|
||
|
||
0:45:46.190,0:45:49.429
|
||
we’re getting no statistically significant difference
|
||
|
||
0:45:49.429,0:45:52.819
|
||
between the baseline case with
|
||
|
||
0:45:52.819,0:45:56.539
|
||
7 processors if we use CPU sets
|
||
we don't see this variance
|
||
|
||
0:45:56.539,0:45:58.520
|
||
which is nice to know that this shows
|
||
|
||
0:45:58.520,0:45:59.509
|
||
that's it
|
||
|
||
0:45:59.509,0:46:02.869
|
||
we actually see a slight performance
|
||
improvement
|
||
|
||
0:46:02.869,0:46:04.179
|
||
and
|
||
|
||
0:46:04.179,0:46:05.579
|
||
we
|
||
|
||
0:46:05.579,0:46:07.589
|
||
we see a reduction in variance
|
||
|
||
0:46:07.589,0:46:11.569
|
||
so CPU set is actually improving performance
|
||
even if we’re not overloaded
|
||
|
||
0:46:11.569,0:46:13.510
|
||
and we see in the overloaded case
|
||
|
||
0:46:13.510,0:46:15.589
|
||
it's the same
|
||
|
||
0:46:15.589,0:46:20.319
|
||
for the other processes
|
||
they’re stuck on other CPUs
|
||
|
||
0:46:20.319,0:46:22.820
|
||
one interesting side note actually is that
|
||
|
||
0:46:22.820,0:46:26.719
|
||
when I was doing some tests early on
|
||
|
||
0:46:26.719,0:46:27.869
|
||
we actually saw
|
||
|
||
0:46:27.869,0:46:32.359
|
||
I tried doing the base line and
|
||
the baseline with CPU set and if you just fired off with the original
|
||
|
||
0:46:32.359,0:46:33.869
|
||
algorithm
|
||
|
||
0:46:33.869,0:46:34.540
|
||
which
|
||
|
||
0:46:34.540,0:46:36.489
|
||
grabbed CPU0
|
||
|
||
0:46:36.489,0:46:39.339
|
||
you saw a significant performance decline
|
||
|
||
0:46:39.339,0:46:42.319
|
||
because there's a lot of stuff that ends up
|
||
running on CPU0
|
||
|
||
0:46:42.319,0:46:43.819
|
||
which
|
||
|
||
0:46:43.819,0:46:45.100
|
||
what led to the
|
||
|
||
0:46:45.100,0:46:49.890
|
||
quick observation you want to allocate
|
||
from the large numbers down
|
||
|
||
0:46:49.890,0:46:50.569
|
||
so that you use
|
||
|
||
0:46:50.569,0:46:55.069
|
||
the CPUs which are not running the random processes
|
||
that get stuck on zero
|
||
|
||
0:46:55.069,0:46:57.880
|
||
or get all the interrupts in some architectures
|
||
|
||
0:46:57.880,0:47:02.199
|
||
and avoid Core0 in particular.
|
||
|
||
0:47:02.199,0:47:04.029
|
||
so some conclusions
|
||
|
||
0:47:04.029,0:47:07.530
|
||
I think we have useful proof of concept
|
||
of going to be deploying
|
||
|
||
0:47:07.530,0:47:09.880
|
||
certainly the
|
||
|
||
0:47:09.880,0:47:11.000
|
||
memory stuff soon
|
||
|
||
0:47:11.000,0:47:13.329
|
||
once we upgrade to seven we’ll
|
||
|
||
0:47:13.329,0:47:15.959
|
||
definitely be deploying the CPU sets
|
||
|
||
0:47:15.959,0:47:16.849
|
||
so it's
|
||
|
||
0:47:16.849,0:47:18.509
|
||
both improves performance
|
||
|
||
0:47:18.509,0:47:22.009
|
||
in the contended case and in the and uncontended case
|
||
|
||
0:47:22.009,0:47:26.299
|
||
we would like in the future to do some more work
|
||
with virtual private server stuff
|
||
|
||
0:47:26.299,0:47:28.979
|
||
Particularly it would be really interesting
|
||
|
||
0:47:28.979,0:47:30.759
|
||
to be able to run different
|
||
|
||
0:47:30.759,0:47:32.540
|
||
different FreeBSD versions in jails
|
||
|
||
0:47:32.540,0:47:37.660
|
||
for to run up for instance CentOS images
|
||
in jail since we’re running CentOS
|
||
|
||
0:47:37.660,0:47:40.649
|
||
on our Linux based systems
|
||
|
||
0:47:40.649,0:47:43.240
|
||
there could actually be some really interesting
|
||
things there
|
||
|
||
0:47:43.240,0:47:45.759
|
||
in that for instance we can run
|
||
|
||
0:47:45.759,0:47:50.989
|
||
we could potentially detrace Linux applications
|
||
it's never going to happen on native Linux
|
||
|
||
0:47:50.989,0:47:53.069
|
||
there's also another example where
|
||
|
||
0:47:53.069,0:47:56.269
|
||
Paul Sub who’s doing some benchmarking recently
|
||
|
||
0:47:56.269,0:48:01.039
|
||
and relative to Linux on the same hardware
|
||
|
||
0:48:01.039,0:48:04.900
|
||
he was seeing a three and a half times improvement
|
||
0:48:04.900,0:48:07.230
|
||
in basic matrix multiplication
|
||
|
||
0:48:07.230,0:48:08.549
|
||
relative to current
|
||
|
||
0:48:08.549,0:48:11.849
|
||
because previously super-pegged functionality
|
||
|
||
0:48:11.849,0:48:14.499
|
||
where you vastly reduce the number of TLV entries
|
||
|
||
0:48:14.499,0:48:16.150
|
||
in the page table
|
||
|
||
0:48:16.150,0:48:17.229
|
||
and so
|
||
|
||
0:48:17.229,0:48:21.109
|
||
that sort of thing can apply even to apply
|
||
to our Linux using population
|
||
|
||
0:48:21.109,0:48:23.969
|
||
could give FreeBSD some real wins there
|
||
|
||
0:48:26.309,0:48:27.579
|
||
I’d like to look at
|
||
|
||
0:48:27.579,0:48:30.859
|
||
more on the point of isolating users from kernel upgrades
|
||
|
||
0:48:30.859,0:48:32.620
|
||
one of the issues we've had is that
|
||
|
||
0:48:32.620,0:48:34.019
|
||
when you do a new bump
|
||
|
||
0:48:34.019,0:48:38.399
|
||
we have users who depend on all sorts of libraries
|
||
immediate which
|
||
|
||
0:48:38.399,0:48:41.380
|
||
you know the vendors like to rev them to
|
||
do
|
||
|
||
0:48:41.380,0:48:44.640
|
||
stupid API breaking changes is fairly
|
||
regularly so
|
||
|
||
0:48:44.640,0:48:48.380
|
||
it’d be nice for users if we can get all the
|
||
benefits to kernel upgrades
|
||
|
||
0:48:48.380,0:48:51.699
|
||
and they could upgrade at their leisure
|
||
|
||
0:48:51.699,0:48:54.459
|
||
so we're hoping to do that in future as well
|
||
|
||
0:48:54.459,0:48:57.809
|
||
we’d would like to see more limits
|
||
on bandwidth type resources
|
||
|
||
0:48:59.219,0:49:01.199
|
||
for instance say limiting the amount of
|
||
|
||
0:49:02.910,0:49:05.649
|
||
it's fairly easy to know the amount
|
||
of sockets I own
|
||
|
||
0:49:05.649,0:49:10.279
|
||
but it’s hard to place a total limit on
|
||
network bandwidth
|
||
|
||
0:49:10.279,0:49:11.819
|
||
by a particular process
|
||
|
||
0:49:11.819,0:49:16.979
|
||
when almost all of our storage is on NFS
|
||
how do you classify that traffic
|
||
|
||
0:49:17.649,0:49:21.259
|
||
without a fair bit of change to the kernel
|
||
and somehow tagging that
|
||
|
||
0:49:21.259,0:49:23.799
|
||
it's an interesting challenge.
|
||
|
||
0:49:23.799,0:49:28.309
|
||
we'd also like to see it could be needed some
|
||
you implement something like
|
||
|
||
0:49:28.309,0:49:30.089
|
||
the IRIX job ID
|
||
|
||
0:49:30.089,0:49:34.099
|
||
to allow the scheduler to just
|
||
tag processes as part of a job
|
||
|
||
0:49:34.099,0:49:36.309
|
||
currently
|
||
|
||
0:49:36.309,0:49:38.939
|
||
I've grid engine uses a clever but evil hack
|
||
|
||
0:49:38.939,0:49:40.010
|
||
where they add
|
||
|
||
0:49:40.010,0:49:42.509
|
||
an extra group to the process
|
||
|
||
0:49:42.509,0:49:44.819
|
||
and they just have a range of groups
|
||
|
||
0:49:44.819,0:49:48.209
|
||
available so they get inherited in the users
|
||
can’t drop them so
|
||
|
||
0:49:48.209,0:49:51.889
|
||
that allows them to track the process
|
||
but it’s an ugly hack
|
||
|
||
0:49:51.889,0:49:57.499
|
||
and with the current limits on the number of groups
|
||
it can become a real problem
|
||
|
||
0:49:57.499,0:49:59.529
|
||
actually before I take questions
|
||
|
||
0:49:59.529,0:49:59.980
|
||
I do want to put in
|
||
|
||
0:49:59.980,0:50:01.119
|
||
one quick point
|
||
|
||
0:50:01.119,0:50:05.100
|
||
the think it's not interesting you live in
|
||
the area and if you're looking for
|
||
|
||
0:50:05.100,0:50:06.430
|
||
looking for a job
|
||
|
||
0:50:06.430,0:50:09.780
|
||
we are trying to hire a few people it's difficult
|
||
to hire good
|
||
|
||
0:50:09.780,0:50:13.069
|
||
we do have some openings and we're looking
|
||
for
|
||
|
||
0:50:13.069,0:50:17.409
|
||
BSD people in general system
|
||
Admin people
|
||
|
||
0:50:17.409,0:50:24.409
|
||
so questions?
|
||
|
||
0:50:38.419,0:50:40.989
|
||
Yes
|
||
(inaudible question)
|
||
|
||
0:50:40.989,0:50:45.719
|
||
I would expect that to happen
|
||
but it's not something I’ve attempted to test
|
||
|
||
0:50:45.719,0:50:50.570
|
||
what I would really like is to have a topology aware allocator
|
||
|
||
0:50:50.570,0:50:53.179
|
||
so that you can request that you know I want
|
||
|
||
0:50:53.179,0:50:56.229
|
||
I want to share cache or I don't want to share cache
|
||
|
||
0:50:56.229,0:51:00.170
|
||
I want to share memory band width or not share memory bandwidth
|
||
|
||
0:51:00.170,0:51:02.459
|
||
open MPI 1.3
|
||
|
||
0:51:02.459,0:51:08.469
|
||
on the Linux side have a topology where a wrapper for their CPU
|
||
|
||
0:51:08.469,0:51:10.159
|
||
functionality
|
||
|
||
0:51:10.159,0:51:12.249
|
||
makes it something called
|
||
|
||
0:51:12.249,0:51:14.139
|
||
the PLAP
|
||
|
||
0:51:14.139,0:51:15.259
|
||
portable Linux
|
||
|
||
0:51:16.519,0:51:19.599
|
||
CPU allocator. Is that what
|
||
it's actually been
|
||
|
||
0:51:19.599,0:51:21.959
|
||
what the acronym is
|
||
|
||
0:51:21.959,0:51:25.400
|
||
in essence they have to work around the fact
|
||
that there were three standard
|
||
|
||
0:51:25.400,0:51:27.809
|
||
there were three different
|
||
|
||
0:51:27.809,0:51:31.759
|
||
kernel APIs for the same syscall
|
||
|
||
0:51:31.759,0:51:38.759
|
||
for CPU allocation because all the vendors
|
||
did it themselves somehow
|
||
|
||
0:51:38.769,0:51:44.969
|
||
they're the same number but
|
||
they’re completely incompatible
|
||
|
||
0:51:44.969,0:51:48.749
|
||
when you first load the application it calls
|
||
the syscall and it tries to figure out which
|
||
|
||
0:51:48.749,0:51:50.579
|
||
one it is
|
||
|
||
0:51:50.579,0:51:52.719
|
||
by what errors it returns depending on what
|
||
|
||
0:51:52.719,0:51:56.139
|
||
are you missing and completely evil
|
||
|
||
0:51:56.139,0:52:00.859
|
||
I think people should port their API
|
||
and have their library work but
|
||
|
||
0:52:00.859,0:52:05.650
|
||
we don’t need to do that junk
|
||
because we did not make that mistake
|
||
|
||
0:52:05.650,0:52:12.650
|
||
so I would like to see the
|
||
topology aware stuff in particular
|
||
|
||
0:52:30.710,0:52:32.529
|
||
(inaudible question)
|
||
|
||
0:52:32.529,0:52:37.180
|
||
the trick is it’s easy to limit application bandwidth
|
||
|
||
0:52:39.500,0:52:42.269
|
||
fairly easy to limit application bandwidth
|
||
|
||
0:52:42.269,0:52:44.329
|
||
it becomes more difficult when you have to
|
||
|
||
0:52:44.329,0:52:45.430
|
||
if your
|
||
|
||
0:52:45.430,0:52:49.759
|
||
interfaces are shared between application traffic
|
||
|
||
0:52:49.759,0:52:50.880
|
||
and
|
||
|
||
0:52:50.880,0:52:53.049
|
||
say NFS
|
||
|
||
0:52:53.049,0:52:57.399
|
||
getting classifying that is going to be trickier
|
||
you have to tag you’d have to add a fair bit of code
|
||
|
||
0:52:57.399,0:53:04.399
|
||
to trace that down through the kernel
|
||
certainly doable
|
||
|
||
0:53:12.069,0:53:15.499
|
||
(inaudible question)
|
||
|
||
0:53:15.499,0:53:18.389
|
||
I have contemplated doing just that
|
||
|
||
0:53:18.389,0:53:22.059
|
||
or in fact the other thing we consider
|
||
doing
|
||
|
||
0:53:22.059,0:53:24.829
|
||
more as a research project than is a practical thing
|
||
|
||
0:53:24.829,0:53:26.719
|
||
would be actually how
|
||
|
||
0:53:26.719,0:53:28.619
|
||
would be
|
||
|
||
0:53:28.619,0:53:30.029
|
||
independent VLANs
|
||
|
||
0:53:30.029,0:53:31.839
|
||
because then we could do
|
||
|
||
0:53:31.839,0:53:32.459
|
||
things like
|
||
|
||
0:53:32.459,0:53:35.489
|
||
give each process a VLAN they couldn't even
|
||
|
||
0:53:35.489,0:53:37.979
|
||
share at the internet layer
|
||
|
||
0:53:37.979,0:53:41.259
|
||
once the images’ in place for instance we will
|
||
be able to do that
|
||
|
||
0:53:41.259,0:53:45.049
|
||
and that say you know you've got your interfaces
|
||
it’s yours whatever
|
||
|
||
0:53:45.049,0:53:46.479
|
||
but then we could limit it
|
||
|
||
0:53:46.479,0:53:49.959
|
||
we could rate limit that at the kernel
|
||
we can also have
|
||
|
||
0:53:49.959,0:53:54.729
|
||
we’d have a physically isolated
|
||
we’d have a logically isolated network as well
|
||
|
||
0:53:54.729,0:53:57.589
|
||
with some of the latest switches we could actually
|
||
rate limit
|
||
|
||
0:53:57.589,0:54:04.589
|
||
at the switch as well
|
||
|
||
0:54:19.939,0:54:22.369
|
||
(inaudible questions)
|
||
so to the first question
|
||
|
||
0:54:22.369,0:54:26.190
|
||
we don’t run multiple
|
||
|
||
0:54:26.190,0:54:27.639
|
||
sensitivity data on these clusters
|
||
|
||
0:54:27.639,0:54:28.709
|
||
unclassified cluster
|
||
|
||
0:54:28.709,0:54:30.460
|
||
we've avoided that problem by
|
||
|
||
0:54:30.460,0:54:32.299
|
||
not allowing it
|
||
|
||
0:54:32.299,0:54:34.929
|
||
But it is a real issue
|
||
|
||
0:54:34.929,0:54:36.939
|
||
it's just not one we've had to deal with
|
||
|
||
0:54:39.559,0:54:42.109
|
||
in practice with stuff that’s sensitive
|
||
|
||
0:54:43.059,0:54:47.579
|
||
has handling requirements that you can't touch
|
||
the same hardware without a scrub
|
||
|
||
0:54:47.579,0:54:49.859
|
||
you need a pretty
|
||
|
||
0:54:49.859,0:54:51.739
|
||
ridiculously aggressive
|
||
|
||
0:54:51.739,0:54:53.770
|
||
you need a very coarse granularity
|
||
|
||
0:54:53.770,0:54:57.240
|
||
a ridiculous remote imaging process that you
|
||
moved all of the data
|
||
|
||
0:54:57.240,0:55:00.959
|
||
so if I were to do that I would
|
||
probably get rid of the discs
|
||
|
||
0:55:00.959,0:55:01.389
|
||
just
|
||
|
||
0:55:01.389,0:55:02.400
|
||
go disc less
|
||
|
||
0:55:02.400,0:55:04.910
|
||
that would get rid of my number-one failure case
|
||
of
|
||
|
||
0:55:04.910,0:55:07.839
|
||
that would be pretty good but
|
||
|
||
0:55:07.839,0:55:09.419
|
||
but haven’t done it
|
||
|
||
0:55:10.609,0:55:13.819
|
||
NFS failures we've had occasional problems of NFS overloading
|
||
|
||
|
||
0:55:13.819,0:55:15.679
|
||
we haven't had real problem
|
||
|
||
0:55:15.679,0:55:19.279
|
||
we're all local network it’s fairly tightly
|
||
contained so we haven't had problems with
|
||
|
||
0:55:19.279,0:55:20.539
|
||
things
|
||
|
||
0:55:20.539,0:55:21.819
|
||
with
|
||
|
||
0:55:21.819,0:55:26.039
|
||
you know the server going down for extended
|
||
periods and causing everything to hang
|
||
|
||
0:55:26.039,0:55:27.819
|
||
it's been more an issue of
|
||
|
||
0:55:27.819,0:55:33.189
|
||
I mean there isn't there's a problem
|
||
that Panasas is described as in cast
|
||
|
||
0:55:33.189,0:55:36.109
|
||
you can take out any NFS server
|
||
|
||
0:55:36.109,0:55:40.809
|
||
I mean we have the bluearc guys come in and the
|
||
PGA based stuff with multiple ten-gig links I said
|
||
|
||
0:55:40.809,0:55:42.049
|
||
you know I've got
|
||
|
||
0:55:42.049,0:55:46.779
|
||
to do this and they said can we not try this with your whole cluster
|
||
|
||
0:55:46.779,0:55:47.950
|
||
because if you got
|
||
|
||
0:55:47.950,0:55:49.370
|
||
three hundred and fifty
|
||
|
||
0:55:49.370,0:55:52.599
|
||
gigabit ethernet interfaces going into
|
||
the system
|
||
|
||
0:55:52.599,0:55:56.589
|
||
Even ten gig you can saturate pretty trivially
|
||
|
||
0:55:56.589,0:55:57.120
|
||
so that level
|
||
|
||
0:55:57.120,0:55:58.930
|
||
there's an inherent problem
|
||
|
||
0:55:58.930,0:56:01.969
|
||
on we need to handle that kind of bandwidth
|
||
we've
|
||
|
||
0:56:01.969,0:56:04.459
|
||
got to get it a parallel file system
|
||
|
||
0:56:04.459,0:56:06.069
|
||
get a cluster
|
||
|
||
0:56:06.069,0:56:12.289
|
||
before doing streaming stuff we could go via SWAN or something
|
||
|
||
0:56:12.289,0:56:14.949
|
||
anyone else?
|
||
|
||
0:56:14.949,0:56:15.429
|
||
thank you, everyone
|
||
(applause and end)
|