3778 lines
		
	
	
	
		
			73 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
			
		
		
	
	
			3778 lines
		
	
	
	
		
			73 KiB
		
	
	
	
		
			Text
		
	
	
	
	
	
| 0:00:15.749,0:00:18.960
 | ||
| I do apologize for the (other)
 | ||
| 
 | ||
| 0:00:18.960,0:00:22.130
 | ||
| for the EuroBSDCon slides.  I've redone the
 | ||
| 
 | ||
| 0:00:22.130,0:00:23.890
 | ||
| title page and redone the
 | ||
| 
 | ||
| 0:00:23.890,0:00:27.380
 | ||
| and made some changes to the slides
 | ||
| and they didn't make it through for approval
 | ||
| 
 | ||
| 0:00:27.380,0:00:33.130
 | ||
| by this afternoon so
 | ||
| 
 | ||
| 0:00:33.130,0:00:34.640
 | ||
| okay so
 | ||
| 
 | ||
| 0:00:34.640,0:00:36.390
 | ||
| I'm gonna be  talking about
 | ||
| 
 | ||
| 0:00:36.390,0:00:38.430
 | ||
| doing
 | ||
| 
 | ||
| 0:00:38.430,0:00:42.889
 | ||
| about isolating jobs for performance and predictability
 | ||
| in clusters
 | ||
| 
 | ||
| 0:00:42.889,0:00:43.970
 | ||
| before I get into that
 | ||
| 
 | ||
| 0:00:43.970,0:00:46.010
 | ||
| I want to talk a little bit about
 | ||
| 
 | ||
| 0:00:46.010,0:00:47.229
 | ||
| who we are and
 | ||
| 
 | ||
| 0:00:47.229,0:00:49.520
 | ||
| what our problem space is like because that
 | ||
| 
 | ||
| 0:00:49.520,0:00:54.760
 | ||
| dictates that… has an effect
 | ||
| on our solutions base
 | ||
| 
 | ||
| 0:00:54.760,0:00:57.079
 | ||
| I work for the aerospace corporation.
 | ||
| 
 | ||
| 0:00:57.079,0:00:58.609
 | ||
| We work;
 | ||
| 
 | ||
| 0:00:58.609,0:01:02.480
 | ||
| we operate a federally-funded
 | ||
| research and development center
 | ||
| 
 | ||
| 0:01:02.480,0:01:05.400
 | ||
| in the area national security space
 | ||
| 
 | ||
| 0:01:05.400,0:01:09.310
 | ||
| and in particular we work with the air force
 | ||
| space and missile command
 | ||
| 
 | ||
| 0:01:09.310,0:01:13.090
 | ||
| and with the national reconnaissance
 | ||
| office
 | ||
| 
 | ||
| 0:01:13.090,0:01:16.670
 | ||
| and our engineers support a wide variety
 | ||
| 
 | ||
| 0:01:16.670,0:01:20.550
 | ||
| of  activities within that area
 | ||
| 
 | ||
| 0:01:20.550,0:01:21.830
 | ||
| so we have 
 | ||
| 
 | ||
| 0:01:21.830,0:01:23.740
 | ||
| a bit over fourteen hundred to correct
 | ||
| 
 | ||
| 0:01:23.740,0:01:25.860
 | ||
| sorry twenty four hundred engineers
 | ||
| 
 | ||
| 0:01:25.860,0:01:28.820
 | ||
| in virtually every discipline we have 
 | ||
| 
 | ||
| 0:01:28.820,0:01:33.520
 | ||
| as you would expect we have our rocket scientists,
 | ||
| we have people who build satellites
 | ||
| 
 | ||
| 0:01:33.520,0:01:37.439
 | ||
| we have people who build sensors that go on
 | ||
| satellites, people who study these sort of things
 | ||
| 
 | ||
| 0:01:37.439,0:01:38.130
 | ||
| that you
 | ||
| 
 | ||
| 0:01:38.130,0:01:39.590
 | ||
| see when you 
 | ||
| 
 | ||
| 0:01:39.590,0:01:40.819
 | ||
| use those sensors
 | ||
| 
 | ||
| 0:01:40.819,0:01:42.040
 | ||
| that sort of thing.
 | ||
| 
 | ||
| 0:01:42.040,0:01:44.180
 | ||
| We also have civil engineers and
 | ||
| 
 | ||
| 0:01:44.180,0:01:45.680
 | ||
| electronic engineers
 | ||
| 
 | ||
| 0:01:45.680,0:01:46.649
 | ||
| and process,
 | ||
| 
 | ||
| 0:01:46.649,0:01:49.170
 | ||
| computer process people
 | ||
| 
 | ||
| 0:01:49.170,0:01:53.120
 | ||
| so we literally do everything related to space
 | ||
| and all sorts of things that you might not
 | ||
| 
 | ||
| 0:01:53.120,0:01:55.270
 | ||
| expect to be related to space,
 | ||
| 
 | ||
| 0:01:55.270,0:01:58.820
 | ||
| since we also for instance help build ground
 | ||
| systems ‘cause satellites aren’t very useful if
 | ||
| 
 | ||
| 0:01:58.820,0:02:00.680
 | ||
| there isn't anything to talk to them;
 | ||
| 
 | ||
| 0:02:02.540,0:02:04.090
 | ||
| and these engineers
 | ||
| 
 | ||
| 0:02:04.090,0:02:07.420
 | ||
| since they're solving all these different problems we have
 | ||
| 
 | ||
| 0:02:07.420,0:02:11.499
 | ||
| engineering applications in you know
 | ||
| virtually every size you can think of
 | ||
| 
 | ||
| 0:02:11.499,0:02:15.539
 | ||
| ranging from you know little spreadsheet things that
 | ||
| you might not think of as an engineering
 | ||
| 
 | ||
| 0:02:15.539,0:02:17.229
 | ||
| application but they are
 | ||
| 
 | ||
| 0:02:17.229,0:02:22.249
 | ||
| to Matlab programs or a lot of C code
 | ||
| 
 | ||
| 0:02:22.249,0:02:23.960
 | ||
| or one of traditional parallel for us
 | ||
| 
 | ||
| 0:02:23.960,0:02:25.159
 | ||
| serial code
 | ||
| 
 | ||
| 0:02:25.159,0:02:26.049
 | ||
| and then
 | ||
| 
 | ||
| 0:02:26.049,0:02:30.949
 | ||
| large parallel applications either in house;
 | ||
| genetic algorithms and that sort
 | ||
| 
 | ||
| 0:02:30.949,0:02:31.769
 | ||
| of thing,
 | ||
| 
 | ||
| 0:02:31.769,0:02:32.900
 | ||
| or traditional
 | ||
| 
 | ||
| 0:02:32.900,0:02:34.749
 | ||
| the classic parallel code
 | ||
| 
 | ||
| 0:02:34.749,0:02:37.599
 | ||
| like you work around a crate or something material simulation
 | ||
| 0:02:40.119,0:02:41.459
 | ||
| or that or food flow
 | ||
| 
 | ||
| 0:02:41.459,0:02:43.869
 | ||
| or that sort of thing
 | ||
| 
 | ||
| 0:02:43.869,0:02:44.240
 | ||
| so
 | ||
| 
 | ||
| 0:02:44.240,0:02:46.349
 | ||
| so we have this big application space
 | ||
| 
 | ||
| 0:02:46.349,0:02:49.029
 | ||
| just want to give a little introduction to that because
 | ||
| it
 | ||
| 
 | ||
| 0:02:49.029,0:02:51.529
 | ||
| does come back and influence what we 
 | ||
| 
 | ||
| 0:02:51.529,0:02:55.999
 | ||
| the sort of solutions we look at
 | ||
| 
 | ||
| 0:02:55.999,0:03:00.499
 | ||
| so the rest of the talk I’m gonna talk about rese…
 | ||
| 
 | ||
| 0:03:00.499,0:03:05.259
 | ||
| we skipped a slide, there we are, that’s a little better.
 | ||
| 
 | ||
| 0:03:05.259,0:03:08.940
 | ||
| Now, what I'm interested in is I do high
 | ||
| performance computing
 | ||
| 
 | ||
| 0:03:08.940,0:03:10.109
 | ||
| at company
 | ||
| 
 | ||
| 0:03:10.109,0:03:13.949
 | ||
| and I provide high performance computing resources
 | ||
| to our users
 | ||
| 
 | ||
| 0:03:13.949,0:03:19.949
 | ||
| as part of my role in our technical
 | ||
| computing services organization
 | ||
| 
 | ||
| 0:03:19.949,0:03:20.370
 | ||
| so
 | ||
| 
 | ||
| 0:03:20.370,0:03:23.120
 | ||
| our primary resource at this point is
 | ||
| 
 | ||
| 0:03:23.120,0:03:25.429
 | ||
| the fellowship cluster
 | ||
| 
 | ||
| 0:03:25.429,0:03:26.540
 | ||
| it's a for the
 | ||
| 
 | ||
| 0:03:26.540,0:03:29.569
 | ||
| named for the fellowship of the ring
 | ||
| 
 | ||
| 0:03:29.569,0:03:30.449
 | ||
| so it's a…
 | ||
| 
 | ||
| 0:03:30.449,0:03:32.520
 | ||
| … eleven axel nodes
 | ||
| 
 | ||
| 0:03:32.520,0:03:33.930
 | ||
| wrap the core systems
 | ||
| 
 | ||
| 0:03:33.930,0:03:35.909
 | ||
| over here there's a
 | ||
| 
 | ||
| 0:03:35.909,0:03:39.659
 | ||
| Cisco a large Cisco switch. Actually today
 | ||
| there are around two sixty five oh nines if 
 | ||
| 
 | ||
| 0:03:39.659,0:03:40.899
 | ||
| you  assess them
 | ||
| 
 | ||
| 0:03:40.899,0:03:46.149
 | ||
| and because we couldn’t get the port density we wanted otherwise
 | ||
| 
 | ||
| 0:03:46.149,0:03:50.219
 | ||
| and primarily the Gigabit Ethernet system runs
 | ||
| FreeBSD currently 6.0 ‘cause we haven’t upgraded 
 | ||
| 
 | ||
| 0:03:50.219,0:03:51.089
 | ||
| it yet
 | ||
| 
 | ||
| 0:03:51.089,0:03:55.639
 | ||
| planning to move probably to 7.1
 | ||
| or maybe slightly past 7.1
 | ||
| 
 | ||
| 0:03:55.639,0:04:01.029
 | ||
| if we want to get the latest HWPMC changes in 
 | ||
| 
 | ||
| 0:04:01.029,0:04:05.900
 | ||
| we use the Sun Grid Engine scheduler was one of
 | ||
| the two main options for open source
 | ||
| 
 | ||
| 0:04:05.900,0:04:08.949
 | ||
| resource managers on clusters the other one being
 | ||
| the…
 | ||
| 
 | ||
| 0:04:09.959,0:04:11.499
 | ||
| … the TORQUE
 | ||
| 
 | ||
| 0:04:11.499,0:04:15.939
 | ||
| and now recombination from cluster resources
 | ||
| 
 | ||
| 0:04:15.939,0:04:17.389
 | ||
| so we also have
 | ||
| 
 | ||
| 0:04:17.389,0:04:18.079
 | ||
| that's actually
 | ||
| 
 | ||
| 0:04:18.079,0:04:22.090
 | ||
| 40 TB that’s really the raw number on a sun thumper and
 | ||
| 0:04:23.219,0:04:26.290
 | ||
| that’s thirty two usable once you start using RAID-Z2
 | ||
| 
 | ||
| 0:04:26.290,0:04:30.939
 | ||
| since you might actually like to have your data
 | ||
| should a disk fail
 | ||
| 
 | ||
| 0:04:30.939,0:04:32.969
 | ||
| and with today's discs RAID…
 | ||
| 
 | ||
| 0:04:32.969,0:04:34.009
 | ||
| RAID five
 | ||
| 
 | ||
| 0:04:34.009,0:04:35.249
 | ||
| doesn't really cut it,
 | ||
| 
 | ||
| 0:04:37.379,0:04:40.220
 | ||
| And then we also have some other resources coming on but I’m going to be (concentrating on)
 | ||
| 
 | ||
| 0:04:40.220,0:04:43.530
 | ||
| two smaller clusters unfortunately probably running Linux and
 | ||
| 
 | ||
| 0:04:43.530,0:04:45.900
 | ||
| some SMPs but
 | ||
| 
 | ||
| 0:04:45.900,0:04:49.990
 | ||
| I’m going to be concentrating here on the work we're
 | ||
| doing on our other
 | ||
| 
 | ||
| 0:04:49.990,0:04:54.259
 | ||
| FreeBSD based cluster.
 | ||
| 
 | ||
| 0:04:54.259,0:04:55.060
 | ||
| So, first of all
 | ||
| 
 | ||
| 0:04:55.060,0:04:59.410
 | ||
| first of all I want to talk about why we want to
 | ||
| share resources. Should be fairly obvious
 | ||
| 
 | ||
| 0:04:59.410,0:05:00.610
 | ||
| but I'll talk about it in a little bit
 | ||
| 
 | ||
| 0:05:00.610,0:05:04.900
 | ||
| and then what goes wrong when you start sharing resources
 | ||
| 
 | ||
| 0:05:04.900,0:05:08.449
 | ||
| after that I'll talk about some different solutions
 | ||
| to those problems
 | ||
| 
 | ||
| 0:05:08.449,0:05:09.759
 | ||
| and
 | ||
| 
 | ||
| 0:05:09.759,0:05:13.399
 | ||
| some fairly trivial experiments that we've done
 | ||
| so far in terms of enhancing the schedule or
 | ||
| 
 | ||
| 0:05:13.399,0:05:15.860
 | ||
| using operating system features
 | ||
| 
 | ||
| 0:05:15.860,0:05:17.730
 | ||
| so you mitigate those problems
 | ||
| 
 | ||
| 0:05:19.349,0:05:20.110
 | ||
| and
 | ||
| 
 | ||
| 0:05:20.110,0:05:25.110
 | ||
| then conclude with some feature work.
 | ||
| 
 | ||
| 0:05:25.110,0:05:29.289
 | ||
| So, obviously if you have a resource the size…
 | ||
| the size of our cluster, fourteen hundred
 | ||
| 
 | ||
| 0:05:29.289,0:05:30.970
 | ||
| cores roughly
 | ||
| 
 | ||
| 0:05:30.970,0:05:32.819
 | ||
| you probably want to share it unless you
 | ||
| 
 | ||
| 0:05:32.819,0:05:35.080
 | ||
| purpose built it for a single application
 | ||
| 
 | ||
| 0:05:35.080,0:05:37.340
 | ||
| you're going to want to have your users
 | ||
| 
 | ||
| 0:05:37.340,0:05:39.440
 | ||
| sharing it
 | ||
| 
 | ||
| 0:05:39.440,0:05:42.909
 | ||
| and you don't want to just say you know, you get on Monday
 | ||
| 
 | ||
| 0:05:42.909,0:05:45.330
 | ||
| probably not going to be a very effective
 | ||
| option
 | ||
| 
 | ||
| 0:05:45.330,0:05:49.270
 | ||
| especially not when we have as many users as we
 | ||
| do
 | ||
| 
 | ||
| 0:05:49.270,0:05:53.849
 | ||
| we also can't just afford to buy another one
 | ||
| every time a user shows up
 | ||
| 
 | ||
| 0:05:53.849,0:05:54.959
 | ||
| so one of our
 | ||
| 
 | ||
| 0:05:54.959,0:05:57.339
 | ||
| senior VPs said a while back
 | ||
| 
 | ||
| 0:05:57.339,0:05:57.969
 | ||
| you know
 | ||
| 
 | ||
| 0:05:57.969,0:06:02.349
 | ||
| we could probably afford to buy just about
 | ||
| anything we could need once
 | ||
| 
 | ||
| 0:06:02.349,0:06:03.800
 | ||
|  we can't just
 | ||
| 
 | ||
| 0:06:03.800,0:06:06.359
 | ||
| buy ten of them though
 | ||
| 
 | ||
| 0:06:06.359,0:06:08.939
 | ||
| if we really, really needed it
 | ||
| 
 | ||
| 0:06:08.939,0:06:09.680
 | ||
| dropping
 | ||
| 
 | ||
| 0:06:09.680,0:06:11.460
 | ||
| small numbers of millions of dollars on
 | ||
| 
 | ||
| 0:06:11.460,0:06:13.349
 | ||
| computing resources wouldn’t be
 | ||
| 
 | ||
| 0:06:13.349,0:06:15.039
 | ||
| impossible
 | ||
| 
 | ||
| 0:06:15.039,0:06:20.829
 | ||
| but we can't go to you know just have every engineer
 | ||
| who wants one just call up Dell and say ship me ten racks 
 | ||
| 
 | ||
| 0:06:20.829,0:06:24.030
 | ||
| it's not going to work
 | ||
| 
 | ||
| 0:06:24.030,0:06:25.580
 | ||
| and the other thing is that we can’t
 | ||
| 
 | ||
| 0:06:25.580,0:06:28.360
 | ||
| we need to also provide quick turnaround
 | ||
| 
 | ||
| 0:06:28.360,0:06:29.390
 | ||
| for some users
 | ||
| 
 | ||
| 0:06:29.390,0:06:33.229
 | ||
| so we can't have one user hogging the system and
 | ||
| hogging it until they are done
 | ||
| 
 | ||
| 0:06:33.229,0:06:34.720
 | ||
| because we have some users
 | ||
| 
 | ||
| 0:06:34.720,0:06:37.099
 | ||
| and then the next one can run
 | ||
| 
 | ||
| 0:06:37.099,0:06:40.949
 | ||
| because we have some users who'll
 | ||
| come in and say well I need to run
 | ||
| 
 | ||
| 0:06:40.949,0:06:43.159
 | ||
| for three months
 | ||
| 
 | ||
| 0:06:43.159,0:06:43.690
 | ||
| and 
 | ||
| 
 | ||
| 0:06:43.690,0:06:46.810
 | ||
| we've had users come in and literally run 
 | ||
| 
 | ||
| 0:06:46.810,0:06:49.740
 | ||
| pretty much using the entire system for three months
 | ||
| 
 | ||
| 0:06:49.740,0:06:53.839
 | ||
| well so we've had to provide some ability for other
 | ||
| users to still get their work done
 | ||
| 
 | ||
| 0:06:53.839,0:06:58.300
 | ||
| so we can't just… so we do have to have some sharing
 | ||
| 
 | ||
| 0:06:58.300,0:07:00.619
 | ||
| however when you start to share any resource
 | ||
| 
 | ||
| 0:07:00.619,0:07:01.610
 | ||
| like this
 | ||
| 
 | ||
| 0:07:01.610,0:07:03.509
 | ||
| you start getting contention
 | ||
| 
 | ||
| 0:07:03.509,0:07:06.300
 | ||
| users need the same thing at the same time
 | ||
| 
 | ||
| 0:07:06.300,0:07:09.700
 | ||
| and so they fight back and forth for it and they
 | ||
| can't get what they want
 | ||
| 
 | ||
| 0:07:09.700,0:07:11.639
 | ||
| so you have to balance them a bit
 | ||
| 
 | ||
| 0:07:12.999,0:07:14.529
 | ||
| you know also
 | ||
| 
 | ||
| 0:07:14.529,0:07:17.869
 | ||
| some jobs lie when they
 | ||
| 
 | ||
| 0:07:17.869,0:07:20.870
 | ||
| request resources and they actually need
 | ||
| more than they ask for
 | ||
| 
 | ||
| 0:07:20.870,0:07:23.279
 | ||
| which can cause problems
 | ||
| 
 | ||
| 0:07:23.279,0:07:27.229
 | ||
| so we schedule them. We say you're going to fit
 | ||
| here fine and they run off and use
 | ||
| 
 | ||
| 0:07:27.229,0:07:28.580
 | ||
| more than they said
 | ||
| 
 | ||
| 0:07:28.580,0:07:31.000
 | ||
| and if we don't have a mechanism to constrain
 | ||
| them
 | ||
| 
 | ||
| 0:07:31.000,0:07:32.389
 | ||
| we have problems.
 | ||
| 
 | ||
| 0:07:32.389,0:07:34.270
 | ||
| Likewise
 | ||
| 
 | ||
| 0:07:34.270,0:07:37.109
 | ||
| once these users start to contend
 | ||
| 
 | ||
| 0:07:37.109,0:07:39.029
 | ||
| that doesn't just result in
 | ||
| 
 | ||
| 0:07:39.029,0:07:40.439
 | ||
| the jobs taking,
 | ||
| 
 | ||
| 0:07:40.439,0:07:43.360
 | ||
| taking longer in terms of wall clock time
 | ||
| 
 | ||
| 0:07:43.360,0:07:44.659
 | ||
| because they are extremely slow
 | ||
| 
 | ||
| 0:07:44.659,0:07:48.430
 | ||
| but there's overhead related to that contention;
 | ||
| they get swapped out due to pressure on
 | ||
| 
 | ||
| 0:07:49.219,0:07:51.509
 | ||
| various systems
 | ||
| 
 | ||
| 0:07:51.509,0:07:52.550
 | ||
| if you really
 | ||
| 
 | ||
| 0:07:52.550,0:07:57.039
 | ||
| for instance run out of memory then you go into
 | ||
| swap and you end up wasting all your cycles 
 | ||
| 
 | ||
| 0:07:57.039,0:07:58.710
 | ||
| pulling junk in and out of disc
 | ||
| 
 | ||
| 0:07:58.710,0:08:00.830
 | ||
| wasting your bandwidth on that
 | ||
| 
 | ||
| 0:08:00.830,0:08:03.530
 | ||
| so there are
 | ||
| 
 | ||
| 0:08:03.530,0:08:04.219
 | ||
| resource
 | ||
| 
 | ||
| 0:08:04.219,0:08:08.139
 | ||
| there are resource costs to the contention not merely
 | ||
| 
 | ||
| 0:08:08.139,0:08:11.979
 | ||
| a delay in returning results.
 | ||
| 
 | ||
| 0:08:11.979,0:08:16.590
 | ||
| So now I'm going to switch gears and start talk… so I'm
 | ||
| going to talk a little bit about different
 | ||
| 
 | ||
| 0:08:16.590,0:08:18.270
 | ||
| solutions to these
 | ||
| 
 | ||
| 
 | ||
| 0:08:18.270,0:08:20.610
 | ||
| to the 
 | ||
| 
 | ||
| 0:08:20.610,0:08:22.339
 | ||
| these contention issues
 | ||
| 
 | ||
| 0:08:23.710,0:08:27.840
 | ||
| and look at different ways of solving the
 | ||
| problem. Most of these are things that have
 | ||
| 
 | ||
| 0:08:27.840,0:08:29.440
 | ||
| already been done
 | ||
| 
 | ||
| 0:08:29.440,0:08:30.620
 | ||
| but I just want to talk about
 | ||
| 
 | ||
| 0:08:30.620,0:08:32.990
 | ||
| the different ways and then
 | ||
| 
 | ||
| 0:08:32.990,0:08:35.710
 | ||
| evaluate them in our context.
 | ||
| 
 | ||
| 0:08:35.710,0:08:38.119
 | ||
| So a classic solution to the problem is
 | ||
| 
 | ||
| 0:08:38.119,0:08:39.280
 | ||
| Gang Scheduling
 | ||
| 
 | ||
| 0:08:39.280,0:08:44.139
 | ||
| It's basically conventional Unix process
 | ||
| context switching
 | ||
| 
 | ||
| 0:08:44.139,0:08:46.560
 | ||
| written really big
 | ||
| 
 | ||
| 0:08:46.560,0:08:50.339
 | ||
| you what you do is you have your parallel
 | ||
| job that’s running 
 | ||
| 
 | ||
| 0:08:50.339,0:08:51.390
 | ||
| on a system
 | ||
| 
 | ||
| 0:08:51.390,0:08:52.839
 | ||
| and it runs for a while
 | ||
| 
 | ||
| 0:08:52.839,0:08:57.920
 | ||
| and then after a certain amount of time you basically
 | ||
| shove it all; you kick it off of all the nodes
 | ||
| 
 | ||
| 0:08:57.920,0:08:59.940
 | ||
| and let the next one come in
 | ||
| 
 | ||
| 0:08:59.940,0:09:04.030
 | ||
| and typically when  people do this they do it on
 | ||
| on the order of hours because the context switch
 | ||
| 
 | ||
| 0:09:04.030,0:09:09.270
 | ||
| time is extremely large is extremely high
 | ||
| 
 | ||
| 0:09:09.270,0:09:10.639
 | ||
| for example
 | ||
| 
 | ||
| 0:09:10.639,0:09:14.530
 | ||
| because it's not just like swapping a process
 | ||
| internet. You suddenly have to co-ordinate 
 | ||
| 
 | ||
| 0:09:14.530,0:09:17.470
 | ||
| the this context which across to all your processes
 | ||
| 
 | ||
| 0:09:17.470,0:09:19.280
 | ||
| if you're running say
 | ||
| 
 | ||
| 0:09:19.280,0:09:21.190
 | ||
| MPI over TCP
 | ||
| 
 | ||
| 0:09:21.190,0:09:25.910
 | ||
| you actually need to tear down the TCP sessions
 | ||
| because you can't just have TCP timers sitting
 | ||
| 
 | ||
| 0:09:25.910,0:09:26.570
 | ||
| around
 | ||
| 
 | ||
| 0:09:26.570,0:09:28.260
 | ||
| or that sort of thing so
 | ||
| 
 | ||
| 0:09:28.260,0:09:29.950
 | ||
| there there's a there's a lot of overhead
 | ||
| 
 | ||
| 0:09:29.950,0:09:34.340
 | ||
| associated with this. You take a long context switch
 | ||
| 
 | ||
| 0:09:34.340,0:09:36.820
 | ||
| if all of your infrastructure supports this
 | ||
| 
 | ||
| 0:09:36.820,0:09:39.420
 | ||
|  it's fairly effective
 | ||
| 
 | ||
| 0:09:39.420,0:09:43.300
 | ||
| and it does allow jobs to avoid interfering
 | ||
| with each other which is nice
 | ||
| 
 | ||
| 0:09:43.300,0:09:46.100
 | ||
| so you can't you don't have issues
 | ||
| 
 | ||
| 0:09:46.100,0:09:47.689
 | ||
| because you're typically allocating
 | ||
| 
 | ||
| 0:09:47.689,0:09:50.950
 | ||
| whole swaps of the system
 | ||
| 
 | ||
| 0:09:50.950,0:09:53.390
 | ||
| and for properly written applications
 | ||
| 
 | ||
| 0:09:55.000,0:09:59.690
 | ||
| partial results can be returned which for some of
 | ||
| our users is really important where you're doing a 
 | ||
| 
 | ||
| 0:09:59.690,0:10:00.490
 | ||
| refinement
 | ||
| 
 | ||
| 0:10:00.490,0:10:04.350
 | ||
| users would want to look at the results and
 | ||
| say okay
 | ||
| 
 | ||
| 0:10:04.350,0:10:06.130
 | ||
| you know is this just going off into the weeds
 | ||
| 
 | ||
| 0:10:06.130,0:10:10.860
 | ||
| or does it look like it's actually converging on
 | ||
| some sort of useful solution
 | ||
| 
 | ||
| 0:10:10.860,0:10:13.980
 | ||
| as they don't want to just wait till the end.
 | ||
| 
 | ||
| 0:10:13.980,0:10:19.270
 | ||
| Down side of course is that this context
 | ||
| switches costs are very high
 | ||
| 
 | ||
| 0:10:19.270,0:10:22.460
 | ||
| and most importantly there's really a lack
 | ||
| of useful implementations
 | ||
| 
 | ||
| 0:10:22.460,0:10:25.340
 | ||
| a number of platforms have implemented this in the past
 | ||
| 
 | ||
| 0:10:25.340,0:10:29.840
 | ||
| but in practice on modern clusters which are
 | ||
| built on commodity hardware
 | ||
| 
 | ||
| 0:10:29.840,0:10:32.340
 | ||
| with you know
 | ||
| 
 | ||
| 0:10:32.340,0:10:35.530
 | ||
| communication libraries written on standard protocols
 | ||
| 
 | ||
| 0:10:35.530,0:10:37.050
 | ||
| the tools just aren’t there
 | ||
| 
 | ||
| 0:10:37.050,0:10:39.100
 | ||
| and so
 | ||
| 
 | ||
| 0:10:39.100,0:10:40.860
 | ||
| it's not very practical.
 | ||
| 
 | ||
| 0:10:40.860,0:10:44.010
 | ||
| Also it doesn't really make a lot of sense with small jobs
 | ||
| 
 | ||
| 0:10:44.010,0:10:47.789
 | ||
| and one of the things that we found is we have users who have
 | ||
| 
 | ||
| 0:10:47.789,0:10:50.860
 | ||
| embarrassingly parallel problems for they need to look at 
 | ||
| 
 | ||
| 0:10:50.860,0:10:53.450
 | ||
| you know twenty thousand studies
 | ||
| 
 | ||
| 0:10:53.450,0:10:57.400
 | ||
| and they could write something that looked more like a
 | ||
| conventional parallel application where they
 | ||
| 
 | ||
| 0:10:57.400,0:11:01.930
 | ||
| you know wrote a Scheduler and set up an MPI a Message Passing Interface 
 | ||
| 
 | ||
| 0:11:01.930,0:11:05.400
 | ||
| and handed out tasks to pieces of their job and then you
 | ||
| could do this
 | ||
| 
 | ||
| 0:11:05.400,0:11:09.280
 | ||
| but then they would be running a Scheduler and they would
 | ||
| probably do a bad job of it turns out it's actually
 | ||
| 
 | ||
| 0:11:09.280,0:11:10.820
 | ||
| fairly difficult to do right
 | ||
| 
 | ||
| 0:11:10.820,0:11:13.740
 | ||
| even a trivial case
 | ||
| 
 | ||
| 0:11:13.740,0:11:16.189
 | ||
| and so what they do instead is they just select twenty
 | ||
| 
 | ||
| 0:11:16.189,0:11:18.730
 | ||
| twenty thousand jobs to grid engine and say okay
 | ||
| 
 | ||
| 0:11:18.730,0:11:21.330
 | ||
| whatever I'll deal with it
 | ||
| 
 | ||
| 0:11:21.330,0:11:23.140
 | ||
| earlier versions that might have been a problem
 | ||
| 
 | ||
| 0:11:23.140,0:11:24.730
 | ||
| current versions of the code
 | ||
| 
 | ||
| 0:11:24.730,0:11:27.060
 | ||
|  handle easily a million jobs that
 | ||
| 
 | ||
| 0:11:27.060,0:11:29.370
 | ||
| so not really a big deal
 | ||
| 
 | ||
| 0:11:29.370,0:11:31.610
 | ||
| but those sort of users wouldn't fit well
 | ||
| 
 | ||
| 0:11:31.610,0:11:34.190
 | ||
| into the gang scheduled environment 
 | ||
| 
 | ||
| 0:11:34.190,0:11:35.690
 | ||
| at least not in a 
 | ||
| 
 | ||
| 0:11:35.690,0:11:39.149
 | ||
| conventional gang scheduled environment where
 | ||
| you do gang scheduling on the granularity of 
 | ||
| 
 | ||
| 0:11:39.149,0:11:40.940
 | ||
| jobs
 | ||
| 
 | ||
| 0:11:40.940,0:11:44.140
 | ||
| so from that perspective it wouldn’t work very well.
 | ||
| 
 | ||
| 0:11:44.140,0:11:48.380
 | ||
| If you have all the pieces in place and you are
 | ||
| doing a big parallel applications it is in fact
 | ||
| 
 | ||
| 0:11:48.380,0:11:53.770
 | ||
| an extremely effective approach.
 | ||
| 
 | ||
| 0:11:53.770,0:11:56.290
 | ||
| Another option which is sort of related
 | ||
| 
 | ||
| 0:11:56.290,0:11:57.420
 | ||
| it's in fact
 | ||
| 
 | ||
| 0:11:57.420,0:12:00.079
 | ||
| take taking an even courser granularity
 | ||
| 
 | ||
| 0:12:00.079,0:12:04.360
 | ||
| is single application or single project
 | ||
| clusters or sub-clusters.
 | ||
| 
 | ||
| 0:12:04.360,0:12:07.590
 | ||
| For instance this is used some national labs
 | ||
| 
 | ||
| 0:12:07.590,0:12:11.910
 | ||
| where you're given a cycle allocation for a
 | ||
| year based on your grant proposals
 | ||
| 
 | ||
| 0:12:11.910,0:12:14.779
 | ||
| and what your cycle allocation actually comes to you as is
 | ||
| 
 | ||
| 0:12:14.779,0:12:16.580
 | ||
| here's your cluster
 | ||
| 
 | ||
| 0:12:16.580,0:12:17.489
 | ||
| here's a frontend
 | ||
| 
 | ||
| 0:12:17.489,0:12:19.840
 | ||
| here's this chunk of notes, they're yours, go to it.
 | ||
| 
 | ||
| 0:12:19.840,0:12:21.930
 | ||
| Install your own OS, whatever you want
 | ||
| 
 | ||
| 0:12:21.930,0:12:25.580
 | ||
| it's yours
 | ||
| 
 | ||
| 0:12:25.580,0:12:30.310
 | ||
| and then and at a sort of finer scale there's things such as 
 | ||
| 
 | ||
| 0:12:30.310,0:12:31.800
 | ||
| you could use Emulab 
 | ||
| 
 | ||
| 0:12:31.800,0:12:36.300
 | ||
| which is the network emulation system but also does a OS install and configuration 
 | ||
| 
 | ||
| 0:12:36.300,0:12:39.300
 | ||
| so you could do dynamic allocation that way
 | ||
| 
 | ||
| 0:12:39.300,0:12:40.540
 | ||
| Sun's
 | ||
| 
 | ||
| 0:12:40.540,0:12:44.040
 | ||
| Project Hedeby now actually I think it's
 | ||
| called service domain manager
 | ||
| 
 | ||
| 0:12:44.040,0:12:46.500
 | ||
| is the productised version
 | ||
| 
 | ||
| 0:12:46.500,0:12:50.010
 | ||
| or some Clusters on Demand
 | ||
| 
 | ||
| 0:12:50.010,0:12:54.450
 | ||
| they were actually talking about web hosting clusters but
 | ||
| 
 | ||
| 0:12:54.450,0:12:57.780
 | ||
| things that allow rapid deployment unless you
 | ||
| do that a little
 | ||
| 
 | ||
| 0:12:57.780,0:12:59.510
 | ||
| little
 | ||
| 
 | ||
| 0:12:59.510,0:13:02.810
 | ||
| a more granular level than the 
 | ||
| 
 | ||
| 0:13:02.810,0:13:05.580
 | ||
| the allocate them once a year approach
 | ||
| 
 | ||
| 0:13:05.580,0:13:07.720
 | ||
| nonetheless
 | ||
| 
 | ||
| 0:13:07.720,0:13:11.220
 | ||
| let’s you give people whole clusters to work with
 | ||
| 
 | ||
| 0:13:11.220,0:13:12.920
 | ||
| nice one nice thing about it is
 | ||
| 
 | ||
| 0:13:12.920,0:13:15.450
 | ||
| the isolation between the processes
 | ||
| 
 | ||
| 0:13:15.450,0:13:16.890
 | ||
| is complete
 | ||
| 
 | ||
| 
 | ||
| 0:13:16.890,0:13:20.800
 | ||
| so you don’t have to worry about users stomping on each other.
 | ||
| It’s their own system, they can trash it all they
 | ||
| 
 | ||
| 0:13:20.800,0:13:22.230
 | ||
| want
 | ||
| 
 | ||
| 0:13:22.230,0:13:24.709
 | ||
| if they flood the network or they 
 | ||
| 
 | ||
| 0:13:24.709,0:13:26.180
 | ||
| run the nodes into swap
 | ||
| 
 | ||
| 0:13:26.180,0:13:28.480
 | ||
| well that's their problem
 | ||
| 
 | ||
| 0:13:28.480,0:13:32.120
 | ||
| but it also has the advantage that you can tailor the images
 | ||
| 
 | ||
| 0:13:32.120,0:13:36.980
 | ||
| on the nodes of the operative systems to
 | ||
| meet the exact needs of the application
 | ||
| 
 | ||
| 0:13:36.980,0:13:40.560
 | ||
| down side of course is its coarse granularity, in our environment that doesn't work
 | ||
| 
 | ||
| 0:13:40.560,0:13:41.500
 | ||
| very well
 | ||
| 
 | ||
| 0:13:41.500,0:13:46.800
 | ||
|  since we do have all of these all these different types of jobs
 | ||
| 
 | ||
| 0:13:46.800,0:13:51.710
 | ||
| context switches are also pretty expensive. Certainly on the order of minutes 
 | ||
| 
 | ||
| 0:13:51.710,0:13:54.690
 | ||
| Emulab typically claim something like ten minutes
 | ||
| 
 | ||
| 0:13:54.690,0:13:57.970
 | ||
| there are some systems out there
 | ||
| 
 | ||
| 0:13:57.970,0:14:03.320
 | ||
| for instance if you use I think it’s Open Boot that
 | ||
| they're calling it today. It used to be 1xBIOS 
 | ||
| 
 | ||
| 0:14:03.320,0:14:06.790
 | ||
| where you can actually deploy a system in
 | ||
| 
 | ||
| 0:14:06.790,0:14:08.700
 | ||
| tens of seconds
 | ||
| 
 | ||
| 0:14:08.700,0:14:11.520
 | ||
| mostly by getting rid of all that junk the BIOS writers wrote
 | ||
| 
 | ||
| 0:14:11.520,0:14:12.890
 | ||
| and
 | ||
| 
 | ||
| 0:14:12.890,0:14:17.770
 | ||
| the OS boots pretty fast if you don’t have all
 | ||
| that stuff to waylay you,
 | ||
| 
 | ||
| 0:14:17.770,0:14:19.940
 | ||
| but in practice on sort of
 | ||
| 
 | ||
| 0:14:19.940,0:14:21.660
 | ||
| off the shelf hardware
 | ||
| 
 | ||
| 0:14:21.660,0:14:24.400
 | ||
| the context switches times’ are quite high
 | ||
| 
 | ||
| 0:14:24.400,0:14:26.930
 | ||
| users of course can interfere with themselves
 | ||
| 
 | ||
| 0:14:26.930,0:14:29.200
 | ||
| you can argue it's not a problem but
 | ||
| 
 | ||
| 0:14:29.200,0:14:31.660
 | ||
| ideally you would like to prevent
 | ||
| that
 | ||
| 
 | ||
| 0:14:31.660,0:14:35.350
 | ||
| one of the things that I have to deal with
 | ||
| is that my users are
 | ||
| 
 | ||
| 0:14:35.350,0:14:37.830
 | ||
| almost universally
 | ||
| 
 | ||
| 0:14:37.830,0:14:40.410
 | ||
| not trained as computer scientists or programmers
 | ||
| 
 | ||
| 0:14:40.410,0:14:42.550
 | ||
| you know they’re trained in their domain area
 | ||
| 
 | ||
| 0:14:42.550,0:14:44.780
 | ||
| they're really good in that area
 | ||
| 
 | ||
| 0:14:44.780,0:14:48.389
 | ||
| but their concepts of the way hardware works in the
 | ||
| way software works
 | ||
| 
 | ||
| 0:14:48.389,0:14:55.389
 | ||
| don’t match reality in many cases
 | ||
| 
 | ||
| 0:15:01.269,0:15:02.830
 | ||
| (inaudible question)
 | ||
| It’s pretty rare in practice
 | ||
| 
 | ||
| 0:15:02.830,0:15:06.700
 | ||
| well I've heard one lab that does it significantly
 | ||
| 
 | ||
| 0:15:06.700,0:15:09.839
 | ||
| but it's like they do it on sort of a yearly
 | ||
| allocation basis
 | ||
| 
 | ||
| 0:15:09.839,0:15:12.790
 | ||
| and throw the hardware away after two or three years
 | ||
| 
 | ||
| 0:15:12.790,0:15:15.999
 | ||
| and you do typically have some sort of the deployment
 | ||
| 
 | ||
| 0:15:15.999,0:15:18.340
 | ||
| system in place
 | ||
| 
 | ||
| 0:15:18.340,0:15:20.680
 | ||
| or in those types of cases actually
 | ||
| 
 | ||
| 0:15:20.680,0:15:22.359
 | ||
| usually your application comes with
 | ||
| 
 | ||
| 0:15:22.359,0:15:26.500
 | ||
| and here's what we're going to spend on this many people 
 | ||
| 
 | ||
| 0:15:26.500,0:15:27.730
 | ||
| on this project so this is
 | ||
| 
 | ||
| 0:15:27.730,0:15:34.730
 | ||
| big resource allocation
 | ||
| 
 | ||
| 0:15:36.000,0:15:39.780
 | ||
| And yeah I guess one other issue with this is there's no real easy
 | ||
| 
 | ||
| 0:15:39.780,0:15:43.320
 | ||
| way to capture underutilized resources
 | ||
| for example
 | ||
| 
 | ||
| 0:15:43.320,0:15:44.389
 | ||
| if you have
 | ||
| 
 | ||
| 0:15:44.389,0:15:49.190
 | ||
| an application which you know say single-threaded
 | ||
| and uses a ton of memory
 | ||
| 
 | ||
| 0:15:49.190,0:15:51.210
 | ||
| and is running on a machine
 | ||
| 
 | ||
| 0:15:51.210,0:15:55.040
 | ||
| the machines we're buying these days are eight core so
 | ||
| 
 | ||
| 0:15:55.040,0:16:00.040
 | ||
| that’s wasting a lot of CPU cycles you're just
 | ||
| generating a lot of heat doing nothing
 | ||
| 
 | ||
| 0:16:00.040,0:16:03.890
 | ||
| so ideally you would like a scheduler that
 | ||
| said okay so you're using
 | ||
| 
 | ||
| 0:16:03.890,0:16:08.040
 | ||
| using eight or seven of the eight Gigabytes of
 | ||
| RAM but we've got these jobs
 | ||
| 
 | ||
| 0:16:08.040,0:16:10.080
 | ||
| sitting here that
 | ||
| 
 | ||
| 0:16:10.080,0:16:11.560
 | ||
| need next to know need
 | ||
| 
 | ||
| 0:16:11.560,0:16:15.910
 | ||
| a hundred megabytes so we slap seven of
 | ||
| those in along with the big job
 | ||
| 
 | ||
| 0:16:15.910,0:16:18.580
 | ||
| and backfill and in this
 | ||
| 
 | ||
| 0:16:18.580,0:16:19.600
 | ||
| mechanism there's no
 | ||
| 
 | ||
| 0:16:19.600,0:16:21.810
 | ||
| there's no good way to do that
 | ||
| 
 | ||
| 0:16:21.810,0:16:26.820
 | ||
| obviously if the users have that application
 | ||
| next they can do it themselves
 | ||
| 
 | ||
| 0:16:26.820,0:16:30.510
 | ||
| but it's not something where we can easily
 | ||
| bring in 
 | ||
| 
 | ||
| 0:16:30.510,0:16:35.090
 | ||
| bring in more jobs and have a mix to
 | ||
| take advantage of the different
 | ||
| 
 | ||
| 0:16:35.090,0:16:37.300
 | ||
| resources.
 | ||
| 
 | ||
| 0:16:37.300,0:16:39.940
 | ||
| A related approach is to 
 | ||
| 
 | ||
| 0:16:39.940,0:16:43.950
 | ||
| to install virtualization software on the
 | ||
| equipment and this is this is
 | ||
| 
 | ||
| 0:16:44.980,0:16:46.379
 | ||
| this is the essence of 
 | ||
| 
 | ||
| 0:16:46.379,0:16:49.800
 | ||
| what Cloud Computing is at the moment
 | ||
| 
 | ||
| 0:16:49.800,0:16:53.520
 | ||
| it's Amazon providing Zen
 | ||
| 
 | ||
| 0:16:53.520,0:16:55.129
 | ||
| Zen hosting for
 | ||
| 
 | ||
| 0:16:55.129,0:16:56.769
 | ||
| relatively arbitrary
 | ||
| 
 | ||
| 0:16:56.769,0:16:59.710
 | ||
| OS images
 | ||
| 
 | ||
| 0:16:59.710,0:17:02.720
 | ||
| it does have the advantage that it allows rapid deployment
 | ||
| 
 | ||
| 0:17:02.720,0:17:06.510
 | ||
| in theory if your application is scalable provides for 
 | ||
| 
 | ||
| 0:17:06.510,0:17:08.259
 | ||
| extremely high scalability
 | ||
| 
 | ||
| 0:17:08.259,0:17:10.110
 | ||
| particularly if you
 | ||
| 
 | ||
| 0:17:10.110,0:17:14.470
 | ||
| aren’t us and therefore can possibly use somebody else's hardware
 | ||
| 
 | ||
| 0:17:14.470,0:17:16.520
 | ||
| in our application's case that’s
 | ||
| 
 | ||
| 0:17:16.520,0:17:18.790
 | ||
| not very practical so
 | ||
| 
 | ||
| 0:17:18.790,0:17:20.360
 | ||
| we can't do that
 | ||
| 
 | ||
| 0:17:20.360,0:17:20.870
 | ||
| and
 | ||
| 
 | ||
| 0:17:20.870,0:17:23.790
 | ||
| it also has the advantage that you can run
 | ||
| 
 | ||
| 0:17:23.790,0:17:26.470
 | ||
| you can have people with their own image in there
 | ||
| 
 | ||
| 0:17:26.470,0:17:30.000
 | ||
| which is tightly resource constrained but you
 | ||
| can run more than one of them on a node. So for instance
 | ||
| 
 | ||
| 0:17:30.000,0:17:31.170
 | ||
| you can give
 | ||
| 
 | ||
| 0:17:31.170,0:17:32.730
 | ||
| one job
 | ||
| 
 | ||
| 0:17:32.730,0:17:35.489
 | ||
| four cores and another job two cores another
 | ||
| 
 | ||
| 0:17:35.489,0:17:37.500
 | ||
| you know and have a couple single core
 | ||
| 
 | ||
| 0:17:37.500,0:17:38.860
 | ||
| jobs in theory
 | ||
| 
 | ||
| 0:17:38.860,0:17:43.340
 | ||
| you can get fairly strong isolation there
 | ||
| obviously there are shared resources underneath 
 | ||
| 
 | ||
| 0:17:43.340,0:17:44.710
 | ||
| and you
 | ||
| 
 | ||
| 0:17:44.710,0:17:45.570
 | ||
| probably can't
 | ||
| 
 | ||
| 0:17:45.570,0:17:48.370
 | ||
| afford to completely isolate say network bandwidth
 | ||
| 
 | ||
| 0:17:48.370,0:17:49.520
 | ||
| at the bottom layer
 | ||
| 
 | ||
| 0:17:49.520,0:17:51.580
 | ||
| you can do some but
 | ||
| 
 | ||
| 0:17:51.580,0:17:56.170
 | ||
| if you go overboard you can spend all your time on accounting
 | ||
| 
 | ||
| 0:17:56.170,0:17:58.830
 | ||
| you also can again 
 | ||
| 
 | ||
| 0:17:58.830,0:18:01.410
 | ||
| tailor the images to the job
 | ||
| 
 | ||
| 0:18:01.410,0:18:05.030
 | ||
| and in this environment actually you can
 | ||
| do that even more strongly than that
 | ||
| 
 | ||
| 0:18:05.030,0:18:07.030
 | ||
| the sub-cluster approach 
 | ||
| 
 | ||
| 0:18:07.030,0:18:09.860
 | ||
| in that you can often do run
 | ||
| 
 | ||
| 0:18:09.860,0:18:16.360
 | ||
| a five-year-old operating system or ten-year-old
 | ||
| operating system if you're using full virtualization
 | ||
| 
 | ||
| 0:18:16.360,0:18:19.030
 | ||
| and that can allow
 | ||
| 
 | ||
| 0:18:19.030,0:18:23.820
 | ||
| allow obsolete code with weird baselines to work which is
 | ||
| important in our space because
 | ||
| 
 | ||
| 0:18:23.820,0:18:27.390
 | ||
| the average program runs ten years or more
 | ||
| 
 | ||
| 0:18:27.390,0:18:30.860
 | ||
| our average project runs ten years or more
 | ||
| 
 | ||
| 0:18:30.860,0:18:32.530
 | ||
| and as a result
 | ||
| 
 | ||
| 0:18:32.530,0:18:36.010
 | ||
| you might have to go rerun this program that was written
 | ||
| 
 | ||
| 0:18:36.010,0:18:37.320
 | ||
| way back on
 | ||
| 
 | ||
| 0:18:37.320,0:18:40.550
 | ||
| some ancient version of windows or whatever
 | ||
| 
 | ||
| 0:18:40.550,0:18:41.890
 | ||
| it also does provide 
 | ||
| 
 | ||
| 0:18:41.890,0:18:43.840
 | ||
| the ability to recover resources
 | ||
| 
 | ||
| 0:18:43.840,0:18:45.290
 | ||
| as I was talking about before
 | ||
| 
 | ||
| 0:18:45.290,0:18:49.530
 | ||
| but you can't do easily with sub-clusters because you can’t just slip 
 | ||
| 
 | ||
| 0:18:49.530,0:18:50.360
 | ||
| another image
 | ||
| 
 | ||
| 0:18:50.360,0:18:52.910
 | ||
| on the on there and say are you can use anything and
 | ||
| 
 | ||
| 0:18:52.910,0:18:56.730
 | ||
| you know give that image idle priority essentially
 | ||
| 
 | ||
| 0:18:56.730,0:19:00.480
 | ||
| down side of course is that it is in complete
 | ||
| isolation and that there is a shared
 | ||
| 
 | ||
| 0:19:00.480,0:19:02.340
 | ||
| hardware
 | ||
| 
 | ||
| 0:19:02.340,0:19:06.490
 | ||
| you're not likely to find I don't think
 | ||
| any the virtualization systems out there
 | ||
| 
 | ||
| 0:19:06.490,0:19:08.890
 | ||
| right now
 | ||
| 
 | ||
| 0:19:08.890,0:19:09.890
 | ||
| virtualize
 | ||
| 
 | ||
| 0:19:09.890,0:19:11.470
 | ||
| your segment of
 | ||
| 
 | ||
| 0:19:11.470,0:19:13.540
 | ||
| memory bandwidth
 | ||
| 
 | ||
| 0:19:13.540,0:19:15.159
 | ||
| or your segment
 | ||
| 
 | ||
| 0:19:15.159,0:19:16.390
 | ||
| of cache
 | ||
| 
 | ||
| 0:19:16.390,0:19:18.390
 | ||
| of cache space
 | ||
| 
 | ||
| 0:19:18.390,0:19:24.809
 | ||
| so users can’t in fact interfere with themselves and each other in this
 | ||
| environment
 | ||
| 
 | ||
| 0:19:24.809,0:19:25.589
 | ||
| it's also
 | ||
| 
 | ||
| 0:19:25.589,0:19:30.479
 | ||
| not really efficient for small jobs; the cost of running an
 | ||
| entire OS for every
 | ||
| 
 | ||
| 0:19:30.479,0:19:33.020
 | ||
| job is fairly high
 | ||
| 
 | ||
| 0:19:33.020,0:19:34.020
 | ||
| even with
 | ||
| 
 | ||
| 0:19:34.020,0:19:34.710
 | ||
| relatively light
 | ||
| 
 | ||
| 0:19:34.710,0:19:38.250
 | ||
| Unix like OSes is you're still looking
 | ||
| 
 | ||
| 0:19:38.250,0:19:40.900
 | ||
| couple hundred megabytes in practice
 | ||
| 
 | ||
| 0:19:40.900,0:19:46.240
 | ||
| once you get everything up and running unless you run something
 | ||
| totally stripped down
 | ||
| 
 | ||
| 0:19:47.230,0:19:49.460
 | ||
| there’s significant overhead 
 | ||
| 
 | ||
| 0:19:49.460,0:19:52.240
 | ||
| there’s CPU slowdown typically in the
 | ||
| 
 | ||
| 0:19:52.240,0:19:55.360
 | ||
| you know typical estimates are in the twenty
 | ||
| percent range
 | ||
| 
 | ||
| 0:19:55.360,0:20:00.450
 | ||
| numbers really range from fifty percent to
 | ||
| five percent depending on what exactly you're doing
 | ||
| 
 | ||
| 0:20:00.450,0:20:02.100
 | ||
| possibly even lower
 | ||
| 
 | ||
| 0:20:02.100,0:20:04.830
 | ||
| or higher
 | ||
| 
 | ||
| 0:20:04.830,0:20:05.870
 | ||
| and just
 | ||
| 
 | ||
| 0:20:05.870,0:20:09.920
 | ||
| you know the overhead because you have the whole OS there's a lot of a lot
 | ||
| 
 | ||
| 0:20:09.920,0:20:11.420
 | ||
| of duplicate
 | ||
| 
 | ||
| 0:20:11.420,0:20:13.320
 | ||
| stuff
 | ||
| 
 | ||
| 0:20:13.320,0:20:15.010
 | ||
| the various vendors
 | ||
| 
 | ||
| 0:20:15.010,0:20:17.090
 | ||
| have their answers they claim you know we can
 | ||
| 
 | ||
| 0:20:17.090,0:20:21.430
 | ||
| we can merge that and say oh you're running the same kernel so we'll keep your memory
 | ||
| 
 | ||
| 0:20:21.430,0:20:24.120
 | ||
| we use the same memory but
 | ||
| 
 | ||
| 0:20:24.120,0:20:25.220
 | ||
| at some level
 | ||
| 
 | ||
| 0:20:25.220,0:20:29.309
 | ||
| it's all going to get duplicated.
 | ||
| 
 | ||
| 0:20:29.309,0:20:30.590
 | ||
| A related option  
 | ||
| 
 | ||
| 0:20:30.590,0:20:34.820
 | ||
| comes from sort of the internet havesting
 | ||
| industry which is to use virtual private
 | ||
| 
 | ||
| 0:20:34.820,0:20:38.130
 | ||
| which is the technology from virtual private servers
 | ||
| 
 | ||
| 0:20:38.130,0:20:42.110
 | ||
| the example that everyone here is probably familiar with is Jails where
 | ||
| 
 | ||
| 0:20:42.110,0:20:44.130
 | ||
| you can provide
 | ||
| 
 | ||
| 0:20:44.130,0:20:46.720
 | ||
| your own file system root
 | ||
| 
 | ||
| 0:20:46.720,0:20:49.060
 | ||
| your own network interface
 | ||
| 
 | ||
| 0:20:49.060,0:20:50.620
 | ||
| and what not
 | ||
| 
 | ||
| 0:20:50.620,0:20:51.500
 | ||
| and 
 | ||
| 
 | ||
| 0:20:51.500,0:20:53.129
 | ||
| the nice thing about this is
 | ||
| 
 | ||
| 0:20:53.129,0:20:56.210
 | ||
| that unlike full virtualization
 | ||
| 
 | ||
| 0:20:56.210,0:20:58.680
 | ||
| the overhead is very small
 | ||
| 
 | ||
| 0:20:58.680,0:21:01.030
 | ||
| basically it costs you
 | ||
| 
 | ||
| 
 | ||
| 0:21:01.030,0:21:02.820
 | ||
| an entry in your process table
 | ||
| 
 | ||
| 0:21:02.820,0:21:05.570
 | ||
| or an entry in few structures
 | ||
| 
 | ||
| 0:21:05.570,0:21:08.760
 | ||
| there's some extra tests in their kernel but otherwise
 | ||
| 
 | ||
| 0:21:10.220,0:21:14.900
 | ||
| there's not a huge overhead for virtualization you don't need
 | ||
| an extra kernel for every
 | ||
| 
 | ||
| 0:21:14.900,0:21:15.460
 | ||
| image
 | ||
| 
 | ||
| 0:21:15.460,0:21:18.390
 | ||
| so you get the difference here
 | ||
| between
 | ||
| 
 | ||
| 0:21:18.390,0:21:21.620
 | ||
| be able to run maybe
 | ||
| 
 | ||
| 0:21:21.620,0:21:25.250
 | ||
| you might be able to squeeze two hundred VMWare images onto a machine
 | ||
| 
 | ||
| 0:21:25.250,0:21:29.620
 | ||
| VMWare people say no no don't do that but we have machines that are running
 | ||
| 
 | ||
| 0:21:29.620,0:21:30.509
 | ||
| nearly that many.
 | ||
| 
 | ||
| 0:21:34.790,0:21:38.289
 | ||
| On the other hand there are people out there who run thousands of
 | ||
| 
 | ||
| 0:21:38.289,0:21:40.730
 | ||
| virtual hosts
 | ||
| 
 | ||
| 0:21:40.730,0:21:43.170
 | ||
| using this technique on a single machine so
 | ||
| 
 | ||
| 0:21:43.170,0:21:45.200
 | ||
| big difference in resource use
 | ||
| 
 | ||
| 0:21:45.200,0:21:46.400
 | ||
| on especially with light
 | ||
| 
 | ||
| 0:21:46.400,0:21:48.070
 | ||
| in the lightly loaded use
 | ||
| 
 | ||
| 0:21:48.070,0:21:52.400
 | ||
| in our environment we're looking more running a very small number of them but still
 | ||
| 
 | ||
| 0:21:52.400,0:21:55.880
 | ||
| that overhead is significant
 | ||
| 
 | ||
| 0:21:55.880,0:21:59.440
 | ||
| you still do have some ability to tailor the 
 | ||
| 
 | ||
| 0:21:59.440,0:22:01.670
 | ||
| images to a job’s needs
 | ||
| 
 | ||
| 0:22:01.670,0:22:03.309
 | ||
| you could have a
 | ||
| 
 | ||
| 0:22:03.309,0:22:05.400
 | ||
| custom root that for instance you could be running
 | ||
| 
 | ||
| 0:22:05.400,0:22:07.380
 | ||
| FreeBSD 6.0 in one 
 | ||
| 
 | ||
| 0:22:07.380,0:22:08.650
 | ||
| in one 
 | ||
| 
 | ||
| 0:22:08.650,0:22:11.040
 | ||
| virtual server and 7.0 in another
 | ||
| 
 | ||
| 0:22:11.040,0:22:15.090
 | ||
| you have to be running of course 7.0 kernel or 8.0 kernel to make
 | ||
| that work
 | ||
| 
 | ||
| 0:22:15.090,0:22:16.330
 | ||
| but it allows you to do that
 | ||
| 
 | ||
| 0:22:16.330,0:22:18.500
 | ||
| we also in principle can do
 | ||
| 
 | ||
| 0:22:18.500,0:22:23.080
 | ||
| evil things like our 64-bit kernel and then 32-bit
 | ||
| user spaces because
 | ||
| 
 | ||
| 0:22:23.080,0:22:26.400
 | ||
| say you have applications that you can't find the source to anymore
 | ||
| 
 | ||
| 0:22:26.400,0:22:31.830
 | ||
| or libraries you don't
 | ||
| have the source to any more
 | ||
| 
 | ||
| 0:22:31.830,0:22:32.990
 | ||
| an answer
 | ||
| 
 | ||
| 0:22:32.990,0:22:34.150
 | ||
| interesting things there
 | ||
| 
 | ||
| 0:22:34.150,0:22:36.680
 | ||
| and the other nice thing is since you're 
 | ||
| 
 | ||
| 0:22:36.680,0:22:39.629
 | ||
| you're doing a very lightweight and incomplete
 | ||
| virtualization
 | ||
| 
 | ||
| 0:22:39.629,0:22:43.269
 | ||
| you don't have to virtualize things you don't
 | ||
| care about so you don’t have the overhead  of
 | ||
| 
 | ||
| 0:22:43.269,0:22:45.520
 | ||
| virtualizing everything.
 | ||
| 
 | ||
| 0:22:45.520,0:22:48.070
 | ||
| Downsides of course are incomplete isolation
 | ||
| 
 | ||
| 0:22:48.070,0:22:50.690
 | ||
| you are running processes that on the same kernel
 | ||
| 
 | ||
| 0:22:50.690,0:22:52.770
 | ||
| and they can interfere with each other
 | ||
| 
 | ||
| 0:22:52.770,0:22:55.320
 | ||
| and there's dubious flexibility obviously 
 | ||
| 
 | ||
| 0:22:55.320,0:22:57.900
 | ||
| I don't think anyone
 | ||
| 
 | ||
| 0:22:57.900,0:23:01.850
 | ||
| should have the ability to run Windows in a jail.
 | ||
| 
 | ||
| 0:23:01.850,0:23:02.860
 | ||
| There’s some 
 | ||
| 
 | ||
| 0:23:02.860,0:23:04.960
 | ||
| Net BSD support but
 | ||
| 
 | ||
| 0:23:04.960,0:23:10.510
 | ||
| and I don’t think it's really gotten to that point.
 | ||
| 
 | ||
| 0:23:10.510,0:23:12.420
 | ||
| One final area
 | ||
| 
 | ||
| 0:23:12.420,0:23:14.350
 | ||
| that sort of diverges from this
 | ||
| 
 | ||
| 0:23:14.350,0:23:16.159
 | ||
| is the classic
 | ||
| 
 | ||
| 0:23:16.159,0:23:18.400
 | ||
| Unix solution to the problem
 | ||
| 
 | ||
| 0:23:18.400,0:23:20.580
 | ||
| on this on single
 | ||
| 
 | ||
| 0:23:20.580,0:23:22.070
 | ||
| in a single machine
 | ||
| 
 | ||
| 0:23:22.070,0:23:22.800
 | ||
| which is
 | ||
| 
 | ||
| 0:23:22.800,0:23:28.950
 | ||
| to use existing resource limits and resource partitioning techniques
 | ||
| 
 | ||
| 0:23:28.950,0:23:33.430
 | ||
| you know for example all Unix like our Unix systems have to process
 | ||
| resource limits
 | ||
| 
 | ||
| 0:23:33.430,0:23:36.240
 | ||
| a resource and typically
 | ||
| 
 | ||
| 0:23:36.240,0:23:36.999
 | ||
| scheduler a
 | ||
| 
 | ||
| 
 | ||
| 0:23:38.340,0:23:41.510
 | ||
| cluster schedulers support the common ones
 | ||
| 
 | ||
| 0:23:41.510,0:23:43.150
 | ||
| so you can set a
 | ||
| 
 | ||
| 0:23:43.150,0:23:47.230
 | ||
| memory limit on your process or a CPU time limit on your process
 | ||
| 
 | ||
| 0:23:47.230,0:23:49.830
 | ||
| and the schedulers typically provide
 | ||
| 
 | ||
| 0:23:49.830,0:23:51.350
 | ||
| at least
 | ||
| 
 | ||
| 0:23:51.350,0:23:54.740
 | ||
| launch support for 
 | ||
| 
 | ||
| 0:23:54.740,0:23:56.850
 | ||
| the limits on
 | ||
| 
 | ||
| 0:23:56.850,0:24:01.900
 | ||
| a given set of process, that’s part of the job
 | ||
| 
 | ||
| 0:24:01.900,0:24:02.850
 | ||
| also the most
 | ||
| 
 | ||
| 0:24:02.850,0:24:05.640
 | ||
| you know there are a number of forms of resource
 | ||
| partitioning that
 | ||
| 
 | ||
| 0:24:05.640,0:24:07.170
 | ||
| are available
 | ||
| 
 | ||
| 0:24:08.100,0:24:09.700
 | ||
| as a standard feature
 | ||
| 
 | ||
| 0:24:09.700,0:24:12.000
 | ||
| on so memory discs are one of them so
 | ||
| 
 | ||
| 0:24:12.000,0:24:16.800
 | ||
| if you want to create a file system space that’s
 | ||
| limited in size, create a memory disc
 | ||
| 
 | ||
| 0:24:16.800,0:24:17.969
 | ||
| and back it 
 | ||
| 
 | ||
| 0:24:17.969,0:24:21.130
 | ||
| and back it with a NMAP file
 | ||
| 
 | ||
| 0:24:21.130,0:24:22.520
 | ||
| or swap
 | ||
| 
 | ||
| 0:24:22.520,0:24:24.570
 | ||
| of partitioning
 | ||
| 
 | ||
| 0:24:24.570,0:24:26.330
 | ||
| disc use
 | ||
| 
 | ||
| 0:24:26.330,0:24:30.330
 | ||
| and then there are techniques like CPU affinities that you can walk
 | ||
| processes to it
 | ||
| 
 | ||
| 0:24:30.330,0:24:32.010
 | ||
| a single process
 | ||
| 
 | ||
| 0:24:32.010,0:24:34.540
 | ||
| processor or a set of processors
 | ||
| 
 | ||
| 0:24:34.540,0:24:39.310
 | ||
| and so they can't interfere with each other
 | ||
| with processes running on other processors
 | ||
| 
 | ||
| 
 | ||
| 0:24:39.310,0:24:44.280
 | ||
| the nice thing about this first is that you're using existing
 | ||
| facilities so you don’t have to rewrite
 | ||
| 
 | ||
| 0:24:44.280,0:24:46.170
 | ||
| lots of new features
 | ||
| 
 | ||
| 0:24:46.170,0:24:49.590
 | ||
| for a niche application
 | ||
| 
 | ||
| 0:24:49.590,0:24:52.790
 | ||
| and they tend to integrate well with existing schedulers
 | ||
| in many cases
 | ||
| 
 | ||
| 0:24:52.790,0:24:55.940
 | ||
| parts of them are already implemented
 | ||
| 
 | ||
| 0:24:55.940,0:24:59.650
 | ||
| and in fact the experiments that I'll talk about later are all using
 | ||
| this type of
 | ||
| 
 | ||
| 0:24:59.650,0:25:02.160
 | ||
| technique.
 | ||
| 
 | ||
| 0:25:02.160,0:25:02.830
 | ||
| Cons are of course
 | ||
| 
 | ||
| 0:25:02.830,0:25:04.850
 | ||
| incomplete isolation again
 | ||
| 
 | ||
| 0:25:04.850,0:25:08.270
 | ||
| and there’s typically no unified framework
 | ||
| 
 | ||
| 0:25:08.270,0:25:12.310
 | ||
| for the concept of a job when a job is composed of the center processes
 | ||
| 
 | ||
| 0:25:12.310,0:25:16.710
 | ||
| yeah there are a number of data structures within the kernel for
 | ||
| instance the session
 | ||
| 
 | ||
| 0:25:16.710,0:25:18.120
 | ||
| which
 | ||
| 
 | ||
| 0:25:18.120,0:25:19.499
 | ||
| sort of aggregate processes
 | ||
| 
 | ||
| 0:25:19.499,0:25:20.990
 | ||
| but there isn’t one
 | ||
| 
 | ||
| 0:25:22.230,0:25:24.800
 | ||
| in BSD or Linux at this point
 | ||
| 
 | ||
| 0:25:24.800,0:25:29.020
 | ||
| which allows you to place resource limits on those in a way that you can a process
 | ||
| 
 | ||
| 0:25:29.020,0:25:32.520
 | ||
| IREX did have support like that
 | ||
| 
 | ||
| 0:25:32.520,0:25:34.160
 | ||
| where they have a job ID
 | ||
| 
 | ||
| 0:25:34.160,0:25:36.210
 | ||
| and there could be a job limit
 | ||
| 
 | ||
| 0:25:36.210,0:25:38.280
 | ||
| and selected projects
 | ||
| 
 | ||
| 0:25:38.280,0:25:41.320
 | ||
| are sort of similar but not quite the same
 | ||
| 
 | ||
| 0:25:41.320,0:25:43.149
 | ||
| processes or part of a project but
 | ||
| 
 | ||
| 0:25:43.149,0:25:46.770
 | ||
| it's not quite the same inherited relationship
 | ||
| 
 | ||
| 0:25:47.720,0:25:49.500
 | ||
| and typically
 | ||
| 
 | ||
| 0:25:49.500,0:25:50.900
 | ||
| there aren’t
 | ||
| 
 | ||
| 0:25:50.900,0:25:55.390
 | ||
| limits on things like bandwidth. There was
 | ||
| 
 | ||
| 0:25:55.390,0:25:56.430
 | ||
| a sort of a
 | ||
| 
 | ||
| 0:25:56.430,0:25:58.350
 | ||
| bandwidth limiting 
 | ||
| 
 | ||
| 0:25:58.350,0:26:00.630
 | ||
| nice type interface
 | ||
| 
 | ||
| 0:26:00.630,0:26:01.950
 | ||
| on that I saw 
 | ||
| 
 | ||
| 0:26:01.950,0:26:03.720
 | ||
| posted as a research project
 | ||
| 
 | ||
| 0:26:03.720,0:26:07.150
 | ||
| many years ago I think in the 2.x days
 | ||
| 
 | ||
| 0:26:07.150,0:26:09.880
 | ||
| where you could say this process can have
 | ||
| 
 | ||
| 0:26:09.880,0:26:11.580
 | ||
| you know five megabits
 | ||
| 
 | ||
| 0:26:11.580,0:26:12.530
 | ||
| or whatever
 | ||
| 
 | ||
| 0:26:12.530,0:26:14.380
 | ||
| but I haven't really seen anything take off
 | ||
| 
 | ||
| 0:26:14.380,0:26:16.940
 | ||
| that would be a pretty neat thing to have
 | ||
| 
 | ||
| 0:26:16.940,0:26:19.309
 | ||
| actually one other exception there
 | ||
| 
 | ||
| 0:26:19.309,0:26:22.230
 | ||
| is on IREX again
 | ||
| 
 | ||
| 0:26:22.230,0:26:28.210
 | ||
| the XFS file system supported guaranteed data rates on file handles
 | ||
| you could say
 | ||
| 
 | ||
| 0:26:28.210,0:26:30.140
 | ||
| you could open a file and say I need
 | ||
| 
 | ||
| 0:26:30.140,0:26:32.940
 | ||
| ten megabits read or ten megabits write 
 | ||
| 
 | ||
| 0:26:32.940,0:26:34.029
 | ||
| or whatever and it would say
 | ||
| 
 | ||
| 0:26:34.029,0:26:35.529
 | ||
| okay or no
 | ||
| 
 | ||
| 0:26:35.529,0:26:39.279
 | ||
| and then you could read and write and
 | ||
| it would do evil things at the file system layer 
 | ||
| 
 | ||
| 0:26:39.279,0:26:40.600
 | ||
| in some cases
 | ||
| 
 | ||
| 0:26:40.600,0:26:43.940
 | ||
| all to ensure that you could get that streaming data rate
 | ||
| 
 | ||
| 0:26:44.900,0:26:49.710
 | ||
| by keeping the file.
 | ||
| 
 | ||
| 
 | ||
| 0:26:49.710,0:26:53.620
 | ||
| So now I’m going to talk about what we've done
 | ||
| 
 | ||
| 0:26:53.620,0:26:59.510
 | ||
| what we needed was a solution to handle
 | ||
| a wide range of job types
 | ||
| 
 | ||
| 0:26:59.510,0:27:01.570
 | ||
| So of the options we looked at for instance
 | ||
| 
 | ||
| 0:27:01.570,0:27:04.990
 | ||
| single application clusters or
 | ||
| project clusters
 | ||
| 
 | ||
| 0:27:04.990,0:27:11.990
 | ||
| I think that the isolation they
 | ||
| provide is essentially unparalleled
 | ||
| 
 | ||
| 0:27:12.590,0:27:16.630
 | ||
| and in our environment we probably have to
 | ||
| virtualize in order to be
 | ||
| 
 | ||
| 0:27:16.630,0:27:18.179
 | ||
| efficient in terms of 
 | ||
| 
 | ||
| 0:27:18.179,0:27:22.060
 | ||
| being able to handle our job mix and what not and handle
 | ||
| the fact that our users
 | ||
| 
 | ||
| 0:27:22.060,0:27:23.740
 | ||
| tend to have
 | ||
| 
 | ||
| 0:27:23.740,0:27:27.730
 | ||
| spikes in their use
 | ||
| 
 | ||
| 0:27:27.730,0:27:32.799
 | ||
| on a large scale so for instance we get GPS we’ll show up and say
 | ||
| we need to run for a month
 | ||
| 
 | ||
| 0:27:32.799,0:27:33.780
 | ||
| on and then
 | ||
| 
 | ||
| 0:27:33.780,0:27:38.460
 | ||
| some indeterminate number of months later
 | ||
| they'll do it again
 | ||
| 
 | ||
| 0:27:38.460,0:27:40.840
 | ||
| for that sort of quick
 | ||
| 
 | ||
| 0:27:40.840,0:27:41.480
 | ||
| demands
 | ||
| 
 | ||
| 0:27:42.240,0:27:44.850
 | ||
| we really need the virtuals something
 | ||
| virtualized
 | ||
| 
 | ||
| 0:27:44.850,0:27:47.120
 | ||
| and then we have to pay the price of
 | ||
| 
 | ||
| 0:27:47.120,0:27:48.380
 | ||
| of the overhead
 | ||
| 
 | ||
| 0:27:48.380,0:27:51.590
 | ||
| and again it doesn't handle small jobs well and that is a
 | ||
| 
 | ||
| 0:27:51.590,0:27:54.050
 | ||
| large portion of our job mix so
 | ||
| 
 | ||
| 0:27:54.050,0:27:55.180
 | ||
| of the
 | ||
| 
 | ||
| 0:27:55.180,0:27:58.070
 | ||
| quarter million or something jobs we’ve run
 | ||
| 
 | ||
| 0:27:58.070,0:27:59.700
 | ||
| on our cluster
 | ||
| 
 | ||
| 0:27:59.700,0:28:02.490
 | ||
| I would guess that
 | ||
| 
 | ||
| 0:28:02.490,0:28:04.730
 | ||
| more than half of those were submitted
 | ||
| 
 | ||
| 0:28:04.730,0:28:05.890
 | ||
| in
 | ||
| 
 | ||
| 0:28:05.890,0:28:09.660
 | ||
| batches of more than ten thousand
 | ||
| 
 | ||
| 0:28:09.660,0:28:11.400
 | ||
| so they'll just pop up
 | ||
| 
 | ||
| 0:28:11.400,0:28:14.030
 | ||
| the other method to have looked at
 | ||
| 
 | ||
| 0:28:14.800,0:28:16.750
 | ||
| are using resource limits
 | ||
| 
 | ||
| 0:28:16.750,0:28:19.060
 | ||
| the nice thing of course is they're achievable
 | ||
| with
 | ||
| 
 | ||
| 0:28:19.060,0:28:21.429
 | ||
| they achieve useful isolation
 | ||
| 
 | ||
| 0:28:21.429,0:28:26.289
 | ||
| and they’re implementable with either existing functionality or small
 | ||
| extensions so that's what we’ve
 | ||
| 
 | ||
| 0:28:26.289,0:28:27.230
 | ||
| concentrating on.
 | ||
| 
 | ||
| 0:28:27.230,0:28:29.740
 | ||
| We’ve also been doing some thinking about
 | ||
| 
 | ||
| 0:28:29.740,0:28:31.809
 | ||
| could we use the techniques there
 | ||
| 
 | ||
| 0:28:31.809,0:28:33.940
 | ||
| and combine them with jails
 | ||
| 
 | ||
| 0:28:33.940,0:28:36.170
 | ||
| or related features
 | ||
| 
 | ||
| 0:28:36.170,0:28:40.019
 | ||
| it may be bulking up jails to be more like zones in Solaris
 | ||
| 
 | ||
| 0:28:40.019,0:28:44.150
 | ||
| or containers I think they're calling them this
 | ||
| week
 | ||
| 
 | ||
| 0:28:44.150,0:28:44.840
 | ||
| and
 | ||
| 
 | ||
| 0:28:44.840,0:28:46.770
 | ||
| so we're looking at that as well
 | ||
| 
 | ||
| 0:28:46.770,0:28:50.840
 | ||
| to be able to provide
 | ||
| 
 | ||
| 
 | ||
| 0:28:50.840,0:28:54.250
 | ||
| to be able to provide pretty user operating environments
 | ||
| 
 | ||
| 0:28:54.250,0:28:59.200
 | ||
| potentially isolating users from upgrades so for instance as we upgrade the kernel
 | ||
| 
 | ||
| 0:28:59.200,0:29:03.469
 | ||
| and users can continue using it all the
 | ||
| images they don't have time to rebuild their
 | ||
| 
 | ||
| 0:29:03.469,0:29:04.330
 | ||
| application in
 | ||
| 
 | ||
| 0:29:04.330,0:29:09.970
 | ||
| and handle the updates in libraries and what not
 | ||
| 
 | ||
| 0:29:09.970,0:29:13.840
 | ||
| they also have the potential to provide strong isolation for security
 | ||
| purposes
 | ||
| 
 | ||
| 0:29:13.840,0:29:18.740
 | ||
| which could be useful in the future.
 | ||
| 
 | ||
| 0:29:18.740,0:29:20.159
 | ||
| We do think that
 | ||
| 
 | ||
| 0:29:20.159,0:29:24.040
 | ||
| of these mechanisms the nice thing is that
 | ||
| resource limit
 | ||
| 
 | ||
| 0:29:24.040,0:29:26.150
 | ||
| the resource limits and partitioning scheme
 | ||
| 
 | ||
| 0:29:26.150,0:29:29.860
 | ||
| as well as virtual private service are very
 | ||
| similar implementation requirements
 | ||
| 
 | ||
| 0:29:29.860,0:29:33.090
 | ||
| set up a fair bit more expensive
 | ||
| 
 | ||
| 0:29:33.090,0:29:34.620
 | ||
| in the VPS case
 | ||
| 
 | ||
| 0:29:34.620,0:29:38.780
 | ||
| but nonetheless they're fairly similar.
 | ||
| 
 | ||
| 0:29:38.780,0:29:42.610
 | ||
| So, what we've been doing is we've taken the Sun Grid Engine
 | ||
| 
 | ||
| 0:29:42.610,0:29:46.880
 | ||
| and we were originally intended to actually
 | ||
| extend Sun Grid Engine and modify its daemons
 | ||
| 
 | ||
| 0:29:46.880,0:29:48.480
 | ||
| to do the work
 | ||
| 
 | ||
| 0:29:48.480,0:29:51.150
 | ||
| on what we ended up doing instead is realize
 | ||
| that well
 | ||
| 
 | ||
| 0:29:51.150,0:29:54.910
 | ||
| we can actually specify an alternate program
 | ||
| to run instead of the shepherd
 | ||
| 
 | ||
| 0:29:54.910,0:29:57.990
 | ||
| The shepherd is the process
 | ||
| 
 | ||
| 0:29:57.990,0:30:00.580
 | ||
| that starts all
 | ||
| 
 | ||
| 0:30:00.580,0:30:02.250
 | ||
| starts the script that
 | ||
| 
 | ||
| 0:30:02.250,0:30:03.380
 | ||
| can for each job
 | ||
| 
 | ||
| 0:30:03.380,0:30:04.920
 | ||
| on a given node
 | ||
| 
 | ||
| 0:30:04.920,0:30:08.559
 | ||
| it collects usage and forwards signals to the
 | ||
| children
 | ||
| 
 | ||
| 0:30:08.559,0:30:12.620
 | ||
| and also is responsible for starting remote
 | ||
| components
 | ||
| 
 | ||
| 0:30:12.620,0:30:14.560
 | ||
| so a shepherd is started and then
 | ||
| 
 | ||
| 0:30:14.560,0:30:17.640
 | ||
| traditionally in Sun grid engine it starts out
 | ||
| 
 | ||
| 0:30:17.640,0:30:19.910
 | ||
| its own RShell Daemon
 | ||
| 
 | ||
| 0:30:19.910,0:30:20.800
 | ||
| and 
 | ||
| 
 | ||
| 0:30:20.800,0:30:22.010
 | ||
| jobs connect over
 | ||
| 
 | ||
| 0:30:22.010,0:30:23.670
 | ||
| these days that for their own
 | ||
| 
 | ||
| 0:30:23.670,0:30:25.870
 | ||
| mechanism which is
 | ||
| 
 | ||
| 0:30:25.870,0:30:26.950
 | ||
| secure
 | ||
| 
 | ||
| 0:30:26.950,0:30:28.000
 | ||
| not using the
 | ||
| 
 | ||
| 0:30:28.840,0:30:30.530
 | ||
| crafty old RShell code.
 | ||
| 
 | ||
| 0:30:35.370,0:30:37.970
 | ||
| So what we've done is we've implemented a wrapper script
 | ||
| 
 | ||
| 0:30:37.970,0:30:40.139
 | ||
| which allows a pre-command hook
 | ||
| 
 | ||
| 0:30:40.139,0:30:42.559
 | ||
| to run before the shepherd starts
 | ||
| 
 | ||
| 0:30:42.559,0:30:47.170
 | ||
| the command wrapper so before we start shepherd we can run like the N program
 | ||
| 
 | ||
| 0:30:47.170,0:30:49.150
 | ||
| or we can run
 | ||
| 
 | ||
| 0:30:49.150,0:30:50.430
 | ||
| TRUE to whatever
 | ||
| 
 | ||
| 0:30:50.430,0:30:54.040
 | ||
| to set up the environment that it runs in or CPU
 | ||
| 
 | ||
| 0:30:54.040,0:30:56.600
 | ||
| setters I’ll show later
 | ||
| 
 | ||
| 0:30:56.600,0:30:58.750
 | ||
| and a post command hook for cleanup
 | ||
| 
 | ||
| 0:30:58.750,0:31:03.940
 | ||
| it's implemented in Ruby because I felt like it.
 | ||
| 
 | ||
| 0:31:03.940,0:31:07.830
 | ||
| The first thing we implemented was memory backed temporary directories. The motivation for
 | ||
| 
 | ||
| 0:31:07.830,0:31:08.700
 | ||
| this
 | ||
| 
 | ||
| 0:31:08.700,0:31:09.640
 | ||
| is that
 | ||
| 
 | ||
| 0:31:09.640,0:31:12.180
 | ||
| we've had problems for users will you know
 | ||
| 
 | ||
| 0:31:12.180,0:31:15.510
 | ||
| run slash temp out on the nodes
 | ||
| 
 | ||
| 0:31:15.510,0:31:19.059
 | ||
| where we have the nodes configured is that they do have discs
 | ||
| 
 | ||
| 0:31:19.059,0:31:22.960
 | ||
| and most of the disc is available as slash temp
 | ||
| 
 | ||
| 0:31:22.960,0:31:25.049
 | ||
| we had some cases
 | ||
| 
 | ||
| 0:31:25.049,0:31:27.840
 | ||
| particularly early on where users would fill up the discs and not delete it 
 | ||
| 
 | ||
| 0:31:27.840,0:31:32.300
 | ||
| their job would crash or they would forget to add clean up code or whatever 
 | ||
| 
 | ||
| 0:31:32.300,0:31:35.100
 | ||
| and then other jobs would fail strangely
 | ||
| 
 | ||
| 0:31:35.100,0:31:39.029
 | ||
| you might expect that you just get a nice error message
 | ||
| 
 | ||
| 0:31:39.029,0:31:42.040
 | ||
| programmers being programmers
 | ||
| 
 | ||
| 0:31:42.040,0:31:42.909
 | ||
| people would not do their
 | ||
| 
 | ||
| 0:31:42.909,0:31:44.630
 | ||
| error handling correctly.
 | ||
| 
 | ||
| 0:31:44.630,0:31:47.380
 | ||
| A number of libraries do have issues like for instance
 | ||
| 
 | ||
| 0:31:47.380,0:31:49.600
 | ||
| the PVM library 
 | ||
| 
 | ||
| 0:31:49.600,0:31:52.600
 | ||
| unexpectedly fails and reports a completely strange error
 | ||
| 
 | ||
| 0:31:52.600,0:31:54.759
 | ||
| if it can't create a file in temp
 | ||
| 
 | ||
| 0:31:54.759,0:32:01.669
 | ||
| because it needs to create a UNIX domain socket
 | ||
| so it can talk to itself.
 | ||
| 
 | ||
| 0:32:01.669,0:32:03.360
 | ||
| So, what we’ve done here
 | ||
| 
 | ||
| 0:32:03.360,0:32:08.059
 | ||
| is it turns out that Sun Grid Engine actually creates a temporary
 | ||
| directory often the
 | ||
| 
 | ||
| 0:32:08.059,0:32:11.730
 | ||
| typically /TEMP but you can change
 | ||
| that
 | ||
| 
 | ||
| 0:32:11.730,0:32:14.490
 | ||
| and points temp dir to that
 | ||
| 
 | ||
| 0:32:14.490,0:32:15.370
 | ||
| location
 | ||
| 
 | ||
| 0:32:15.370,0:32:17.499
 | ||
| we've educated most of all users now
 | ||
| 
 | ||
| 0:32:17.499,0:32:21.360
 | ||
| to use that location correctly
 | ||
| so they’ll use that variable
 | ||
| 
 | ||
| 0:32:21.360,0:32:23.279
 | ||
| they treat their files under temp dir
 | ||
| 
 | ||
| 0:32:23.279,0:32:24.950
 | ||
| and then when the job exits
 | ||
| 
 | ||
| 0:32:24.950,0:32:26.569
 | ||
| the Grid Engine deletes the directory
 | ||
| 
 | ||
| 0:32:26.569,0:32:28.510
 | ||
| and that all gets cleaned up
 | ||
| 
 | ||
| 0:32:28.510,0:32:32.720
 | ||
| the problem of course being that of multiple
 | ||
| are also running on the same node at the same time
 | ||
| 
 | ||
| 0:32:32.720,0:32:35.290
 | ||
| one of them could still fill temp
 | ||
| 
 | ||
| 0:32:35.290,0:32:38.759
 | ||
| so the solution was pretty simple 
 | ||
| we created a
 | ||
| 
 | ||
| 0:32:38.759,0:32:41.420
 | ||
| wrapper script at the beginning of the job 
 | ||
| 
 | ||
| 0:32:41.420,0:32:42.760
 | ||
| creates a 
 | ||
| 
 | ||
| 0:32:42.760,0:32:43.940
 | ||
| a 
 | ||
| 
 | ||
| 0:32:43.940,0:32:47.260
 | ||
| memory file to swap back to MD file system
 | ||
| 
 | ||
| 0:32:47.260,0:32:50.790
 | ||
| of a user requestable size with the default
 | ||
| 
 | ||
| 0:32:50.790,0:32:53.310
 | ||
| and 
 | ||
| 
 | ||
| 0:32:53.310,0:32:56.520
 | ||
| this has a number of advantages the biggest one of course is that 
 | ||
| 
 | ||
| 0:32:56.520,0:32:58.320
 | ||
| it's fixed size so we get
 | ||
| 
 | ||
| 0:32:58.320,0:32:59.449
 | ||
| you know
 | ||
| 
 | ||
| 0:32:59.449,0:33:01.000
 | ||
| the user gets 
 | ||
| 
 | ||
| 0:33:01.000,0:33:03.420
 | ||
| what they asked for
 | ||
| 
 | ||
| 0:33:03.420,0:33:05.930
 | ||
| and once they run of space, they run out of space well
 | ||
| 
 | ||
| 0:33:05.930,0:33:09.300
 | ||
| and too bad they ran out of space
 | ||
| 
 | ||
| 0:33:09.300,0:33:12.760
 | ||
| they should have asked for more
 | ||
| 
 | ||
| 0:33:12.760,0:33:16.350
 | ||
| the other
 | ||
| 
 | ||
| 0:33:16.350,0:33:18.770
 | ||
| the other advantage is the side-effect that
 | ||
| 
 | ||
| 0:33:18.770,0:33:21.619
 | ||
| now that we're running swap back memory files systems for temp 
 | ||
| 
 | ||
| 0:33:21.619,0:33:24.560
 | ||
| the users who only use a fairly small amount of temp
 | ||
| 
 | ||
| 0:33:24.560,0:33:28.190
 | ||
| should see vastly improved performance
 | ||
| because they're running in memory
 | ||
| 
 | ||
| 0:33:28.190,0:33:32.980
 | ||
| rather than writing to disc
 | ||
| 
 | ||
| 0:33:32.980,0:33:34.690
 | ||
| quick example
 | ||
| 
 | ||
| 0:33:34.690,0:33:38.270
 | ||
| we've a little job script here
 | ||
| 
 | ||
| 0:33:38.270,0:33:39.830
 | ||
| prints temp dir and
 | ||
| 
 | ||
| 0:33:39.830,0:33:41.950
 | ||
| prints the
 | ||
| 
 | ||
| 0:33:41.950,0:33:43.080
 | ||
| amount of space
 | ||
| 
 | ||
| 0:33:43.080,0:33:46.210
 | ||
| we submit our job request saying that we want
 | ||
| 
 | ||
| 0:33:46.210,0:33:51.539
 | ||
| this is what we want hundred megabytes of
 | ||
| temp space
 | ||
| 
 | ||
| 0:33:51.539,0:33:53.580
 | ||
| the same that's why if this
 | ||
| 
 | ||
| 0:33:53.580,0:33:55.230
 | ||
| so the program doesn't
 | ||
| 
 | ||
| 0:33:55.230,0:33:57.620
 | ||
| so the program ends at the end of it
 | ||
| 
 | ||
| 0:33:57.620,0:33:58.709
 | ||
| for doing it
 | ||
| 
 | ||
| 0:33:58.709,0:34:00.510
 | ||
| here's a live demo
 | ||
| 
 | ||
| 0:34:00.510,0:34:01.840
 | ||
| all and then
 | ||
| 
 | ||
| 0:34:01.840,0:34:03.389
 | ||
| you look at the output
 | ||
| 
 | ||
| 0:34:03.389,0:34:04.280
 | ||
| you can see it
 | ||
| 
 | ||
| 0:34:04.280,0:34:07.549
 | ||
| does in fact it creates a memory file system 
 | ||
| 
 | ||
| 0:34:07.549,0:34:10.449
 | ||
| I attempted to do great code
 | ||
| 
 | ||
| 0:34:10.449,0:34:13.409
 | ||
| having a variable space
 | ||
| 
 | ||
| 0:34:13.409,0:34:15.839
 | ||
| that is roughly what the user asked for
 | ||
| 
 | ||
| 0:34:15.839,0:34:17.089
 | ||
| the version that I had
 | ||
| 
 | ||
| 0:34:17.089,0:34:20.739
 | ||
| when I was attempting this was not entirely
 | ||
| accurate
 | ||
| 
 | ||
| 0:34:20.739,0:34:24.710
 | ||
| trying to guess what all the
 | ||
| UFS overhead would be 
 | ||
| 
 | ||
| 0:34:24.710,0:34:25.889
 | ||
| as the result was
 | ||
| 
 | ||
| 0:34:25.889,0:34:28.399
 | ||
| not quite consistent
 | ||
| 
 | ||
| 0:34:30.790,0:34:33.899
 | ||
| I couldn't figure out easy function so
 | ||
| 
 | ||
| 0:34:33.899,0:34:39.589
 | ||
| it does a better job than it did to start with, it’s not perfect
 | ||
| 
 | ||
| 0:34:39.589,0:34:40.600
 | ||
| sometimes however
 | ||
| 
 | ||
| 0:34:40.600,0:34:42.329
 | ||
| today that that's a good fix
 | ||
| 
 | ||
| 0:34:42.329,0:34:43.550
 | ||
| we're coming to
 | ||
| 
 | ||
| 0:34:43.550,0:34:45.359
 | ||
| Deploy it pretty soon
 | ||
| 
 | ||
| 0:34:45.359,0:34:47.159
 | ||
| it works pretty easily
 | ||
| 
 | ||
| 0:34:47.159,0:34:48.570
 | ||
| well sometimes it's not enough
 | ||
| 
 | ||
| 0:34:48.570,0:34:51.390
 | ||
| the biggest issue is that they were badly designed programs all
 | ||
| 
 | ||
| 0:34:51.390,0:34:52.720
 | ||
| all over the world
 | ||
| 
 | ||
| 0:34:52.720,0:34:54.919
 | ||
| don't use temp dir like they're supposed to
 | ||
| 
 | ||
| 0:34:54.919,0:34:59.319
 | ||
| in fact
 | ||
| 
 | ||
| 0:35:10.099,0:35:12.759
 | ||
| (inaudible question)
 | ||
| so there are all these applications
 | ||
| 
 | ||
| 0:35:12.759,0:35:17.979
 | ||
| there are all these applications still that need
 | ||
| temp say during start up
 | ||
| 
 | ||
| 0:35:17.979,0:35:19.230
 | ||
| that sort of thing
 | ||
| 
 | ||
| 0:35:19.230,0:35:20.809
 | ||
| so
 | ||
| 
 | ||
| 0:35:20.809,0:35:22.599
 | ||
| all
 | ||
| 
 | ||
| 0:35:22.599,0:35:25.829
 | ||
| so we have problems with these
 | ||
| 
 | ||
| 0:35:25.829,0:35:26.290
 | ||
| realistically
 | ||
| 
 | ||
| 0:35:26.290,0:35:27.799
 | ||
| we can’t change all of them
 | ||
| 
 | ||
| 0:35:27.799,0:35:30.019
 | ||
| it's just not going to happen
 | ||
| 
 | ||
| 0:35:30.019,0:35:31.950
 | ||
| so we still have problems with people
 | ||
| 
 | ||
| 0:35:31.950,0:35:34.509
 | ||
| running out of resources
 | ||
| 
 | ||
| 0:35:34.509,0:35:35.819
 | ||
| so we probably  
 | ||
| 
 | ||
| 0:35:35.819,0:35:37.489
 | ||
| feel that
 | ||
| 
 | ||
| 
 | ||
| 0:35:37.489,0:35:41.240
 | ||
| the most general solution is to write a per job slash temp 
 | ||
| 
 | ||
| 0:35:41.240,0:35:44.880
 | ||
| and virtualize that portion of the files system
 | ||
| in memory space
 | ||
| 
 | ||
| 0:35:44.880,0:35:47.119
 | ||
| and variate symlinks can do that
 | ||
| 
 | ||
| 0:35:47.119,0:35:52.539
 | ||
| and so we said okay let's give it a shot
 | ||
| 
 | ||
| 0:35:52.539,0:35:56.969
 | ||
| just to introduce the concept of variate symlinks for people who aren’t familiar with them
 | ||
| 
 | ||
| 0:35:56.969,0:36:00.280
 | ||
| variate symlinks are basically symlinks that
 | ||
| contain variables
 | ||
| 
 | ||
| 0:36:00.280,0:36:02.389
 | ||
| which are expanded at run time
 | ||
| 
 | ||
| 0:36:02.389,0:36:05.549
 | ||
| it allows paths to be different for different
 | ||
| processes
 | ||
| 
 | ||
| 0:36:05.549,0:36:06.969
 | ||
| for example
 | ||
| 
 | ||
| 0:36:06.969,0:36:08.689
 | ||
| you create some files
 | ||
| 
 | ||
| 0:36:08.689,0:36:10.069
 | ||
| you create
 | ||
| 
 | ||
| 0:36:10.069,0:36:12.459
 | ||
| a symlink whose contents are
 | ||
| 
 | ||
| 0:36:12.459,0:36:18.329
 | ||
| this variable which has the default shell value
 | ||
| 
 | ||
| 0:36:18.329,0:36:18.990
 | ||
| and you
 | ||
| 
 | ||
| 0:36:18.990,0:36:24.949
 | ||
| get different results with different
 | ||
| variable sets.
 | ||
| 
 | ||
| 0:36:24.949,0:36:27.170
 | ||
| So, to talk about the implementation we’ve done,
 | ||
| 
 | ||
| 0:36:27.170,0:36:32.389
 | ||
| it's derived from direct implementation, most of
 | ||
| the data structures are identical
 | ||
| 
 | ||
| 0:36:32.389,0:36:33.869
 | ||
| however, I’ve made a number of changes
 | ||
| 
 | ||
| 0:36:33.869,0:36:39.649
 | ||
| the biggest one is that we took the concept
 | ||
| of scopes and we turned them entirely around
 | ||
| 
 | ||
| 0:36:40.409,0:36:45.329
 | ||
| in there is a system scope which
 | ||
| is over overridden by a user scope and by a 
 | ||
| 
 | ||
| 0:36:45.329,0:36:47.259
 | ||
| process scope
 | ||
| 
 | ||
| 0:36:49.819,0:36:53.449
 | ||
| problem with that is if you
 | ||
| 
 | ||
| 0:36:53.449,0:36:56.099
 | ||
| only think about say the systems scope
 | ||
| 
 | ||
| 0:36:56.099,0:36:57.079
 | ||
| and
 | ||
| 
 | ||
| 0:36:57.079,0:36:59.459
 | ||
| you decide you want to do something clever like have 
 | ||
| 
 | ||
| 0:36:59.459,0:37:02.219
 | ||
| a root file system which
 | ||
| 
 | ||
| 0:37:02.219,0:37:06.109
 | ||
| where slash lib points to different things
 | ||
| for different
 | ||
| 
 | ||
| 0:37:06.109,0:37:08.249
 | ||
| different architectures
 | ||
| 
 | ||
| 0:37:08.249,0:37:11.849
 | ||
| well, works quite nicely until the users come along
 | ||
| and
 | ||
| 
 | ||
| 0:37:11.849,0:37:14.189
 | ||
| set their arch variable
 | ||
| 
 | ||
| 0:37:14.189,0:37:15.629
 | ||
| up for you
 | ||
| 
 | ||
| 0:37:15.629,0:37:18.900
 | ||
| if you have say a Set UID program and you don't
 | ||
| defensively
 | ||
| 
 | ||
| 0:37:18.900,0:37:22.319
 | ||
| and you don't implement correctly
 | ||
| 
 | ||
| 0:37:22.319,0:37:24.900
 | ||
| the obvious bad things happen. Obviously you would
 | ||
| 
 | ||
| 0:37:24.900,0:37:28.599
 | ||
| write your code to not do that I believe they
 | ||
| did, but
 | ||
| 
 | ||
| 0:37:28.599,0:37:31.700
 | ||
| there's a whole class of problems where
 | ||
| 
 | ||
| 0:37:31.700,0:37:33.449
 | ||
| it's easy to screw up
 | ||
| 
 | ||
| 0:37:33.449,0:37:36.219
 | ||
| add and do something wrong there
 | ||
| 
 | ||
| 0:37:36.219,0:37:37.270
 | ||
| so by
 | ||
| 
 | ||
| 0:37:37.270,0:37:38.509
 | ||
| reversing the order
 | ||
| 
 | ||
| 0:37:38.509,0:37:41.849
 | ||
| we can reduce the risks
 | ||
| 
 | ||
| 0:37:41.849,0:37:43.329
 | ||
| at the moment we don't
 | ||
| 
 | ||
| 0:37:43.329,0:37:44.309
 | ||
| have a user scope
 | ||
| 
 | ||
| 0:37:44.309,0:37:47.530
 | ||
| I just don't like the idea of the users scope 
 | ||
| to be honest
 | ||
| 
 | ||
| 0:37:47.530,0:37:50.900
 | ||
| problem being that then you have to have
 | ||
| per user state in kernel
 | ||
| 
 | ||
| 0:37:50.900,0:37:55.509
 | ||
| that just sort of sits around forever
 | ||
| you can never garbage collect it except the
 | ||
| 
 | ||
| 0:37:55.509,0:37:57.059
 | ||
| Administrator way
 | ||
| 
 | ||
| 0:37:57.059,0:37:59.489
 | ||
| just doesn't seem like a great idea to me
 | ||
| 
 | ||
| 0:37:59.489,0:38:00.700
 | ||
| And jail scope
 | ||
| 
 | ||
| 0:38:00.700,0:38:04.609
 | ||
| just hasn't been implemented
 | ||
| 
 | ||
| 0:38:04.609,0:38:09.809
 | ||
| because it wasn't entirely clear what the semantics should be
 | ||
| 
 | ||
| 0:38:11.010,0:38:14.719
 | ||
| I also added default variable support variable
 | ||
| also shell style 
 | ||
| 
 | ||
| 0:38:14.719,0:38:16.999
 | ||
| variable support
 | ||
| 
 | ||
| 0:38:16.999,0:38:19.169
 | ||
| to some extent undoes the scope
 | ||
| 
 | ||
| 0:38:19.169,0:38:20.870
 | ||
| the scope change
 | ||
| 
 | ||
| 0:38:20.870,0:38:21.779
 | ||
| in that
 | ||
| 
 | ||
| 0:38:21.779,0:38:24.749
 | ||
| the default variable becomes a system scope 
 | ||
| 
 | ||
| 0:38:24.749,0:38:26.540
 | ||
| which is overridden by everything
 | ||
| 
 | ||
| 0:38:26.540,0:38:30.890
 | ||
| but there are cases where we need to do that
 | ||
| in particular who wants implement their 
 | ||
| 
 | ||
| 0:38:30.890,0:38:33.380
 | ||
| slashed temp which varies
 | ||
| 
 | ||
| 0:38:33.380,0:38:36.240
 | ||
| we have to do something like this because temp needs to work
 | ||
| 
 | ||
| 0:38:37.209,0:38:42.059
 | ||
| if we don't have the job values set
 | ||
| 
 | ||
| 0:38:42.059,0:38:45.829
 | ||
| I also decided to use 
 | ||
| 
 | ||
| 0:38:45.829,0:38:49.839
 | ||
| percent instead of dollar sign to avoid
 | ||
| confusion with shell variables because these
 | ||
| 
 | ||
| 0:38:49.839,0:38:50.379
 | ||
| are
 | ||
| 
 | ||
| 0:38:50.379,0:38:52.620
 | ||
| a separate namespace in the kernel
 | ||
| 
 | ||
| 0:38:52.620,0:38:56.669
 | ||
| we can't do it to main OS and do all the evaluation in the
 | ||
| user space
 | ||
| 
 | ||
| 0:38:56.669,0:38:59.269
 | ||
| it's classic vulnerability
 | ||
| 
 | ||
| 0:38:59.269,0:39:02.739
 | ||
| in the CVE database for instance
 | ||
| 
 | ||
| 0:39:02.739,0:39:08.109
 | ||
| and we’re not using @ and avoid confusion
 | ||
| with AFS
 | ||
| 
 | ||
| 0:39:08.109,0:39:09.819
 | ||
| or the Net BSD implementation
 | ||
| 
 | ||
| 0:39:09.819,0:39:11.019
 | ||
| which does not allow
 | ||
| 
 | ||
| 0:39:11.019,0:39:14.879
 | ||
| user or administratively settable values
 | ||
| 
 | ||
| 0:39:14.879,0:39:17.019
 | ||
| that support
 | ||
| 
 | ||
| 0:39:17.019,0:39:20.359
 | ||
| I don't have any automated variables such
 | ||
| as
 | ||
| 
 | ||
| 0:39:20.359,0:39:25.789
 | ||
| the percent sys value which is universally
 | ||
| set in the Net BDS implementation
 | ||
| 
 | ||
| 0:39:25.789,0:39:26.750
 | ||
| or
 | ||
| 
 | ||
| 0:39:28.039,0:39:32.579
 | ||
| a UID variable which they also have
 | ||
| 0:39:32.579,0:39:34.909
 | ||
| and currently and it allows
 | ||
| 
 | ||
| 0:39:34.909,0:39:40.880
 | ||
| setting of values in other processes,
 | ||
| you can only set them in your own and inherit it
 | ||
| 
 | ||
| 0:39:40.880,0:39:42.699
 | ||
| that may change but
 | ||
| 
 | ||
| 0:39:42.699,0:39:47.339
 | ||
| one of my goals here is because they were
 | ||
| subtle ways to make dumb mistakes and
 | ||
| 
 | ||
| 0:39:47.339,0:39:48.930
 | ||
| cause security vulnerabilities
 | ||
| 
 | ||
| 0:39:48.930,0:39:52.479
 | ||
| I've attempted to slim the feature set
 | ||
| down to the point where you
 | ||
| 
 | ||
| 0:39:52.479,0:39:54.909
 | ||
| have some reasonable chance of not
 | ||
| 
 | ||
| 0:39:54.909,0:39:56.339
 | ||
| doing that
 | ||
| 
 | ||
| 0:39:56.339,0:40:03.339
 | ||
| if you start building systems on them for deployment.
 | ||
| 
 | ||
| 0:40:04.419,0:40:06.909
 | ||
| The final area that we've worked on
 | ||
| 
 | ||
| 0:40:06.909,0:40:09.499
 | ||
| is moving away from the final system space
 | ||
| 
 | ||
| 0:40:09.499,0:40:12.559
 | ||
| and into CPU sets
 | ||
| 
 | ||
| 0:40:12.559,0:40:16.379
 | ||
| Jeff Roberts implemented a program
 | ||
| 
 | ||
| 0:40:16.379,0:40:20.699
 | ||
| implemented a CPU set functionality which
 | ||
| allows you to
 | ||
| 
 | ||
| 0:40:20.699,0:40:23.489
 | ||
| create… put a process into a CPU set 
 | ||
| 
 | ||
| 0:40:23.489,0:40:24.879
 | ||
| and then set the affinity of that
 | ||
| 
 | ||
| 0:40:24.879,0:40:26.269
 | ||
| CPU set
 | ||
| 
 | ||
| 0:40:26.269,0:40:29.189
 | ||
| by default every process has an anonymous
 | ||
| 
 | ||
| 0:40:29.189,0:40:33.059
 | ||
| CPU set that was stuffed into
 | ||
| one that was created by this
 | ||
| 
 | ||
| 0:40:33.059,0:40:37.269
 | ||
| in a parent
 | ||
| 
 | ||
| 0:40:37.269,0:40:38.619
 | ||
| so for a little background here
 | ||
| 
 | ||
| 0:40:38.619,0:40:40.740
 | ||
| in a typical SGE configuration
 | ||
| 
 | ||
| 0:40:40.740,0:40:42.769
 | ||
| every node has one slot
 | ||
| 
 | ||
| 0:40:42.769,0:40:44.429
 | ||
| per CPU
 | ||
| 
 | ||
| 0:40:44.429,0:40:48.639
 | ||
| There are a number of other ways you
 | ||
| can configure it, basically a slot is something
 | ||
| 
 | ||
| 0:40:48.639,0:40:50.019
 | ||
| a job can run in
 | ||
| 
 | ||
| 0:40:50.019,0:40:56.719
 | ||
| and a parallel job crosses slots
 | ||
| and can be in more than one slot
 | ||
| 
 | ||
| 0:40:56.719,0:41:01.359
 | ||
| for instance in many applications where
 | ||
| code tends to spend a fair bit of time
 | ||
| 
 | ||
| 0:41:01.359,0:41:02.380
 | ||
| waiting for IO
 | ||
| 
 | ||
| 0:41:02.380,0:41:06.209
 | ||
| you are looking at more than one slot per CPU so two slots per
 | ||
| 
 | ||
| 0:41:06.209,0:41:08.089
 | ||
| core is not uncommon
 | ||
| 
 | ||
| 0:41:08.089,0:41:10.869
 | ||
| but probably the most common configuration
 | ||
| and the one that
 | ||
| 
 | ||
| 0:41:10.869,0:41:13.719
 | ||
| you get out of the box is you just install a Grid Engine
 | ||
| 
 | ||
| 0:41:13.719,0:41:16.739
 | ||
| is one slot for each CPU
 | ||
| 
 | ||
| 0:41:16.739,0:41:19.830
 | ||
| and that's how that's how we run because we
 | ||
| want users to have
 | ||
| 
 | ||
| 0:41:19.830,0:41:23.699
 | ||
| that whole CPU for whatever they want to do with
 | ||
| it
 | ||
| 
 | ||
| 0:41:23.699,0:41:26.130
 | ||
| so jobs are allocated one or more slots
 | ||
| 
 | ||
| 0:41:26.130,0:41:27.599
 | ||
| if they're 
 | ||
| 
 | ||
| 0:41:27.599,0:41:33.189
 | ||
| depending on whether they're sequential or parallel jobs
 | ||
| and how many they ask for
 | ||
| 
 | ||
| 0:41:33.189,0:41:37.239
 | ||
| but this is just a convention
 | ||
| there's no actual connection between slots
 | ||
| 
 | ||
| 0:41:37.239,0:41:39.119
 | ||
| and CPUs
 | ||
| 
 | ||
| 0:41:39.119,0:41:40.829
 | ||
| so it's quite possible to
 | ||
| 
 | ||
| 0:41:40.829,0:41:42.819
 | ||
| submit a non-parallel job
 | ||
| 
 | ||
| 0:41:42.819,0:41:45.019
 | ||
| that goes off and spawns a zillion threads
 | ||
| 
 | ||
| 0:41:45.019,0:41:48.369
 | ||
| and sucks up all the CPUs on the whole system
 | ||
| 
 | ||
| 0:41:48.369,0:41:50.800
 | ||
| in some early versions of grid engine
 | ||
| 
 | ||
| 0:41:50.800,0:41:53.569
 | ||
| there actually was
 | ||
| 
 | ||
| 0:41:53.569,0:41:55.729
 | ||
| support for tying slots
 | ||
| 
 | ||
| 0:41:55.729,0:41:58.669
 | ||
| to CPUs if you set it up that
 | ||
| way
 | ||
| 
 | ||
| 0:41:58.669,0:42:02.979
 | ||
| there is a sensible implementation for IREX
 | ||
| and then things got weirder and weirder is
 | ||
| 
 | ||
| 0:42:02.979,0:42:06.010
 | ||
| people tried to implement it on other platforms
 | ||
| which had
 | ||
| 
 | ||
| 0:42:06.010,0:42:07.030
 | ||
| vastly different
 | ||
| 
 | ||
| 0:42:07.030,0:42:09.839
 | ||
| CPU binding semantics
 | ||
| 
 | ||
| 0:42:09.839,0:42:12.359
 | ||
| and at this point it’s entirely broken
 | ||
| 
 | ||
| 0:42:12.359,0:42:14.959
 | ||
| on every platform as far as I can tell
 | ||
| 
 | ||
| 0:42:14.959,0:42:18.759
 | ||
| so we decided okay we've got this wrapper
 | ||
| let's see what we can do
 | ||
| 
 | ||
| 0:42:18.759,0:42:21.009
 | ||
| in terms of making things work.
 | ||
| 
 | ||
| 0:42:21.659,0:42:27.119
 | ||
| We now have the wrapper store allocations in the final system
 | ||
| 
 | ||
| 0:42:27.119,0:42:31.239
 | ||
| we have a not yet recursive allocation algorithm
 | ||
| 
 | ||
| 0:42:31.239,0:42:33.369
 | ||
| well we try to do is
 | ||
| 
 | ||
| 0:42:33.369,0:42:34.690
 | ||
| find the best fit
 | ||
| 
 | ||
| 0:42:34.690,0:42:35.779
 | ||
| fitting set of
 | ||
| 
 | ||
| 0:42:35.779,0:42:39.539
 | ||
| adjacent cores
 | ||
| 
 | ||
| 0:42:39.539,0:42:42.329
 | ||
| and then if that doesn't work we take the largest
 | ||
| to repeat
 | ||
| 
 | ||
| 0:42:43.519,0:42:45.180
 | ||
| and until we fix
 | ||
| 
 | ||
| 0:42:45.180,0:42:47.300
 | ||
| or until we've got enough slots
 | ||
| 
 | ||
| 0:42:47.300,0:42:50.800
 | ||
| the goal is to minimize new fragments we haven't
 | ||
| done any analysis
 | ||
| 
 | ||
| 0:42:50.800,0:42:52.269
 | ||
| to determine whether that's actually
 | ||
| 
 | ||
| 0:42:52.269,0:42:55.179
 | ||
| an appropriate algorithm
 | ||
| 
 | ||
| 0:42:55.179,0:42:56.289
 | ||
| but off hand it seems
 | ||
| 
 | ||
| 0:42:56.289,0:43:00.519
 | ||
| fine given I’ve thought about it over lunch.
 | ||
| 
 | ||
| 0:43:00.519,0:43:02.810
 | ||
| Should 40’s lay down their OSes
 | ||
| 
 | ||
| 0:43:02.810,0:43:09.649
 | ||
| turns out that FreeBSD, CPU setting, API
 | ||
| and the Linux one
 | ||
| 
 | ||
| 0:43:09.649,0:43:12.519
 | ||
| differ only in the very small details
 | ||
| 
 | ||
| 0:43:12.519,0:43:13.599
 | ||
| They’re
 | ||
| 
 | ||
| 0:43:13.599,0:43:15.479
 | ||
| essentially exactly 
 | ||
| 
 | ||
| 0:43:15.479,0:43:17.569
 | ||
| identical which is
 | ||
| 
 | ||
| 0:43:17.569,0:43:20.489
 | ||
| convenient semantically,
 | ||
| so converting between then is pretty straight forward
 | ||
| 
 | ||
| 0:43:20.489,0:43:24.869
 | ||
| so converting between then is pretty straight forward,
 | ||
| so I did a set of benchmarks
 | ||
| 
 | ||
| 0:43:24.869,0:43:27.019
 | ||
| to demonstrate the
 | ||
| 
 | ||
| 0:43:28.089,0:43:29.359
 | ||
| effectiveness of CPU set,
 | ||
| they also happen to demonstrate the wrapper
 | ||
| 
 | ||
| 0:43:29.359,0:43:33.319
 | ||
| but don’t really have any relevance
 | ||
| 
 | ||
| 0:43:33.319,0:43:35.229
 | ||
| used a little eight core Intel Xeon box
 | ||
| 
 | ||
| 0:43:38.289,0:43:40.749
 | ||
| 7.1 pre-release that had
 | ||
| 
 | ||
| 0:43:40.749,0:43:43.239
 | ||
| John Bjorkman backported
 | ||
| 
 | ||
| 0:43:43.239,0:43:46.640
 | ||
| CPU set
 | ||
| 
 | ||
| 0:43:46.640,0:43:49.039
 | ||
| from 8.0 shortly before release
 | ||
| 
 | ||
| 0:43:49.039,0:43:53.450
 | ||
| well not so shortly, it's supposed to be shortly
 | ||
| before
 | ||
| 
 | ||
| 0:43:53.450,0:43:55.579
 | ||
| and the SG 6.2
 | ||
| 
 | ||
| 0:43:55.579,0:43:59.739
 | ||
| we used the simple integer benchmarks
 | ||
| 
 | ||
| 0:43:59.739,0:44:02.519
 | ||
| end Queens program were tested
 | ||
| 
 | ||
| 0:44:02.519,0:44:03.349
 | ||
| for instance an 8 x 8 board
 | ||
| 
 | ||
| 0:44:03.349,0:44:05.360
 | ||
| placed
 | ||
| 
 | ||
| 0:44:05.360,0:44:08.069
 | ||
| the 8 queens so they can’t capture each other
 | ||
| 
 | ||
| 0:44:08.069,0:44:09.289
 | ||
| on the board
 | ||
| 
 | ||
| 0:44:11.039,0:44:13.680
 | ||
| so it's a simple load benchmark
 | ||
| 
 | ||
| 0:44:13.680,0:44:18.800
 | ||
| that we ran a small version of the problem
 | ||
| as our measure command to generate
 | ||
| 
 | ||
| 0:44:19.599,0:44:24.439
 | ||
| load we ran a larger version that we ran for much longer
 | ||
| 
 | ||
| 0:44:24.439,0:44:28.149
 | ||
| some results
 | ||
| 
 | ||
| 0:44:28.149,0:44:30.129
 | ||
| so for baseline,
 | ||
| 
 | ||
| 0:44:30.129,0:44:33.170
 | ||
| the most interesting thing is to do
 | ||
| a baseline run
 | ||
| 
 | ||
| 0:44:33.170,0:44:34.279
 | ||
| you see this
 | ||
| 
 | ||
| 0:44:34.279,0:44:36.410
 | ||
| some variance it's not really very high
 | ||
| 
 | ||
| 0:44:36.410,0:44:38.979
 | ||
| not surprising it doesn't really do anything
 | ||
| 
 | ||
| 0:44:38.979,0:44:40.979
 | ||
| except suck CPU see here
 | ||
| 
 | ||
| 0:44:40.979,0:44:41.729
 | ||
| Really not much
 | ||
| 
 | ||
| 0:44:41.729,0:44:45.229
 | ||
| going on
 | ||
| 
 | ||
| 0:44:45.229,0:44:50.029
 | ||
| in this case we’ve got seven
 | ||
| load processes and a single
 | ||
| 
 | ||
| 0:44:50.029,0:44:52.789
 | ||
| a single test process running
 | ||
| 
 | ||
| 0:44:52.789,0:44:55.160
 | ||
| we see things slow down slightly
 | ||
| 
 | ||
| 0:44:55.160,0:44:55.890
 | ||
| and
 | ||
| 
 | ||
| 0:44:55.890,0:44:58.389
 | ||
| the standard deviation goes up a bit
 | ||
| 
 | ||
| 0:44:58.389,0:45:00.829
 | ||
| it’s a little bit of deviation from baseline
 | ||
| 
 | ||
| 0:45:00.829,0:45:03.659
 | ||
|  the obvious explanation is clearly
 | ||
| 
 | ||
| 0:45:03.659,0:45:07.339
 | ||
| we’re just content switching
 | ||
| a bit more
 | ||
| 
 | ||
| 0:45:08.840,0:45:10.349
 | ||
| because we don't have
 | ||
| 
 | ||
| 0:45:10.349,0:45:12.410
 | ||
| CPUs that are doing nothing at all
 | ||
| 
 | ||
| 0:45:12.410,0:45:15.559
 | ||
| there some extra load from the system
 | ||
| as well
 | ||
| 
 | ||
| 0:45:15.559,0:45:20.049
 | ||
| since the kernel has to run and
 | ||
| background tests have to run
 | ||
| 
 | ||
| 0:45:20.049,0:45:23.150
 | ||
| you know in this case we have a badly behaved application
 | ||
| 
 | ||
| 0:45:23.150,0:45:26.579
 | ||
| we now have 8 load processes which would suck up all the CPU
 | ||
| 
 | ||
| 0:45:26.579,0:45:28.879
 | ||
| and then we try to run our measurement process
 | ||
| 
 | ||
| 0:45:28.879,0:45:30.639
 | ||
| we see a you know
 | ||
| 
 | ||
| 0:45:30.639,0:45:32.739
 | ||
| substantial performance decrease
 | ||
| 
 | ||
| 0:45:32.739,0:45:35.570
 | ||
| you know about in the range we would expect
 | ||
| 
 | ||
| 0:45:35.570,0:45:37.289
 | ||
| see if we had any
 | ||
| 
 | ||
| 0:45:37.289,0:45:40.140
 | ||
| decrease
 | ||
| 
 | ||
| 0:45:40.140,0:45:43.220
 | ||
| we fired up with CPU set
 | ||
| 
 | ||
| 0:45:43.220,0:45:44.249
 | ||
| quite obviously
 | ||
| 
 | ||
| 0:45:44.249,0:45:46.190
 | ||
| the interesting thing here is to see it
 | ||
| 
 | ||
| 0:45:46.190,0:45:49.429
 | ||
| we’re getting no statistically significant difference
 | ||
| 
 | ||
| 0:45:49.429,0:45:52.819
 | ||
| between the baseline case with
 | ||
| 
 | ||
| 0:45:52.819,0:45:56.539
 | ||
| 7 processors if we use CPU sets
 | ||
| we don't see this variance
 | ||
| 
 | ||
| 0:45:56.539,0:45:58.520
 | ||
| which is nice to know that this shows
 | ||
| 
 | ||
| 0:45:58.520,0:45:59.509
 | ||
| that's it
 | ||
| 
 | ||
| 0:45:59.509,0:46:02.869
 | ||
| we actually see a slight performance
 | ||
| improvement
 | ||
| 
 | ||
| 0:46:02.869,0:46:04.179
 | ||
| and
 | ||
| 
 | ||
| 0:46:04.179,0:46:05.579
 | ||
| we
 | ||
| 
 | ||
| 0:46:05.579,0:46:07.589
 | ||
| we see a reduction in variance
 | ||
| 
 | ||
| 0:46:07.589,0:46:11.569
 | ||
| so CPU set is actually improving performance
 | ||
| even if we’re not overloaded
 | ||
| 
 | ||
| 0:46:11.569,0:46:13.510
 | ||
| and we see in the overloaded case
 | ||
| 
 | ||
| 0:46:13.510,0:46:15.589
 | ||
| it's the same
 | ||
| 
 | ||
| 0:46:15.589,0:46:20.319
 | ||
| for the other processes
 | ||
| they’re stuck on other CPUs
 | ||
| 
 | ||
| 0:46:20.319,0:46:22.820
 | ||
| one interesting side note actually is that
 | ||
| 
 | ||
| 0:46:22.820,0:46:26.719
 | ||
| when I was doing some tests early on
 | ||
| 
 | ||
| 0:46:26.719,0:46:27.869
 | ||
| we actually saw
 | ||
| 
 | ||
| 0:46:27.869,0:46:32.359
 | ||
| I tried doing the base line and
 | ||
| the baseline with CPU set and if you just fired off with the original
 | ||
| 
 | ||
| 0:46:32.359,0:46:33.869
 | ||
| algorithm
 | ||
| 
 | ||
| 0:46:33.869,0:46:34.540
 | ||
| which
 | ||
| 
 | ||
| 0:46:34.540,0:46:36.489
 | ||
| grabbed CPU0
 | ||
| 
 | ||
| 0:46:36.489,0:46:39.339
 | ||
| you saw a significant performance decline
 | ||
| 
 | ||
| 0:46:39.339,0:46:42.319
 | ||
| because there's a lot of stuff that ends up
 | ||
| running on CPU0
 | ||
| 
 | ||
| 0:46:42.319,0:46:43.819
 | ||
| which
 | ||
| 
 | ||
| 0:46:43.819,0:46:45.100
 | ||
| what led to the
 | ||
| 
 | ||
| 0:46:45.100,0:46:49.890
 | ||
| quick observation you want to allocate
 | ||
| from the large numbers down
 | ||
| 
 | ||
| 0:46:49.890,0:46:50.569
 | ||
| so that you use
 | ||
| 
 | ||
| 0:46:50.569,0:46:55.069
 | ||
| the CPUs which are not running the random processes
 | ||
| that get stuck on zero
 | ||
| 
 | ||
| 0:46:55.069,0:46:57.880
 | ||
| or get all the interrupts in some architectures
 | ||
| 
 | ||
| 0:46:57.880,0:47:02.199
 | ||
| and avoid Core0 in particular.
 | ||
| 
 | ||
| 0:47:02.199,0:47:04.029
 | ||
| so some conclusions
 | ||
| 
 | ||
| 0:47:04.029,0:47:07.530
 | ||
| I think we have useful proof of concept
 | ||
| of going to be deploying
 | ||
| 
 | ||
| 0:47:07.530,0:47:09.880
 | ||
| certainly the 
 | ||
| 
 | ||
| 0:47:09.880,0:47:11.000
 | ||
| memory stuff soon
 | ||
| 
 | ||
| 0:47:11.000,0:47:13.329
 | ||
| once we upgrade to seven we’ll
 | ||
| 
 | ||
| 0:47:13.329,0:47:15.959
 | ||
| definitely be deploying the CPU sets
 | ||
| 
 | ||
| 0:47:15.959,0:47:16.849
 | ||
| so it's
 | ||
| 
 | ||
| 0:47:16.849,0:47:18.509
 | ||
| both improves performance
 | ||
| 
 | ||
| 0:47:18.509,0:47:22.009
 | ||
| in the contended case and in the and uncontended case
 | ||
| 
 | ||
| 0:47:22.009,0:47:26.299
 | ||
| we would like in the future to do some more work
 | ||
| with virtual private server stuff
 | ||
| 
 | ||
| 0:47:26.299,0:47:28.979
 | ||
| Particularly it would be really interesting
 | ||
| 
 | ||
| 0:47:28.979,0:47:30.759
 | ||
| to be able to run different
 | ||
| 
 | ||
| 0:47:30.759,0:47:32.540
 | ||
| different FreeBSD versions in jails
 | ||
| 
 | ||
| 0:47:32.540,0:47:37.660
 | ||
| for to run up for instance CentOS images
 | ||
| in jail since we’re running CentOS
 | ||
| 
 | ||
| 0:47:37.660,0:47:40.649
 | ||
| on our Linux based systems
 | ||
| 
 | ||
| 0:47:40.649,0:47:43.240
 | ||
| there could actually be some really interesting
 | ||
| things there
 | ||
| 
 | ||
| 0:47:43.240,0:47:45.759
 | ||
| in that for instance we can run
 | ||
| 
 | ||
| 0:47:45.759,0:47:50.989
 | ||
| we could potentially detrace Linux applications
 | ||
| it's never going to happen on native Linux
 | ||
| 
 | ||
| 0:47:50.989,0:47:53.069
 | ||
| there's also another example where
 | ||
| 
 | ||
| 0:47:53.069,0:47:56.269
 | ||
| Paul Sub who’s doing some benchmarking recently
 | ||
| 
 | ||
| 0:47:56.269,0:48:01.039
 | ||
| and relative to Linux on the same hardware
 | ||
| 
 | ||
| 0:48:01.039,0:48:04.900
 | ||
| he was seeing a three and a half times improvement
 | ||
| 0:48:04.900,0:48:07.230
 | ||
| in basic matrix multiplication
 | ||
| 
 | ||
| 0:48:07.230,0:48:08.549
 | ||
| relative to current
 | ||
| 
 | ||
| 0:48:08.549,0:48:11.849
 | ||
| because previously super-pegged functionality
 | ||
| 
 | ||
| 0:48:11.849,0:48:14.499
 | ||
| where you vastly reduce the number of TLV entries
 | ||
| 
 | ||
| 0:48:14.499,0:48:16.150
 | ||
| in the page table
 | ||
| 
 | ||
| 0:48:16.150,0:48:17.229
 | ||
| and so
 | ||
| 
 | ||
| 0:48:17.229,0:48:21.109
 | ||
| that sort of thing can apply even to apply
 | ||
| to our Linux using population
 | ||
| 
 | ||
| 0:48:21.109,0:48:23.969
 | ||
| could give FreeBSD some real wins there
 | ||
| 
 | ||
| 0:48:26.309,0:48:27.579
 | ||
| I’d like to look at
 | ||
| 
 | ||
| 0:48:27.579,0:48:30.859
 | ||
| more on the point of isolating users from kernel upgrades
 | ||
| 
 | ||
| 0:48:30.859,0:48:32.620
 | ||
| one of the issues we've had is that
 | ||
| 
 | ||
| 0:48:32.620,0:48:34.019
 | ||
| when you do a new bump
 | ||
| 
 | ||
| 0:48:34.019,0:48:38.399
 | ||
| we have users who depend on all sorts of libraries
 | ||
| immediate which
 | ||
| 
 | ||
| 0:48:38.399,0:48:41.380
 | ||
| you know the vendors like to rev them to
 | ||
| do
 | ||
| 
 | ||
| 0:48:41.380,0:48:44.640
 | ||
| stupid API breaking changes is fairly
 | ||
| regularly so
 | ||
| 
 | ||
| 0:48:44.640,0:48:48.380
 | ||
| it’d be nice for users if we can get all the
 | ||
| benefits to kernel upgrades
 | ||
| 
 | ||
| 0:48:48.380,0:48:51.699
 | ||
| and they could upgrade at their leisure
 | ||
| 
 | ||
| 0:48:51.699,0:48:54.459
 | ||
| so we're hoping to do that in future as well
 | ||
| 
 | ||
| 0:48:54.459,0:48:57.809
 | ||
| we’d would like to see more limits
 | ||
| on bandwidth type resources
 | ||
| 
 | ||
| 0:48:59.219,0:49:01.199
 | ||
| for instance say limiting the amount of
 | ||
| 
 | ||
| 0:49:02.910,0:49:05.649
 | ||
| it's fairly easy to know the amount
 | ||
| of sockets I own
 | ||
| 
 | ||
| 0:49:05.649,0:49:10.279
 | ||
| but it’s hard to place a total limit on
 | ||
| network bandwidth
 | ||
| 
 | ||
| 0:49:10.279,0:49:11.819
 | ||
| by a particular process
 | ||
| 
 | ||
| 0:49:11.819,0:49:16.979
 | ||
| when almost all of our storage is on NFS
 | ||
| how do you classify that traffic
 | ||
| 
 | ||
| 0:49:17.649,0:49:21.259
 | ||
| without a fair bit of change to the kernel
 | ||
| and somehow tagging that
 | ||
| 
 | ||
| 0:49:21.259,0:49:23.799
 | ||
| it's an interesting challenge.
 | ||
| 
 | ||
| 0:49:23.799,0:49:28.309
 | ||
| we'd also like to see it could be needed some
 | ||
| you implement something like
 | ||
| 
 | ||
| 0:49:28.309,0:49:30.089
 | ||
| the IRIX job ID
 | ||
| 
 | ||
| 0:49:30.089,0:49:34.099
 | ||
| to allow the scheduler to just
 | ||
| tag processes as part of a job
 | ||
| 
 | ||
| 0:49:34.099,0:49:36.309
 | ||
| currently
 | ||
| 
 | ||
| 0:49:36.309,0:49:38.939
 | ||
| I've grid engine uses a clever but evil hack
 | ||
| 
 | ||
| 0:49:38.939,0:49:40.010
 | ||
| where they add
 | ||
| 
 | ||
| 0:49:40.010,0:49:42.509
 | ||
| an extra group to the process
 | ||
| 
 | ||
| 0:49:42.509,0:49:44.819
 | ||
| and they just have a range of groups
 | ||
| 
 | ||
| 0:49:44.819,0:49:48.209
 | ||
| available so they get inherited in the users
 | ||
| can’t drop them so
 | ||
| 
 | ||
| 0:49:48.209,0:49:51.889
 | ||
| that allows them to track the process
 | ||
| but it’s an ugly hack
 | ||
| 
 | ||
| 0:49:51.889,0:49:57.499
 | ||
| and with the current limits on the number of groups
 | ||
| it can become a real problem
 | ||
| 
 | ||
| 0:49:57.499,0:49:59.529
 | ||
| actually before I take questions
 | ||
| 
 | ||
| 0:49:59.529,0:49:59.980
 | ||
| I do want to put in
 | ||
| 
 | ||
| 0:49:59.980,0:50:01.119
 | ||
| one quick point
 | ||
| 
 | ||
| 0:50:01.119,0:50:05.100
 | ||
| the think it's not interesting you live in
 | ||
| the area and if you're looking for
 | ||
| 
 | ||
| 0:50:05.100,0:50:06.430
 | ||
| looking for a job
 | ||
| 
 | ||
| 0:50:06.430,0:50:09.780
 | ||
| we are trying to hire a few people it's difficult
 | ||
| to hire good
 | ||
| 
 | ||
| 0:50:09.780,0:50:13.069
 | ||
| we do have some openings and we're looking
 | ||
| for
 | ||
| 
 | ||
| 0:50:13.069,0:50:17.409
 | ||
| BSD people in general system
 | ||
| Admin people
 | ||
| 
 | ||
| 0:50:17.409,0:50:24.409
 | ||
| so questions?
 | ||
| 
 | ||
| 0:50:38.419,0:50:40.989
 | ||
| Yes
 | ||
| (inaudible question)
 | ||
| 
 | ||
| 0:50:40.989,0:50:45.719
 | ||
| I would expect that to happen
 | ||
| but it's not something I’ve attempted to test
 | ||
| 
 | ||
| 0:50:45.719,0:50:50.570
 | ||
| what I would really like is to have a topology aware allocator
 | ||
| 
 | ||
| 0:50:50.570,0:50:53.179
 | ||
| so that you can request that you know I want
 | ||
| 
 | ||
| 0:50:53.179,0:50:56.229
 | ||
| I want to share cache or I don't want to share cache
 | ||
| 
 | ||
| 0:50:56.229,0:51:00.170
 | ||
| I want to share memory band width or not share memory bandwidth
 | ||
| 
 | ||
| 0:51:00.170,0:51:02.459
 | ||
| open MPI 1.3
 | ||
| 
 | ||
| 0:51:02.459,0:51:08.469
 | ||
| on the Linux side have a topology where a wrapper for their CPU
 | ||
| 
 | ||
| 0:51:08.469,0:51:10.159
 | ||
| functionality
 | ||
| 
 | ||
| 0:51:10.159,0:51:12.249
 | ||
| makes it something called
 | ||
| 
 | ||
| 0:51:12.249,0:51:14.139
 | ||
| the PLAP
 | ||
| 
 | ||
| 0:51:14.139,0:51:15.259
 | ||
| portable Linux
 | ||
| 
 | ||
| 0:51:16.519,0:51:19.599
 | ||
| CPU allocator. Is that what
 | ||
| it's actually been
 | ||
| 
 | ||
| 0:51:19.599,0:51:21.959
 | ||
| what the acronym is
 | ||
| 
 | ||
| 0:51:21.959,0:51:25.400
 | ||
| in essence they have to work around the fact
 | ||
| that there were three standard
 | ||
| 
 | ||
| 0:51:25.400,0:51:27.809
 | ||
| there were three different
 | ||
| 
 | ||
| 0:51:27.809,0:51:31.759
 | ||
| kernel APIs for the same syscall
 | ||
| 
 | ||
| 0:51:31.759,0:51:38.759
 | ||
| for CPU allocation because all the vendors 
 | ||
| did it themselves somehow
 | ||
| 
 | ||
| 0:51:38.769,0:51:44.969
 | ||
| they're the same number but
 | ||
| they’re completely incompatible
 | ||
| 
 | ||
| 0:51:44.969,0:51:48.749
 | ||
| when you first load the application it calls
 | ||
| the syscall and it tries to figure out which
 | ||
| 
 | ||
| 0:51:48.749,0:51:50.579
 | ||
| one it is
 | ||
| 
 | ||
| 0:51:50.579,0:51:52.719
 | ||
| by what errors it returns depending on what
 | ||
| 
 | ||
| 0:51:52.719,0:51:56.139
 | ||
| are you missing and completely evil
 | ||
| 
 | ||
| 0:51:56.139,0:52:00.859
 | ||
| I think people should port their API
 | ||
| and have their library work but
 | ||
| 
 | ||
| 0:52:00.859,0:52:05.650
 | ||
| we don’t need to do that junk
 | ||
| because we did not make that mistake 
 | ||
| 
 | ||
| 0:52:05.650,0:52:12.650
 | ||
| so I would like to see the
 | ||
| topology aware stuff in particular
 | ||
| 
 | ||
| 0:52:30.710,0:52:32.529
 | ||
| (inaudible question)
 | ||
| 
 | ||
| 0:52:32.529,0:52:37.180
 | ||
| the trick is it’s easy to limit application bandwidth
 | ||
| 
 | ||
| 0:52:39.500,0:52:42.269
 | ||
| fairly easy to limit application bandwidth
 | ||
| 
 | ||
| 0:52:42.269,0:52:44.329
 | ||
| it becomes more difficult when you have to
 | ||
| 
 | ||
| 0:52:44.329,0:52:45.430
 | ||
| if your
 | ||
| 
 | ||
| 0:52:45.430,0:52:49.759
 | ||
| interfaces are shared between application traffic
 | ||
| 
 | ||
| 0:52:49.759,0:52:50.880
 | ||
| and
 | ||
| 
 | ||
| 0:52:50.880,0:52:53.049
 | ||
| say NFS
 | ||
| 
 | ||
| 0:52:53.049,0:52:57.399
 | ||
| getting classifying that is going to be trickier
 | ||
| you have to tag you’d have to add a fair bit of code
 | ||
| 
 | ||
| 0:52:57.399,0:53:04.399
 | ||
| to trace that down through the kernel
 | ||
| certainly doable
 | ||
| 
 | ||
| 0:53:12.069,0:53:15.499
 | ||
| (inaudible question)
 | ||
| 
 | ||
| 0:53:15.499,0:53:18.389
 | ||
| I have contemplated doing just that
 | ||
| 
 | ||
| 0:53:18.389,0:53:22.059
 | ||
| or in fact the other thing we consider
 | ||
| doing
 | ||
| 
 | ||
| 0:53:22.059,0:53:24.829
 | ||
| more as a research project than is a practical thing
 | ||
| 
 | ||
| 0:53:24.829,0:53:26.719
 | ||
| would be actually how
 | ||
| 
 | ||
| 0:53:26.719,0:53:28.619
 | ||
| would be
 | ||
| 
 | ||
| 0:53:28.619,0:53:30.029
 | ||
| independent VLANs
 | ||
| 
 | ||
| 0:53:30.029,0:53:31.839
 | ||
| because then we could do
 | ||
| 
 | ||
| 0:53:31.839,0:53:32.459
 | ||
| things like
 | ||
| 
 | ||
| 0:53:32.459,0:53:35.489
 | ||
| give each process a VLAN they couldn't even
 | ||
| 
 | ||
| 0:53:35.489,0:53:37.979
 | ||
| share at the internet layer
 | ||
| 
 | ||
| 0:53:37.979,0:53:41.259
 | ||
| once the images’ in place for instance we will
 | ||
| be able to do that
 | ||
| 
 | ||
| 0:53:41.259,0:53:45.049
 | ||
| and that say you know you've got your interfaces
 | ||
| it’s yours whatever
 | ||
| 
 | ||
| 0:53:45.049,0:53:46.479
 | ||
| but then we could limit it
 | ||
| 
 | ||
| 0:53:46.479,0:53:49.959
 | ||
| we could rate limit that at the kernel
 | ||
| we can also have
 | ||
| 
 | ||
| 0:53:49.959,0:53:54.729
 | ||
| we’d have a physically isolated
 | ||
| we’d have a logically isolated network as well
 | ||
| 
 | ||
| 0:53:54.729,0:53:57.589
 | ||
| with some of the latest switches we could actually
 | ||
| rate limit
 | ||
| 
 | ||
| 0:53:57.589,0:54:04.589
 | ||
| at the switch as well
 | ||
| 
 | ||
| 0:54:19.939,0:54:22.369
 | ||
| (inaudible questions)
 | ||
| so to the first question
 | ||
| 
 | ||
| 0:54:22.369,0:54:26.190
 | ||
| we don’t run multiple
 | ||
| 
 | ||
| 0:54:26.190,0:54:27.639
 | ||
| sensitivity data on these clusters
 | ||
| 
 | ||
| 0:54:27.639,0:54:28.709
 | ||
| unclassified cluster
 | ||
| 
 | ||
| 0:54:28.709,0:54:30.460
 | ||
| we've avoided that problem by
 | ||
| 
 | ||
| 0:54:30.460,0:54:32.299
 | ||
| not allowing it
 | ||
| 
 | ||
| 0:54:32.299,0:54:34.929
 | ||
| But it is a real issue
 | ||
| 
 | ||
| 0:54:34.929,0:54:36.939
 | ||
| it's just not one we've had to deal with 
 | ||
| 
 | ||
| 0:54:39.559,0:54:42.109
 | ||
| in practice with stuff that’s sensitive
 | ||
| 
 | ||
| 0:54:43.059,0:54:47.579
 | ||
| has handling requirements that you can't touch
 | ||
| the same hardware without a scrub
 | ||
| 
 | ||
| 0:54:47.579,0:54:49.859
 | ||
| you need a pretty
 | ||
| 
 | ||
| 0:54:49.859,0:54:51.739
 | ||
| ridiculously aggressive
 | ||
| 
 | ||
| 0:54:51.739,0:54:53.770
 | ||
| you need a very coarse granularity
 | ||
| 
 | ||
| 0:54:53.770,0:54:57.240
 | ||
| a ridiculous remote imaging process that you
 | ||
| moved all of the data
 | ||
| 
 | ||
| 0:54:57.240,0:55:00.959
 | ||
| so if I were to do that I would
 | ||
| probably get rid of the discs
 | ||
| 
 | ||
| 0:55:00.959,0:55:01.389
 | ||
| just
 | ||
| 
 | ||
| 0:55:01.389,0:55:02.400
 | ||
| go disc less
 | ||
| 
 | ||
| 0:55:02.400,0:55:04.910
 | ||
| that would get rid of my number-one failure case
 | ||
| of
 | ||
| 
 | ||
| 0:55:04.910,0:55:07.839
 | ||
| that would be pretty good but
 | ||
| 
 | ||
| 0:55:07.839,0:55:09.419
 | ||
| but haven’t done it
 | ||
| 
 | ||
| 0:55:10.609,0:55:13.819
 | ||
| NFS failures we've had occasional problems of NFS overloading
 | ||
| 
 | ||
| 
 | ||
| 0:55:13.819,0:55:15.679
 | ||
| we haven't had real problem
 | ||
| 
 | ||
| 0:55:15.679,0:55:19.279
 | ||
| we're all local network it’s fairly tightly
 | ||
| contained so we haven't had problems with 
 | ||
| 
 | ||
| 0:55:19.279,0:55:20.539
 | ||
| things
 | ||
| 
 | ||
| 0:55:20.539,0:55:21.819
 | ||
| with
 | ||
| 
 | ||
| 0:55:21.819,0:55:26.039
 | ||
| you know the server going down for extended
 | ||
| periods and causing everything to hang
 | ||
| 
 | ||
| 0:55:26.039,0:55:27.819
 | ||
| it's been more an issue of
 | ||
| 
 | ||
| 0:55:27.819,0:55:33.189
 | ||
| I mean there isn't there's a problem
 | ||
| that Panasas is described as in cast
 | ||
| 
 | ||
| 0:55:33.189,0:55:36.109
 | ||
| you can take out any NFS server
 | ||
| 
 | ||
| 0:55:36.109,0:55:40.809
 | ||
| I mean we have the bluearc guys come in and the
 | ||
| PGA based stuff with multiple ten-gig links I said
 | ||
| 
 | ||
| 0:55:40.809,0:55:42.049
 | ||
| you know I've got
 | ||
| 
 | ||
| 0:55:42.049,0:55:46.779
 | ||
| to do this and they said can we not try this with your whole cluster
 | ||
| 
 | ||
| 0:55:46.779,0:55:47.950
 | ||
| because if you got
 | ||
| 
 | ||
| 0:55:47.950,0:55:49.370
 | ||
| three hundred and fifty
 | ||
| 
 | ||
| 0:55:49.370,0:55:52.599
 | ||
| gigabit ethernet interfaces going into
 | ||
| the system
 | ||
| 
 | ||
| 0:55:52.599,0:55:56.589
 | ||
| Even ten gig you can saturate pretty trivially
 | ||
| 
 | ||
| 0:55:56.589,0:55:57.120
 | ||
| so that level
 | ||
| 
 | ||
| 0:55:57.120,0:55:58.930
 | ||
| there's an inherent problem
 | ||
| 
 | ||
| 0:55:58.930,0:56:01.969
 | ||
| on we need to handle that kind of bandwidth
 | ||
| we've 
 | ||
| 
 | ||
| 0:56:01.969,0:56:04.459
 | ||
| got to get it a parallel file system 
 | ||
| 
 | ||
| 0:56:04.459,0:56:06.069
 | ||
| get a cluster
 | ||
| 
 | ||
| 0:56:06.069,0:56:12.289
 | ||
| before doing streaming stuff we could go via SWAN or something
 | ||
| 
 | ||
| 0:56:12.289,0:56:14.949
 | ||
| anyone else?
 | ||
| 
 | ||
| 0:56:14.949,0:56:15.429
 | ||
| thank you, everyone
 | ||
| (applause and end)
 |