Second round of human edits made for hire through Amazon Mechanical

Turk to improve machine generated transcript.

Sponsored by:	FreeBSD Foundation
This commit is contained in:
Murray Stokely 2010-02-15 19:49:09 +00:00
parent 461847500d
commit b29753b3b5
Notes: svn2git 2020-12-08 03:00:23 +00:00
svn path=/head/; revision=35367

View file

@ -1,8 +1,8 @@
0:00:00.530,0:00:01.590
So, basically,
So basically,
0:00:04.590,0:00:10.029
we are going to look mainly in this second part
we are going to look, mainly in this second part,
at how to
0:00:10.029,0:00:11.519
@ -15,32 +15,32 @@ locking problems
that categorize in the kernel.
0:00:17.910,0:00:24.410
Here there are described two kinds of problems
you can get with locks, that are pretty much common
Here, there are described two kinds of problems
you can get with locks, that are pretty much common.
0:00:24.410,0:00:27.859
The first one is called Lock Order Reversal
The first one is called Lock Order Reversal (LOR).
0:00:27.859,0:00:30.140
when you have for example a thread A
When you have for example a thread A,
0:00:30.140,0:00:32.340
which owns
0:00:32.340,0:00:35.870
a lock code, for example, L1
a lock code, for example L1
0:00:35.870,0:00:37.920
and another thread B
0:00:37.920,0:00:40.070
which owns another lock, L2
which owns the lock, L2
0:00:40.070,0:00:43.150
Then thread A tries to
Then thread A tries to..
0:00:43.150,0:00:44.730
Right, it's wrong.
Right.. it's wrong.
0:00:44.730,0:00:46.220
The slide is wrong.
@ -96,7 +96,7 @@ knows that
locks should maintain
0:01:32.910,0:01:34.319
an ordering regard of each other.
an ordering in regard of each other.
0:01:34.319,0:01:38.859
That's not very simple when
@ -110,25 +110,26 @@ that there are 3 kinds of classes of locks
0:01:44.850,0:01:49.180
is going to count because you can
never mix two different kinds of locks
never mix two different kinds of locks.
0:01:49.180,0:01:50.680
for example
For example
0:01:50.680,0:01:51.610
a spin lock
a spinlock
0:01:51.610,0:01:53.770
and a mutex.
and a mutex
0:01:53.770,0:01:59.120
You can mix in this way.
can be mixed in this way.
0:01:59.120,0:02:01.720
You can have the mutex first and the spinlock later,
while the opposite is not actually true.
0:02:01.720,0:02:07.060
So, you will see that this kind
So, you will see that these kind
of deadlocks are possible
0:02:07.060,0:02:09.290
@ -141,26 +142,26 @@ like for example 2 mutex or 2 spin mutex,
or such.
0:02:14.569,0:02:16.090
Um...
0:02:16.090,0:02:17.409
Also
Also,
0:02:17.409,0:02:19.949
Even if it's not very well documented,
even if it's not very well documented,
0:02:19.949,0:02:22.880
for example, spin locks,
for example spinlocks
0:02:22.880,0:02:26.599
in previous deep, as a way to
in FreeBSD, have a way to
identify such kind of deadlocks.
0:02:26.599,0:02:27.619
And it's pretty much implemented...
And it's pretty much implemented.
0:02:27.619,0:02:29.709
a very much in it would
0:02:29.709,0:02:32.449
It's a feature enabled in the code.
@ -178,7 +179,7 @@ if it exceeds
an exaggerated result,
0:02:41.379,0:02:47.870
it means that they are probable
it means that they are probably
under a deadlock and the system panics.
0:02:47.870,0:02:52.489
@ -224,7 +225,7 @@ you can end up having some threads
sleeping on this wait channel
0:03:30.569,0:03:34.589
and nobody is going to wake up them again.
and nobody is going to wake them up again.
0:03:34.589,0:03:37.629
This is usually called missed wakeup
@ -240,10 +241,10 @@ The problem is that
0:03:46.719,0:03:52.109
it's very difficult to differentiate
between missed wakeup and,
between missed wakeup and
0:03:52.109,0:03:53.480
for example,
for example
0:03:53.480,0:03:56.189
forever sleep
@ -256,7 +257,7 @@ that is not likely to be awaken.
0:04:01.859,0:04:07.109
So these kind of deadlocks are
very difficult to be discovered
very very difficult to be discovered
0:04:07.109,0:04:11.669
and will require some bit of
@ -278,7 +279,7 @@ kernel systems
and some things integrated into the debugger.
0:04:22.240,0:04:22.979
Um,
0:04:22.979,0:04:25.520
In FreeBSD,
@ -292,7 +293,7 @@ with kernel problems.
0:04:32.080,0:04:36.539
The first one (and the most important)
is called witness.
is called WITNESS.
0:04:36.539,0:04:39.169
It was introduced
@ -301,7 +302,7 @@ It was introduced
in the context of SMPng
0:04:42.080,0:04:44.979
and has been written in the recent past,
and has been rewritten in the recent past,
0:04:44.979,0:04:47.919
mainly by a contribution of
@ -313,7 +314,7 @@ Isilon systems.
They contributed back then
0:04:52.270,0:04:54.989
to the writing of witness.
to the writing of WITNESS.
0:04:54.989,0:04:57.389
This subsystem is very important
@ -322,7 +323,7 @@ This subsystem is very important
because it tracks down exactly every order
0:05:02.730,0:05:03.949
of the locks
of the locks.
0:05:03.949,0:05:07.810
So that, if there is an ordering violation like a LOR,
@ -356,7 +357,7 @@ Doing that,
we can identify
0:05:36.539,0:05:38.419
deadlocks possible
deadlocks, possibly
0:05:38.419,0:05:39.500
even
@ -368,7 +369,7 @@ on the
reader's path.
0:05:45.529,0:05:49.609
We could say that witness is pretty big,
We could say that WITNESS is pretty big,
0:05:49.609,0:05:52.289
so activating it
@ -381,10 +382,10 @@ It's mainly used when you are going to
develop a new feature in the kernel
0:05:59.929,0:06:02.110
and you are going to test it heavily,
and you are going to test it heavily.
0:06:02.110,0:06:05.479
in particular if it has
In particular if it has
0:06:05.479,0:06:06.819
some
@ -393,7 +394,7 @@ some
relation to locking.
0:06:10.509,0:06:13.089
Uh,
0:06:13.089,0:06:17.840
We could also tell that with the new code
@ -403,10 +404,10 @@ provided by Isilon and Nokia,
basically
0:06:19.150,0:06:21.689
the orad
0:06:21.689,0:06:25.479
offered by witness is greatly reduced to about
offered by WITNESS is greatly reduced to about
0:06:25.479,0:06:27.699
the 10th part of
@ -415,7 +416,7 @@ the 10th part of
what we had before.
0:06:30.240,0:06:36.150
Witness is very good at tracking LOR,
WITNESS is very good at tracking LOR,
0:06:36.150,0:06:37.849
but
@ -448,7 +449,7 @@ and
basically
0:06:58.879,0:07:04.159
and basically, it's in the 8th release,
it's in the 8th release,
we have new features
0:07:04.159,0:07:05.759
@ -467,18 +468,18 @@ and their orderings
and
0:07:17.549,0:07:23.550
and shows some graphs of the relations.
Even from the user space,
it shows some graphs of the relations
even from the user space.
0:07:23.550,0:07:28.550
you don't have to go into the kernel
degubber to look at it's output.
You don't have to go into the kernel
debugger to look at it's output.
0:07:28.550,0:07:35.550
Well
0:07:35.620,0:07:37.380
...
0:07:37.380,0:07:42.250
Well, I see that sometimes when
@ -507,36 +508,36 @@ when a deadlock
is in the kernel.
0:07:59.590,0:08:02.490
...
0:08:02.490,0:08:03.389
Usually,
0:08:03.389,0:08:07.939
if you want to find a deadlock
that's happening in the kernel
that's happening in the kernel,
0:08:07.939,0:08:10.909
your first line of analysis start from the DDB
your first line of analysis starts from the DDB
0:08:10.909,0:08:13.919
instead of a post-mortem analysis,
0:08:13.919,0:08:16.839
which is even more important,
which is even more important.
0:08:16.839,0:08:22.330
but using DDB you will get more
But, using DDB you will get more
processes and better information.
0:08:22.330,0:08:24.970
Uh,
0:08:24.970,0:08:28.499
The most important unit in order to find the deadlock
0:08:28.499,0:08:34.389
are the LORs reported by witness in order
are the LORs reported by WITNESS in order
to see if there is something strange
0:08:34.389,0:08:36.690
@ -547,10 +548,10 @@ You want to know the state of all the threads
that are running on the system that is deadlocking.
0:08:41.700,0:08:42.900
...
0:08:42.900,0:08:47.050
You can see that you're deadlocking if you see that
You can see that you're deadlocking, if you see that
0:08:47.050,0:08:48.070
on the runqueue
@ -570,21 +571,21 @@ and you have all the threads sleeping
in their own containers.
0:09:02.450,0:09:07.850
You need to know which are exactly locks
You need to know which are the exact locks
that are acquired
0:09:07.850,0:09:11.270
in the system
0:09:11.270,0:09:15.570
and that's something that witness provides,
and that's something that WITNESS provides
0:09:15.570,0:09:20.720
and the very important things is
and the very important thing is
to know why the threads are stopping.
0:09:20.720,0:09:24.250
So one on the most important things is
So one of the most important things is
retrieving what the threads were doing
0:09:24.250,0:09:26.320
@ -594,13 +595,13 @@ when
they were put asleep.
0:09:28.960,0:09:30.070
...
0:09:30.070,0:09:33.009
The backtraces of all the threads involved
0:09:33.009,0:09:37.130
are so printed out in order to identify deadlocks.
are printed out in order to identify deadlocks.
0:09:37.130,0:09:38.589
In the case that
@ -609,13 +610,13 @@ In the case that
buffered cache and VFS are
0:09:42.830,0:09:45.910
probably parts of the deadlocking
probably parts of the deadlocking,
0:09:45.910,0:09:50.790
you should also print out
0:09:50.790,0:09:53.420
the info about vnodes
the information about vnodes
0:09:53.420,0:09:58.250
and what we're interested in is which vnodes are called,
@ -630,10 +631,10 @@ are actually referenced
and
0:10:03.530,0:10:10.530
which way they were called.
in which way they were called.
0:10:11.030,0:10:13.380
So
So,
0:10:13.380,0:10:15.770
this is an example
@ -648,7 +649,7 @@ thread states
in the case of a deadlock.
0:10:20.760,0:10:27.480
This is an example of a real deadlock
This is an real example of a deadlock
0:10:27.480,0:10:28.900
but you can see
@ -663,10 +664,10 @@ this is not totally complete.
But you can see that all the threads are sleeping.
0:10:38.450,0:10:39.870
Uh...
0:10:39.870,0:10:43.580
And this one is the message
This one is the message
0:10:43.580,0:10:44.790
used by the wait channel
@ -678,7 +679,7 @@ on which they're sleeping on
or used by
0:10:48.710,0:10:54.480
the container like the turnsile or the sleep queue.
the container like the turnstile or the sleepqueue.
0:10:54.480,0:10:59.410
If I recall correctly, it's a forced amount
@ -688,7 +689,7 @@ that does deadlocking at some point.
I'm not really sure
0:11:01.290,0:11:04.190
because I have to take a look at it.
because I should have looked at it.
0:11:04.190,0:11:08.810
You can see that the revelant command here
@ -698,7 +699,7 @@ is -ps
that DDB supports.
0:11:11.220,0:11:14.220
Um,
0:11:14.220,0:11:17.520
Another important thing
@ -716,43 +717,39 @@ As you can see there,
usually
0:11:25.210,0:11:31.600
is because you can add some data structures corrupted
its because you can add some data structures corrupted
0:11:31.600,0:11:34.320
in the per-CPU datas.
0:11:34.320,0:11:38.830
That's a very common situation when you get deadlocks,
That's a very common situation where you can get deadlocks,
0:11:38.830,0:11:40.280
because, for example,
because, for example,
0:11:40.280,0:11:43.149
leaving a corrupted LPD will lead
0:11:43.149,0:11:48.750
I loved you too much review shellacking double
falls and things like that about that
0:11:48.750,0:11:55.290
to you having a bigger massive breakage like
double-faults. In general. it's a good idea to
look at all the CPUs involved in the system.
double-faults and things like that. Usually it's always a
good idea to look at all the CPUs involved in the system.
0:11:55.290,0:11:57.310
The command
0:11:57.310,0:12:00.120
is ""show allpcpu""
is """"-show allpcpu"".
0:12:00.120,0:12:04.960
Um,
0:12:04.960,0:12:06.959
This one
0:12:06.959,0:12:12.009
this one is a witness specific command -show alllocks
is a WITNESS specific command ""-show alllocks""
and it's going to show all the locks,
0:12:12.009,0:12:13.130
@ -771,7 +768,7 @@ a mount,
and the thread is this one,
0:12:21.270,0:12:23.660
what lock is holding,
what the lock is holding,
0:12:23.660,0:12:24.970
that's the address
@ -783,7 +780,7 @@ and where it was acquired.
It gives you lines and file.
0:12:31.140,0:12:32.770
...
0:12:32.770,0:12:34.730
Actually,
@ -792,7 +789,7 @@ Actually,
that's just possible
0:12:37.620,0:12:40.859
with witness, because otherwise,
with WITNESS, because otherwise,
0:12:40.859,0:12:44.410
trying to keep the oldest information
@ -806,13 +803,13 @@ Then, the most important thing is
the backtrace for any thread.
0:12:59.730,0:13:01.150
...
0:13:01.150,0:13:03.390
It's going to show the backtrace
0:13:03.390,0:13:05.700
for old threads.
for all the threads.
0:13:05.700,0:13:08.380
the seas
@ -827,7 +824,7 @@ the thread with these addresses TID and PID
basically got sleeping
0:13:15.350,0:13:17.140
on a pnode.
on a vnode.
0:13:17.140,0:13:22.020
You will see that the backend in this case is FFF
@ -839,7 +836,7 @@ and
that's the context switching function,
0:13:25.729,0:13:26.900
next
0:13:26.900,0:13:32.220
those are the sleepqueues of the containter
@ -867,10 +864,11 @@ you will have a lot of these kinds of traces,
but they are very important
0:13:53.079,0:13:59.270
so as developers to understand what is going on.
for the developers in order to understand
what is going on.
0:13:59.270,0:14:02.590
This ones are the locked vnodes
These ones are the locked vnodes
0:14:02.590,0:14:05.830
that are also very important when
@ -891,10 +889,10 @@ they are specific
to some handling of the vnodes such as recycling,
0:14:23.850,0:14:26.020
and completely freeing
and completely freeing.
0:14:26.020,0:14:27.290
that's the mount point
That's the mount point
0:14:27.290,0:14:28.770
where the vnodes
@ -927,7 +925,7 @@ For example, it tells you that
the lock
0:14:52.830,0:14:55.040
the lock is in exclusive mode
is in exclusive mode
0:14:55.040,0:14:56.280
and
@ -942,13 +940,13 @@ on its queues.
That's also
0:15:04.090,0:15:06.370
the node number
the node number.
0:15:06.370,0:15:09.140
...
0:15:09.140,0:15:13.880
there are also under information you could receive
There is also other information you could receive
from the DDB linked to, for example,
0:15:13.880,0:15:16.980
@ -961,18 +959,18 @@ like sleep chains,
for any
0:15:19.310,0:15:24.250
wait channel if you have the address
wait channel, if you have the address
0:15:24.250,0:15:27.150
and, for example,
and for example,
0:15:27.150,0:15:32.650
you can also print the wall table of
the lock relations from witness
the lock relations from WITNESS
0:15:32.650,0:15:38.010
but it's mostly never useful
because you already know that.
because you should already know that.
0:15:38.010,0:15:41.100
So you will just need to know which is the one
@ -981,14 +979,12 @@ So you will just need to know which is the one
that
0:15:41.980,0:15:43.019
um,
0:15:43.019,0:15:47.750
can give trouble.
can give the trouble.
0:15:47.750,0:15:51.640
if you're going to say I mean some problem
is a problem
0:15:51.640,0:15:53.980
So if you are going to submit some problems
@ -1020,8 +1016,8 @@ I think that
it is a very good thing to talk about it.
0:16:25.670,0:16:31.420
Along with the witness, we have another
mechanism that could help us,
Along with the WITNESS, we have another
important mechanism that could help us with deadlocks
0:16:31.420,0:16:34.620
and it's called KTR.
@ -1030,17 +1026,17 @@ and it's called KTR.
KTR is
0:16:36.100,0:16:40.630
basically a logger of events,
basically a logger, a kernel logger, of events.
0:16:40.630,0:16:42.550
it's
It's
0:16:42.550,0:16:45.090
highly configurable,
0:16:45.090,0:16:48.280
as you can, for example,
handle different classes of events,
handle different classes of events.
0:16:48.280,0:16:53.940
In FreeBSD we have
@ -1071,7 +1067,7 @@ enable several classes,
like
0:17:13.610,0:17:16.470
the time class of the KTR
the ten classes of the KTR
0:17:16.470,0:17:21.000
and then you are just interested in three of them
@ -1114,7 +1110,7 @@ and not the information,
for example,
0:17:48.390,0:17:50.160
a string,
the strings,
0:17:50.160,0:17:55.000
it doesn't make copies, you need to just pass
@ -1139,10 +1135,10 @@ you can also look at it from the user space
through the ktrdump interface.
0:18:13.430,0:18:15.820
...
0:18:15.820,0:18:17.030
And now...
0:18:17.030,0:18:19.669
Why is that important for locking?
@ -1163,8 +1159,8 @@ on which CPU branches,
and the order it happened in.
0:18:30.020,0:18:34.580
and this is very important when you're
going to track down for example races,
This is very important when you're
going to track down for example traces,
0:18:34.580,0:18:37.720
when you are not sure about the order of operations and
@ -1185,14 +1181,14 @@ a typical trace of KTR,
where you have
0:18:52.410,0:18:56.890
the CPU where the event happened, the index,
the CPU where the event happened, thats the index,
0:18:56.890,0:18:58.620
that's a timestamp,
0:18:58.620,0:19:03.400
I think it's retrieved directly from the TSC,
but i'm not actually sure.
but i'm actually not sure.
0:19:03.400,0:19:04.889
In this case,
@ -1202,7 +1198,7 @@ i was tracking down the scheduler class,
0:19:10.210,0:19:16.100
so I was interested mainly in scheduler
workloads and i could see
workloads and I could see
0:19:16.100,0:19:19.210
for example
@ -1238,10 +1234,10 @@ this one
and other things.
0:19:46.420,0:19:48.770
Well
0:19:48.770,0:19:50.820
you can enable it
You can enable
0:19:50.820,0:19:55.280
the option KTR, but you must handle it carefully.
@ -1266,7 +1262,7 @@ enough entries
to have a reliable tracking.
0:20:11.500,0:20:13.580
...
0:20:13.580,0:20:16.780
For example, if you are going to track a lot of events,
@ -1293,23 +1289,23 @@ let you compile some classes,
or mask them,
0:20:38.370,0:20:43.770
or mask a CPU,
or even mask the CPU.
0:20:43.770,0:20:46.289
if you have a big SMP environment,
If you have a big SMP environment,
0:20:46.289,0:20:50.160
so that you can selectively enable some of them.
0:20:50.160,0:20:54.700
For example, this is very good for
tracking down races in the sleeping queue.
tracking down traces in the sleeping queue.
0:20:54.700,0:21:01.700
You can find referrals here.
0:21:02.770,0:21:04.820
...
0:21:04.820,0:21:06.220
So,
@ -1332,7 +1328,7 @@ I think that actually our locking system
is pretty complete,
0:21:21.750,0:21:26.919
but it's also confusing for newcomers,
but it's also pretty confusing for newcomers,
it's not widely documented.
0:21:26.919,0:21:32.280
@ -1361,14 +1357,14 @@ who just need to do simple tasks.
For example, I saw a lot of guys coming from Linux World
0:21:56.660,0:22:00.620
who wanted to actually use spinlocks for time
who wanted to actually use spinlocks for time.
0:22:00.620,0:22:05.720
it's obvious they are missing something from our
architecture
It's obvious they are missing something from our
architecture.
0:22:05.720,0:22:07.250
From, uh, ...
From
0:22:07.250,0:22:11.010
just a technical point of view,
@ -1384,7 +1380,7 @@ we have lockmgr and sxlog
0:22:22.900,0:22:27.990
which are both read/write locks and
are both serverd by sleep queues.
are both servered by sleep queues.
0:22:27.990,0:22:31.800
They have some differences, obviously,
@ -1394,21 +1390,20 @@ but, mainly,
0:22:32.660,0:22:38.920
we could manage the missing bits and
just one of the 2 interfaces.
just use one of the two interfaces.
0:22:38.920,0:22:42.059
on the scene where he hasn't told you before
visiting all
0:22:42.059,0:22:43.920
In the same way, as i told you before,
0:22:43.920,0:22:47.340
the sleeping point, true end sleep,
the sleeping point, true-end sleep,
read/write sleep and sxsleep
0:22:47.340,0:22:50.350
should probably be merged with cond vars
should probably be managed with cond vars
0:22:50.350,0:22:52.930
and superdoff our kernel
@ -1418,10 +1413,10 @@ and we should probably drop sema,
0:22:55.070,0:23:00.290
because it is obsolete, and can be
replaced by condova and mutex.
replaced by condvars and mutex.
0:23:00.290,0:23:02.620
...
0:23:02.620,0:23:03.830
From
@ -1500,7 +1495,7 @@ Instead, the other one
0:24:10.290,0:24:16.180
uses spinning on a local variable
which is not shared by the threads
which is not shared by the threads.
0:24:16.180,0:24:18.030
and the time spent
@ -1512,10 +1507,10 @@ on that
local variable increases
0:24:22.780,0:24:25.440
um...
0:24:25.440,0:24:28.320
...
0:24:28.320,0:24:31.780
with the passing of time.
@ -1527,10 +1522,10 @@ Another interesting thing would be benchmarking
different wake-up algorithms for blocking primitives.
0:24:37.930,0:24:42.390
We have an algorithms that has proven to be
We have an algorithm that has proven to be
0:24:42.390,0:24:42.910
...
0:24:42.910,0:24:45.200
quite good
@ -1549,10 +1544,10 @@ a higher overhead but could give time improvements
on a big SMP environment.
0:24:59.760,0:25:02.500
Um, ...
0:25:02.500,0:25:07.000
another thing that would be very interesting
Another thing that would be very interesting
to fix is the priority inversion problem
0:25:07.000,0:25:08.670
@ -1584,7 +1579,7 @@ is often just a single atomic operation,
and
0:25:30.010,0:25:33.770
if it fails
if it fails,
0:25:33.770,0:25:36.900
it falls down and the art pattern tries to do
@ -1595,7 +1590,7 @@ in this case the owner of record technic
was going to make the fastpack too simple
0:25:40.210,0:25:42.640
Basically
Basically,
0:25:42.640,0:25:46.310
it just considers
@ -1611,7 +1606,7 @@ And it practically lands the priority to this
owner of record which does it's right log.
0:26:02.210,0:26:06.900
...
0:26:06.900,0:26:11.900
Another important thing obviously is improving locking
@ -1647,4 +1642,4 @@ like the one we saw before with the malloc command,
that needs to sleep.
0:26:44.070,0:26:44.320
Questions?
Any questions?