doc/en_US.ISO8859-1/captions/2009/asiabsdcon/rao-kernellocking-2.sbv

0:00:00.530,0:00:01.590
So basically,

0:00:04.590,0:00:10.029
we are going to look, mainly in this second part,
at how to

0:00:10.029,0:00:11.519
handle some

0:00:11.519,0:00:12.560
locking problems

0:00:12.560,0:00:17.910
that categorize in the kernel.

0:00:17.910,0:00:24.410
Here, there are described two kinds of problems
you can get with locks, that are pretty much common.

0:00:24.410,0:00:27.859
The first one is called Lock Order Reversal (LOR).

0:00:27.859,0:00:30.140
When you have for example a thread A,

0:00:30.140,0:00:32.340
which owns

0:00:32.340,0:00:35.870
a lock code, for example L1

0:00:35.870,0:00:37.920
and another thread B

0:00:37.920,0:00:40.070
which owns the lock, L2

0:00:40.070,0:00:43.150
Then thread A tries to..

0:00:43.150,0:00:44.730
Right.. it's wrong.

0:00:44.730,0:00:46.220
The slide is wrong.

0:00:46.220,0:00:48.020
The slide is wrong.

0:00:48.020,0:00:51.910
Thread A tries to acquire L2,

0:00:51.910,0:00:55.670
but obviously it sleeps because
it is owned by thread B

0:00:55.670,0:00:58.530
and then thread B tries to acquire

0:00:58.530,0:01:00.030
the lock L1

0:01:00.030,0:01:02.240
and it sleeps because

0:01:02.240,0:01:06.440
it's owned by thread B.

0:01:06.440,0:01:11.410
This is a situation that never ends and
it's pretty much well documented in Cormen

0:01:11.410,0:01:16.150
and in traditional literature.

0:01:16.150,0:01:18.650
It's a classical deadlock.

0:01:18.650,0:01:19.960
This means that,

0:01:19.960,0:01:21.950
as everybody who

0:01:21.950,0:01:24.940
has ever read a book

0:01:24.940,0:01:25.899
about an operating system,

0:01:25.899,0:01:30.420
knows that

0:01:30.420,0:01:32.910
locks should maintain

0:01:32.910,0:01:34.319
an ordering in regard of each other.

0:01:34.319,0:01:38.859
That's not very simple when

0:01:38.859,0:01:40.100
you speak about a kernel.

0:01:40.100,0:01:44.850
From this point of view, the fact
that there are 3 kinds of classes of locks

0:01:44.850,0:01:49.180
is going to count because you can
never mix two different kinds of locks.

0:01:49.180,0:01:50.680
For example

0:01:50.680,0:01:51.610
a spinlock

0:01:51.610,0:01:53.770
and a mutex

0:01:53.770,0:01:59.120
can be mixed in this way.

0:01:59.120,0:02:01.720
You can have the mutex first and the spinlock later,
while the opposite is not actually true.

0:02:01.720,0:02:07.060
So, you will see that these kind
of deadlocks are possible

0:02:07.060,0:02:09.290
only in the same class of locks,

0:02:09.290,0:02:13.019
like for example 2 mutex or 2 spin mutex,

0:02:13.019,0:02:14.569
or such.

0:02:14.569,0:02:16.090


0:02:16.090,0:02:17.409
Also,

0:02:17.409,0:02:19.949
even if it's not very well documented,

0:02:19.949,0:02:22.880
for example spinlocks

0:02:22.880,0:02:26.599
in FreeBSD, have a way to
identify such kind of deadlocks.

0:02:26.599,0:02:27.619
And it's pretty much implemented.

0:02:27.619,0:02:29.709


0:02:29.709,0:02:32.449
It's a feature enabled in the code.

0:02:32.449,0:02:34.949
They just count how many times they are spinning

0:02:34.949,0:02:36.010
and

0:02:36.010,0:02:39.169
if it exceeds

0:02:39.169,0:02:41.379
an exaggerated result,

0:02:41.379,0:02:47.870
it means that they are probably
under a deadlock and the system panics.

0:02:47.870,0:02:52.489
Another common problem about locking
is when you have

0:02:52.489,0:02:55.189
a wait channel

0:02:55.189,0:02:56.659
like a condition variable (cond var),

0:02:56.659,0:02:58.849
which it is protected by a lock.

0:02:58.849,0:03:03.629
If there are some ratio chances that
this condition variable encounters,

0:03:03.629,0:03:05.489
for example,

0:03:05.489,0:03:07.219
when cond var are chased

0:03:07.219,0:03:12.649
with some preliminary conditions,
like a waiter's counter

0:03:12.649,0:03:16.359
that has to be updated

0:03:16.359,0:03:21.249
anytime that any thread tries to sleep
on the counter or flags to be set.

0:03:21.249,0:03:24.909
If these conditions are not protected
by the same lock,

0:03:24.909,0:03:30.569
you can end up having some threads
sleeping on this wait channel

0:03:30.569,0:03:34.589
and nobody is going to wake them up again.

0:03:34.589,0:03:37.629
This is usually called missed wakeup

0:03:37.629,0:03:41.249
and it's a pretty common mistake

0:03:41.249,0:03:44.799
that leads to a deadlock.

0:03:44.799,0:03:46.719
The problem is that

0:03:46.719,0:03:52.109
it's very difficult to differentiate
between missed wakeup and

0:03:52.109,0:03:53.480
for example

0:03:53.480,0:03:56.189
forever sleep

0:03:56.189,0:03:58.419
of a thread

0:03:58.419,0:04:01.859
that is not likely to be awaken.

0:04:01.859,0:04:07.109
So these kind of deadlocks are
very very difficult to be discovered

0:04:07.109,0:04:11.669
and will require some bit of
work that we will see right now.

0:04:11.669,0:04:14.509
For example,

0:04:14.509,0:04:15.270
using

0:04:15.270,0:04:16.219
some

0:04:16.219,0:04:18.179
kernel systems

0:04:18.179,0:04:22.240
and some things integrated into the debugger.

0:04:22.240,0:04:22.979


0:04:22.979,0:04:25.520
In FreeBSD,

0:04:25.520,0:04:29.859
we have quite a lot of good mechanisms
we can use to cope

0:04:29.859,0:04:32.080
with kernel problems.

0:04:32.080,0:04:36.539
The first one (and the most important)
is called WITNESS.

0:04:36.539,0:04:39.169
It was introduced

0:04:39.169,0:04:42.080
in the context of SMPng

0:04:42.080,0:04:44.979
and has been rewritten in the recent past,

0:04:44.979,0:04:47.919
mainly by a contribution of

0:04:47.919,0:04:51.360
Isilon systems.

0:04:51.360,0:04:52.270
They contributed back then

0:04:52.270,0:04:54.989
to the writing of WITNESS.

0:04:54.989,0:04:57.389
This subsystem is very important

0:04:57.389,0:05:02.730
because it tracks down exactly every order

0:05:02.730,0:05:03.949
of the locks.

0:05:03.949,0:05:07.810
So that, if there is an ordering violation like a LOR,

0:05:07.810,0:05:09.439
it's going to

0:05:09.439,0:05:12.150
tell the system.

0:05:12.150,0:05:18.029
You can even set it to directly panic if
it finds some deadlocks,

0:05:18.029,0:05:19.879
or only

0:05:19.879,0:05:22.729
some possible deadlocks.

0:05:22.729,0:05:27.260
Another important feature with it is

0:05:27.260,0:05:32.569
that it can keep track of read/write locks.

0:05:32.569,0:05:33.690
Doing that,

0:05:33.690,0:05:36.539
we can identify

0:05:36.539,0:05:38.419
deadlocks, possibly

0:05:38.419,0:05:39.500
even

0:05:39.500,0:05:40.690
on the

0:05:40.690,0:05:45.529
reader's path.

0:05:45.529,0:05:49.609
We could say that WITNESS is pretty big,

0:05:49.609,0:05:52.289
so activating it

0:05:52.289,0:05:55.039
in your production system is never an option.

0:05:55.039,0:05:59.929
It's mainly used when you are going to
develop a new feature in the kernel

0:05:59.929,0:06:02.110
and you are going to test it heavily.

0:06:02.110,0:06:05.479
In particular if it has

0:06:05.479,0:06:06.819
some

0:06:06.819,0:06:10.509
relation to locking.

0:06:10.509,0:06:13.089


0:06:13.089,0:06:17.840
We could also tell that with the new code
provided by Isilon and Nokia,

0:06:17.840,0:06:19.150
basically

0:06:19.150,0:06:21.689


0:06:21.689,0:06:25.479
offered by WITNESS is greatly reduced to about

0:06:25.479,0:06:27.699
the 10th part of

0:06:27.699,0:06:30.240
what we had before.

0:06:30.240,0:06:36.150
WITNESS is very good at tracking LOR,

0:06:36.150,0:06:37.849
but

0:06:37.849,0:06:40.009
it's not very good at, for example,

0:06:40.009,0:06:42.449
trying to

0:06:42.449,0:06:44.060
help you

0:06:44.060,0:06:47.479
in the case of lost wakeups,

0:06:47.479,0:06:49.519
because of its nature,

0:06:49.519,0:06:52.090
mainly.

0:06:52.090,0:06:55.889
It has a very good integration with the DDB debugger

0:06:55.889,0:06:57.740
and

0:06:57.740,0:06:58.879
basically

0:06:58.879,0:07:04.159
it's in the 8th release,
we have new features

0:07:04.159,0:07:05.759
that help you

0:07:05.759,0:07:08.389
print out backtraces

0:07:08.389,0:07:11.150
of the contesting threads on the LORs

0:07:11.150,0:07:16.039
and their orderings

0:07:16.039,0:07:17.549
and

0:07:17.549,0:07:23.550
it shows some graphs of the relations
even from the user space.

0:07:23.550,0:07:28.550
You don't have to go into the kernel
debugger to look at it's output.

0:07:28.550,0:07:35.550


0:07:35.620,0:07:37.380


0:07:37.380,0:07:42.250
Well, I see that sometimes when
they are released there is a confusion

0:07:42.250,0:07:44.250
about the information reports

0:07:44.250,0:07:48.440
in regard of deadlocks conditions and what help

0:07:48.440,0:07:50.020
users can provide to developers

0:07:50.020,0:07:52.039
about that.

0:07:52.039,0:07:54.020
So we are going to see

0:07:54.020,0:07:58.700
all the relevant information
when a deadlock

0:07:58.700,0:07:59.590
is in the kernel.

0:07:59.590,0:08:02.490


0:08:02.490,0:08:03.389
Usually,

0:08:03.389,0:08:07.939
if you want to find a deadlock
that's happening in the kernel,

0:08:07.939,0:08:10.909
your first line of analysis starts from the DDB

0:08:10.909,0:08:13.919
instead of a post-mortem analysis,

0:08:13.919,0:08:16.839
which is even more important.

0:08:16.839,0:08:22.330
But, using DDB you will get more
processes and better information.

0:08:22.330,0:08:24.970


0:08:24.970,0:08:28.499
The most important unit in order to find the deadlock

0:08:28.499,0:08:34.389
are the LORs reported by WITNESS in order
to see if there is something strange

0:08:34.389,0:08:36.690
that can be happening.

0:08:36.690,0:08:41.700
You want to know the state of all the threads
that are running on the system that is deadlocking.

0:08:41.700,0:08:42.900


0:08:42.900,0:08:47.050
You can see that you're deadlocking, if you see that

0:08:47.050,0:08:48.070
on the runqueue

0:08:48.070,0:08:48.540
there are

0:08:48.540,0:08:51.850
just idle threads.

0:08:51.850,0:08:56.640
Because it's like saying that
runqueues are complete flushed

0:08:56.640,0:09:02.450
and you have all the threads sleeping
in their own containers.

0:09:02.450,0:09:07.850
You need to know which are the exact locks
that are acquired

0:09:07.850,0:09:11.270
in the system

0:09:11.270,0:09:15.570
and that's something that WITNESS provides

0:09:15.570,0:09:20.720
and the very important thing is
to know why the threads are stopping.

0:09:20.720,0:09:24.250
So one of the most important things is
retrieving what the threads were doing

0:09:24.250,0:09:26.320
when

0:09:26.320,0:09:28.960
they were put asleep.

0:09:28.960,0:09:30.070


0:09:30.070,0:09:33.009
The backtraces of all the threads involved

0:09:33.009,0:09:37.130
are printed out in order to identify deadlocks.

0:09:37.130,0:09:38.589
In the case that

0:09:38.589,0:09:42.830
buffered cache and VFS are

0:09:42.830,0:09:45.910
probably parts of the deadlocking,

0:09:45.910,0:09:50.790
you should also print out

0:09:50.790,0:09:53.420
the information about vnodes

0:09:53.420,0:09:58.250
and what we're interested in is which vnodes are called,

0:09:58.250,0:09:59.320
which

0:09:59.320,0:10:01.370
are actually referenced

0:10:01.370,0:10:03.530
and

0:10:03.530,0:10:10.530
in which way they were called.

0:10:11.030,0:10:13.380
So,

0:10:13.380,0:10:15.770
this is an example

0:10:15.770,0:10:17.430
of the

0:10:17.430,0:10:18.880
thread states

0:10:18.880,0:10:20.760
in the case of a deadlock.

0:10:20.760,0:10:27.480
This is an real example of a deadlock

0:10:27.480,0:10:28.900
but you can see

0:10:28.900,0:10:31.890
that

0:10:31.890,0:10:35.650
this is not totally complete.

0:10:35.650,0:10:38.450
But you can see that all the threads are sleeping.

0:10:38.450,0:10:39.870


0:10:39.870,0:10:43.580
This one is the message

0:10:43.580,0:10:44.790
used by the wait channel

0:10:44.790,0:10:47.550
on which they're sleeping on

0:10:47.550,0:10:48.710
or used by

0:10:48.710,0:10:54.480
the container like the turnstile or the sleepqueue.

0:10:54.480,0:10:59.410
If I recall correctly, it's a forced amount
that does deadlocking at some point.

0:10:59.410,0:11:01.290
I'm not really sure

0:11:01.290,0:11:04.190
because I should have looked at it.

0:11:04.190,0:11:08.810
You can see that the revelant command here
is -ps

0:11:08.810,0:11:11.220
that DDB supports.

0:11:11.220,0:11:14.220


0:11:14.220,0:11:17.520
Another important thing

0:11:17.520,0:11:18.820
is getting

0:11:18.820,0:11:21.680
the situation of all CPUs.

0:11:21.680,0:11:24.100
As you can see there,

0:11:24.100,0:11:25.210
usually

0:11:25.210,0:11:31.600
its because you can add some data structures corrupted

0:11:31.600,0:11:34.320
in the per-CPU datas.

0:11:34.320,0:11:38.830
That's a very common situation where you can get deadlocks,

0:11:38.830,0:11:40.280
because, for example,

0:11:40.280,0:11:43.149
leaving a corrupted LPD will lead

0:11:48.750,0:11:55.290
to you having a bigger massive breakage like
double-faults and things like that. Usually it's always a
good idea to look at all the CPUs involved in the system.

0:11:55.290,0:11:57.310
The command

0:11:57.310,0:12:00.120
is """"-show allpcpu"".

0:12:00.120,0:12:04.960


0:12:04.960,0:12:06.959
This one

0:12:06.959,0:12:12.009
is a WITNESS specific command ""-show alllocks""
and it's going to show all the locks,

0:12:12.009,0:12:13.130
in the system,

0:12:13.130,0:12:15.070
who is the owner,

0:12:15.070,0:12:15.850
like in this case,

0:12:15.850,0:12:17.690
a mount,

0:12:17.690,0:12:21.270
and the thread is this one,

0:12:21.270,0:12:23.660
what the lock is holding,

0:12:23.660,0:12:24.970
that's the address

0:12:24.970,0:12:27.360
and where it was acquired.

0:12:27.360,0:12:31.140
It gives you lines and file.

0:12:31.140,0:12:32.770


0:12:32.770,0:12:34.730
Actually,

0:12:34.730,0:12:37.620
that's just possible

0:12:37.620,0:12:40.859
with WITNESS, because otherwise,

0:12:40.859,0:12:44.410
trying to keep the oldest information

0:12:44.410,0:12:51.410
in a general purpose kernel will be
very expensive for our logging subsystem.

0:12:55.330,0:12:59.730
Then, the most important thing is
the backtrace for any thread.

0:12:59.730,0:13:01.150


0:13:01.150,0:13:03.390
It's going to show the backtrace

0:13:03.390,0:13:05.700
for all the threads.

0:13:05.700,0:13:08.380
the seas

0:13:08.380,0:13:09.169
In this case,

0:13:09.169,0:13:13.010
the thread with these addresses TID and PID

0:13:13.010,0:13:15.350
basically got sleeping

0:13:15.350,0:13:17.140
on a vnode.

0:13:17.140,0:13:22.020
You will see that the backend in this case is FFF

0:13:22.020,0:13:24.000
and

0:13:24.000,0:13:25.729
that's the context switching function,

0:13:25.729,0:13:26.900


0:13:26.900,0:13:32.220
those are the sleepqueues of the containter
that is containing the threads,

0:13:32.220,0:13:34.230
and this one

0:13:34.230,0:13:36.370
is what it was going to do

0:13:36.370,0:13:37.910
before,

0:13:37.910,0:13:41.810
in this case mounting the filesystems.

0:13:41.810,0:13:47.220
You will see that on a complete feeding,

0:13:47.220,0:13:50.310
you will have a lot of these kinds of traces,

0:13:50.310,0:13:53.079
but they are very important

0:13:53.079,0:13:59.270
for the developers in order to understand
what is going on.

0:13:59.270,0:14:02.590
These ones are the locked vnodes

0:14:02.590,0:14:05.830
that are also very important when

0:14:05.830,0:14:11.780
a deadlock happens in VFS or in the buffer cache.

0:14:11.780,0:14:13.700
You will see

0:14:13.700,0:14:18.580
that these are the ref counts linked to vnodes,

0:14:18.580,0:14:20.980
they are specific

0:14:20.980,0:14:23.850
to some handling of the vnodes such as recycling,

0:14:23.850,0:14:26.020
and completely freeing.

0:14:26.020,0:14:27.290
That's the mount point

0:14:27.290,0:14:28.770
where the vnodes

0:14:28.770,0:14:31.740
belong

0:14:31.740,0:14:33.930
and

0:14:33.930,0:14:35.910
that is the backtrace

0:14:35.910,0:14:39.760
of what happened when the vnode

0:14:39.760,0:14:41.060
was acquired.

0:14:41.060,0:14:46.600
You can see that this comment also gives you information

0:14:46.600,0:14:49.000
about the lock linked to the vnode.

0:14:49.000,0:14:51.640
For example, it tells you that

0:14:51.640,0:14:52.830
the lock

0:14:52.830,0:14:55.040
is in exclusive mode

0:14:55.040,0:14:56.280
and

0:14:56.280,0:14:59.320
it does some shared waits

0:14:59.320,0:15:03.260
on its queues.

0:15:03.260,0:15:04.090
That's also

0:15:04.090,0:15:06.370
the node number.

0:15:06.370,0:15:09.140


0:15:09.140,0:15:13.880
There is also other information you could receive
from the DDB linked to, for example,

0:15:13.880,0:15:16.980
the bugging deadlocks,

0:15:16.980,0:15:18.100
like sleep chains,

0:15:18.100,0:15:19.310
for any

0:15:19.310,0:15:24.250
wait channel, if you have the address

0:15:24.250,0:15:27.150
and for example,

0:15:27.150,0:15:32.650
you can also print the wall table of
the lock relations from WITNESS

0:15:32.650,0:15:38.010
but it's mostly never useful
because you should already know that.

0:15:38.010,0:15:41.100
So you will just need to know which is the one

0:15:41.100,0:15:41.980
that

0:15:41.980,0:15:43.019


0:15:43.019,0:15:47.750
can give the trouble.

0:15:47.750,0:15:51.640

0:15:51.640,0:15:53.980
So if you are going to submit some problems

0:15:53.980,0:15:57.180
usually called NGAP that are probably
deadlocked in the kernel space,

0:15:57.180,0:16:04.130
that ones

0:16:04.130,0:16:11.130
are the information that we actually need.

0:16:11.650,0:16:14.970
Now,

0:16:14.970,0:16:18.950
it's very difficult to see very good reports
about deadlocks,

0:16:18.950,0:16:20.020
so

0:16:20.020,0:16:22.569
I think that

0:16:22.569,0:16:25.670
it is a very good thing to talk about it.

0:16:25.670,0:16:31.420
Along with the WITNESS, we have another
important mechanism that could help us with deadlocks

0:16:31.420,0:16:34.620
and it's called KTR.

0:16:34.620,0:16:36.100
KTR is

0:16:36.100,0:16:40.630
basically a logger, a kernel logger, of events.

0:16:40.630,0:16:42.550
It's

0:16:42.550,0:16:45.090
highly configurable,

0:16:45.090,0:16:48.280
as you can, for example,
handle different classes of events.

0:16:48.280,0:16:53.940
In FreeBSD we have

0:16:53.940,0:16:55.130
classes linked to the scheduler,

0:16:55.130,0:16:56.290
to the locking,

0:16:56.290,0:16:58.520
to the VFS, to callouts,

0:16:58.520,0:17:05.030
and they are all packed in the same class.

0:17:05.030,0:17:08.880
So they can be selectively enabled or masked.

0:17:08.880,0:17:10.030
For example

0:17:10.030,0:17:12.190
the difference is that you can
enable several classes,

0:17:12.190,0:17:13.610
like

0:17:13.610,0:17:16.470
the ten classes of the KTR

0:17:16.470,0:17:21.000
and then you are just interested in three of them
even if all ten of them are

0:17:21.000,0:17:23.030
actually tracked.

0:17:23.030,0:17:24.240
You will are just going to

0:17:24.240,0:17:26.839
see three of them.

0:17:26.839,0:17:28.439
Um,

0:17:28.439,0:17:31.940
an important thing is that KTR

0:17:31.940,0:17:34.520
doesn't handle,

0:17:34.520,0:17:37.770
for example

0:17:37.770,0:17:38.300
pointers,

0:17:38.300,0:17:40.340
doesn't store the information

0:17:40.340,0:17:45.450
passed to init, it just stores the pointer

0:17:45.450,0:17:46.880
and not the information,

0:17:46.880,0:17:48.390
for example,

0:17:48.390,0:17:50.160
the strings,

0:17:50.160,0:17:55.000
it doesn't make copies, you need to just pass
the pointers

0:17:55.000,0:17:57.610
which are persistent in the memory.

0:17:57.610,0:18:00.340
Otherwise,

0:18:00.340,0:18:05.500
KTR won't be able to access them.

0:18:05.500,0:18:09.760
The good thing about KTR is that

0:18:09.760,0:18:11.600
you can also look at it from the user space

0:18:11.600,0:18:13.430
through the ktrdump interface.

0:18:13.430,0:18:15.820


0:18:15.820,0:18:17.030


0:18:17.030,0:18:19.669
Why is that important for locking?

0:18:19.669,0:18:20.279
Because

0:18:20.279,0:18:21.090
after,

0:18:21.090,0:18:24.350
it can tell you what happed,

0:18:24.350,0:18:27.260
on which CPU branches,

0:18:27.260,0:18:30.020
and the order it happened in.

0:18:30.020,0:18:34.580
This is very important when you're
going to track down for example traces,

0:18:34.580,0:18:37.720
when you are not sure about the order of operations and

0:18:37.720,0:18:44.710
how they happened. It's going to tell you.

0:18:44.710,0:18:46.290
For example

0:18:46.290,0:18:48.650
that is

0:18:48.650,0:18:51.090
a typical trace of KTR,

0:18:51.090,0:18:52.410
where you have

0:18:52.410,0:18:56.890
the CPU where the event happened, thats the index,

0:18:56.890,0:18:58.620
that's a timestamp,

0:18:58.620,0:19:03.400
I think it's retrieved directly from the TSC,
but i'm actually not sure.

0:19:03.400,0:19:04.889
In this case,

0:19:04.889,0:19:10.210
i was tracking down the scheduler class,

0:19:10.210,0:19:16.100
so I was interested mainly in scheduler
workloads and I could see

0:19:16.100,0:19:19.210
for example

0:19:19.210,0:19:21.100
that

0:19:21.100,0:19:24.870
a contact switch happened

0:19:24.870,0:19:26.919
scheduling

0:19:26.919,0:19:28.010
the idle CPU

0:19:28.010,0:19:30.270
and then other information,

0:19:30.270,0:19:34.370
like for example the load of the CPU 1

0:19:34.370,0:19:37.190
and scan priority boost,

0:19:37.190,0:19:38.870
like

0:19:38.870,0:19:40.310
this one

0:19:40.310,0:19:46.420
and other things.

0:19:46.420,0:19:48.770


0:19:48.770,0:19:50.820
You can enable

0:19:50.820,0:19:55.280
the option KTR, but you must handle it carefully.

0:19:55.280,0:19:57.130
For example

0:19:57.130,0:20:01.990
use an option i didn't include here,

0:20:01.990,0:20:07.410
which is the length of the buffer it uses
to store the pointers in, it's called KTR_ENTRIES,

0:20:07.410,0:20:08.360
and you should specify

0:20:08.360,0:20:09.590
enough entries

0:20:09.590,0:20:11.500
to have a reliable tracking.

0:20:11.500,0:20:13.580


0:20:13.580,0:20:16.780
For example, if you are going to track a lot of events,

0:20:16.780,0:20:19.100
a short queue is not an option,

0:20:19.100,0:20:22.120
because you are going to lose some information.

0:20:22.120,0:20:26.780
A typical queue is of length 2K (2 kilobytes)

0:20:26.780,0:20:29.710
of entries.

0:20:29.710,0:20:32.520
These other options

0:20:32.520,0:20:36.190
let you compile some classes,

0:20:36.190,0:20:38.370
or mask them,

0:20:38.370,0:20:43.770
or even mask the CPU.

0:20:43.770,0:20:46.289
If you have a big SMP environment,

0:20:46.289,0:20:50.160
so that you can selectively enable some of them.

0:20:50.160,0:20:54.700
For example, this is very good for
tracking down traces in the sleeping queue.

0:20:54.700,0:21:01.700
You can find referrals here.

0:21:02.770,0:21:04.820


0:21:04.820,0:21:06.220
So,

0:21:06.220,0:21:09.020
I will spend the last time of the speech
speaking about possible improvements

0:21:09.020,0:21:10.500
to

0:21:10.500,0:21:15.670
our locking system, which is not very bad.

0:21:15.670,0:21:16.500
Well,

0:21:16.500,0:21:21.750
I think that actually our locking system
is pretty complete,

0:21:21.750,0:21:26.919
but it's also pretty confusing for newcomers,
it's not widely documented.

0:21:26.919,0:21:32.280
so maybe we will spend a good amount of time
on documentation.

0:21:32.280,0:21:32.799
As you can see,

0:21:32.799,0:21:38.120
even in this presentation, which is not very huge,

0:21:38.120,0:21:42.540
there are many things to say

0:21:42.540,0:21:46.700
and that are not very simple to understand in particular

0:21:46.700,0:21:48.240
for people

0:21:48.240,0:21:50.280
who just need to do simple tasks.

0:21:50.280,0:21:56.660
For example, I saw a lot of guys coming from Linux World

0:21:56.660,0:22:00.620
who wanted to actually use spinlocks for time.

0:22:00.620,0:22:05.720
It's obvious they are missing something from our
architecture.

0:22:05.720,0:22:07.250
From

0:22:07.250,0:22:11.010
just a technical point of view,

0:22:11.010,0:22:14.530
it would be very good if we could remove

0:22:14.530,0:22:20.440
legacy support and overriding support. For example,

0:22:20.440,0:22:22.900
we have lockmgr and sxlog

0:22:22.900,0:22:27.990
which are both read/write locks and
are both servered by sleep queues.

0:22:27.990,0:22:31.800
They have some differences, obviously,

0:22:31.800,0:22:32.660
but, mainly,

0:22:32.660,0:22:38.920
we could manage the missing bits and
just use one of the two interfaces.

0:22:38.920,0:22:42.059


0:22:42.059,0:22:43.920
In the same way, as i told you before,

0:22:43.920,0:22:47.340
the sleeping point, true-end sleep,
read/write sleep and sxsleep

0:22:47.340,0:22:50.350
should probably be managed with cond vars

0:22:50.350,0:22:52.930
and superdoff our kernel

0:22:52.930,0:22:55.070
and we should probably drop sema,

0:22:55.070,0:23:00.290
because it is obsolete, and can be
replaced by condvars and mutex.

0:23:00.290,0:23:02.620


0:23:02.620,0:23:03.830
From

0:23:03.830,0:23:05.639
a strong technical point of view,

0:23:05.639,0:23:07.350
as you can see,

0:23:07.350,0:23:09.680
we spent a lot of time

0:23:09.680,0:23:12.109
on optimization on our blocking primitives,

0:23:12.109,0:23:16.770
but very few on our spinning primitives.

0:23:16.770,0:23:21.810
That's because obviously blocking
primitives are our first choice,

0:23:21.810,0:23:22.700
but

0:23:22.700,0:23:25.210
spinlocks could be improved too,

0:23:25.210,0:23:30.130
using technics such as the so-called
back-off algorithms and

0:23:30.130,0:23:31.429
queued spinlocks.

0:23:31.429,0:23:33.560
I have a patch for that but

0:23:33.560,0:23:35.499
I would really need

0:23:35.499,0:23:39.999
to test it and tune it on a big SMP environment.

0:23:39.999,0:23:42.270
I don't think

0:23:42.270,0:23:43.540
that, for now,

0:23:43.540,0:23:45.730
they can handle such an environment.

0:23:45.730,0:23:47.820
In addition,

0:23:47.820,0:23:51.500
I'm not sure you are familiar with
queued spinlocks algorithms.

0:23:51.500,0:23:55.580
Basically a back-off algorithms try

0:23:55.580,0:24:02.250
to reduce the cache pressure on the

0:24:02.250,0:24:06.090
threads contesting on the lock

0:24:06.090,0:24:08.270
by giving a time.

0:24:08.270,0:24:10.290
Instead, the other one

0:24:10.290,0:24:16.180
uses spinning on a local variable
which is not shared by the threads.

0:24:16.180,0:24:18.030
and the time spent

0:24:18.030,0:24:20.140
on that

0:24:20.140,0:24:22.780
local variable increases

0:24:22.780,0:24:25.440


0:24:25.440,0:24:28.320


0:24:28.320,0:24:31.780
with the passing of time.

0:24:31.780,0:24:35.950
Another interesting thing would be benchmarking

0:24:35.950,0:24:37.930
different wake-up algorithms for blocking primitives.

0:24:37.930,0:24:42.390
We have an algorithm that has proven to be

0:24:42.390,0:24:42.910


0:24:42.910,0:24:45.200
quite good

0:24:45.200,0:24:47.440
but

0:24:47.440,0:24:51.330
we are not confronted with other kind of
wake-ups that could have

0:24:51.330,0:24:56.450
a higher overhead but could give time improvements

0:24:56.450,0:24:59.760
on a big SMP environment.

0:24:59.760,0:25:02.500


0:25:02.500,0:25:07.000
Another thing that would be very interesting
to fix is the priority inversion problem

0:25:07.000,0:25:08.670
in the case of read locks.

0:25:08.670,0:25:09.790
There is an approach

0:25:09.790,0:25:13.950
called dominant record implemented in FreeBSD,

0:25:13.950,0:25:16.360
but i saw that it's pretty

0:25:16.360,0:25:22.160
slow for our fastpack. In FreeBSD all our locking
primitives are broken and add to path

0:25:22.160,0:25:23.290
where

0:25:23.290,0:25:25.820
the fastpack is

0:25:25.820,0:25:28.620
is often just a single atomic operation,

0:25:28.620,0:25:30.010
and

0:25:30.010,0:25:33.770
if it fails,

0:25:33.770,0:25:36.900
it falls down and the art pattern tries to do
all the end-work of the sleep queues.

0:25:36.900,0:25:40.210
in this case the owner of record technic
was going to make the fastpack too simple

0:25:40.210,0:25:42.640
Basically,

0:25:42.640,0:25:46.310
it just considers

0:25:46.310,0:25:50.940
the readets as one

0:25:50.940,0:25:55.380
and can switch in and out.

0:25:55.380,0:26:02.210
And it practically lands the priority to this
owner of record which does it's right log.

0:26:02.210,0:26:06.900


0:26:06.900,0:26:11.900
Another important thing obviously is improving locking

0:26:11.900,0:26:13.420
where the

0:26:13.420,0:26:15.649
optimum approach is not chosen.

0:26:15.649,0:26:21.039
I see a lot of parts in which the
primitives chosen by the developers

0:26:21.039,0:26:23.320
are not the most suitable ones

0:26:23.320,0:26:27.690
and we should switch to the right one.

0:26:27.690,0:26:33.120
Like, for example, in discriminated usage
of the spinlocks

0:26:33.120,0:26:34.990
or the blocking primitives,

0:26:34.990,0:26:40.430
just to handle cases like that,
like the one we saw before with the malloc command,

0:26:40.430,0:26:44.070
that needs to sleep.

0:26:44.070,0:26:44.320
Any questions?