1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:15,280 --> 00:00:19,760
welcome back everyone

3
00:00:16,800 --> 00:00:21,480
um next up uh we have mohamed for luck

4
00:00:19,760 --> 00:00:24,800
talking about uh

5
00:00:21,480 --> 00:00:26,960
ebpf101 mohamed verlak is a software

6
00:00:24,800 --> 00:00:29,039
engineer who is interested in how the

7
00:00:26,960 --> 00:00:31,760
ebpf subsystem

8
00:00:29,039 --> 00:00:33,600
can be leveraged in novel ways to enable

9
00:00:31,760 --> 00:00:37,120
new use cases

10
00:00:33,600 --> 00:00:37,120
uh please welcome muhammad vlad

11
00:00:38,160 --> 00:00:45,680
hello so this is uh ebpf101 talk i hope

12
00:00:43,120 --> 00:00:48,160
everybody can hear me all right so the

13
00:00:45,680 --> 00:00:51,280
idea is that we only have 20 minutes 20

14
00:00:48,160 --> 00:00:52,480
or 25 odd minutes and plus questions so

15
00:00:51,280 --> 00:00:54,480
what we're going to do is we're not

16
00:00:52,480 --> 00:00:55,440
going to do any programming sort of

17
00:00:54,480 --> 00:00:57,680
learning

18
00:00:55,440 --> 00:00:59,280
what's how to program the eppf subsystem

19
00:00:57,680 --> 00:01:01,440
but it's a it's an overview from a

20
00:00:59,280 --> 00:01:03,359
perspective of non-kernel programmer of

21
00:01:01,440 --> 00:01:04,879
what the cbpf subsystem means because

22
00:01:03,359 --> 00:01:06,799
there's been lately a lot of buzz around

23
00:01:04,879 --> 00:01:08,880
the ebp subsystem

24
00:01:06,799 --> 00:01:11,600
so

25
00:01:08,880 --> 00:01:13,280
a little bit about me who am i i

26
00:01:11,600 --> 00:01:15,200
recently finished this school book

27
00:01:13,280 --> 00:01:16,799
pretending to be a linux kernel expert

28
00:01:15,200 --> 00:01:18,720
pun intended

29
00:01:16,799 --> 00:01:20,400
i come from a very beautiful part of the

30
00:01:18,720 --> 00:01:21,680
world called srinagar

31
00:01:20,400 --> 00:01:22,560
kashmir

32
00:01:21,680 --> 00:01:25,840
and

33
00:01:22,560 --> 00:01:27,759
yeah i am no way an expert in all of

34
00:01:25,840 --> 00:01:30,159
this i just am reading it for fun and

35
00:01:27,759 --> 00:01:34,000
trying to you know leverage it and use

36
00:01:30,159 --> 00:01:36,400
it in my own day job and try and do

37
00:01:34,000 --> 00:01:37,759
more experiments with it

38
00:01:36,400 --> 00:01:40,400
so

39
00:01:37,759 --> 00:01:41,360
yeah let's look at the agenda so first

40
00:01:40,400 --> 00:01:44,720
of all

41
00:01:41,360 --> 00:01:47,680
we are going to look at the history of

42
00:01:44,720 --> 00:01:49,040
what this bpf thingy is

43
00:01:47,680 --> 00:01:51,600
and then we are going to move to the

44
00:01:49,040 --> 00:01:54,880
more recent the current parts of the

45
00:01:51,600 --> 00:01:56,159
ebpf where the e comes from

46
00:01:54,880 --> 00:01:58,880
so

47
00:01:56,159 --> 00:02:02,159
without further ado let's get started

48
00:01:58,880 --> 00:02:05,119
so let's go back in time probably to the

49
00:02:02,159 --> 00:02:08,160
90s and assume that there's no

50
00:02:05,119 --> 00:02:10,399
tcp dump and we want to design a packet

51
00:02:08,160 --> 00:02:12,959
filter so

52
00:02:10,399 --> 00:02:13,920
what could be our sort of

53
00:02:12,959 --> 00:02:16,640
design

54
00:02:13,920 --> 00:02:18,959
goals are to design a packet filter

55
00:02:16,640 --> 00:02:20,720
which copies or

56
00:02:18,959 --> 00:02:23,840
gives us a way to look at every packet

57
00:02:20,720 --> 00:02:25,200
that goes out through the wire and

58
00:02:23,840 --> 00:02:27,840
inspect it

59
00:02:25,200 --> 00:02:29,040
so how could be implemented there are

60
00:02:27,840 --> 00:02:29,840
generally

61
00:02:29,040 --> 00:02:31,519
two

62
00:02:29,840 --> 00:02:33,440
ways these are not exhaustive ways but

63
00:02:31,519 --> 00:02:34,879
generally there are two ways

64
00:02:33,440 --> 00:02:36,959
one could be

65
00:02:34,879 --> 00:02:40,239
you copy everything that goes throughout

66
00:02:36,959 --> 00:02:41,840
the wire to user space and then apply a

67
00:02:40,239 --> 00:02:44,160
filter on that of whatever is

68
00:02:41,840 --> 00:02:46,640
interesting to you whatever is not

69
00:02:44,160 --> 00:02:48,400
the other way could be more optimal

70
00:02:46,640 --> 00:02:52,879
where you write a kernel module and you

71
00:02:48,400 --> 00:02:55,599
say if x is the destination port

72
00:02:52,879 --> 00:02:57,599
x is the source why is the source port

73
00:02:55,599 --> 00:03:00,720
blah blah blah you load that model in

74
00:02:57,599 --> 00:03:02,400
the kernel it only copies the packets

75
00:03:00,720 --> 00:03:05,599
what you wanted to look at were

76
00:03:02,400 --> 00:03:07,440
interesting to you but again we we are

77
00:03:05,599 --> 00:03:10,879
going to talk about the trade-offs

78
00:03:07,440 --> 00:03:13,120
involved so what's the problem here

79
00:03:10,879 --> 00:03:15,120
if we have the user space implementation

80
00:03:13,120 --> 00:03:15,840
we copy everything it's it's

81
00:03:15,120 --> 00:03:18,080
like

82
00:03:15,840 --> 00:03:21,760
a no-brainer you copy everything what

83
00:03:18,080 --> 00:03:22,959
goes on to the wire to the user space

84
00:03:21,760 --> 00:03:24,720
and then

85
00:03:22,959 --> 00:03:26,319
you do what you need to do with that in

86
00:03:24,720 --> 00:03:28,400
the kernel space if you implemented that

87
00:03:26,319 --> 00:03:31,440
module thing what would have happened

88
00:03:28,400 --> 00:03:33,120
you hard code whatever you wanted

89
00:03:31,440 --> 00:03:35,599
to look at what was interesting for your

90
00:03:33,120 --> 00:03:37,040
use case and then only that thing gets

91
00:03:35,599 --> 00:03:38,480
copied to the user space and you take a

92
00:03:37,040 --> 00:03:40,400
look at it at the end of the day you

93
00:03:38,480 --> 00:03:42,159
have to copy stuff to the user space the

94
00:03:40,400 --> 00:03:45,120
only optimization that we are looking at

95
00:03:42,159 --> 00:03:47,280
is how much do you copy

96
00:03:45,120 --> 00:03:49,920
and uh if you look at if you look at the

97
00:03:47,280 --> 00:03:52,159
trade-offs here in the user space thingy

98
00:03:49,920 --> 00:03:53,680
it is not optimal

99
00:03:52,159 --> 00:03:54,959
and the kernel

100
00:03:53,680 --> 00:03:56,720
module thing

101
00:03:54,959 --> 00:03:59,200
it is pretty optimal because it just

102
00:03:56,720 --> 00:04:00,560
copies what you want and you're not

103
00:03:59,200 --> 00:04:02,959
doing more than

104
00:04:00,560 --> 00:04:04,720
more work than it's required the user

105
00:04:02,959 --> 00:04:07,040
system implementation is a generic

106
00:04:04,720 --> 00:04:08,720
solution you can implement it once sort

107
00:04:07,040 --> 00:04:10,239
of have that switch in the driver or

108
00:04:08,720 --> 00:04:12,879
whatever

109
00:04:10,239 --> 00:04:14,959
and just just be done with it whenever

110
00:04:12,879 --> 00:04:17,120
you want to do some tcp dump thingy you

111
00:04:14,959 --> 00:04:19,519
just nudge the driver it copies

112
00:04:17,120 --> 00:04:21,120
everything to your user space and then

113
00:04:19,519 --> 00:04:22,639
you do the packet processing and then

114
00:04:21,120 --> 00:04:24,160
apply filters for example there are 100

115
00:04:22,639 --> 00:04:25,919
packets copied

116
00:04:24,160 --> 00:04:28,479
you could only sort of the interesting

117
00:04:25,919 --> 00:04:30,400
ones could be only three or four in the

118
00:04:28,479 --> 00:04:32,320
kernel side you only copy the three

119
00:04:30,400 --> 00:04:33,520
drivers but this solution is not generic

120
00:04:32,320 --> 00:04:36,800
because you had

121
00:04:33,520 --> 00:04:37,840
hard coded or you had written the module

122
00:04:36,800 --> 00:04:40,960
and

123
00:04:37,840 --> 00:04:43,120
it only would apply for that destination

124
00:04:40,960 --> 00:04:44,960
and that that that that destination port

125
00:04:43,120 --> 00:04:46,400
and that that whatever protocol you're

126
00:04:44,960 --> 00:04:48,240
looking at

127
00:04:46,400 --> 00:04:50,479
there's one more interesting case here

128
00:04:48,240 --> 00:04:52,960
is the user space solution is a little

129
00:04:50,479 --> 00:04:55,040
safer when we say safe i mean if there's

130
00:04:52,960 --> 00:04:56,880
a bug in your code what worse could go i

131
00:04:55,040 --> 00:04:58,400
mean what could go wrong

132
00:04:56,880 --> 00:05:00,639
you basically could have a sick fault in

133
00:04:58,400 --> 00:05:03,199
the user space that's not too bad but

134
00:05:00,639 --> 00:05:06,240
nothing is going to go too bad but if

135
00:05:03,199 --> 00:05:08,320
the kernel module had a bug

136
00:05:06,240 --> 00:05:10,000
the whole system is going down

137
00:05:08,320 --> 00:05:11,680
probably i mean the sanity of the system

138
00:05:10,000 --> 00:05:12,720
is in question

139
00:05:11,680 --> 00:05:14,400
so

140
00:05:12,720 --> 00:05:15,520
what would be the right way to do things

141
00:05:14,400 --> 00:05:18,000
here

142
00:05:15,520 --> 00:05:19,919
yeah i mean what if we had best of the

143
00:05:18,000 --> 00:05:21,600
both of the worlds like we had like a

144
00:05:19,919 --> 00:05:24,320
generic solution

145
00:05:21,600 --> 00:05:26,479
and an optimal performance so what would

146
00:05:24,320 --> 00:05:29,440
that kind of a design be i mean how

147
00:05:26,479 --> 00:05:30,160
would you how would you do it

148
00:05:29,440 --> 00:05:33,280
so

149
00:05:30,160 --> 00:05:35,919
in 1992 the psd packet filter this paper

150
00:05:33,280 --> 00:05:36,800
it's a seminal paper that was published

151
00:05:35,919 --> 00:05:39,120
where

152
00:05:36,800 --> 00:05:41,680
the folks designed a novel architecture

153
00:05:39,120 --> 00:05:44,400
to do exactly the same things of how do

154
00:05:41,680 --> 00:05:45,840
we implement tcp dump and do it

155
00:05:44,400 --> 00:05:46,639
optimally

156
00:05:45,840 --> 00:05:48,639
so

157
00:05:46,639 --> 00:05:50,160
they had their in interesting design

158
00:05:48,639 --> 00:05:54,080
choice what they did is

159
00:05:50,160 --> 00:05:56,400
they implemented a vm a simple vm which

160
00:05:54,080 --> 00:05:58,880
resides in the kernel and

161
00:05:56,400 --> 00:06:01,840
it just could not do much

162
00:05:58,880 --> 00:06:03,360
it just does a bunch of loads

163
00:06:01,840 --> 00:06:05,199
a bunch of stores

164
00:06:03,360 --> 00:06:07,199
a little bit of jumping

165
00:06:05,199 --> 00:06:10,160
jumping around very basic arithmetic

166
00:06:07,199 --> 00:06:12,560
operations returns and it had like an

167
00:06:10,160 --> 00:06:15,440
accumulator register and one more like x

168
00:06:12,560 --> 00:06:17,520
register so you could do transfer one

169
00:06:15,440 --> 00:06:20,240
instruction

170
00:06:17,520 --> 00:06:22,400
i mean data from the accumulator to the

171
00:06:20,240 --> 00:06:24,240
x register or maybe from the x register

172
00:06:22,400 --> 00:06:27,360
to the accumulator register nothing

173
00:06:24,240 --> 00:06:30,000
fancy at all so you have

174
00:06:27,360 --> 00:06:31,039
basically a vm that resides in the

175
00:06:30,000 --> 00:06:33,919
kernel

176
00:06:31,039 --> 00:06:35,600
and you you operate that and you do

177
00:06:33,919 --> 00:06:37,600
something with that which we'll come to

178
00:06:35,600 --> 00:06:38,720
shortly

179
00:06:37,600 --> 00:06:41,280
so

180
00:06:38,720 --> 00:06:43,120
okay now you you what you've done is

181
00:06:41,280 --> 00:06:45,840
you've created a vm you've dumped it

182
00:06:43,120 --> 00:06:47,360
inside the kernel but how do you run

183
00:06:45,840 --> 00:06:49,039
a ppf program

184
00:06:47,360 --> 00:06:50,800
okay let's take a digression first i

185
00:06:49,039 --> 00:06:53,840
mean how

186
00:06:50,800 --> 00:06:55,520
do user space programs work

187
00:06:53,840 --> 00:06:57,360
you could have a compiled program you

188
00:06:55,520 --> 00:06:59,759
could have an interpreted program for

189
00:06:57,360 --> 00:07:01,360
compiled programs you write the piece of

190
00:06:59,759 --> 00:07:03,599
code

191
00:07:01,360 --> 00:07:05,759
then you give it to the compiler

192
00:07:03,599 --> 00:07:07,599
plus the linker you get a binary out of

193
00:07:05,759 --> 00:07:09,039
it and then you go ahead and you run the

194
00:07:07,599 --> 00:07:11,039
binary

195
00:07:09,039 --> 00:07:12,800
and everything sort of you get the

196
00:07:11,039 --> 00:07:15,520
output you wanted to write a hello world

197
00:07:12,800 --> 00:07:18,319
program you write the source code

198
00:07:15,520 --> 00:07:21,039
compile plus link it and do an a dot out

199
00:07:18,319 --> 00:07:22,560
dot slash a dot out whatever and it just

200
00:07:21,039 --> 00:07:24,720
prints it on the screen

201
00:07:22,560 --> 00:07:27,039
you as a programmer or whoever wants to

202
00:07:24,720 --> 00:07:28,560
use the program has the control on when

203
00:07:27,039 --> 00:07:30,880
the program runs

204
00:07:28,560 --> 00:07:32,400
in the interpreted case similar thing

205
00:07:30,880 --> 00:07:34,479
you write the code you hand it over to

206
00:07:32,400 --> 00:07:37,199
the interpreter it juts out the

207
00:07:34,479 --> 00:07:40,319
instructions and then executes them

208
00:07:37,199 --> 00:07:42,960
but how do you do it in the ppf side

209
00:07:40,319 --> 00:07:45,120
because the bp vm is inside the kernel i

210
00:07:42,960 --> 00:07:46,879
mean you don't have control over the

211
00:07:45,120 --> 00:07:49,840
kernel you can't really really just go

212
00:07:46,879 --> 00:07:51,759
and run whatever you want in the kernel

213
00:07:49,840 --> 00:07:52,639
the only interfaces you have with the

214
00:07:51,759 --> 00:07:55,360
kernel

215
00:07:52,639 --> 00:07:57,520
is the syscall or maybe some interrupts

216
00:07:55,360 --> 00:07:58,960
but generally it's just the syscall so

217
00:07:57,520 --> 00:08:00,800
do you have a direct way of nudging the

218
00:07:58,960 --> 00:08:01,840
kernel to run a program

219
00:08:00,800 --> 00:08:03,919
probably

220
00:08:01,840 --> 00:08:05,599
i mean would that make sense

221
00:08:03,919 --> 00:08:06,639
we'll we'll look at it

222
00:08:05,599 --> 00:08:09,199
so

223
00:08:06,639 --> 00:08:10,639
how does a ppf program

224
00:08:09,199 --> 00:08:13,360
run

225
00:08:10,639 --> 00:08:15,840
bpf programs it's very important to note

226
00:08:13,360 --> 00:08:18,639
that bpf programs unlike the normal

227
00:08:15,840 --> 00:08:21,039
programs are not

228
00:08:18,639 --> 00:08:24,560
dependent upon the programmer or whoever

229
00:08:21,039 --> 00:08:25,759
wants to use them they do not run on his

230
00:08:24,560 --> 00:08:28,639
wish

231
00:08:25,759 --> 00:08:31,199
they are mostly event driven so there

232
00:08:28,639 --> 00:08:33,039
are a bunch of events in the kernel that

233
00:08:31,199 --> 00:08:34,880
are placed or a bunch of hook points in

234
00:08:33,039 --> 00:08:37,599
the kernel and

235
00:08:34,880 --> 00:08:39,680
whenever you write a ebf ebpf program or

236
00:08:37,599 --> 00:08:41,839
a bpf program

237
00:08:39,680 --> 00:08:43,680
you write that in the instruction set

238
00:08:41,839 --> 00:08:45,519
that that simple instruction set you

239
00:08:43,680 --> 00:08:47,920
target for that vm

240
00:08:45,519 --> 00:08:51,440
you load the kernel wire you load the

241
00:08:47,920 --> 00:08:53,360
program via a syscall into the kernel

242
00:08:51,440 --> 00:08:55,279
it still will not run because you just

243
00:08:53,360 --> 00:08:56,160
have loaded the program

244
00:08:55,279 --> 00:08:59,200
then

245
00:08:56,160 --> 00:09:00,240
you attach it to a particular hook point

246
00:08:59,200 --> 00:09:02,640
so

247
00:09:00,240 --> 00:09:05,040
let's let's actually look at it the bpf

248
00:09:02,640 --> 00:09:07,600
programs are stateless that's an

249
00:09:05,040 --> 00:09:11,360
interesting point to note that these

250
00:09:07,600 --> 00:09:14,320
programs are very simple they run to

251
00:09:11,360 --> 00:09:15,200
completion and you load them to the

252
00:09:14,320 --> 00:09:16,800
kernel

253
00:09:15,200 --> 00:09:18,399
you first of all write the filter

254
00:09:16,800 --> 00:09:21,120
expression the filter expression is

255
00:09:18,399 --> 00:09:22,399
pretty simple it's it's like here we are

256
00:09:21,120 --> 00:09:24,959
going to go through this in a short

257
00:09:22,399 --> 00:09:27,279
while but you write like

258
00:09:24,959 --> 00:09:29,279
eventual answer should be true or false

259
00:09:27,279 --> 00:09:31,920
you load the byte code in the kernel you

260
00:09:29,279 --> 00:09:33,279
attach the loaded program to a hook

261
00:09:31,920 --> 00:09:35,519
a hook could be for example every

262
00:09:33,279 --> 00:09:37,839
received packet is a hook whenever

263
00:09:35,519 --> 00:09:39,200
colonel receives a packet run this ppf

264
00:09:37,839 --> 00:09:41,440
program

265
00:09:39,200 --> 00:09:43,680
and the programs are event driven

266
00:09:41,440 --> 00:09:45,360
and they run to completion there's no

267
00:09:43,680 --> 00:09:49,760
sort of preemption or anything that

268
00:09:45,360 --> 00:09:52,160
happens with it a pvpf event occurs the

269
00:09:49,760 --> 00:09:55,600
bpf program runs to completion and at

270
00:09:52,160 --> 00:09:57,760
the end it tells you yes or no i mean

271
00:09:55,600 --> 00:10:00,399
and that boolean instruction at that

272
00:09:57,760 --> 00:10:02,720
boolean return value can be used to do

273
00:10:00,399 --> 00:10:05,279
very interesting sort of things

274
00:10:02,720 --> 00:10:08,640
so let's just take a small example of

275
00:10:05,279 --> 00:10:10,480
a very very simple bpf program that that

276
00:10:08,640 --> 00:10:12,720
that comes via the paper i've just

277
00:10:10,480 --> 00:10:14,160
copied it from the paper and this this

278
00:10:12,720 --> 00:10:15,839
program here

279
00:10:14,160 --> 00:10:19,040
is basically

280
00:10:15,839 --> 00:10:20,800
a program that just filters out

281
00:10:19,040 --> 00:10:23,200
ip packets

282
00:10:20,800 --> 00:10:25,200
and which go to a particular destination

283
00:10:23,200 --> 00:10:28,640
port so if you see

284
00:10:25,200 --> 00:10:31,040
i'm doing a load half word

285
00:10:28,640 --> 00:10:32,480
of the 12th offset in the packet so

286
00:10:31,040 --> 00:10:34,480
every packet when it comes it has a

287
00:10:32,480 --> 00:10:37,600
particular like the standard it has a

288
00:10:34,480 --> 00:10:39,200
particular format and what you go is you

289
00:10:37,600 --> 00:10:41,440
start with the packet starting and then

290
00:10:39,200 --> 00:10:43,760
go to the 12th offset and see

291
00:10:41,440 --> 00:10:46,880
is it ethernet protocol

292
00:10:43,760 --> 00:10:48,399
and if it is then jump to l1 otherwise

293
00:10:46,880 --> 00:10:50,240
l5

294
00:10:48,399 --> 00:10:51,680
0 i mean i don't want to do anything

295
00:10:50,240 --> 00:10:54,240
with it because it's not ethernet i

296
00:10:51,680 --> 00:10:56,959
don't know what it is so get out

297
00:10:54,240 --> 00:10:58,880
then you do a load byte on the 23rd

298
00:10:56,959 --> 00:11:01,120
offset of the

299
00:10:58,880 --> 00:11:04,399
whatever packet you had and then see if

300
00:11:01,120 --> 00:11:05,920
it's tcp if it is tcp you go to the

301
00:11:04,399 --> 00:11:07,519
this part of the

302
00:11:05,920 --> 00:11:09,519
program otherwise you just put the

303
00:11:07,519 --> 00:11:12,880
packet on the floor and then you keep on

304
00:11:09,519 --> 00:11:15,440
munching a bunch of these small small

305
00:11:12,880 --> 00:11:17,360
instructions that you do with it so here

306
00:11:15,440 --> 00:11:18,399
for example the interesting one is you

307
00:11:17,360 --> 00:11:20,880
load

308
00:11:18,399 --> 00:11:23,279
if if this is an ip packet like the ip

309
00:11:20,880 --> 00:11:24,959
head of the length of the ip header here

310
00:11:23,279 --> 00:11:27,040
this this part is the length of the ip

311
00:11:24,959 --> 00:11:30,160
header i mean it's not too interesting

312
00:11:27,040 --> 00:11:32,000
of how exactly this happens but it's

313
00:11:30,160 --> 00:11:33,519
it's just as a

314
00:11:32,000 --> 00:11:35,519
notion here that these are very simple

315
00:11:33,519 --> 00:11:37,680
instructions that you can do on a packet

316
00:11:35,519 --> 00:11:39,200
receive and then

317
00:11:37,680 --> 00:11:41,360
you know you return true or false and

318
00:11:39,200 --> 00:11:43,760
depending on that result you could have

319
00:11:41,360 --> 00:11:46,079
that if it's true i'm going to copy it

320
00:11:43,760 --> 00:11:48,000
to user space if it's not i'm not going

321
00:11:46,079 --> 00:11:51,600
to copy it to user space so this is

322
00:11:48,000 --> 00:11:53,360
basically what's the vm and

323
00:11:51,600 --> 00:11:54,800
how do you how do you write a bpf

324
00:11:53,360 --> 00:11:58,240
program and then load it inside the

325
00:11:54,800 --> 00:11:58,240
kernel and then make it run

326
00:11:58,399 --> 00:12:03,040
uh before we move any further to the e

327
00:12:01,040 --> 00:12:04,959
part of the bpf where does the e come

328
00:12:03,040 --> 00:12:07,680
from we we need to look at some of the

329
00:12:04,959 --> 00:12:10,399
ideas that are similar to ppf

330
00:12:07,680 --> 00:12:12,800
this is interesting and of course it's

331
00:12:10,399 --> 00:12:15,760
it's a little orthogonal but it helps me

332
00:12:12,800 --> 00:12:17,200
to make make sense of why do we need ppf

333
00:12:15,760 --> 00:12:19,839
at the first place

334
00:12:17,200 --> 00:12:21,760
so think of it as the embedded lua vm in

335
00:12:19,839 --> 00:12:23,680
nginx to modify behavior for example to

336
00:12:21,760 --> 00:12:26,800
check certain headers if there is a

337
00:12:23,680 --> 00:12:29,760
certain header in the http request

338
00:12:26,800 --> 00:12:31,120
uh allow it and if there is none just

339
00:12:29,760 --> 00:12:33,839
drop it on the floor

340
00:12:31,120 --> 00:12:36,720
now there were two ways of doing this

341
00:12:33,839 --> 00:12:40,000
kind of thing either you could pull the

342
00:12:36,720 --> 00:12:42,240
nginx source code and then

343
00:12:40,000 --> 00:12:44,639
add that piece of code in the c language

344
00:12:42,240 --> 00:12:47,279
whatever language nginx is written in

345
00:12:44,639 --> 00:12:48,880
and then compile it and whenever you

346
00:12:47,279 --> 00:12:52,160
have to do anything you have to modify

347
00:12:48,880 --> 00:12:54,320
that rule you always have to recompile

348
00:12:52,160 --> 00:12:57,279
change and recompile

349
00:12:54,320 --> 00:13:00,000
i i think that's a little unwieldy if

350
00:12:57,279 --> 00:13:02,560
you want to do these kind of things and

351
00:13:00,000 --> 00:13:05,360
having a lua vm embedded inside the

352
00:13:02,560 --> 00:13:08,959
nginx module it helps you a lot

353
00:13:05,360 --> 00:13:11,120
maybe a new of him i i don't think if i

354
00:13:08,959 --> 00:13:12,320
want to extend the functionality of my

355
00:13:11,120 --> 00:13:15,120
editor

356
00:13:12,320 --> 00:13:18,079
i would want to pull the source code in

357
00:13:15,120 --> 00:13:20,240
add that bunch of functionality inside

358
00:13:18,079 --> 00:13:23,360
the source code i would rather write a

359
00:13:20,240 --> 00:13:24,880
lua plug-in and then load it in the

360
00:13:23,360 --> 00:13:26,399
so so these are these are some of the

361
00:13:24,880 --> 00:13:29,040
things where it tells us that it's

362
00:13:26,399 --> 00:13:30,079
probably easier to use the vm based

363
00:13:29,040 --> 00:13:32,240
approach where you don't have to

364
00:13:30,079 --> 00:13:33,120
recompile all the program and start from

365
00:13:32,240 --> 00:13:34,560
zero

366
00:13:33,120 --> 00:13:36,240
and

367
00:13:34,560 --> 00:13:38,160
it's working out

368
00:13:36,240 --> 00:13:40,320
great i mean i haven't forgotten about

369
00:13:38,160 --> 00:13:42,240
emacs folks emacs

370
00:13:40,320 --> 00:13:43,199
there are two ways to modify emacs

371
00:13:42,240 --> 00:13:45,360
either

372
00:13:43,199 --> 00:13:49,440
you modify the source code or you write

373
00:13:45,360 --> 00:13:51,279
emacs lisp your init.tl whatever

374
00:13:49,440 --> 00:13:52,880
so this this sort of this sort of notion

375
00:13:51,279 --> 00:13:55,199
is is pretty

376
00:13:52,880 --> 00:13:57,519
prevalent these days and it's pretty

377
00:13:55,199 --> 00:13:59,199
pretty easy to use uh

378
00:13:57,519 --> 00:14:00,800
it sometimes becomes a little difficult

379
00:13:59,199 --> 00:14:02,240
to think in the sense of kernel that

380
00:14:00,800 --> 00:14:03,920
okay why are we not changing the kernel

381
00:14:02,240 --> 00:14:06,320
but we are

382
00:14:03,920 --> 00:14:07,519
introducing a certain vm inside the

383
00:14:06,320 --> 00:14:09,360
kernel and

384
00:14:07,519 --> 00:14:11,680
increasing complexity but if you look at

385
00:14:09,360 --> 00:14:14,160
it if you compare it with other things

386
00:14:11,680 --> 00:14:16,399
it's pretty normal that you have your

387
00:14:14,160 --> 00:14:18,560
own plugins and everything written a

388
00:14:16,399 --> 00:14:20,399
different language than what the

389
00:14:18,560 --> 00:14:22,079
original editor or whatever target you

390
00:14:20,399 --> 00:14:24,079
were planning to use it on

391
00:14:22,079 --> 00:14:25,920
was and it gives you a lot of

392
00:14:24,079 --> 00:14:29,480
flexibility and a lot of the turnaround

393
00:14:25,920 --> 00:14:29,480
time is pretty quick

394
00:14:30,639 --> 00:14:35,680
so now let's move on to ebpf so

395
00:14:34,320 --> 00:14:37,760
absolutely we need to introduce

396
00:14:35,680 --> 00:14:40,480
ourselves to the ebp of mascot which is

397
00:14:37,760 --> 00:14:40,480
the qtb

398
00:14:41,600 --> 00:14:48,560
so where did this extended or e in the

399
00:14:45,839 --> 00:14:52,160
bpf thingy come from alexey

400
00:14:48,560 --> 00:14:54,079
sent a patch close to 2013 2014-ish

401
00:14:52,160 --> 00:14:56,079
where he improved the existing bpf

402
00:14:54,079 --> 00:14:57,440
infrastructure in the kernel the bpf

403
00:14:56,079 --> 00:15:00,160
infrastructure was already in the kernel

404
00:14:57,440 --> 00:15:02,880
and the prime users for that was tcp

405
00:15:00,160 --> 00:15:04,320
dump because that's why it was sort of

406
00:15:02,880 --> 00:15:06,800
gotten into the kernel it started off

407
00:15:04,320 --> 00:15:09,120
from bsd but then it was very soon

408
00:15:06,800 --> 00:15:11,279
ported to linux

409
00:15:09,120 --> 00:15:12,639
and

410
00:15:11,279 --> 00:15:15,360
if you recall

411
00:15:12,639 --> 00:15:18,959
the bsd packet filter

412
00:15:15,360 --> 00:15:21,279
just had hook points inside the network

413
00:15:18,959 --> 00:15:23,360
stack only i mean all the hook points

414
00:15:21,279 --> 00:15:24,480
were embedded there just for the network

415
00:15:23,360 --> 00:15:25,680
stack

416
00:15:24,480 --> 00:15:28,560
nothing else

417
00:15:25,680 --> 00:15:30,720
what alexis patch did is first of all

418
00:15:28,560 --> 00:15:32,320
improved the vm quality i mean earlier

419
00:15:30,720 --> 00:15:34,000
if you remember we looked at the

420
00:15:32,320 --> 00:15:36,480
instructions there were just a few

421
00:15:34,000 --> 00:15:38,320
instructions a very small set of

422
00:15:36,480 --> 00:15:43,120
registers to work with

423
00:15:38,320 --> 00:15:44,480
this patch vastly improved on that

424
00:15:43,120 --> 00:15:46,800
by making

425
00:15:44,480 --> 00:15:48,399
uh improvements in the number of

426
00:15:46,800 --> 00:15:51,120
registers you have the number of

427
00:15:48,399 --> 00:15:53,839
instructions you could implement and

428
00:15:51,120 --> 00:15:56,800
sort of write and overall performance so

429
00:15:53,839 --> 00:15:59,600
this this makes it e this this adds the

430
00:15:56,800 --> 00:16:02,399
e and then

431
00:15:59,600 --> 00:16:04,560
you have hook points spread throughout

432
00:16:02,399 --> 00:16:06,880
the linux kernel it is not only the

433
00:16:04,560 --> 00:16:10,079
network stack that has the hook points

434
00:16:06,880 --> 00:16:12,160
there is a bunch of other places where

435
00:16:10,079 --> 00:16:14,399
the hook points are

436
00:16:12,160 --> 00:16:14,399
so

437
00:16:14,800 --> 00:16:18,480
yeah by the way if i if i was not clear

438
00:16:16,959 --> 00:16:21,120
if if there are any questions please

439
00:16:18,480 --> 00:16:23,120
feel free to ask them

440
00:16:21,120 --> 00:16:26,000
uh and

441
00:16:23,120 --> 00:16:26,800
now we have to talk a little about

442
00:16:26,000 --> 00:16:29,839
uh

443
00:16:26,800 --> 00:16:32,399
the c bpf or the old style bpf which is

444
00:16:29,839 --> 00:16:34,079
also called as the classical bpf and

445
00:16:32,399 --> 00:16:36,240
extended bpf so there are there are

446
00:16:34,079 --> 00:16:37,360
differences between the c bpf the

447
00:16:36,240 --> 00:16:39,920
classical

448
00:16:37,360 --> 00:16:42,720
style of bpf and the extended style of

449
00:16:39,920 --> 00:16:46,079
bpf the cbpf typically a very

450
00:16:42,720 --> 00:16:48,240
constrainted vm will not let you do

451
00:16:46,079 --> 00:16:50,959
very interesting things

452
00:16:48,240 --> 00:16:52,720
while as ebpf is the extended one which

453
00:16:50,959 --> 00:16:54,399
has hook points

454
00:16:52,720 --> 00:16:56,639
throughout the kernel you could do a

455
00:16:54,399 --> 00:16:58,320
bunch of things with it earlier you

456
00:16:56,639 --> 00:16:59,759
could only do sort of packet processing

457
00:16:58,320 --> 00:17:02,320
decisions

458
00:16:59,759 --> 00:17:04,559
with the classical vpf but with ebpf you

459
00:17:02,320 --> 00:17:06,319
could do much more for example

460
00:17:04,559 --> 00:17:08,880
you could have a

461
00:17:06,319 --> 00:17:12,400
hook point on every system call that's

462
00:17:08,880 --> 00:17:15,199
executed and what one could do is

463
00:17:12,400 --> 00:17:17,120
whenever a certain program does a system

464
00:17:15,199 --> 00:17:19,439
call you could have a hook point and a

465
00:17:17,120 --> 00:17:21,919
bpf program attached to that hook point

466
00:17:19,439 --> 00:17:23,520
a program did a syscall

467
00:17:21,919 --> 00:17:26,400
since the hook point is attached to the

468
00:17:23,520 --> 00:17:28,640
cisco event it gets fired you check

469
00:17:26,400 --> 00:17:30,400
whether this particular program

470
00:17:28,640 --> 00:17:33,600
is allowed to make the syscall for

471
00:17:30,400 --> 00:17:36,640
example a small application like a cat

472
00:17:33,600 --> 00:17:37,679
like application that just has work to

473
00:17:36,640 --> 00:17:40,080
get

474
00:17:37,679 --> 00:17:41,520
you know file data which is the read

475
00:17:40,080 --> 00:17:43,039
system call

476
00:17:41,520 --> 00:17:45,280
and dump it on the screen or probably

477
00:17:43,039 --> 00:17:47,440
redirect to a file does not have to do

478
00:17:45,280 --> 00:17:48,480
anything with the network socket i o

479
00:17:47,440 --> 00:17:49,360
calls

480
00:17:48,480 --> 00:17:51,360
and

481
00:17:49,360 --> 00:17:53,360
if for example for some reason you could

482
00:17:51,360 --> 00:17:55,039
see malicious behavior you could deny

483
00:17:53,360 --> 00:17:57,280
that so

484
00:17:55,039 --> 00:18:00,320
the security aspect of it is very

485
00:17:57,280 --> 00:18:02,000
interesting and you could also have hook

486
00:18:00,320 --> 00:18:04,160
points in

487
00:18:02,000 --> 00:18:06,160
there's there is that there are certain

488
00:18:04,160 --> 00:18:08,160
points in the kernel called k probes

489
00:18:06,160 --> 00:18:09,039
caret probes trace points these are

490
00:18:08,160 --> 00:18:11,919
again

491
00:18:09,039 --> 00:18:14,160
uh hooks where a certain function

492
00:18:11,919 --> 00:18:16,000
executes there's a there's an event

493
00:18:14,160 --> 00:18:18,720
attached to that so you could get a lot

494
00:18:16,000 --> 00:18:19,840
of telemetry data from it and if you if

495
00:18:18,720 --> 00:18:22,240
you look very

496
00:18:19,840 --> 00:18:25,679
closely you could see there's this

497
00:18:22,240 --> 00:18:27,440
penguin as well as the windows signs so

498
00:18:25,679 --> 00:18:30,400
ebpf nowadays

499
00:18:27,440 --> 00:18:32,400
runs generally on most of the platforms

500
00:18:30,400 --> 00:18:36,000
that are available and windows was the

501
00:18:32,400 --> 00:18:37,440
most recent edition like 2020ish

502
00:18:36,000 --> 00:18:39,840
was the

503
00:18:37,440 --> 00:18:41,600
thing that it started to come up and

504
00:18:39,840 --> 00:18:43,919
it can run on windows of course you

505
00:18:41,600 --> 00:18:47,200
can't run linux specific things

506
00:18:43,919 --> 00:18:49,520
on like evp of things on windows but

507
00:18:47,200 --> 00:18:51,760
generally the idea is there the vm is

508
00:18:49,520 --> 00:18:51,760
there

509
00:18:52,320 --> 00:18:57,039
so again the capabilities networking

510
00:18:55,200 --> 00:18:58,400
absolutely it's it's where it started

511
00:18:57,039 --> 00:18:59,440
from you could do a lot of networking

512
00:18:58,400 --> 00:19:00,640
stuff with

513
00:18:59,440 --> 00:19:02,480
ebpf

514
00:19:00,640 --> 00:19:04,000
you could do a lot of security stuff you

515
00:19:02,480 --> 00:19:04,880
could like we talked about you could

516
00:19:04,000 --> 00:19:07,679
have a

517
00:19:04,880 --> 00:19:10,400
ppf program which has a hook point and

518
00:19:07,679 --> 00:19:12,320
just looks at a policy of whether a

519
00:19:10,400 --> 00:19:13,919
program is allowed to do a certain

520
00:19:12,320 --> 00:19:15,760
syscall or not

521
00:19:13,919 --> 00:19:18,400
you could do a lot of observability like

522
00:19:15,760 --> 00:19:20,160
if you have those hook points where

523
00:19:18,400 --> 00:19:22,400
the trace points and the k probes you

524
00:19:20,160 --> 00:19:24,720
could get you could gather a bunch of

525
00:19:22,400 --> 00:19:26,960
information from the system dynamically

526
00:19:24,720 --> 00:19:28,799
by attaching the ebpf program whenever

527
00:19:26,960 --> 00:19:31,120
the hook fires you get that information

528
00:19:28,799 --> 00:19:32,720
all the processing is done inside the

529
00:19:31,120 --> 00:19:34,240
kernel and then finally you get the

530
00:19:32,720 --> 00:19:35,520
answer now

531
00:19:34,240 --> 00:19:38,240
why this is

532
00:19:35,520 --> 00:19:40,400
fast because since you have a vm inside

533
00:19:38,240 --> 00:19:44,000
the kernel you want to count how many

534
00:19:40,400 --> 00:19:46,960
syscalls happened at a point x or i mean

535
00:19:44,000 --> 00:19:49,600
from a to b time what you could do is

536
00:19:46,960 --> 00:19:52,480
you could write that code and then have

537
00:19:49,600 --> 00:19:54,320
internally the ebpa program gather all

538
00:19:52,480 --> 00:19:56,160
the details do all the number crunching

539
00:19:54,320 --> 00:19:57,919
and then finally when you're done it

540
00:19:56,160 --> 00:19:59,280
just spits out the answer back to the

541
00:19:57,919 --> 00:20:01,360
user space

542
00:19:59,280 --> 00:20:03,919
uh copying data from

543
00:20:01,360 --> 00:20:06,640
across the user space and kernel space

544
00:20:03,919 --> 00:20:08,799
boundaries is costly so we want to do it

545
00:20:06,640 --> 00:20:11,440
as little as possible and want to keep

546
00:20:08,799 --> 00:20:13,200
the number crunching as local to where

547
00:20:11,440 --> 00:20:16,000
the data actually is and we just only

548
00:20:13,200 --> 00:20:16,000
want to look at the data

549
00:20:16,559 --> 00:20:19,840
so

550
00:20:17,520 --> 00:20:21,200
an interesting thing of the ebpf

551
00:20:19,840 --> 00:20:22,880
verifier

552
00:20:21,200 --> 00:20:24,799
and jit is

553
00:20:22,880 --> 00:20:25,840
you can't really willy nilly run any

554
00:20:24,799 --> 00:20:27,360
program

555
00:20:25,840 --> 00:20:30,320
inside the

556
00:20:27,360 --> 00:20:32,000
ebpfvm or the linux kernel i mean when

557
00:20:30,320 --> 00:20:32,960
we talk about arbitrary program we talk

558
00:20:32,000 --> 00:20:36,480
about like

559
00:20:32,960 --> 00:20:38,880
we do that that very restricted set of

560
00:20:36,480 --> 00:20:41,200
instructions that were given

561
00:20:38,880 --> 00:20:43,520
that the ebp fem runs but you really

562
00:20:41,200 --> 00:20:46,000
cannot do anything because

563
00:20:43,520 --> 00:20:48,640
the ebpf program runs to completion now

564
00:20:46,000 --> 00:20:51,120
what if a malicious user put an infinite

565
00:20:48,640 --> 00:20:53,520
loop in that whenever an event fires the

566
00:20:51,120 --> 00:20:55,440
cbpf program is going to go and then

567
00:20:53,520 --> 00:20:58,080
just infinitely loop and since the

568
00:20:55,440 --> 00:21:00,480
program runs to completion you basically

569
00:20:58,080 --> 00:21:03,039
just start the cpu you did a denial of

570
00:21:00,480 --> 00:21:05,360
service because now there's no way for

571
00:21:03,039 --> 00:21:09,600
the system to yield the program it just

572
00:21:05,360 --> 00:21:10,799
runs in the app so bpf verifier looks at

573
00:21:09,600 --> 00:21:12,480
whenever you

574
00:21:10,799 --> 00:21:14,159
load

575
00:21:12,480 --> 00:21:16,240
the

576
00:21:14,159 --> 00:21:19,600
ebpf program that you've written inside

577
00:21:16,240 --> 00:21:22,080
the kernel via the bpf system call

578
00:21:19,600 --> 00:21:24,960
it first of all goes to the verifier

579
00:21:22,080 --> 00:21:26,799
the verifier looks at all possible

580
00:21:24,960 --> 00:21:28,799
branches and whatever you have done in

581
00:21:26,799 --> 00:21:30,799
your code and

582
00:21:28,799 --> 00:21:33,200
first of all verifies the sanity of the

583
00:21:30,799 --> 00:21:35,440
program if the program according to the

584
00:21:33,200 --> 00:21:37,520
epp of verifier is same then only it

585
00:21:35,440 --> 00:21:40,159
gets handed over to the jit compiler

586
00:21:37,520 --> 00:21:42,480
which then emits out

587
00:21:40,159 --> 00:21:44,400
native instructions for whatever

588
00:21:42,480 --> 00:21:46,080
architecture you're running on and then

589
00:21:44,400 --> 00:21:47,200
it moves along and does whatever it

590
00:21:46,080 --> 00:21:48,640
needs to do

591
00:21:47,200 --> 00:21:50,000
then you properly attach it to a

592
00:21:48,640 --> 00:21:51,520
particular hook point because just

593
00:21:50,000 --> 00:21:53,440
loading the program in the kernel is not

594
00:21:51,520 --> 00:21:55,120
going to do anything when you load a

595
00:21:53,440 --> 00:21:57,440
program in the kernel you have to attach

596
00:21:55,120 --> 00:21:58,480
it to a certain

597
00:21:57,440 --> 00:22:01,440
point

598
00:21:58,480 --> 00:22:02,799
and uh then you do whatever you want to

599
00:22:01,440 --> 00:22:03,760
do with it so let me look at the

600
00:22:02,799 --> 00:22:06,000
questions

601
00:22:03,760 --> 00:22:07,840
since ppf verifier evolves can we expect

602
00:22:06,000 --> 00:22:12,880
that a ppf program written today for

603
00:22:07,840 --> 00:22:15,760
5.16 will work in a few years oh okay

604
00:22:12,880 --> 00:22:18,720
so i think

605
00:22:15,760 --> 00:22:20,880
if we uh if if we look at it if

606
00:22:18,720 --> 00:22:23,840
we look at the architecture of the bpf

607
00:22:20,880 --> 00:22:24,840
vm it's generally very simple

608
00:22:23,840 --> 00:22:27,760
so

609
00:22:24,840 --> 00:22:29,280
if by what you mean is if you've written

610
00:22:27,760 --> 00:22:31,919
a program today

611
00:22:29,280 --> 00:22:34,159
and you have used

612
00:22:31,919 --> 00:22:36,080
trace points and you have not you have

613
00:22:34,159 --> 00:22:37,919
not relied yourself on

614
00:22:36,080 --> 00:22:39,039
api in the kernel that changes for

615
00:22:37,919 --> 00:22:40,960
example

616
00:22:39,039 --> 00:22:42,799
we talked about k probes and trace

617
00:22:40,960 --> 00:22:44,880
points unfortunately i do not have

618
00:22:42,799 --> 00:22:46,799
enough time to talk about what those are

619
00:22:44,880 --> 00:22:49,280
but trace points have a guarantee of

620
00:22:46,799 --> 00:22:50,880
being more rigid like system calls

621
00:22:49,280 --> 00:22:53,840
they're not going to they're going to

622
00:22:50,880 --> 00:22:55,280
survive multiple kernel versions but k

623
00:22:53,840 --> 00:22:57,200
probes don't give you that because

624
00:22:55,280 --> 00:22:58,159
that's the internal kernel functions and

625
00:22:57,200 --> 00:23:00,159
whenever the name of the function

626
00:22:58,159 --> 00:23:01,360
changes the k probe changes so if you if

627
00:23:00,159 --> 00:23:04,240
you you

628
00:23:01,360 --> 00:23:06,080
use or leverage those kind of techniques

629
00:23:04,240 --> 00:23:09,280
where you use k trace points instead of

630
00:23:06,080 --> 00:23:12,400
k probes probably uh the ebpf program

631
00:23:09,280 --> 00:23:14,960
should survive multiple kernel divisions

632
00:23:12,400 --> 00:23:17,440
can ebpf store state between system

633
00:23:14,960 --> 00:23:20,320
calls okay yeah sure that's an excellent

634
00:23:17,440 --> 00:23:21,360
question i'm going to come to it

635
00:23:20,320 --> 00:23:23,520
so

636
00:23:21,360 --> 00:23:27,039
what are the different types of ebpf

637
00:23:23,520 --> 00:23:30,320
programs recall in the classical ebpf

638
00:23:27,039 --> 00:23:32,799
sense it was only sort of relatively

639
00:23:30,320 --> 00:23:34,480
constrainted towards the socket filter

640
00:23:32,799 --> 00:23:36,640
or the network stack

641
00:23:34,480 --> 00:23:37,840
but in the extended sense

642
00:23:36,640 --> 00:23:39,520
it's just

643
00:23:37,840 --> 00:23:41,600
spread throughout the kernel so we're

644
00:23:39,520 --> 00:23:43,440
going to talk about a little

645
00:23:41,600 --> 00:23:44,640
interesting of those

646
00:23:43,440 --> 00:23:46,159
some of the interesting this is not an

647
00:23:44,640 --> 00:23:49,120
exhaustive list

648
00:23:46,159 --> 00:23:50,720
so this bpf prog type sock filter what

649
00:23:49,120 --> 00:23:51,600
is this this is basically a packet

650
00:23:50,720 --> 00:23:54,559
filter

651
00:23:51,600 --> 00:23:57,120
the thing that the original bpf vm

652
00:23:54,559 --> 00:23:59,120
started with you apply a filter at a

653
00:23:57,120 --> 00:24:01,600
particular socket that okay if this is

654
00:23:59,120 --> 00:24:03,200
this destination port this source

655
00:24:01,600 --> 00:24:05,200
address this

656
00:24:03,200 --> 00:24:06,480
source spot blah blah blah

657
00:24:05,200 --> 00:24:08,320
do something

658
00:24:06,480 --> 00:24:11,120
there's another one called

659
00:24:08,320 --> 00:24:13,360
xdp the bpf prog type xtp this is an

660
00:24:11,120 --> 00:24:15,120
interesting one

661
00:24:13,360 --> 00:24:16,640
i'm sorry this is an interesting one

662
00:24:15,120 --> 00:24:19,919
because

663
00:24:16,640 --> 00:24:22,159
xtp is express data path and this is a

664
00:24:19,919 --> 00:24:24,240
bpf program which is attached at a hook

665
00:24:22,159 --> 00:24:26,400
point which is as close to the device

666
00:24:24,240 --> 00:24:28,000
driver as possible now

667
00:24:26,400 --> 00:24:29,919
we have to talk a little about it so i'm

668
00:24:28,000 --> 00:24:32,000
going to give one minute to it so what

669
00:24:29,919 --> 00:24:34,080
happens here is whenever the packet

670
00:24:32,000 --> 00:24:37,919
comes to your nic card

671
00:24:34,080 --> 00:24:40,960
it first of all gets stored on your nics

672
00:24:37,919 --> 00:24:42,960
memory and then from the nik and nick

673
00:24:40,960 --> 00:24:45,120
network interface card and from the

674
00:24:42,960 --> 00:24:47,840
network interface card it gets dmade

675
00:24:45,120 --> 00:24:50,720
inside the linux kernel main memory and

676
00:24:47,840 --> 00:24:52,480
then you raise a soft irq and interrupt

677
00:24:50,720 --> 00:24:54,480
and then you you

678
00:24:52,480 --> 00:24:56,480
tell the linux kernel that hey i got a

679
00:24:54,480 --> 00:24:59,120
packet for you start processing it you

680
00:24:56,480 --> 00:25:01,600
know take it into your levels of

681
00:24:59,120 --> 00:25:04,400
bureaucracy of the network stack where

682
00:25:01,600 --> 00:25:06,640
you first of all you know plug the

683
00:25:04,400 --> 00:25:08,799
l2 header then the l3 header and then

684
00:25:06,640 --> 00:25:12,159
you keep on moving it up the stack and

685
00:25:08,799 --> 00:25:13,600
finally you do all sorts of you know

686
00:25:12,159 --> 00:25:15,360
net filter

687
00:25:13,600 --> 00:25:17,120
that kind of table manipulations and if

688
00:25:15,360 --> 00:25:20,559
all is green you

689
00:25:17,120 --> 00:25:22,640
give it to if it's for descent to your

690
00:25:20,559 --> 00:25:23,679
process or if it was distant for some

691
00:25:22,640 --> 00:25:26,000
routing

692
00:25:23,679 --> 00:25:27,039
this xtp is interesting because the

693
00:25:26,000 --> 00:25:28,880
moment

694
00:25:27,039 --> 00:25:30,799
a packet arrives on your nic card and

695
00:25:28,880 --> 00:25:32,799
it's dmade

696
00:25:30,799 --> 00:25:34,720
you raise a soft irq request with the

697
00:25:32,799 --> 00:25:36,640
nappy interface like the internal

698
00:25:34,720 --> 00:25:38,799
network stack

699
00:25:36,640 --> 00:25:40,799
you have a capability to run a program

700
00:25:38,799 --> 00:25:42,880
now at that point in time

701
00:25:40,799 --> 00:25:46,559
your packet is just a buffer it's not

702
00:25:42,880 --> 00:25:48,159
even an sk buff yet it's just a buffer

703
00:25:46,559 --> 00:25:49,840
you can do all sorts of crazy things

704
00:25:48,159 --> 00:25:50,640
there you could redirect a program you

705
00:25:49,840 --> 00:25:53,200
could

706
00:25:50,640 --> 00:25:54,960
drop it on the floor and you'd say i

707
00:25:53,200 --> 00:25:56,480
could do that with iptables why do i

708
00:25:54,960 --> 00:25:58,080
need to do something there

709
00:25:56,480 --> 00:26:01,200
well if you do that with iptables it

710
00:25:58,080 --> 00:26:03,279
happens much higher up the lane i mean

711
00:26:01,200 --> 00:26:05,600
it happens like after you've done the l2

712
00:26:03,279 --> 00:26:07,760
and the l3 things so you have to

713
00:26:05,600 --> 00:26:10,240
allocate a lot of space for it for a

714
00:26:07,760 --> 00:26:11,679
packet that you probably did not want so

715
00:26:10,240 --> 00:26:13,679
you could do that

716
00:26:11,679 --> 00:26:15,679
right when you receive that packet and

717
00:26:13,679 --> 00:26:17,840
if you were acting as a router or a

718
00:26:15,679 --> 00:26:19,840
forwarder you could straight away ask

719
00:26:17,840 --> 00:26:22,720
the bpf program

720
00:26:19,840 --> 00:26:24,559
in this case xtp to you know

721
00:26:22,720 --> 00:26:26,240
do something with it forward it via some

722
00:26:24,559 --> 00:26:27,840
other interface

723
00:26:26,240 --> 00:26:30,320
one interesting other thing that could

724
00:26:27,840 --> 00:26:32,159
be done here at this point is you could

725
00:26:30,320 --> 00:26:34,400
do

726
00:26:32,159 --> 00:26:35,919
a kernel bypass like you've got the

727
00:26:34,400 --> 00:26:37,200
buffer just

728
00:26:35,919 --> 00:26:38,320
leave the kernel i don't want to go

729
00:26:37,200 --> 00:26:40,080
through the network stack just straight

730
00:26:38,320 --> 00:26:41,600
away deliver it to the user space and i

731
00:26:40,080 --> 00:26:43,840
want to do whatever i want to do with it

732
00:26:41,600 --> 00:26:46,000
like the raw socket thingy

733
00:26:43,840 --> 00:26:48,720
the k probe price point sock ops are

734
00:26:46,000 --> 00:26:49,440
similar and there are much more

735
00:26:48,720 --> 00:26:51,840
now

736
00:26:49,440 --> 00:26:54,480
recall the classical bpf was entirely

737
00:26:51,840 --> 00:26:56,480
stateless ebpf as well is stateless you

738
00:26:54,480 --> 00:26:58,880
can't really store state but it has the

739
00:26:56,480 --> 00:27:00,880
capability to access storage which is

740
00:26:58,880 --> 00:27:03,039
called bpf maps now

741
00:27:00,880 --> 00:27:05,279
these maps are not not like actually a

742
00:27:03,039 --> 00:27:08,159
key value pair but whenever we say ebpf

743
00:27:05,279 --> 00:27:10,640
map think of it as storage ebpf storage

744
00:27:08,159 --> 00:27:12,880
so a bp map is basically a generic data

745
00:27:10,640 --> 00:27:15,440
structure that allows you

746
00:27:12,880 --> 00:27:18,000
to pass data to and flow from the user

747
00:27:15,440 --> 00:27:20,799
to the kernel and inside the kernel so

748
00:27:18,000 --> 00:27:22,880
you create a bpf map by using the same

749
00:27:20,799 --> 00:27:24,720
bpf syscall which is a multi-tool which

750
00:27:22,880 --> 00:27:26,480
lets you do a lot of things lets you

751
00:27:24,720 --> 00:27:29,039
load a program attach a program to a

752
00:27:26,480 --> 00:27:31,279
particular hook point create a map

753
00:27:29,039 --> 00:27:32,960
attach a map to a certain place do all

754
00:27:31,279 --> 00:27:34,720
sorts of things with the map and a few

755
00:27:32,960 --> 00:27:36,960
interesting map types are

756
00:27:34,720 --> 00:27:39,679
a map type hash which is actually like a

757
00:27:36,960 --> 00:27:41,840
key value store a map type array which

758
00:27:39,679 --> 00:27:44,320
is just like a normal array a map type

759
00:27:41,840 --> 00:27:46,000
prog array which stores file descriptors

760
00:27:44,320 --> 00:27:48,480
of a bunch of ebf programs that you've

761
00:27:46,000 --> 00:27:50,240
loaded and a bunch of other maps

762
00:27:48,480 --> 00:27:52,640
recently there was a map type plume

763
00:27:50,240 --> 00:27:54,799
filter that was added so you could do a

764
00:27:52,640 --> 00:27:56,799
bunch of state state

765
00:27:54,799 --> 00:27:58,000
stuff inside the kernel but this state

766
00:27:56,799 --> 00:28:00,480
is global like

767
00:27:58,000 --> 00:28:02,640
since the vf program comes in it can

768
00:28:00,480 --> 00:28:04,480
access that map do whatever it needs to

769
00:28:02,640 --> 00:28:05,919
do it does not store any state of its

770
00:28:04,480 --> 00:28:07,600
own but it can do whatever it needs to

771
00:28:05,919 --> 00:28:09,360
do in the state and then finally just

772
00:28:07,600 --> 00:28:12,960
die

773
00:28:09,360 --> 00:28:12,960
so i hope that answers the question

774
00:28:14,399 --> 00:28:20,720
can ebp store state between calls

775
00:28:16,399 --> 00:28:20,720
smoothing the input from pen uh i

776
00:28:20,880 --> 00:28:24,799
i think i i talked about how how does it

777
00:28:23,279 --> 00:28:26,080
store state but

778
00:28:24,799 --> 00:28:29,279
probably we can take that offline

779
00:28:26,080 --> 00:28:32,640
because we're pretty sure on time

780
00:28:29,279 --> 00:28:33,679
and yeah what's the conclusion ebpf

781
00:28:32,640 --> 00:28:36,000
programs

782
00:28:33,679 --> 00:28:39,200
are not controlled by the programmer but

783
00:28:36,000 --> 00:28:41,679
they run in response to events ebpm

784
00:28:39,200 --> 00:28:43,600
programs run to completion now an

785
00:28:41,679 --> 00:28:46,080
interesting thing here was i had put a

786
00:28:43,600 --> 00:28:48,399
bracket here and i'd said that they're

787
00:28:46,080 --> 00:28:50,240
not preemptive but then one friend of

788
00:28:48,399 --> 00:28:52,559
mine his name is kartik he's one of the

789
00:28:50,240 --> 00:28:55,200
epf developers he sort of sends patches

790
00:28:52,559 --> 00:28:57,279
regularly he corrected me and said that

791
00:28:55,200 --> 00:28:59,919
probably they can be preempted but not

792
00:28:57,279 --> 00:29:01,600
migrated so i i just left that

793
00:28:59,919 --> 00:29:03,120
and that's food for thought i myself

794
00:29:01,600 --> 00:29:04,159
don't know much about it but just just

795
00:29:03,120 --> 00:29:05,919
saying it

796
00:29:04,159 --> 00:29:07,360
running an ebpa program is much safer

797
00:29:05,919 --> 00:29:08,640
than running and maintaining a kernel

798
00:29:07,360 --> 00:29:11,279
module now

799
00:29:08,640 --> 00:29:12,000
what does that mean if somebody gave me

800
00:29:11,279 --> 00:29:14,399
a

801
00:29:12,000 --> 00:29:16,720
kernel module and said that yeah this

802
00:29:14,399 --> 00:29:18,240
does an amazing solves an amazing

803
00:29:16,720 --> 00:29:19,440
problem

804
00:29:18,240 --> 00:29:22,000
can you run it in your production

805
00:29:19,440 --> 00:29:24,559
environment i'd be very skeptical

806
00:29:22,000 --> 00:29:27,039
because me running a kernel module in

807
00:29:24,559 --> 00:29:29,279
production given by somebody it's it's

808
00:29:27,039 --> 00:29:31,440
it's a little dangerous but on the other

809
00:29:29,279 --> 00:29:33,600
hand if somebody gave me an ebf program

810
00:29:31,440 --> 00:29:35,120
i would very well just try it out and

811
00:29:33,600 --> 00:29:37,360
probably not in production but at least

812
00:29:35,120 --> 00:29:39,279
i'll not be that hesitant because i know

813
00:29:37,360 --> 00:29:42,240
the epp verifier is going to help me and

814
00:29:39,279 --> 00:29:43,760
not cause any problems the entry bar to

815
00:29:42,240 --> 00:29:45,679
get useful information from the kernel

816
00:29:43,760 --> 00:29:47,520
is significantly reduced people like me

817
00:29:45,679 --> 00:29:49,760
who don't know anything about the kernel

818
00:29:47,520 --> 00:29:52,000
or probably are pretending to be kernel

819
00:29:49,760 --> 00:29:53,919
experts can can sort of

820
00:29:52,000 --> 00:29:56,159
know a lot about how the kernel works

821
00:29:53,919 --> 00:29:57,840
and the overhead is just pay as you go

822
00:29:56,159 --> 00:30:00,480
zero cost abstraction style i mean if

823
00:29:57,840 --> 00:30:02,480
you're using it that's the only time

824
00:30:00,480 --> 00:30:04,880
you pay for it it's it's minimal but

825
00:30:02,480 --> 00:30:07,440
still you have to pay for it and

826
00:30:04,880 --> 00:30:08,399
the bpf vm is already there

827
00:30:07,440 --> 00:30:10,640
so

828
00:30:08,399 --> 00:30:13,840
yeah

829
00:30:10,640 --> 00:30:16,240
thank you and i think we are right on

830
00:30:13,840 --> 00:30:16,240
time

831
00:30:17,600 --> 00:30:21,039
thank you very much

832
00:30:19,039 --> 00:30:22,320
um

833
00:30:21,039 --> 00:30:24,559
luck

834
00:30:22,320 --> 00:30:26,399
for that excellent introduction um

835
00:30:24,559 --> 00:30:28,240
unfortunately we are at the end of the

836
00:30:26,399 --> 00:30:30,399
time slot so we don't have time for more

837
00:30:28,240 --> 00:30:32,480
questions uh if you do have more

838
00:30:30,399 --> 00:30:34,399
questions feel free to uh contact for

839
00:30:32,480 --> 00:30:36,240
luck offline

840
00:30:34,399 --> 00:30:37,200
outside of the session and i'm sure

841
00:30:36,240 --> 00:30:39,919
he'll be

842
00:30:37,200 --> 00:30:41,919
more than happy to keep talking about

843
00:30:39,919 --> 00:30:44,919
ebpf

844
00:30:41,919 --> 00:30:44,919
um

