1
00:00:06,320 --> 00:00:11,499
[Music]

2
00:00:15,360 --> 00:00:19,520
hi everyone welcome to the first session

3
00:00:17,119 --> 00:00:21,119
of system and mini conf our first talk

4
00:00:19,520 --> 00:00:23,199
is from alan

5
00:00:21,119 --> 00:00:24,800
shone and it is on the importance of

6
00:00:23,199 --> 00:00:26,800
visibility

7
00:00:24,800 --> 00:00:28,720
so here down excellent

8
00:00:26,800 --> 00:00:31,359
thank you very much simon

9
00:00:28,720 --> 00:00:34,320
so hi everyone uh today i want to talk a

10
00:00:31,359 --> 00:00:37,760
bit about visibility and

11
00:00:34,320 --> 00:00:41,120
it's really a very uh simple

12
00:00:37,760 --> 00:00:42,840
concept um for me uh it's it's a very

13
00:00:41,120 --> 00:00:45,200
basic thing

14
00:00:42,840 --> 00:00:46,399
and uh

15
00:00:45,200 --> 00:00:48,719
like the

16
00:00:46,399 --> 00:00:51,760
the sort of first steps that you take

17
00:00:48,719 --> 00:00:54,800
for anything system related really

18
00:00:51,760 --> 00:00:57,920
and so i've got three particular things

19
00:00:54,800 --> 00:00:59,120
that i'm going to cover uh first off uh

20
00:00:57,920 --> 00:01:01,760
the basics

21
00:00:59,120 --> 00:01:03,760
which is really just the basics uh those

22
00:01:01,760 --> 00:01:04,960
first bits and pieces uh

23
00:01:03,760 --> 00:01:07,360
a bit about

24
00:01:04,960 --> 00:01:09,760
what i'm actually trying to talk about

25
00:01:07,360 --> 00:01:11,280
uh not necessarily what

26
00:01:09,760 --> 00:01:13,280
you might expect

27
00:01:11,280 --> 00:01:15,600
and a bit about why

28
00:01:13,280 --> 00:01:17,920
the second will be a particular

29
00:01:15,600 --> 00:01:19,920
situation a bit of a story

30
00:01:17,920 --> 00:01:21,520
to go along with this

31
00:01:19,920 --> 00:01:23,600
something that has

32
00:01:21,520 --> 00:01:25,280
a bit of background and

33
00:01:23,600 --> 00:01:26,799
hopefully either some lessons or

34
00:01:25,280 --> 00:01:29,119
takeaways

35
00:01:26,799 --> 00:01:32,960
for everyone that's watching

36
00:01:29,119 --> 00:01:35,040
and the third uh incidence which really

37
00:01:32,960 --> 00:01:38,240
is another situation

38
00:01:35,040 --> 00:01:39,439
but a little bit different and hopefully

39
00:01:38,240 --> 00:01:42,240
a bit more

40
00:01:39,439 --> 00:01:44,240
color on the the actual topic of

41
00:01:42,240 --> 00:01:46,079
visibility in general

42
00:01:44,240 --> 00:01:48,799
so to start

43
00:01:46,079 --> 00:01:50,640
we have visibility

44
00:01:48,799 --> 00:01:51,920
what is it

45
00:01:50,640 --> 00:01:54,079
depending on

46
00:01:51,920 --> 00:01:55,200
who you talk to or what you're thinking

47
00:01:54,079 --> 00:01:57,680
about

48
00:01:55,200 --> 00:01:59,840
it's a buzz word really uh

49
00:01:57,680 --> 00:02:02,000
there's a lot of

50
00:01:59,840 --> 00:02:03,840
different interpretations of the one

51
00:02:02,000 --> 00:02:06,159
word and

52
00:02:03,840 --> 00:02:07,119
it doesn't necessarily mean

53
00:02:06,159 --> 00:02:09,599
what

54
00:02:07,119 --> 00:02:11,840
either i'm trying to talk about or what

55
00:02:09,599 --> 00:02:13,440
you might think it means uh

56
00:02:11,840 --> 00:02:15,200
so it's it's not

57
00:02:13,440 --> 00:02:17,520
uh necessarily

58
00:02:15,200 --> 00:02:20,239
what you might see talked about in an

59
00:02:17,520 --> 00:02:22,879
open source or or a public space uh from

60
00:02:20,239 --> 00:02:24,720
from companies and and uh tooling

61
00:02:22,879 --> 00:02:26,400
options

62
00:02:24,720 --> 00:02:29,440
what it should be though is is the bare

63
00:02:26,400 --> 00:02:30,400
minimum it's it's that first step

64
00:02:29,440 --> 00:02:31,200
that

65
00:02:30,400 --> 00:02:33,760
just

66
00:02:31,200 --> 00:02:34,720
knowing that something is there or

67
00:02:33,760 --> 00:02:37,519
knowing

68
00:02:34,720 --> 00:02:39,680
something is happening

69
00:02:37,519 --> 00:02:43,760
depending on the systems that you use as

70
00:02:39,680 --> 00:02:45,920
well it's it's often already available

71
00:02:43,760 --> 00:02:48,560
you might see the aws console for

72
00:02:45,920 --> 00:02:50,480
instance or other cloud providers

73
00:02:48,560 --> 00:02:52,400
you can see every service that's up and

74
00:02:50,480 --> 00:02:53,440
running and so that's

75
00:02:52,400 --> 00:02:57,120
that's

76
00:02:53,440 --> 00:02:59,760
visibility in in a very basic sense so

77
00:02:57,120 --> 00:03:00,560
for for what i'm trying to talk about is

78
00:02:59,760 --> 00:03:02,720
just

79
00:03:00,560 --> 00:03:04,720
that bare definition

80
00:03:02,720 --> 00:03:06,400
the the fact state or degree of being

81
00:03:04,720 --> 00:03:07,599
visible

82
00:03:06,400 --> 00:03:10,159
and this was

83
00:03:07,599 --> 00:03:12,080
purely from the dictionary uh

84
00:03:10,159 --> 00:03:13,200
and it really highlights

85
00:03:12,080 --> 00:03:14,640
just

86
00:03:13,200 --> 00:03:16,400
what visibility

87
00:03:14,640 --> 00:03:18,800
is at least to me

88
00:03:16,400 --> 00:03:19,840
and the way that i think about it

89
00:03:18,800 --> 00:03:22,560
um

90
00:03:19,840 --> 00:03:25,200
distinctly from from other terms

91
00:03:22,560 --> 00:03:27,200
and other concepts

92
00:03:25,200 --> 00:03:29,760
which then means that

93
00:03:27,200 --> 00:03:31,840
it's not observability uh

94
00:03:29,760 --> 00:03:33,519
it's

95
00:03:31,840 --> 00:03:35,519
the monitoring side of things nor the

96
00:03:33,519 --> 00:03:37,120
tooling side of things

97
00:03:35,519 --> 00:03:39,840
depending on exactly what you're

98
00:03:37,120 --> 00:03:42,640
thinking about or

99
00:03:39,840 --> 00:03:45,040
what you're talking about or trying to

100
00:03:42,640 --> 00:03:46,959
do it might encompass

101
00:03:45,040 --> 00:03:50,959
some of these things but visibility in a

102
00:03:46,959 --> 00:03:53,280
strict sense is not these things

103
00:03:50,959 --> 00:03:55,280
like again things like the the aws

104
00:03:53,280 --> 00:03:56,319
console that's that's a tool of some

105
00:03:55,280 --> 00:03:59,120
sort

106
00:03:56,319 --> 00:04:02,720
which gives that visibility but it

107
00:03:59,120 --> 00:04:03,680
doesn't mean that the tool is visibility

108
00:04:02,720 --> 00:04:05,439
uh

109
00:04:03,680 --> 00:04:08,480
and really it's

110
00:04:05,439 --> 00:04:11,439
it comes back to visibility being

111
00:04:08,480 --> 00:04:13,599
just the ability to know that something

112
00:04:11,439 --> 00:04:16,079
is there and that

113
00:04:13,599 --> 00:04:18,239
it's theoretically doing

114
00:04:16,079 --> 00:04:22,000
what it should be doing

115
00:04:18,239 --> 00:04:24,080
at a very bare minimum level

116
00:04:22,000 --> 00:04:26,479
and so with that

117
00:04:24,080 --> 00:04:30,479
we have our situation um

118
00:04:26,479 --> 00:04:33,040
and i i picked this picture here because

119
00:04:30,479 --> 00:04:34,800
visibility is like an onion and

120
00:04:33,040 --> 00:04:36,639
in that same sense visibility is a bit

121
00:04:34,800 --> 00:04:39,280
like an ogre uh

122
00:04:36,639 --> 00:04:42,240
there's a lot of layers and

123
00:04:39,280 --> 00:04:44,479
once you get beyond those first steps

124
00:04:42,240 --> 00:04:46,720
and start digging in deep but that's

125
00:04:44,479 --> 00:04:48,560
where you get into those other terms the

126
00:04:46,720 --> 00:04:50,160
the observability the

127
00:04:48,560 --> 00:04:52,240
the monitoring and all the rest of it

128
00:04:50,160 --> 00:04:54,160
and then you get into a lot more

129
00:04:52,240 --> 00:04:55,759
complexity

130
00:04:54,160 --> 00:04:58,000
but really

131
00:04:55,759 --> 00:05:00,479
being the first step visibility just

132
00:04:58,000 --> 00:05:02,800
gives those those very base

133
00:05:00,479 --> 00:05:06,400
uh insights and understandings that

134
00:05:02,800 --> 00:05:08,160
something is available and is doing

135
00:05:06,400 --> 00:05:10,320
at least

136
00:05:08,160 --> 00:05:12,000
at face value what it's supposed to be

137
00:05:10,320 --> 00:05:13,759
doing

138
00:05:12,000 --> 00:05:15,759
and so for this situation we had a

139
00:05:13,759 --> 00:05:17,600
problem uh

140
00:05:15,759 --> 00:05:19,919
we were already lacking

141
00:05:17,600 --> 00:05:21,759
uh certain

142
00:05:19,919 --> 00:05:23,759
observances and and

143
00:05:21,759 --> 00:05:26,000
visible aspects to

144
00:05:23,759 --> 00:05:27,759
some of our systems

145
00:05:26,000 --> 00:05:29,919
and

146
00:05:27,759 --> 00:05:32,560
what was available

147
00:05:29,919 --> 00:05:34,160
for us to readily implement

148
00:05:32,560 --> 00:05:36,960
was

149
00:05:34,160 --> 00:05:39,199
incredibly expensive

150
00:05:36,960 --> 00:05:41,199
and even though

151
00:05:39,199 --> 00:05:43,759
we could have just implemented it

152
00:05:41,199 --> 00:05:45,520
uh it would have been

153
00:05:43,759 --> 00:05:47,039
something that would have turned a lot

154
00:05:45,520 --> 00:05:49,199
of heads and

155
00:05:47,039 --> 00:05:52,160
probably gotten quite a few people into

156
00:05:49,199 --> 00:05:54,160
a lot of hot water if it had been

157
00:05:52,160 --> 00:05:56,400
enabled straight away

158
00:05:54,160 --> 00:05:57,360
but what we did have

159
00:05:56,400 --> 00:05:59,520
is

160
00:05:57,360 --> 00:06:02,400
a set of scenarios that

161
00:05:59,520 --> 00:06:04,560
we knew we needed to be able to

162
00:06:02,400 --> 00:06:05,600
to make visible so

163
00:06:04,560 --> 00:06:08,160
we knew

164
00:06:05,600 --> 00:06:10,800
certain things

165
00:06:08,160 --> 00:06:12,720
were potentially going to happen if they

166
00:06:10,800 --> 00:06:14,960
weren't already happening and we just

167
00:06:12,720 --> 00:06:18,080
needed to make them visible

168
00:06:14,960 --> 00:06:19,520
and it was really that that thing

169
00:06:18,080 --> 00:06:22,080
of

170
00:06:19,520 --> 00:06:24,400
we know that these are possible but we

171
00:06:22,080 --> 00:06:27,680
don't know that they're not happening

172
00:06:24,400 --> 00:06:29,120
and that was our visibility dilemma of

173
00:06:27,680 --> 00:06:30,960
the time

174
00:06:29,120 --> 00:06:33,600
and so

175
00:06:30,960 --> 00:06:36,240
we we had a solution um

176
00:06:33,600 --> 00:06:38,319
we can build our own tool

177
00:06:36,240 --> 00:06:40,720
which is

178
00:06:38,319 --> 00:06:43,440
always the best idea um

179
00:06:40,720 --> 00:06:45,600
but also not please don't uh

180
00:06:43,440 --> 00:06:46,479
we could make it composable though

181
00:06:45,600 --> 00:06:47,199
where

182
00:06:46,479 --> 00:06:49,759
we

183
00:06:47,199 --> 00:06:51,120
have our certain set of rules that we

184
00:06:49,759 --> 00:06:53,199
already know

185
00:06:51,120 --> 00:06:56,000
but we might have more in the future for

186
00:06:53,199 --> 00:06:59,120
instance um especially as

187
00:06:56,000 --> 00:07:01,759
the makeup of our platform changes

188
00:06:59,120 --> 00:07:05,599
we we need to

189
00:07:01,759 --> 00:07:06,880
make different components more visible

190
00:07:05,599 --> 00:07:08,960
and

191
00:07:06,880 --> 00:07:13,520
by making this particular tool

192
00:07:08,960 --> 00:07:13,520
composable we can split it up and

193
00:07:13,840 --> 00:07:18,319
i guess try a different paradigm for how

194
00:07:16,720 --> 00:07:19,440
we run

195
00:07:18,319 --> 00:07:22,160
tooling

196
00:07:19,440 --> 00:07:24,000
so this particular situation uses a

197
00:07:22,160 --> 00:07:26,560
series of lambda functions

198
00:07:24,000 --> 00:07:28,400
that interact with different

199
00:07:26,560 --> 00:07:31,520
components to

200
00:07:28,400 --> 00:07:33,120
retrieve data massage it and then look

201
00:07:31,520 --> 00:07:35,440
for certain things

202
00:07:33,120 --> 00:07:37,360
at different points in time based on the

203
00:07:35,440 --> 00:07:39,120
different rules for each from each of

204
00:07:37,360 --> 00:07:40,880
the different functions

205
00:07:39,120 --> 00:07:42,560
and this also meant that we could

206
00:07:40,880 --> 00:07:46,560
integrate it with all of our existing

207
00:07:42,560 --> 00:07:48,639
workflows all of our existing tools and

208
00:07:46,560 --> 00:07:49,599
within the rest of our

209
00:07:48,639 --> 00:07:51,919
uh

210
00:07:49,599 --> 00:07:52,960
infrastructure in general

211
00:07:51,919 --> 00:07:55,199
and

212
00:07:52,960 --> 00:07:56,160
what it looked like then was a slack

213
00:07:55,199 --> 00:07:58,479
message

214
00:07:56,160 --> 00:08:01,360
so we would have this

215
00:07:58,479 --> 00:08:03,520
uh set of lambda functions that would do

216
00:08:01,360 --> 00:08:07,840
all of this work and

217
00:08:03,520 --> 00:08:10,240
if anything was found a message would be

218
00:08:07,840 --> 00:08:13,280
dropped into this particular channel

219
00:08:10,240 --> 00:08:15,039
so that way one of the team were able to

220
00:08:13,280 --> 00:08:16,800
investigate it

221
00:08:15,039 --> 00:08:19,120
and so one of the

222
00:08:16,800 --> 00:08:22,319
interesting components was we we kept

223
00:08:19,120 --> 00:08:25,360
track of uh known ip addresses

224
00:08:22,319 --> 00:08:26,879
and these were from infrastructure these

225
00:08:25,360 --> 00:08:30,080
were from

226
00:08:26,879 --> 00:08:31,120
internal usage so like my ip address for

227
00:08:30,080 --> 00:08:32,880
instance

228
00:08:31,120 --> 00:08:35,680
as a member of the team

229
00:08:32,880 --> 00:08:37,760
and we we stored all of these

230
00:08:35,680 --> 00:08:39,519
uh via one of these particular lambda

231
00:08:37,760 --> 00:08:41,440
functions so that way

232
00:08:39,519 --> 00:08:42,560
when we were looking at different

233
00:08:41,440 --> 00:08:44,720
activity

234
00:08:42,560 --> 00:08:46,399
from one of the other functions it

235
00:08:44,720 --> 00:08:48,240
it could request

236
00:08:46,399 --> 00:08:51,279
uh whether or not

237
00:08:48,240 --> 00:08:54,560
a an ip address was known and it would

238
00:08:51,279 --> 00:08:56,800
use this other other function to do that

239
00:08:54,560 --> 00:08:58,080
and so this sort of composable

240
00:08:56,800 --> 00:09:00,560
architecture

241
00:08:58,080 --> 00:09:03,519
it makes sense uh because we could reuse

242
00:09:00,560 --> 00:09:05,440
these things in different ways and

243
00:09:03,519 --> 00:09:08,959
for very different purposes than than

244
00:09:05,440 --> 00:09:12,320
what they were built for

245
00:09:08,959 --> 00:09:13,519
and this was good it worked for us

246
00:09:12,320 --> 00:09:15,760
it

247
00:09:13,519 --> 00:09:17,040
allowed us to have our own rules we

248
00:09:15,760 --> 00:09:19,600
could

249
00:09:17,040 --> 00:09:20,800
work our own way

250
00:09:19,600 --> 00:09:22,399
and it also

251
00:09:20,800 --> 00:09:24,000
meant that we only implemented the

252
00:09:22,399 --> 00:09:26,480
things that we needed

253
00:09:24,000 --> 00:09:28,880
we we didn't end up with

254
00:09:26,480 --> 00:09:31,680
all of this extra stuff

255
00:09:28,880 --> 00:09:33,040
that we just weren't going to use

256
00:09:31,680 --> 00:09:35,360
it wasn't

257
00:09:33,040 --> 00:09:37,120
it was more of a a toolbox of sorts

258
00:09:35,360 --> 00:09:38,160
rather than a swiss army knife for

259
00:09:37,120 --> 00:09:39,200
instance

260
00:09:38,160 --> 00:09:40,880
um

261
00:09:39,200 --> 00:09:42,160
like i've never used a saw on a swiss

262
00:09:40,880 --> 00:09:43,519
army knife

263
00:09:42,160 --> 00:09:45,120
because i'm not sure what kind of wood

264
00:09:43,519 --> 00:09:47,120
it's going to cut but

265
00:09:45,120 --> 00:09:49,360
it's probably not going to be useful for

266
00:09:47,120 --> 00:09:50,720
anything that i would try to try to use

267
00:09:49,360 --> 00:09:51,680
it for

268
00:09:50,720 --> 00:09:53,839
um

269
00:09:51,680 --> 00:09:55,440
there were some additional problems with

270
00:09:53,839 --> 00:09:56,959
this though um

271
00:09:55,440 --> 00:09:58,880
which is always the way when you build

272
00:09:56,959 --> 00:10:00,880
things yourself um how do we know it's

273
00:09:58,880 --> 00:10:03,680
working um

274
00:10:00,880 --> 00:10:04,720
when it comes to things like

275
00:10:03,680 --> 00:10:06,720
external

276
00:10:04,720 --> 00:10:09,519
or external to the business

277
00:10:06,720 --> 00:10:12,320
third-party integrations these sorts of

278
00:10:09,519 --> 00:10:14,320
sas type solutions

279
00:10:12,320 --> 00:10:15,920
there's usually

280
00:10:14,320 --> 00:10:17,920
feedback loops and

281
00:10:15,920 --> 00:10:20,320
there's there's ways to get information

282
00:10:17,920 --> 00:10:23,360
about the current state

283
00:10:20,320 --> 00:10:25,839
of of the uh platform

284
00:10:23,360 --> 00:10:27,760
and so we added some very basic

285
00:10:25,839 --> 00:10:28,720
visibility uh

286
00:10:27,760 --> 00:10:31,360
which

287
00:10:28,720 --> 00:10:33,360
as as you might guess was another lambda

288
00:10:31,360 --> 00:10:34,800
function um or at least a couple of

289
00:10:33,360 --> 00:10:36,720
lambda functions

290
00:10:34,800 --> 00:10:37,680
and these

291
00:10:36,720 --> 00:10:38,480
all

292
00:10:37,680 --> 00:10:40,640
uh

293
00:10:38,480 --> 00:10:43,279
worked within the the aws environment

294
00:10:40,640 --> 00:10:46,640
that we we managed all of this

295
00:10:43,279 --> 00:10:48,480
and would follow the same steps as the

296
00:10:46,640 --> 00:10:50,480
actual tool itself

297
00:10:48,480 --> 00:10:52,480
it would reuse the

298
00:10:50,480 --> 00:10:54,240
the function that sent the message into

299
00:10:52,480 --> 00:10:55,760
slack it would

300
00:10:54,240 --> 00:10:56,640
look at the

301
00:10:55,760 --> 00:10:58,880
known

302
00:10:56,640 --> 00:11:02,720
uh cloudwatch information that we would

303
00:10:58,880 --> 00:11:04,640
generate and do all the processing um ip

304
00:11:02,720 --> 00:11:05,519
addresses and and all that sort of stuff

305
00:11:04,640 --> 00:11:07,600
as well

306
00:11:05,519 --> 00:11:10,560
and so it was great it

307
00:11:07,600 --> 00:11:11,680
was was an afterthought for sure

308
00:11:10,560 --> 00:11:15,980
but

309
00:11:11,680 --> 00:11:18,880
it was done and so that was fine

310
00:11:15,980 --> 00:11:20,560
[Music]

311
00:11:18,880 --> 00:11:22,480
except

312
00:11:20,560 --> 00:11:24,880
we didn't know when things weren't

313
00:11:22,480 --> 00:11:27,120
working

314
00:11:24,880 --> 00:11:29,360
we we knew when things were working and

315
00:11:27,120 --> 00:11:31,760
we could test that we could

316
00:11:29,360 --> 00:11:33,600
take our own actions

317
00:11:31,760 --> 00:11:36,560
to essentially canary

318
00:11:33,600 --> 00:11:39,120
the the processing pipeline and we could

319
00:11:36,560 --> 00:11:40,480
do things that we know would definitely

320
00:11:39,120 --> 00:11:42,480
trigger

321
00:11:40,480 --> 00:11:44,720
something happening

322
00:11:42,480 --> 00:11:44,720
and

323
00:11:45,120 --> 00:11:49,519
yeah we

324
00:11:46,560 --> 00:11:52,800
oh like this this visibility was very

325
00:11:49,519 --> 00:11:56,240
basic uh it it didn't really

326
00:11:52,800 --> 00:11:58,399
test everything um but it worked for

327
00:11:56,240 --> 00:11:59,600
just letting us know

328
00:11:58,399 --> 00:12:00,959
um

329
00:11:59,600 --> 00:12:03,519
and

330
00:12:00,959 --> 00:12:05,360
that was that was all good

331
00:12:03,519 --> 00:12:07,600
except um

332
00:12:05,360 --> 00:12:09,920
yeah we

333
00:12:07,600 --> 00:12:11,680
we didn't test things like the different

334
00:12:09,920 --> 00:12:13,680
integrations themselves

335
00:12:11,680 --> 00:12:15,120
so

336
00:12:13,680 --> 00:12:17,040
as as

337
00:12:15,120 --> 00:12:19,279
this image shows um

338
00:12:17,040 --> 00:12:21,600
we didn't test to make sure that the

339
00:12:19,279 --> 00:12:22,880
slack integration worked fine

340
00:12:21,600 --> 00:12:24,399
and

341
00:12:22,880 --> 00:12:26,959
we didn't know if it ever stopped

342
00:12:24,399 --> 00:12:29,760
working because the only messages we got

343
00:12:26,959 --> 00:12:31,279
from the system were interslack

344
00:12:29,760 --> 00:12:34,000
which is also a problem if slack is

345
00:12:31,279 --> 00:12:36,959
unavailable and that sort of thing but

346
00:12:34,000 --> 00:12:39,360
if the integration itself breaks um

347
00:12:36,959 --> 00:12:40,720
we don't know about that

348
00:12:39,360 --> 00:12:43,040
and so one day

349
00:12:40,720 --> 00:12:43,839
uh one of the team members

350
00:12:43,040 --> 00:12:45,440
uh

351
00:12:43,839 --> 00:12:47,360
noticed this

352
00:12:45,440 --> 00:12:49,519
like hey

353
00:12:47,360 --> 00:12:50,399
why haven't we had anything for a while

354
00:12:49,519 --> 00:12:51,519
like

355
00:12:50,399 --> 00:12:53,680
surely

356
00:12:51,519 --> 00:12:56,560
nothing is running so good that there

357
00:12:53,680 --> 00:12:59,120
are no problems uh that's

358
00:12:56,560 --> 00:13:00,880
usually a good sign that something is

359
00:12:59,120 --> 00:13:02,480
not actually

360
00:13:00,880 --> 00:13:03,680
working

361
00:13:02,480 --> 00:13:06,639
and

362
00:13:03,680 --> 00:13:08,000
that was a problem uh because you don't

363
00:13:06,639 --> 00:13:09,120
really think about it unless there's an

364
00:13:08,000 --> 00:13:12,560
alert

365
00:13:09,120 --> 00:13:14,959
when you look in a channel or

366
00:13:12,560 --> 00:13:16,320
when you're used to having notifications

367
00:13:14,959 --> 00:13:18,480
about things

368
00:13:16,320 --> 00:13:20,079
the absence of a notification doesn't

369
00:13:18,480 --> 00:13:21,600
really

370
00:13:20,079 --> 00:13:24,880
trigger anything you don't really think

371
00:13:21,600 --> 00:13:26,880
about it until this sort of time where

372
00:13:24,880 --> 00:13:28,639
i think it had been about a month a

373
00:13:26,880 --> 00:13:30,720
month and a half

374
00:13:28,639 --> 00:13:33,040
since we had actually

375
00:13:30,720 --> 00:13:35,600
seen anything come from this

376
00:13:33,040 --> 00:13:36,800
and so we did what we did before we we

377
00:13:35,600 --> 00:13:39,360
ran through

378
00:13:36,800 --> 00:13:40,320
uh one of our canary tests and sure

379
00:13:39,360 --> 00:13:41,680
enough

380
00:13:40,320 --> 00:13:43,519
nothing happened

381
00:13:41,680 --> 00:13:44,560
and so

382
00:13:43,519 --> 00:13:46,079
that was

383
00:13:44,560 --> 00:13:47,040
the problem

384
00:13:46,079 --> 00:13:49,120
so

385
00:13:47,040 --> 00:13:51,680
we started digging in we started looking

386
00:13:49,120 --> 00:13:53,040
around we went back through

387
00:13:51,680 --> 00:13:54,240
how everything was set up in the first

388
00:13:53,040 --> 00:13:56,639
place

389
00:13:54,240 --> 00:13:58,720
and we we looked

390
00:13:56,639 --> 00:14:00,720
at all the functions we looked at the

391
00:13:58,720 --> 00:14:03,199
code for the functions we looked at the

392
00:14:00,720 --> 00:14:04,639
infrastructure and

393
00:14:03,199 --> 00:14:05,920
everything was there

394
00:14:04,639 --> 00:14:07,519
the config

395
00:14:05,920 --> 00:14:10,240
hadn't drifted

396
00:14:07,519 --> 00:14:13,600
we we managed everything with terraform

397
00:14:10,240 --> 00:14:15,279
and all of the code was in git along

398
00:14:13,600 --> 00:14:17,920
with that terraform

399
00:14:15,279 --> 00:14:19,680
there hadn't been changes for months uh

400
00:14:17,920 --> 00:14:21,199
everything looks fine we could see that

401
00:14:19,680 --> 00:14:23,120
the functions were running

402
00:14:21,199 --> 00:14:25,600
we could validate that

403
00:14:23,120 --> 00:14:27,600
they executed without errors

404
00:14:25,600 --> 00:14:30,800
and

405
00:14:27,600 --> 00:14:30,800
it looked weird

406
00:14:31,360 --> 00:14:33,839
and

407
00:14:32,639 --> 00:14:35,519
what we

408
00:14:33,839 --> 00:14:37,040
sort of got to the point of was that

409
00:14:35,519 --> 00:14:38,959
well

410
00:14:37,040 --> 00:14:41,199
we know that all of this is there

411
00:14:38,959 --> 00:14:42,959
it's all very disparate but

412
00:14:41,199 --> 00:14:45,120
we know that things are working fine

413
00:14:42,959 --> 00:14:46,480
just that messages aren't coming through

414
00:14:45,120 --> 00:14:48,399
and so

415
00:14:46,480 --> 00:14:50,959
we checked okay well if that's the only

416
00:14:48,399 --> 00:14:52,959
thing that we're not seeing then

417
00:14:50,959 --> 00:14:54,639
maybe we need to update how that

418
00:14:52,959 --> 00:14:56,320
integration is working

419
00:14:54,639 --> 00:14:59,120
and suddenly

420
00:14:56,320 --> 00:15:00,639
it works fine again um

421
00:14:59,120 --> 00:15:03,040
we

422
00:15:00,639 --> 00:15:05,199
had to do all this digging but

423
00:15:03,040 --> 00:15:06,480
we just didn't know what we didn't know

424
00:15:05,199 --> 00:15:09,120
and

425
00:15:06,480 --> 00:15:12,399
a lot of this was was built in isolation

426
00:15:09,120 --> 00:15:15,199
uh away from the rest of the team so we

427
00:15:12,399 --> 00:15:17,519
had to dig we we didn't have

428
00:15:15,199 --> 00:15:20,560
any way to just know where certain

429
00:15:17,519 --> 00:15:22,639
things are or be able to see

430
00:15:20,560 --> 00:15:26,160
how things are progressing

431
00:15:22,639 --> 00:15:27,600
and updating the details was was all we

432
00:15:26,160 --> 00:15:28,399
needed to do

433
00:15:27,600 --> 00:15:31,360
and

434
00:15:28,399 --> 00:15:32,560
then we were back on track

435
00:15:31,360 --> 00:15:35,839
so

436
00:15:32,560 --> 00:15:37,920
the problem was that the the slack uh

437
00:15:35,839 --> 00:15:40,480
integration which was just a web hook to

438
00:15:37,920 --> 00:15:41,519
send the message uh it had been disabled

439
00:15:40,480 --> 00:15:44,079
uh

440
00:15:41,519 --> 00:15:45,600
which is a a wonderful feature it may

441
00:15:44,079 --> 00:15:47,680
have changed actually but it was a

442
00:15:45,600 --> 00:15:50,399
wonderful feature from slack that

443
00:15:47,680 --> 00:15:52,480
whenever a user

444
00:15:50,399 --> 00:15:54,720
adds an integration or sets up an

445
00:15:52,480 --> 00:15:57,440
application or anything like this it's

446
00:15:54,720 --> 00:16:00,480
owned by that user themself

447
00:15:57,440 --> 00:16:02,639
and that also means if that user is

448
00:16:00,480 --> 00:16:04,560
disabled or removed

449
00:16:02,639 --> 00:16:07,199
or otherwise

450
00:16:04,560 --> 00:16:09,199
unavailable to manage the integration

451
00:16:07,199 --> 00:16:11,040
anymore then

452
00:16:09,199 --> 00:16:12,480
that integration is disabled

453
00:16:11,040 --> 00:16:15,279
and so

454
00:16:12,480 --> 00:16:18,399
this web hook had

455
00:16:15,279 --> 00:16:22,320
become unavailable to be used and that

456
00:16:18,399 --> 00:16:24,639
meant that no messages were ever sent

457
00:16:22,320 --> 00:16:25,839
but yeah once we created a new one and

458
00:16:24,639 --> 00:16:27,920
we used

459
00:16:25,839 --> 00:16:29,440
a headless user to do so

460
00:16:27,920 --> 00:16:32,399
so that way there was

461
00:16:29,440 --> 00:16:33,519
a much lower risk of it being removed in

462
00:16:32,399 --> 00:16:35,040
the future

463
00:16:33,519 --> 00:16:36,800
we could update the function and it

464
00:16:35,040 --> 00:16:39,360
would work fine again

465
00:16:36,800 --> 00:16:42,079
and then retesting

466
00:16:39,360 --> 00:16:42,079
and it's all good

467
00:16:42,480 --> 00:16:46,560
i think this is this is another a good

468
00:16:44,800 --> 00:16:48,560
thing to think about as well

469
00:16:46,560 --> 00:16:51,279
when it comes to

470
00:16:48,560 --> 00:16:54,240
how these sorts of things fit uh

471
00:16:51,279 --> 00:16:56,240
how they fit together and i'm sure that

472
00:16:54,240 --> 00:16:57,360
if that original

473
00:16:56,240 --> 00:16:58,480
broken

474
00:16:57,360 --> 00:17:01,279
web hook

475
00:16:58,480 --> 00:17:02,399
had been attempted to be used

476
00:17:01,279 --> 00:17:05,919
there would have been an error message

477
00:17:02,399 --> 00:17:07,679
of some sort and

478
00:17:05,919 --> 00:17:09,919
that would have been a good thing to

479
00:17:07,679 --> 00:17:12,160
return from that lambda function

480
00:17:09,919 --> 00:17:13,760
if we had been able to see that there

481
00:17:12,160 --> 00:17:16,160
was that error message we would have

482
00:17:13,760 --> 00:17:17,839
known very quickly and it would have

483
00:17:16,160 --> 00:17:19,039
been something that we could have done

484
00:17:17,839 --> 00:17:20,880
something about

485
00:17:19,039 --> 00:17:23,039
and we

486
00:17:20,880 --> 00:17:24,799
wouldn't have missed whatever messages

487
00:17:23,039 --> 00:17:26,799
we had missed along the way up until

488
00:17:24,799 --> 00:17:27,679
that point

489
00:17:26,799 --> 00:17:29,600
but

490
00:17:27,679 --> 00:17:33,200
after a period of a couple of days we

491
00:17:29,600 --> 00:17:35,200
were all back on track again

492
00:17:33,200 --> 00:17:38,000
and so the third topic i wanted to cover

493
00:17:35,200 --> 00:17:40,400
was around incidents and

494
00:17:38,000 --> 00:17:41,840
the incidents uh the the sort of typical

495
00:17:40,400 --> 00:17:43,679
things from

496
00:17:41,840 --> 00:17:46,000
production infrastructure or just

497
00:17:43,679 --> 00:17:47,919
infrastructure in general the the things

498
00:17:46,000 --> 00:17:50,880
where you say hey there's

499
00:17:47,919 --> 00:17:53,280
like something is down or unavailable

500
00:17:50,880 --> 00:17:55,520
uh those those types of things and

501
00:17:53,280 --> 00:17:58,400
the the analogy here is

502
00:17:55,520 --> 00:18:00,320
if a tree falls in the woods

503
00:17:58,400 --> 00:18:01,679
and no one's around to hear it doesn't

504
00:18:00,320 --> 00:18:03,360
make a sound

505
00:18:01,679 --> 00:18:05,360
uh and

506
00:18:03,360 --> 00:18:06,960
that's where incidents come in

507
00:18:05,360 --> 00:18:07,919
for for us

508
00:18:06,960 --> 00:18:10,720
um

509
00:18:07,919 --> 00:18:11,760
so a couple of years ago um

510
00:18:10,720 --> 00:18:14,000
all

511
00:18:11,760 --> 00:18:16,480
sort of interruptions and incidents were

512
00:18:14,000 --> 00:18:19,840
very chaotic they were

513
00:18:16,480 --> 00:18:20,880
all over the place they were confusing

514
00:18:19,840 --> 00:18:23,919
and

515
00:18:20,880 --> 00:18:25,440
they typically happened in private

516
00:18:23,919 --> 00:18:26,720
channels

517
00:18:25,440 --> 00:18:29,200
they were

518
00:18:26,720 --> 00:18:30,480
typically between the same couple of

519
00:18:29,200 --> 00:18:32,160
people

520
00:18:30,480 --> 00:18:34,400
and

521
00:18:32,160 --> 00:18:37,520
often afterwards

522
00:18:34,400 --> 00:18:41,200
there weren't clear actions to come out

523
00:18:37,520 --> 00:18:43,360
from from handling that incident so

524
00:18:41,200 --> 00:18:45,360
the sorts of things like

525
00:18:43,360 --> 00:18:48,960
when there's a problem

526
00:18:45,360 --> 00:18:50,400
and we know that there is a fundamental

527
00:18:48,960 --> 00:18:53,039
uh issue

528
00:18:50,400 --> 00:18:54,000
whether it's design or architecture or

529
00:18:53,039 --> 00:18:56,720
even just

530
00:18:54,000 --> 00:18:58,400
bugs in code

531
00:18:56,720 --> 00:18:59,200
there was

532
00:18:58,400 --> 00:19:01,120
not

533
00:18:59,200 --> 00:19:03,520
always a very clear

534
00:19:01,120 --> 00:19:05,440
outcome to say okay we need to

535
00:19:03,520 --> 00:19:07,760
improve or shift

536
00:19:05,440 --> 00:19:09,600
these things so that way we can prevent

537
00:19:07,760 --> 00:19:10,720
this in the future

538
00:19:09,600 --> 00:19:13,600
and

539
00:19:10,720 --> 00:19:15,120
it was because of those things that we

540
00:19:13,600 --> 00:19:17,200
we decided okay we needed to do

541
00:19:15,120 --> 00:19:19,840
something about this we needed to

542
00:19:17,200 --> 00:19:22,000
make it visible

543
00:19:19,840 --> 00:19:23,760
and so we created our own

544
00:19:22,000 --> 00:19:25,679
incident management framework um and

545
00:19:23,760 --> 00:19:26,880
when i say created our own i don't mean

546
00:19:25,679 --> 00:19:28,880
we

547
00:19:26,880 --> 00:19:32,160
took a blank blank slate and started

548
00:19:28,880 --> 00:19:34,320
from scratch um we borrowed a lot from a

549
00:19:32,160 --> 00:19:35,840
lot of the common uh publicly known

550
00:19:34,320 --> 00:19:39,679
frameworks such as

551
00:19:35,840 --> 00:19:42,720
uh the one from etsy and what a lot of

552
00:19:39,679 --> 00:19:44,960
places like i think dropbox and netflix

553
00:19:42,720 --> 00:19:45,919
and some of the other uh big software

554
00:19:44,960 --> 00:19:47,679
companies

555
00:19:45,919 --> 00:19:49,919
their frameworks that they've published

556
00:19:47,679 --> 00:19:53,039
and talked about

557
00:19:49,919 --> 00:19:55,120
the intention here was to to streamline

558
00:19:53,039 --> 00:19:56,480
handling uh those

559
00:19:55,120 --> 00:19:57,280
interruptions

560
00:19:56,480 --> 00:19:59,840
um

561
00:19:57,280 --> 00:20:02,320
but also distribute the workload so make

562
00:19:59,840 --> 00:20:04,640
sure that it's not always just those

563
00:20:02,320 --> 00:20:07,919
same handful of people

564
00:20:04,640 --> 00:20:10,080
make it more visible as well uh

565
00:20:07,919 --> 00:20:11,840
for two reasons i guess um

566
00:20:10,080 --> 00:20:13,039
one was

567
00:20:11,840 --> 00:20:15,760
to sort of

568
00:20:13,039 --> 00:20:17,600
show just how much of an impact it makes

569
00:20:15,760 --> 00:20:19,600
to the team when there are these sorts

570
00:20:17,600 --> 00:20:21,760
of interruptions where

571
00:20:19,600 --> 00:20:24,240
groups of people have to drop whatever

572
00:20:21,760 --> 00:20:25,760
work they're doing to to look at a

573
00:20:24,240 --> 00:20:27,679
critical problem

574
00:20:25,760 --> 00:20:30,159
because if they don't then

575
00:20:27,679 --> 00:20:32,799
there's going to be customer problems or

576
00:20:30,159 --> 00:20:32,799
that sort of thing

577
00:20:33,120 --> 00:20:36,159
but the other thing

578
00:20:34,559 --> 00:20:39,200
that was a really great outcome from

579
00:20:36,159 --> 00:20:41,440
this was was the visibility and

580
00:20:39,200 --> 00:20:43,679
visibility in this case is

581
00:20:41,440 --> 00:20:46,080
uh both an internal to the company but

582
00:20:43,679 --> 00:20:48,960
an external thing to customers

583
00:20:46,080 --> 00:20:51,360
so not only making

584
00:20:48,960 --> 00:20:52,400
say a product manager aware that the

585
00:20:51,360 --> 00:20:54,320
reason

586
00:20:52,400 --> 00:20:56,799
their squad of engineers hasn't been

587
00:20:54,320 --> 00:20:58,720
able to progress is because of the

588
00:20:56,799 --> 00:21:01,200
incidents that have been happening it's

589
00:20:58,720 --> 00:21:02,960
also to communicate with customers to

590
00:21:01,200 --> 00:21:04,799
make them aware of

591
00:21:02,960 --> 00:21:06,480
like the reasons they're unable to use

592
00:21:04,799 --> 00:21:08,080
the platform right now or because of

593
00:21:06,480 --> 00:21:09,200
these issues happening

594
00:21:08,080 --> 00:21:11,840
and

595
00:21:09,200 --> 00:21:13,760
then giving more of an eta and keeping

596
00:21:11,840 --> 00:21:17,120
them in the loop on how everything's

597
00:21:13,760 --> 00:21:19,200
going and that visibility was was really

598
00:21:17,120 --> 00:21:20,880
one of the biggest impactful changes to

599
00:21:19,200 --> 00:21:23,120
come from the incident management

600
00:21:20,880 --> 00:21:25,120
framework

601
00:21:23,120 --> 00:21:26,000
and so the framework itself

602
00:21:25,120 --> 00:21:28,159
um

603
00:21:26,000 --> 00:21:28,159
we

604
00:21:28,240 --> 00:21:33,280
we have it um

605
00:21:30,400 --> 00:21:34,080
in a way that we give a lot of structure

606
00:21:33,280 --> 00:21:36,480
to

607
00:21:34,080 --> 00:21:38,080
i guess live problem solving

608
00:21:36,480 --> 00:21:40,640
so we we have a set of roles and

609
00:21:38,080 --> 00:21:43,840
responsibilities for those roles uh we

610
00:21:40,640 --> 00:21:45,919
have a hierarchy of um

611
00:21:43,840 --> 00:21:49,200
of the way that those roles interact

612
00:21:45,919 --> 00:21:51,039
with each other and we train people

613
00:21:49,200 --> 00:21:52,000
for each of those roles in particular

614
00:21:51,039 --> 00:21:53,840
because

615
00:21:52,000 --> 00:21:55,919
they have different requirements they

616
00:21:53,840 --> 00:21:57,600
they have a different skill set

617
00:21:55,919 --> 00:22:00,320
and

618
00:21:57,600 --> 00:22:01,840
what might work for one role and the way

619
00:22:00,320 --> 00:22:04,720
that you approach

620
00:22:01,840 --> 00:22:07,840
uh working in that role may not work at

621
00:22:04,720 --> 00:22:09,200
all in another role so we we provide

622
00:22:07,840 --> 00:22:10,960
that training

623
00:22:09,200 --> 00:22:12,880
for those individuals when they would

624
00:22:10,960 --> 00:22:16,080
like to or depending on role

625
00:22:12,880 --> 00:22:16,880
uh some roles are more suited to certain

626
00:22:16,080 --> 00:22:18,640
uh

627
00:22:16,880 --> 00:22:21,360
some roles in the business are more

628
00:22:18,640 --> 00:22:23,039
suited to different roles inside the

629
00:22:21,360 --> 00:22:24,720
framework

630
00:22:23,039 --> 00:22:27,840
and that structure is is really

631
00:22:24,720 --> 00:22:29,919
important because it means when there's

632
00:22:27,840 --> 00:22:32,000
like a critical issue where the entire

633
00:22:29,919 --> 00:22:33,600
platform is unavailable

634
00:22:32,000 --> 00:22:37,039
we don't have

635
00:22:33,600 --> 00:22:38,400
a series of executives or or managers or

636
00:22:37,039 --> 00:22:39,679
a group of people

637
00:22:38,400 --> 00:22:42,640
uh

638
00:22:39,679 --> 00:22:46,320
worrying that the world is on fire

639
00:22:42,640 --> 00:22:48,000
we have a structure that gives people a

640
00:22:46,320 --> 00:22:50,799
grounding to

641
00:22:48,000 --> 00:22:52,480
keep them calmer and actually focus on

642
00:22:50,799 --> 00:22:53,520
solving the problem

643
00:22:52,480 --> 00:22:55,200
without

644
00:22:53,520 --> 00:22:56,400
worrying as much

645
00:22:55,200 --> 00:22:58,000
um

646
00:22:56,400 --> 00:22:59,919
and it's a great way to facilitate the

647
00:22:58,000 --> 00:23:01,200
communication it's it's the back and

648
00:22:59,919 --> 00:23:03,679
forth the

649
00:23:01,200 --> 00:23:07,120
making sure that people are aware making

650
00:23:03,679 --> 00:23:09,600
sure that stakeholders know that

651
00:23:07,120 --> 00:23:12,640
this problem we know about it and we're

652
00:23:09,600 --> 00:23:15,039
actively working on it um and

653
00:23:12,640 --> 00:23:16,640
they can get updates uh through through

654
00:23:15,039 --> 00:23:18,960
that process and

655
00:23:16,640 --> 00:23:21,520
uh it's same for customers like uh for

656
00:23:18,960 --> 00:23:22,880
our incident management we publish to

657
00:23:21,520 --> 00:23:26,559
status page

658
00:23:22,880 --> 00:23:29,440
and people can subscribe to email or

659
00:23:26,559 --> 00:23:30,960
rss updates that sort of stuff

660
00:23:29,440 --> 00:23:33,120
and the framework also means that we

661
00:23:30,960 --> 00:23:35,200
involve the right people so because we

662
00:23:33,120 --> 00:23:36,880
have those roles we

663
00:23:35,200 --> 00:23:38,880
have people who are trained in those

664
00:23:36,880 --> 00:23:41,440
orals we can

665
00:23:38,880 --> 00:23:44,240
have very particular say page duty

666
00:23:41,440 --> 00:23:47,279
groups that we can page people

667
00:23:44,240 --> 00:23:48,880
for those roles in particular

668
00:23:47,279 --> 00:23:51,679
and

669
00:23:48,880 --> 00:23:53,600
we also capture actions as a part of the

670
00:23:51,679 --> 00:23:56,799
process and as a part of the

671
00:23:53,600 --> 00:23:58,720
retrospective afterwards we look at

672
00:23:56,799 --> 00:24:01,679
what the problems were that caused the

673
00:23:58,720 --> 00:24:04,400
incident and then we bring those into

674
00:24:01,679 --> 00:24:06,640
our development workflows and we make

675
00:24:04,400 --> 00:24:08,159
sure that we resolve those long term so

676
00:24:06,640 --> 00:24:11,279
that way we don't have those incidents

677
00:24:08,159 --> 00:24:11,279
again in the future

678
00:24:11,360 --> 00:24:15,919
and so

679
00:24:13,120 --> 00:24:16,880
the visibility aspect of this was all

680
00:24:15,919 --> 00:24:19,279
from

681
00:24:16,880 --> 00:24:21,200
the actual implementation

682
00:24:19,279 --> 00:24:24,080
when we first started the incident

683
00:24:21,200 --> 00:24:26,400
management framework it was mayhem and

684
00:24:24,080 --> 00:24:29,679
there was just a lot of confusion

685
00:24:26,400 --> 00:24:29,679
for about three weeks

686
00:24:30,240 --> 00:24:34,960
mostly in the systems team or actually

687
00:24:32,400 --> 00:24:37,279
in a lot of the engineering teams

688
00:24:34,960 --> 00:24:39,679
there was not really a lot of planned

689
00:24:37,279 --> 00:24:40,799
work getting done and that was because

690
00:24:39,679 --> 00:24:42,880
we would

691
00:24:40,799 --> 00:24:43,919
get an incident called we would all jump

692
00:24:42,880 --> 00:24:45,360
in

693
00:24:43,919 --> 00:24:47,760
as we needed to

694
00:24:45,360 --> 00:24:49,520
and as we resolved it there would be

695
00:24:47,760 --> 00:24:51,039
another incident called for something

696
00:24:49,520 --> 00:24:52,080
completely different

697
00:24:51,039 --> 00:24:54,640
and

698
00:24:52,080 --> 00:24:56,559
when you factor in that and writing the

699
00:24:54,640 --> 00:24:58,080
documentation afterwards and then

700
00:24:56,559 --> 00:25:01,279
meeting to discuss

701
00:24:58,080 --> 00:25:03,360
how it went um and then continuing to

702
00:25:01,279 --> 00:25:06,080
train everybody it was it was very

703
00:25:03,360 --> 00:25:07,279
taxing in the beginning um and it also

704
00:25:06,080 --> 00:25:09,679
gave

705
00:25:07,279 --> 00:25:12,400
uh quite a bit of a a hit to the

706
00:25:09,679 --> 00:25:14,400
confidence across the leadership group

707
00:25:12,400 --> 00:25:15,440
because it was just hey

708
00:25:14,400 --> 00:25:18,080
what's going on there's all these

709
00:25:15,440 --> 00:25:19,919
incidents um it looks really really bad

710
00:25:18,080 --> 00:25:21,919
uh and that was purely because it's

711
00:25:19,919 --> 00:25:23,919
visible now um

712
00:25:21,919 --> 00:25:26,080
originally these things were happening

713
00:25:23,919 --> 00:25:26,799
anyway it's just that nobody knew about

714
00:25:26,080 --> 00:25:29,039
it

715
00:25:26,799 --> 00:25:30,880
and so once we knew about it but we

716
00:25:29,039 --> 00:25:33,360
actually had a plan to do something

717
00:25:30,880 --> 00:25:37,039
about it we could handle those things

718
00:25:33,360 --> 00:25:38,159
and then we we get better we continually

719
00:25:37,039 --> 00:25:40,880
improve

720
00:25:38,159 --> 00:25:42,640
we also update the the framework if we

721
00:25:40,880 --> 00:25:44,400
need to if we find

722
00:25:42,640 --> 00:25:45,840
that we have different types of

723
00:25:44,400 --> 00:25:47,360
incidents that don't quite lend

724
00:25:45,840 --> 00:25:48,080
themselves to the way that we're doing

725
00:25:47,360 --> 00:25:50,640
it

726
00:25:48,080 --> 00:25:54,240
we can change that and that's fine

727
00:25:50,640 --> 00:25:57,360
and it works really well now um it's

728
00:25:54,240 --> 00:25:58,480
not really mayhem at all anymore

729
00:25:57,360 --> 00:26:00,240
and

730
00:25:58,480 --> 00:26:02,400
we have a really good feel for the way

731
00:26:00,240 --> 00:26:05,760
that the different roles interact

732
00:26:02,400 --> 00:26:08,559
and uh we have a pool of people that we

733
00:26:05,760 --> 00:26:12,640
can lean on to be able to help

734
00:26:08,559 --> 00:26:12,640
if and when anything goes wrong

735
00:26:13,279 --> 00:26:16,240
and so i want to

736
00:26:14,799 --> 00:26:19,200
talk a little bit about some takeaways

737
00:26:16,240 --> 00:26:21,039
then from these three uh particular

738
00:26:19,200 --> 00:26:22,400
topics in general

739
00:26:21,039 --> 00:26:24,159
um

740
00:26:22,400 --> 00:26:26,080
the first side is the the technical side

741
00:26:24,159 --> 00:26:28,080
of things um

742
00:26:26,080 --> 00:26:30,640
the

743
00:26:28,080 --> 00:26:33,200
the key point of visibility is just that

744
00:26:30,640 --> 00:26:36,080
it's being visible uh and it should be

745
00:26:33,200 --> 00:26:37,600
very basic it shouldn't be complex there

746
00:26:36,080 --> 00:26:38,559
shouldn't be

747
00:26:37,600 --> 00:26:41,279
any

748
00:26:38,559 --> 00:26:44,799
rules engine or anything like that

749
00:26:41,279 --> 00:26:46,640
it's just the bare minimum

750
00:26:44,799 --> 00:26:50,720
making sure that something is running

751
00:26:46,640 --> 00:26:52,559
like having a health check on a service

752
00:26:50,720 --> 00:26:54,080
whatever it is wherever it's running it

753
00:26:52,559 --> 00:26:56,720
doesn't really matter but

754
00:26:54,080 --> 00:26:57,679
having a service be able to

755
00:26:56,720 --> 00:26:58,720
to

756
00:26:57,679 --> 00:27:01,039
alert

757
00:26:58,720 --> 00:27:02,799
when it's not able to do what it's

758
00:27:01,039 --> 00:27:05,360
supposed to do

759
00:27:02,799 --> 00:27:07,600
and you can see like a dashboard for

760
00:27:05,360 --> 00:27:11,360
instance to say like the service is

761
00:27:07,600 --> 00:27:11,360
healthy it's it's working fine

762
00:27:11,840 --> 00:27:15,039
the next thing is to think about the way

763
00:27:13,360 --> 00:27:16,000
that things can fail

764
00:27:15,039 --> 00:27:18,559
so

765
00:27:16,000 --> 00:27:22,000
with our

766
00:27:18,559 --> 00:27:24,559
situation we we thought about okay how

767
00:27:22,000 --> 00:27:26,960
do we know that something's running and

768
00:27:24,559 --> 00:27:29,520
we added some some basic visibility for

769
00:27:26,960 --> 00:27:31,600
that and that was fine but

770
00:27:29,520 --> 00:27:32,960
what we didn't consider was what things

771
00:27:31,600 --> 00:27:35,679
could go wrong

772
00:27:32,960 --> 00:27:36,480
and that was where we missed that aspect

773
00:27:35,679 --> 00:27:38,559
of

774
00:27:36,480 --> 00:27:40,240
the actual message that gets sent to let

775
00:27:38,559 --> 00:27:42,559
us know that something is wrong we

776
00:27:40,240 --> 00:27:43,279
didn't know when that broke

777
00:27:42,559 --> 00:27:45,200
so

778
00:27:43,279 --> 00:27:46,000
being able to send the message was sort

779
00:27:45,200 --> 00:27:49,120
of

780
00:27:46,000 --> 00:27:50,399
the most fundamental part of all of it

781
00:27:49,120 --> 00:27:53,039
if

782
00:27:50,399 --> 00:27:55,039
if the messages can't be sent then

783
00:27:53,039 --> 00:27:56,880
the whole rest of the pipeline doesn't

784
00:27:55,039 --> 00:27:58,480
matter because we will never get any

785
00:27:56,880 --> 00:28:00,559
output from it

786
00:27:58,480 --> 00:28:01,600
so thinking about that is is something

787
00:28:00,559 --> 00:28:03,600
that's

788
00:28:01,600 --> 00:28:06,000
really good at informing

789
00:28:03,600 --> 00:28:08,960
how the rest of the visibility and

790
00:28:06,000 --> 00:28:11,279
beyond journey goes

791
00:28:08,960 --> 00:28:14,240
it's also good to have redundancy

792
00:28:11,279 --> 00:28:16,640
in the visibility side of things so

793
00:28:14,240 --> 00:28:19,039
adding the observability in the metrics

794
00:28:16,640 --> 00:28:21,200
means that you not only say get the

795
00:28:19,039 --> 00:28:22,960
health check for the visibility but then

796
00:28:21,200 --> 00:28:24,880
you can say okay

797
00:28:22,960 --> 00:28:26,159
the services is reporting that it's

798
00:28:24,880 --> 00:28:27,279
working fine

799
00:28:26,159 --> 00:28:29,440
but now

800
00:28:27,279 --> 00:28:31,279
let's check the the business logic to

801
00:28:29,440 --> 00:28:33,200
make sure that it's still

802
00:28:31,279 --> 00:28:36,960
when it takes an action it's actually

803
00:28:33,200 --> 00:28:38,399
doing the action we expect it to do

804
00:28:36,960 --> 00:28:41,440
and

805
00:28:38,399 --> 00:28:42,720
cost isn't always about money

806
00:28:41,440 --> 00:28:45,520
so

807
00:28:42,720 --> 00:28:47,600
while the tool that we didn't use was

808
00:28:45,520 --> 00:28:49,360
prohibitively expensive

809
00:28:47,600 --> 00:28:52,399
that doesn't mean that building our own

810
00:28:49,360 --> 00:28:55,039
was necessarily a good idea uh when you

811
00:28:52,399 --> 00:28:58,080
factor in things like uh the hours it

812
00:28:55,039 --> 00:28:59,440
takes for people to maintain or the

813
00:28:58,080 --> 00:29:02,159
hours it took to write in the first

814
00:28:59,440 --> 00:29:04,159
place um and then the hours for

815
00:29:02,159 --> 00:29:05,520
the few of us that dug into why it

816
00:29:04,159 --> 00:29:07,520
wasn't working

817
00:29:05,520 --> 00:29:09,919
they all add up and

818
00:29:07,520 --> 00:29:13,200
those hours and

819
00:29:09,919 --> 00:29:15,200
the time can be converted into some

820
00:29:13,200 --> 00:29:18,320
monetary value that would be comparable

821
00:29:15,200 --> 00:29:19,840
to the actual tool cost itself

822
00:29:18,320 --> 00:29:21,919
and and the final technical point is

823
00:29:19,840 --> 00:29:22,799
that sometimes standards can be good

824
00:29:21,919 --> 00:29:25,440
we

825
00:29:22,799 --> 00:29:27,760
have ways of running services and

826
00:29:25,440 --> 00:29:29,760
processes across the platform and they

827
00:29:27,760 --> 00:29:30,880
weren't used for this which meant that

828
00:29:29,760 --> 00:29:33,039
the team

829
00:29:30,880 --> 00:29:35,120
that already had a good understanding of

830
00:29:33,039 --> 00:29:36,880
how things function

831
00:29:35,120 --> 00:29:38,840
were left a bit

832
00:29:36,880 --> 00:29:42,559
uh in the dark

833
00:29:38,840 --> 00:29:43,440
really and generally for visibility

834
00:29:42,559 --> 00:29:44,480
uh

835
00:29:43,440 --> 00:29:47,520
it might

836
00:29:44,480 --> 00:29:48,559
be horrifying to make something visible

837
00:29:47,520 --> 00:29:50,480
and

838
00:29:48,559 --> 00:29:52,720
sometimes that's exactly what you need

839
00:29:50,480 --> 00:29:54,559
so with the incidents

840
00:29:52,720 --> 00:29:56,240
making them visible was really important

841
00:29:54,559 --> 00:29:57,919
but it meant that we could actually do

842
00:29:56,240 --> 00:29:59,840
something about it

843
00:29:57,919 --> 00:30:01,360
and with the data that you get from

844
00:29:59,840 --> 00:30:03,360
these types of things

845
00:30:01,360 --> 00:30:04,320
you can make improvements things can get

846
00:30:03,360 --> 00:30:06,240
better

847
00:30:04,320 --> 00:30:09,120
and it's good to understand who the

848
00:30:06,240 --> 00:30:11,360
target audience is and

849
00:30:09,120 --> 00:30:12,640
you can use that visibility then to

850
00:30:11,360 --> 00:30:15,760
build trust

851
00:30:12,640 --> 00:30:15,760
and and get better

852
00:30:16,559 --> 00:30:19,600
cool that so that was

853
00:30:18,399 --> 00:30:21,679
uh

854
00:30:19,600 --> 00:30:22,960
visibility i'm al

855
00:30:21,679 --> 00:30:25,360
thank you

856
00:30:22,960 --> 00:30:27,120
okay thank you very much helen um we

857
00:30:25,360 --> 00:30:28,960
don't have any questions but there's a

858
00:30:27,120 --> 00:30:30,080
few people sharing your stories in the

859
00:30:28,960 --> 00:30:32,399
chat

860
00:30:30,080 --> 00:30:35,559
all right thanks

861
00:30:32,399 --> 00:30:35,559
thank you

