BSP code is fairly straight forward really, the complications come from precision.
See the pointcontents code for a primer - it just goes down through the tree until it finds the leaf.
Well, traces are basically the same, except they have two points.
Go down the tree, favouring the start side every time, so that you can early out properly. When you've got a node between the points, it has to go down both. If its solid on the other side of the node (quake actually uses a pointcontentsalike for that) then you have an impact, figure out the fraction and return up the tree to the caller.
There's quite a few things that make this costly.
1: Its a tree, which has hopefully been balanced somewhat to try to limit the maximum cost. On huge maps, this can still get quite high.
2: the balancing probably left node positions with somewhat random ordering so each node will give you a new cache miss... or more... cache misses are slow, and they get slower (proportionally) the faster the cpu is.
3: Trees suffer from the same inefficincy as lists, that is ptr = ptr->next; foo = ptr->field; contains a memory fetch in the first statement with the second directly depending upon its result. The cpu has to wait on the result of the first before it can evaluate the second. When iterating a list, the index is already in a register, thus idx = idx + 1; has no memory fetch, thus the second instruction no longer needs to wait for as long. X86 CPUs can generally reorder instructions with some small tolerances, but its not magic.
4: quake uses pointcontents to determine solidity on the other side of the node! this means that for every node, its walking down its immediate children twice... 5 nodes means that its walking down to the same leaf 32 times! A couple of engines have a fix for this, but the precision aspect makes it clumsy. Really though this issue isn't quite as bad as it sounds - those nodes will at least be cached.

A couple of engines have a fix for this, but its still fairly expensive.
5: It uses floating point. Floating point operations are more expensive than integer operations. An x86 has multiple floating point units nowadays and can crunch multiple numbers simultaneously to reduce the effective cost down to only a couple of cycles, assuming the compiler can interleave enough integer operations in there to give the float operations time to finish. An ARM chip often doesn't actually have an FPU and thus quite a lot of programs for arm are built to emulate floats with ints... which is rather slow (but faster than compiling for an FPU and having the operating system emulate it whenever it detects an illegal instruction...).
anyway, tracelines are slow. 1000 a frame and you'll notice, and have people shouting at you, even with a fixed/optimised traceline.
[ontopic]
tracelines are slow, you can add some timer and recalculate it only every halfsecond or so, but that gets awkward in tracking *which* player can see each marker, which requires at least two .floats for quakeworld (only one for NQ's traditional max of 16 players, which kinda highlights the awkwardness of tracking players with a bitmask).
[/ontopic]
.