I'd be surprised if that was the legal view. If I take a picture of someone next to the Mona Lisa and then crop the Mona Lisa out of it, that doesn't make the picture of the person a derivative work of the Mona Lisa.
They don't mean the end result but the process of detecting something requires you to have a copy of that work and hence your software needs the copyrighted material in order to detect copyrighted songs to remove accurately.
I don't know if image cropping is a good metaphor. The algorithm for removing copyrighted music while preserving other sound presumably "subtracts" the copyrighted music. Thus, the copyrighted music is used to create the music-less video.