In the original CAM, as described in the post, the architecture uses GAP followed by a single fully connected layer. There is no requirement that particular convolutional layers need to be removed, although you certainly can remove some convolutional layers if you have a particular reason to do so. The general structure of a neural network for which you can use CAM is convolutional layers followed by GAP followed by one FC layer. The weights of this final FC layer (after GAP) are the w1, w2, and w3 in the example. You do not obtain the predictions directly from the results of GAP — there is one FC layer after GAP. I hope this helps.