If we represent an image as a two dimensional array of pixels, then convolutions can be applied to detect edges and other more complex features.
The image is then transformed from an array of pixels to an array of convolutions, which are then again fed into more and more convolutional layers. This enables the learning of complex shapes that can then be used by the final fully connected layer to classify the image.
Intuitively, one can imagine the process going like this:
An image of a car is received
Features are extracted with convolutional layers
The resulting vector of features is then classified
i.e. this object has four wheels, windows, a license plate and seat --> it must be a car.
In mathematical terms, in a convolution a small kernel (usually 2x2, 3x3 or 4x4) is applied to an input matrix (the pixels of an image).
The 3x3 kernel is applied to every 3x3 square in the input image sequentially, reducing the image dimension and extracting information in the process.