Sound propogates by displacing the medium it's traveling in. So at every point in time we can only sample the displacement (amplitude/volume). Our cochlear hair cells do a Fourier transform to decompose the sound into component frequencies for the brain to analyze.
Similarly, an audio stream is just samples of the amplitude over time and we need to do additional processing to extract the frequencies.