Audio Analysis¶
Audio import and analysis — pitch tracking (YIN), transcription, tempo estimation, chord detection, and drum detection. See the Listening — Microphone In guide for a tutorial.
Audio import — turn a recording back into a Score.
Hum a melody into your phone, whistle a hook, play a bass line — then load the WAV and get notes you can edit, harmonize, export to MIDI, or print as sheet music:
from pytheory import Score
score = Score.from_wav("hum.wav")
print(score.parts["melody"].notes)
score.save_midi("hum.mid")
Transcription is monophonic: one note at a time (voice, whistle, a single instrument line). Chords and polyphonic recordings are a much harder problem — run melody and bass as separate takes.
The pitch tracker is the YIN algorithm (de Cheveigné & Kawahara, 2002) — the classic autocorrelation-with-a-twist method that powers most monophonic tuners — implemented in pure numpy.
- pytheory.audio.load_wav(path)[source]¶
Load an audio file as mono float64 in [-1, 1].
WAV files (8/16/32-bit PCM and float) are read directly; stereo is mixed down. Anything else — .m4a voice memos, .mp3, .aiff — is converted on the fly through
afconvert(built into macOS) orffmpeg, whichever is on your PATH.- Returns:
(samples, sample_rate) tuple.
- pytheory.audio.hpss(samples, sample_rate=44100, *, kernel=31)[source]¶
Harmonic-percussive source separation.
Drums and notes look completely different on a spectrogram: a held note is a horizontal line (steady frequency over time), a drum hit is a vertical line (all frequencies at one instant). Median-filter the spectrogram along time and you keep the horizontals; along frequency and you keep the verticals. Soft masks built from the two estimates split the signal.
- Returns:
(harmonic, percussive) sample arrays, same length as input.
- pytheory.audio.estimate_tempo(samples, sample_rate=44100, *, bpm_min=60, bpm_max=200)[source]¶
Estimate tempo from the onset pattern.
Builds an onset-strength envelope (spectral flux — how much new energy appears frame to frame), autocorrelates it, and finds the beat period that explains the recording best, gently preferring tempos near 120.
- Returns:
Estimated BPM as an int, or None if the recording doesn’t have a confident pulse (e.g. rubato humming).
- pytheory.audio.detect_pitch(samples, sample_rate=44100, *, frame_size=2048, hop=512, fmin=50.0, fmax=1500.0, threshold=0.12)[source]¶
Track pitch over time with the YIN algorithm.
- Parameters:
samples – Mono float array.
sample_rate – Sample rate in Hz.
frame_size – Analysis window (default 2048 ≈ 46ms).
hop – Samples between frames (default 512 ≈ 12ms).
fmin/fmax – Pitch search range in Hz.
threshold – YIN aperiodicity threshold — lower is stricter about what counts as a pitched sound.
- Returns:
(times, freqs, voiced) arrays — one entry per frame.
freqsis 0 wherevoicedis False.
- pytheory.audio.identify_chord(samples, sample_rate=44100, *, min_confidence=0.7)[source]¶
Identify the chord sounding in a buffer of audio.
The one-shot, “what am I strumming right now?” version of
detect_chords()— fold the buffer into a chromagram and match it against the same major/minor/sus/7th templates on all twelve roots.Harmonics are discounted before matching — each chord tone’s 3rd, 5th, and 7th partials land a fifth, major third, and flat seventh above it in pitch-class space, which is what makes a bright C major read as Cmaj7 if you match the raw chromagram. A polyphony gate rejects single notes (whose energy concentrates on too few pitch classes) rather than misreading a melody note as a chord. Coefficients are calibrated against pytheory’s own guitar/piano/rhodes renders: 93% on an 81-case battery.
- Parameters:
samples – Mono float array, ideally ~0.5–1.5 s of audio.
sample_rate – Sample rate in Hz.
min_confidence – Template match score (0–1) below which
Noneis returned.
- Returns:
Dict with
symbol(e.g."Am"),confidence, andnotes(the chord tones, low to high from the root) — orNoneif no chord is confidently sounding.
- pytheory.audio.detect_chords(samples, sample_rate=44100, *, bpm=120, beats_per_chord=2.0)[source]¶
Detect a chord progression from audio.
Folds the harmonic content into pitch classes (a chromagram), averages it over chord-sized windows on a beat grid aligned to the music’s own onsets, and matches each window against major/minor/sus triad and 7th-chord templates on all twelve roots. When the bass clearly sits on a chord tone other than the root, the chord is reported as a slash chord (
"C/E").- Returns:
List of (start_beat, duration_beats, symbol) tuples, with consecutive identical chords merged — e.g.
[(0.0, 8.0, "Am"), (8.0, 4.0, "F")].
- pytheory.audio.detect_drums(samples, sample_rate=44100, *, bpm=120, quantize=0.25)[source]¶
Detect drum hits from (ideally percussive) audio.
Finds onsets in the energy envelope, then classifies each by where its energy lives: kicks are bottom-heavy, hats are all sizzle, snares are the broadband middle.
- Returns:
List of (beat_position, sound_name, velocity) tuples, where sound_name is
"kick","snare", or"closed_hat".
- pytheory.audio.transcribe(path, *, bpm=None, quantize=None, split=False, part_name='melody', synth='piano_synth', fmin=50.0, fmax=1500.0)[source]¶
Transcribe an audio recording into a Score.
- Parameters:
path – Audio file path — WAV directly, anything else (.m4a, .mp3) via afconvert/ffmpeg. Or a (samples, sample_rate) tuple.
bpm – Tempo to interpret the timing against. Default
Noneestimates it from the recording’s onset pattern, falling back to 120 when there’s no confident pulse (rubato humming). Pass a number to pin it.quantize – Optional grid in beats (e.g.
0.25snaps note starts and lengths to sixteenths). Default: no snapping — you get the timing as performed.split – If True, run harmonic-percussive separation first and transcribe two parts from the harmonic signal — a
"bass"part (40-200 Hz) and a"melody"part (200 Hz up, with the bass filtered out). Use this on full mixes; expect the bassline to come out well and the melody to come out only as well as it dominates the mix.part_name – Name for the created part (non-split mode).
synth – Synth for playback of the transcription.
fmin/fmax – Pitch range to search, in Hz (non-split mode). Tighten these for better results (e.g.
fmin=60, fmax=350for a bass).
- Returns:
A
Scoreholding the detected notes, rests, and velocities.