Audio Analysis¶

Audio import and analysis — pitch tracking (YIN), transcription, tempo estimation, chord detection, and drum detection. See the Listening — Microphone In guide for a tutorial.

Audio import — turn a recording back into a Score.

Hum a melody into your phone, whistle a hook, play a bass line — then load the WAV and get notes you can edit, harmonize, export to MIDI, or print as sheet music:

from pytheory import Score

score = Score.from_wav("hum.wav")
print(score.parts["melody"].notes)
score.save_midi("hum.mid")

Transcription is monophonic: one note at a time (voice, whistle, a single instrument line). Chords and polyphonic recordings are a much harder problem — run melody and bass as separate takes.

The pitch tracker is the YIN algorithm (de Cheveigné & Kawahara, 2002) — the classic autocorrelation-with-a-twist method that powers most monophonic tuners — implemented in pure numpy.

pytheory.audio.load_wav(path)[source]¶

Load an audio file as mono float64 in [-1, 1].

WAV files (8/16/32-bit PCM and float) are read directly; stereo is mixed down. Anything else — .m4a voice memos, .mp3, .aiff — is converted on the fly through afconvert (built into macOS) or ffmpeg, whichever is on your PATH.

Returns:: (samples, sample_rate) tuple.

pytheory.audio.hpss(samples, sample_rate=44100, *, kernel=31)[source]¶

Harmonic-percussive source separation.

Drums and notes look completely different on a spectrogram: a held note is a horizontal line (steady frequency over time), a drum hit is a vertical line (all frequencies at one instant). Median-filter the spectrogram along time and you keep the horizontals; along frequency and you keep the verticals. Soft masks built from the two estimates split the signal.

Returns:: (harmonic, percussive) sample arrays, same length as input.

pytheory.audio.estimate_tempo(samples, sample_rate=44100, *, bpm_min=60, bpm_max=200)[source]¶

Estimate tempo from the onset pattern.

Builds an onset-strength envelope (spectral flux — how much new energy appears frame to frame), autocorrelates it, and finds the beat period that explains the recording best, gently preferring tempos near 120.

Returns:: Estimated BPM as an int, or None if the recording doesn’t have a confident pulse (e.g. rubato humming).

pytheory.audio.detect_pitch(samples, sample_rate=44100, *, frame_size=2048, hop=512, fmin=50.0, fmax=1500.0, threshold=0.12)[source]¶

Track pitch over time with the YIN algorithm.

Parameters:

samples – Mono float array.
sample_rate – Sample rate in Hz.
frame_size – Analysis window (default 2048 ≈ 46ms).
hop – Samples between frames (default 512 ≈ 12ms).
fmin/fmax – Pitch search range in Hz.
threshold – YIN aperiodicity threshold — lower is stricter about what counts as a pitched sound.

Returns:

(times, freqs, voiced) arrays — one entry per frame. freqs is 0 where voiced is False.

pytheory.audio.identify_chord(samples, sample_rate=44100, *, min_confidence=0.7)[source]¶

Identify the chord sounding in a buffer of audio.

The one-shot, “what am I strumming right now?” version of detect_chords() — fold the buffer into a chromagram and match it against the same major/minor/sus/7th templates on all twelve roots.

Harmonics are discounted before matching — each chord tone’s 3rd, 5th, and 7th partials land a fifth, major third, and flat seventh above it in pitch-class space, which is what makes a bright C major read as Cmaj7 if you match the raw chromagram. A polyphony gate rejects single notes (whose energy concentrates on too few pitch classes) rather than misreading a melody note as a chord. Coefficients are calibrated against pytheory’s own guitar/piano/rhodes renders: 93% on an 81-case battery.

Parameters:

samples – Mono float array, ideally ~0.5–1.5 s of audio.
sample_rate – Sample rate in Hz.
min_confidence – Template match score (0–1) below which None is returned.

Returns:

Dict with symbol (e.g. "Am"), confidence, and notes (the chord tones, low to high from the root) — or None if no chord is confidently sounding.

pytheory.audio.detect_chords(samples, sample_rate=44100, *, bpm=120, beats_per_chord=2.0)[source]¶

Detect a chord progression from audio.

Folds the harmonic content into pitch classes (a chromagram), averages it over chord-sized windows on a beat grid aligned to the music’s own onsets, and matches each window against major/minor/sus triad and 7th-chord templates on all twelve roots. When the bass clearly sits on a chord tone other than the root, the chord is reported as a slash chord ("C/E").

Returns:: List of (start_beat, duration_beats, symbol) tuples, with consecutive identical chords merged — e.g. [(0.0, 8.0, "Am"), (8.0, 4.0, "F")].

pytheory.audio.detect_drums(samples, sample_rate=44100, *, bpm=120, quantize=0.25)[source]¶

Detect drum hits from (ideally percussive) audio.

Finds onsets in the energy envelope, then classifies each by where its energy lives: kicks are bottom-heavy, hats are all sizzle, snares are the broadband middle.

Returns:: List of (beat_position, sound_name, velocity) tuples, where sound_name is "kick", "snare", or "closed_hat".

pytheory.audio.transcribe(path, *, bpm=None, quantize=None, split=False, part_name='melody', synth='piano_synth', fmin=50.0, fmax=1500.0)[source]¶

Transcribe an audio recording into a Score.

Parameters:

path – Audio file path — WAV directly, anything else (.m4a, .mp3) via afconvert/ffmpeg. Or a (samples, sample_rate) tuple.
bpm – Tempo to interpret the timing against. Default None estimates it from the recording’s onset pattern, falling back to 120 when there’s no confident pulse (rubato humming). Pass a number to pin it.
quantize – Optional grid in beats (e.g. 0.25 snaps note starts and lengths to sixteenths). Default: no snapping — you get the timing as performed.
split – If True, run harmonic-percussive separation first and transcribe two parts from the harmonic signal — a "bass" part (40-200 Hz) and a "melody" part (200 Hz up, with the bass filtered out). Use this on full mixes; expect the bassline to come out well and the melody to come out only as well as it dominates the mix.
part_name – Name for the created part (non-split mode).
synth – Synth for playback of the transcription.
fmin/fmax – Pitch range to search, in Hz (non-split mode). Tighten these for better results (e.g. fmin=60, fmax=350 for a bass).

Returns:

A Score holding the detected notes, rests, and velocities.

Audio Analysis¶

PyTheory

Navigation

Related Topics