Statistical Analysis of Bach's Cello Suite No. 2 in D minor, BWV1008

I. Prelude:
As I have begun to study the cello suite number 2 in D minor (adapted for the viola), my teacher has reccomended that I go through the sheet music and identify chromatic sequences in the music. Here is a sample of that analysis, performed by hand, on the Prelude to the cello suite number 1 in G major.

Inspired by the work of Tom Collins (and his article in Significance magazine), I began to think about if (and how) it would be possible to run that same analysis using statistical software - specifically, using SAS. Though many cheaper and more popular statistical packages exist, SAS remains a powerhouse in the industry and it's the only software which I have access to and training in, which is why I used it for this project.
I split up the project into the following parts:
  1. Importing the score
  2. Transforming the MIDI notes into constant values
  3. Identifying patterns

Importing the Score

A quick Google search led me to Dave's J.S. Bach Page, where I was able to download the MIDI files for the entire second suite. I used the free score-editing software MuseScore to open the MIDI file and export it as an XML document. The resulting XML file contained a lot of useless information - pitch values for playback, note locations for page rendering, voice information, etc, but it also contained an efficient listing of all the notes, their accidentals, the note name, and the ordinal location in notes and measures of the particular note.
Using the SAS XML mapper, I was able to open the MuseScore export file in SAS as a library, and copy over the various tables contained within. The most valuable of those (for this project) were the tables "note" (containing the ordinal locations and accidentals of the notes), and "pitch" (containing the note location, the note name (A-G), and the MIDI octave number (-5 to +5). A quick data step merge produced a seemingly ideal set - the locations, pitch, note names, accidentals, and octaves of each note in the prelude. Here is a sample:

Many of the fields above are useless for this analysis, but I let them stay in case they came in handy later.

Transforming the MIDI notes into absolute values

Let's take a closer look at the sample above. A quick examination finds some problems immediately. The key signature for D minor has three sharps: F, C, and G, but in row 5 of the data above, we have an F without a natural accidental (modifier), which is incorrect according to the score. This is because in music, if a note is modified with an accidental, all following instances of that note within a measure also have that modification (but it is not explcitly stated). Because we downloaded this data in MIDI score form, not absolute value form, it was assumed that we would be reading the notes as a musician, not a computer, would do. Notice also that the Alter field of the example above is useless to us, as it seems to only catch some of the accidental modifications.
Obviously, I needed a more objective way of measuring pitches, so I went back and re-examined the MIDI system. The MIDI format records notes over 10 octives, from octaves -5 to 5, with middle C being octave 0. Starting from the C of the -5th octave, every half tone is given a number. As there are 12 half-tones in an octave, and 5 octaves below middle C, it makes sense that middle C has a MIDI note value of 60. In our cello suite, most of the notes are in octaves 5 and 6. To get some idea of the MIDI note values, I ran the following data step on the notes dataset:
data bach.notes_midi;
merge bach.notes;
format NoteNum 3.;
format AccidentalAdj 2.;
if step eq "C" then NoteNum = 0; *step is the note letter name;
if step eq "D" then NoteNum = 2;
if step eq "E" then NoteNum = 4;
if step eq "F" then NoteNum = 5;
if step eq "G" then NoteNum = 7;
if step eq "A" then NoteNum = 9;
if step eq "B" then NoteNum = 11;
NoteNum = NoteNum + ((octave+5)*12);

Then, to correct for naturals, accidentals, and key signatures in a measure, I ran the following data step on the result:

data bach.dmprelude; *now correct for naturals and sharps in a measure;
set bach.notes_midi;
format C $14. D $14. E $14. F $14. G $14. A $14. B $14. accsym $4.;
retain mnum C D E F G A B; *these are flags - they will hold accidentals that are triggered for a measure;
drop C D E F G A B;
if(mnum eq measure_ORDINAL) then do; *essentially, if this is the same measure number as the previous note;
if accidental ne '' then do; *if there is a modification to be made, then set the appropriate flag;
     if (step eq "C" && C eq '') then C=accidental;
     else if (step eq "C" && C eq 'sharp') then C=accidental;
     if (step eq "D" && D eq '') then D=accidental;
     if (step eq "E" && E eq '') then E=accidental;
     if (step eq "F" && F eq '') then F=accidental;
         else if (step eq "F" && F eq 'sharp') then F=accidental;
     if (step eq "G" && G eq '') then G=accidental;
         else if (step eq "G" && G eq 'sharp') then G=accidental;
     if (step eq "A" && A eq '') then A=accidental;
     if (step eq "B" && B eq '') then B=accidental;
end; *endif accidental exists;
else do; *if no accidental exists yet, and this is the same measure, the mkae the modification indicated in the flag;
     if(step eq "C") then accidental=C;
     if(step eq "D") then accidental=D;
     if(step eq "E") then accidental=E;
     if(step eq "F") then accidental=F;
     if(step eq "G") then accidental=G;
     if(step eq "A") then accidental=A;
     if(step eq "B") then accidental=B;
end;*endif accidental doesn't exist;
end;*endif in same measure;
else do; *if not in same measure as last note;
    *reset our trackers;
    C="sharp"; D=""; E=""; F="sharp"; G="sharp"; A=""; B=""; *Note the key signature is hard-coded here!;
    if accidental eq '' then do; *if no accidental exists on this note, set the correct flag;
         if(step eq "C") then accidental=C;
         if(step eq "D") then accidental=D;
         if(step eq "E") then accidental=E;
         if(step eq "F") then accidental=F;
         if(step eq "G") then accidental=G;
         if(step eq "A") then accidental=A;
         if(step eq "B") then accidental=B;
end; *endif no accidental on this note;
end; *endif first note in new measure;
mnum = measure_ORDINAL; *reset this for the next iteration;
if(accidental eq "flat") then accsym = "b"; *add the symbols in for easier recognition;
if(accidental eq "sharp") then accsym = "#";
if(accidental eq "natural") then accsym = "n *the natural symbol won't work without unicode, so I subsituted 'n';
if(accidental eq "double-flat") then accsym = "bb";
if(accidental eq "double-sharp") then accsym = "##";
drop mnum;

Finally, absolute values at last! Here is a sample of the resulting dataset:

Note that our MIDI values only correspond to the letter note, without any accidentals applied. To do that, we simply compute the "accidental adjustment" value- how many half-tones to add or subtract from the MIDI note number to equal our accidental - and add it to the value. To save time later, I also made another column - "scale" - contianing the number 1-7 for which note of the scale it is (c=1, g=7). This will come in handy for pattern anaylsis later.

data bach.dmprelude;
*octaves start on Cs, and c3 is 96. Half tones are whole numbers;
set bach.dmprelude (rename=(NoteNum = MIDIN));
if accidental eq "natural" then AccidentalAdj = 0;
else if accidental eq "flat" then AccidentalAdj = -1;
else if accidental eq "sharp" then AccidentalAdj = 1;
else if accidental eq "double-sharp" then AccidentalAdj = 2;
else if accidental eq "double-flat" then AccidentalAdj = -2;
else AccidentalAdj=0;
NoteNum = MIDIN + AccidentalAdj;
if step eq "A" then scale=1;
if step eq "B" then scale=2;
if step eq "C" then scale=3;
if step eq "D" then scale=4;
if step eq "E" then scale=5;
if step eq "F" then scale=6;
if step eq "G" then scale=7;

And now we can finally see what the first few measures look like, graphed in equal half-tones:

symbol1 v=dot i=join c=black;
proc gplot data=bach.dmprelude;
   title1 "Bach Cello Suite in D Minor";
   title2 "Values Graphed in MIDI note numbers";
   label NoteNum = "MIDI Note #";
   label note_ORDINAL = "Note Order";
   plot NoteNum * note_ORDINAL = 1 /grid;
   where measure_ORDINAL < 5;

Still, there are some problems - measures don't always contain the same amount of notes (although with Bach it's very close), and tied notes are represented with two notes of the same pitch (although this will work in our favor for sequence recognition). But I considered these drawbacks to be unavoidable and inconsequential, and so I moved on to pattern recognition.

Identifying patterns

If you'll recall, my teacher started all this by asking me to identify ascending and descending sequences - where notes were always one whole tone above or below their preceeding note. This leads to some difficulty, as a descending five note pattern could be interpreted as 3 notes and 2 notes, or 2 notes and 3 notes, or 5 notes together. These distinctions are matters of interpretation and personal taste, so I set out instead to simply identify all instances where notes were descending and ascending, and not assign sequence patterns within them. To that end, I applied the following data step:
data bach.sequences;
set bach.dmprelude;
prevscale_up = lag1(scale)+1;
prevscale_down = lag1(scale)-1;
if(prevscale_up = 8) then prevscale_up = 1;
if(prevscale_down = 0) then prevscale_down = 7;
if (scale = prevscale_up) then seq=1;
else if (scale = prevscale_down) then seq=1;
else seq=0;
The conditional lines relating to the prevscale variables have to do with 'looping' around the octave. If the previous note was a B (with scale value 7) and this note is a C (with scale value 1), increasing the 7 to 8 without resetting it to 1 fails to identify a pattern where one exists. The same problem happens with 1 to 0. What we end up with is a Seq boolean value (1 or 0) which tells us if the note is part of a sequence or not. However, becuase it takes at least two notes to identify a sequence, the Seq value of the first note of a sequence will always have Seq=0 when it should Seq=1.
To fix this, I reversed the dataset using a proc sort, and then effectively looped through it backwards, setting the Seq= value to 1 if the note after it was part of a sequence, as shown in the code below:
proc sort data=bach.sequences; by descending note_ORDINAL; run; *reverse the ordering;

data bach.sequences; *correct for the sequence error;
set bach.sequences;
if (lag1(seq) eq 1) then seq = 1;

proc sort data=bach.sequences; by note_ORDINAL; run; *and reset the ordering to normal;

All that's left now is to tranpose that data into two columns, in order to graph them separately (with an overlay):
proc transpose data=bach.sequences out=bach.seq_data;
id seq;
by note_ORDINAL;
var NoteNum;

The resulting set looks like this:

Now it's a simple matter to graph the sequenced and non-sequenced notes with an overlay to produce our final result:

For a large image containing the whole piece, click here.

Back to Top
Questions? Contact Jacob Warwick at