Saturday, October 3, 2015

{Awareness} Excel: Extracting text with FIND and MID Functions

 





Data analysis begins with usable data, and that means every piece organized nicely into its own field where we can count, sort and otherwise test it out.
What if you get a spreadsheet where the pieces of data are all packed in one field? Say, something like this (which I cobbled together from Major League Baseball data in honor of the Nationals' first playoff appearance):
    NAME: Sean Burnett POS: RP AGE: 30 WT: 200 BORN: Dunedin, FL SALARY: 2350000  NAME: Tyler Clippard POS: RP AGE: 27 WT: 200 BORN: Lexington, KY SALARY: 1650000  NAME: Ross Detwiler POS: SP AGE: 26 WT: 174 BORN: St. Louis, MO SALARY: 485000  NAME: Christian Garcia POS: RP AGE: 27 WT: 215 BORN: Miami, FL SALARY: N/A  NAME: Gio Gonzalez POS: SP AGE: 27 WT: 205 BORN: Hialeah, FL SALARY: 3335000  NAME: Mike Gonzalez POS: RP AGE: 34 WT: 215 BORN: Robstown, TX SALARY: N/A  NAME: Ryan Mattheus POS: RP AGE: 28 WT: 215 BORN: Sacramento, CA SALARY: 481000  NAME: Craig Stammen POS: RP AGE: 28 WT: 200 BORN: Coldwater, OH SALARY: 485000  NAME: Drew Storen POS: RP AGE: 25 WT: 180 BORN: Indianapolis, IN SALARY: 498750  NAME: Jordan Zimmermann POS: SP AGE: 26 WT: 218 BORN: Auburndale, WI SALARY: 2300000
Let's say you want to extract the city of each player's birth into a separate field. The varying length of each player's name means the birth place isn't always in the same position in the string, so a typical text-to-columns operation won't work. So, how to do it?
The answer lies in two very handy Excel functions: FIND and MID.
FIND locates characters you specify and returns its numeric place in the string.
MID returns X characters from a string beginning at a location X you specify.
For example, we can locate the position where each city name begins by using FIND to locate the string "BORN:" in each cell. The city name itself always starts six characters after the position of that string, so we add six to the result:
    =FIND("BORN:",A2)+6
In the first row above, the functions returns 50. In the second row, 52. We'll feed that value to the MID function as the starting point for our extraction.
MID takes three arguments: Text or cell, position to start, number of characters to return. So, we use the above FIND function as the second argument and, for now, extract 10 characters:
    =MID(A2,FIND("BORN:",A2)+6,10)
That gets us part of the way there. We're starting in the right spot, but 10 characters isn't always the length of the city and state, so it leads to choppy results:
    Dunedin, F  Lexington,  St. Louis,  Miami, FL   Hialeah, F  Robstown,   Sacramento  Coldwater,  Indianapol  Auburndale
What we need to do is tell MID the exact number of characters to return each time even though the length of the city varies. We can figure this out using FIND again.

The city name is always followed by the word "SALARY". So, if we search for the position of that word and subtract the position of "BORN," we'll get the length of what's between the two. The ultimate formula looks like spaghetti but works just fine:
    =MID(A2,FIND("BORN:",A2)+6,(FIND("SALARY",A2)-(FIND("BORN:",A2)+6)))
Used on the example text, it returns:
    Dunedin, FL   Lexington, KY   St. Louis, MO   Miami, FL   Hialeah, FL   Robstown, TX   Sacramento, CA   Coldwater, OH   Indianapolis, IN   Auburndale, WI
That's it. Fairly handy and further proof that Excel is a versatile part of the data journalists' tool kit.
  • If the substring is on the right side of the data, use the RIGHT function to extract it.
  • If the substring is on the left side of the data, use the LEFT function to extract it.
  • If the substring has unwanted characters on both sides of the desired data, use the MID function to extract it.

__._,_.___

Posted by: Junaid Tahir <mjunaidtahir@gmail.com>
Reply via web post Reply to sender Reply to group Start a New Topic Messages in this topic (1)

.

__,_._,___

No comments:

Post a Comment