Introduction
This is a short post on how to recognize numbers such as simple integers, real numbers and special codes such as zip codes and credit card numbers and also extract these number from unstructured text in the popular bash (Bourne Again Shell) shell or scripting language. Bash is the default Unix console or terminal window on Macintosh computers as well as a number of other Apple products. It is also used by cygwin and several other variants of Unix and Linux.
Regular expressions are a compact efficient way of representing patterns of characters including the letters in the English alphabet and digits. There is extensive information on the web on regular expression. Interested readers can start with the Wikipedia page on regular expressions. The goal of this post in to illustrate specifically how to recognize common types of numbers using regular expressions in BASH. The regular expressions in the examples will also work in some other environments.
The example scripts below were tested on a MacBook Air with Mac OS X 10.10.3 (Yosemite) and this version of the bash shell:
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin14) Copyright (C) 2007 Free Software Foundation, Inc.
NOTE: If you are new to programming or regular expressions, regular expressions are not derived from the English language or basic arithmetic taught in schools. Generally aspects of programming that are not close to English or standard mathematics are harder to master and require more practice, drilling, and subsequent continued use to master. Programming languages such as BASIC or Python that are closer to English are easier to learn and master than languages such as C or C++ that make heavy use of custom computer notations and terms not found in standard English or standard school mathematics. Regular expressions are way out there in the cryptic computer notation wilderness.
The example bash scripts are listed below. They are also available at GitHub. The HTTPS URL for the Git repository at GitHub is:
https://github.com/jmcgowan79/mathbash.git
To get you must have Git installed and configured on your computer. Then:
$ git clone https://github.com/jmcgowan79/mathbash.git $ cd mathbash $ chmod ugo+x *.sh # make scripts executable mathbash$ ./test_isnumber.sh # to test the installation
Regular Expressions for Numbers
This script recognizes various types of numbers including integers, real numbers, complex numbers, and special codes including zip codes, telephone numbers, and credit card numbers.
isnumber.sh
#!/bin/bash # test if a string is a number, report type of the number (e.g. INTEGER, REAL, etc.) # exit code is 0 for success -- string is a NUMBER # exit code is 1 for failure -- string is NOT A NUMBER # # illustrates regular expressions for recognizing number strings # # bash is sometimes in /usr/local/bin/bash # # (C) 2015 John F. McGowan, Ph.D. if [[ "$#" -ne 1 || "$1" == "-h" || "$1" == "-?" ]]; then echo "Usage: `basename $0` " echo " -- reports number type of possible_number_string " echo " -- POSITIVE INTEGER, NON-NEGATIVE INTEGER, SIGNED INTEGER" echo " -- HEXADECIMAL, REAL NUMBER, VECTOR, ZIP CODE, TELEPHONE NUMBER " echo " -- CREDIT CARD NUMBER" echo " " echo " -- ILLUSTRATES REGULAR EXPRESSIONS FOR RECOGNIZING NUMBER STRINGS" echo " " echo ' -- use bash$ echo $? to test exit code' echo " -- use enclosing quotes for string with spaces such as credit card numbers" echo " " echo " Author: John F. McGowan, Ph.D. (jmcgowan79@gmail.com)" echo " " exit 0 fi # Unix/bash exit code of 0 means success (is a number in this case) is_number=1 # start no number found number_string=$1 # regular expressions match patterns of characters # # caret ^ represents the start of a string or line outside of brackets # dollar $ represents the end of a string or line # square brackets [1-9] represent all characters in the brackets # [abc] for example can be "a," "b," or "c" # hyphen inside brackets indicates a range of characters # typically digits or letters # [1-9] represents the digits in range 1,2,3,...9 # [a-c] represents the letters a,b,c # # inside brackets caret ^ negates the list of characters # for example [^0-9] represents all characters EXCEPT 0,1,2,...9 # # ? indicates 0 or 1 of preceding pattern # * indicates 0 or more of preceding pattern # + indicates 1 or more of preceding pattern # . matches any single character # # (...) is a group # for example, (ab)? matches (nothing) or ab # for example, (ab)* matches (nothing), ab, abab, ... # for example, (ab)+ matches ab, abab, ababab, ... # (...){n,m} indicates from n to m repetitions of the pattern # for example, (ab){2,3} matches only abab and ababab # the backlash is used to escape the characters with special meanings # \^ \$ \( \) \[ \] \{ \} \* \. \? \+ # # =~ is the reguar expression pattern matching operator in bash # positive integers/counting numbers (1,2,3,...) if [[ $number_string =~ ^[1-9][0-9]*$ ]]; then echo "POSITIVE INTEGER" is_number=0 fi # add zero to numbers # zero was remarkably difficult to invent # the ancient Babylonians had a place-value # number system based on 60 (not 10) which # included an implicit zero, but the explicit # symbol for zero took many more centuries to # invent # non-negative integers (0,1,2,...) if [[ $number_string =~ ^[0-9]+$ ]]; then echo "NON-NEGATIVE INTEGER" is_number=0 fi # negative numbers are even less obvious # signed integers (..., -2, -1, 0, 1, 2,...) if [[ $number_string =~ ^[+-][0-9]+$ ]]; then echo "SIGNED INTEGER" is_number=0 fi # hexadecimal numbers are used with computers # and low-level programming of computers # hexadecimal (base 16) numbers such as AA12 or 0x12ab etc. if [[ $number_string =~ ^(0[xX])?[0-9a-fA-F]+$ ]]; then # also recognize C format hex numbers such as 0xaf12 echo "HEXADECIMAL NUMBER (INTEGER)" is_number=0 fi # fractions such as 1/2, 1/3 date to antiquity but the # concept of real numbers such as square root of 2 # proved difficult to grasp. The ancient Greeks # knew a proof that the square root of 2 could not # be a ratio of two integers, but were apparently # unable to make the leap to real numbers. # real numbers/decimal numbers (0.0, ..., 0.5, ..., 1.0, ..., 3.1415...,...) real_regexp="[+-]?([0-9]+|[0-9]+\.[0-9]*|\.[0-9]+)" if [[ $number_string =~ ^$real_regexp$ ]]; then echo "REAL NUMBER" is_number=0 fi # vectors are usually used to represent a magnitude # with a direction such as the direction and speed # of the wind or an ocean current (early uses of # the vector concept) # vector with enclosing parenthesis, e.g. (1, 2, 3) vector_regexp="\(( *$real_regexp, *)+$real_regexp *\)" if [[ $number_string =~ ^$vector_regexp$ ]]; then echo "VECTOR" is_number=0 fi # vector with enclosing brackets, e.g. [1, 2, 3] vector_regexp="\[( *$real_regexp, *)+$real_regexp *\]" if [[ $number_string =~ ^$vector_regexp$ ]]; then echo "VECTOR" is_number=0 fi # vector with enclosing curly braces, e.g {1, 2, 3} vector_regexp="\{( *$real_regexp, *)+$real_regexp *\}" if [[ $number_string =~ ^$vector_regexp$ ]]; then echo "VECTOR" is_number=0 fi # the imaginary numbers turned up in roots of # polynomials and are now used in everthing from # electrical engineering, cryptography, to # quantum mechanics, but remain mysterious. # pure imaginary numbers i = square root(-1) if [[ $number_string =~ ^$real_regexp[iI]$ ]]; then echo "PURE IMAGINARY NUMBER"; is_number=0 fi # complex numbers (1.1 + 2i, -1 + 2.1i, ...) # complex_regexp="$real_regexp( *[+-] *($real_regexp)?[iI])?" if [[ $number_string =~ ^$complex_regexp$ ]]; then echo "COMPLEX NUMBER" is_number=0 fi # large integers are frequently used as unique identifiers # zip code (United States) if [[ $number_string =~ ^[0-9]{5,5}(-[0-9]{4,4})?$ ]]; then echo "ZIP CODE (USA)" is_number=0 fi # telephone number (USA) if [[ $number_string =~ ^(([0-9]( |-))?[0-9]{3,3} +|([0-9]( |-))?\([0-9]{3,3}\) *)?[0-9]{3,3}( |-)[0-9]{4,4}$ ]]; then echo "TELEPHONE NUMBER (USA)" is_number=0 fi # credit card number (16 digits) if [[ $number_string =~ ^[0-9]{16,16}|([0-9]{4,4} ?){4,4}$ ]]; then echo "CREDIT CARD NUMBER" # remove spaces from credit card number number_string_cleaned=${number_string// /} # echo "" if [[ $number_string_cleaned =~ ^4[0-9]{6,}$ ]]; then echo "PROBABLE VISA CARD (VISA CARD START WITH 4)"; fi if [[ $number_string_cleaned =~ ^5[1-5][0-9]{5,}$ ]]; then echo "PROBABLE MASTER CARD"; fi is_number=0 fi # report if string is not a number # if [[ $is_number == 1 ]]; then echo "NOT A NUMBER" fi exit $is_number
How to Extract Numbers from Text using Regular Expressions in BASH
This short script demonstrate how to use the regular expressions for numbers to extract numbers and numeric data from unstructured text, a common problem in this age of the Internet. Note that bash stores the first matched sub-pattern indicated by enclosing parenthesis in the regular expression in the special variable BASH_REMATCH[1]
extract_number.sh
#!/bin/bash # # example of extracting number from text using regular expressions in bash # -- we frequently want to extract numerical data from unstructured text # # illustrates regular expressions for recognizing number strings # # bash is sometimes in /usr/local/bin/bash # # (C) 2015 John F. McGowan, Ph.D. (jmcgowan79@gmail.com) if [[ "$1" == "-h" || "$1" == "-?" ]]; then echo "Usage: `basename $0` extract number from string " echo " -- exit code 1 if no number found" echo " -- exit code 0 if a number is found" echo " -- reports number if found" echo " " echo " Author: John F. McGowan, Ph.D. (jmcgowan79@gmail.com)" echo " " exit 0 fi found_number=1 # haven't found number yet # # inside brackets, caret negates the list of characters # [^0-9] matches all characters except for 0,1,2...9 # # real numbers/decimal numbers (0.0, ..., 0.5, ..., 1.0, ..., 3.1415...,...) real_regexp="[+-]?([0-9]+|[0-9]+\.[0-9]*|\.[0-9]+)" complex_regexp="$real_regexp( *[+-] *($real_regexp)?[iI])?" if [[ $1 =~ [^0-9\-]*($complex_regexp) ]]; then echo ${BASH_REMATCH[1]} found_number=0 fi exit $found_number
Tester for Is Number Script
This is a script to run a series of tests on the isnumber.sh script to verify that it is installed and working correctly. Note that bash is not always located at /bin/bash. It may also be at /usr/local/bin/bash or other locations on your computer’s file system.
#!/bin/bash # # test script for isnumber.sh # # Author: John F. McGowan Ph.D. (jmcgowan79@gmail.com) # (C) 2015 John F. McGowan # # test non-numbers # ntests=0 nfails=0 echo "NOT A NUMBER TESTS" report=`./isnumber.sh dog` result=$? # need to assign this to result immediately after isnumber.sh exits ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "PASSED" else echo "FAILED" nfails=`expr $nfails + 1` fi report=`./isnumber.sh 123x` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "PASSED" else echo "FAILED" nfails=`expr $nfails + 1` fi report=`./isnumber.sh 1.2.3` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "PASSED" else echo "FAILED" nfails=`expr $nfails + 1` fi report=`./isnumber.sh 1.2.i` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "PASSED" else echo "FAILED" nfails=`expr $nfails + 1` fi # test numbers # echo "NUMBER TESTS" # integer report=`./isnumber.sh 1` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi # zero report=`./isnumber.sh 0` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi # negative integer report=`./isnumber.sh -1` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 1.23` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh .1` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 0.1` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh af` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 0xaf` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh "(1,2,3)"` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh [1,2,3]` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh {1,2, 3}` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 12i` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 12.i` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh .12i` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi report=`./isnumber.sh 1.2i` result=$? ntests=`expr $ntests + 1` if [[ $result == "1" ]]; then echo "FAILED" nfails=`expr $nfails + 1` else echo "PASSED" fi echo " " echo "SUMMARY" echo "---------------------------------" echo "FAILED $nfails OF $ntests TESTS"; if [[ $nfails == 0 ]]; then echo "PASSED ALL TESTS!!!!" fi # the end
Example Successful Output from Tester
This is an example of the output when the tests are all passed.
$ ./test_isnumber.sh NOT A NUMBER TESTS PASSED PASSED PASSED PASSED NUMBER TESTS PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED PASSED SUMMARY --------------------------------- FAILED 0 OF 19 TESTS PASSED ALL TESTS!!!!
© 2015 John F. McGowan
About the Author
John F. McGowan, Ph.D. solves problems using mathematics and mathematical software, including developing gesture recognition for touch devices, video compression and speech recognition technologies. He has extensive experience developing software in C, C++, MATLAB, Python, Visual Basic and many other programming languages. He has been a Visiting Scholar at HP Labs developing computer vision algorithms and software for mobile devices. He has worked as a contractor at NASA Ames Research Center involved in the research and development of image and video processing algorithms and technology. He has published articles on the origin and evolution of life, the exploration of Mars (anticipating the discovery of methane on Mars), and cheap access to space. He has a Ph.D. in physics from the University of Illinois at Urbana-Champaign and a B.S. in physics from the California Institute of Technology (Caltech). He can be reached at jmcgowan11@earthlink.net.
Don’t overlook the use of “j” as the square root of -1 by engineers, and sometimes the j precedes the number bit, as in 3 + j4
Very long script.
I wrote about Linux regex before. Regex reduces time and effort.
Thanks for the article.
Regards,